首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
A large number of web pages contain data structured in the form of ??lists??. Many such lists can be further split into multi-column tables, which can then be used in more semantically meaningful tasks. However, harvesting relational tables from such lists can be a challenging task. The lists are manually generated and hence need not have well-defined templates??they have inconsistent delimiters (if any) and often have missing information. We propose a novel technique for extracting tables from lists. The technique is domain independent and operates in a fully unsupervised manner. We first use multiple sources of information to split individual lines into multiple fields and then, compare the splits across multiple lines to identify and fix incorrect splits and bad alignments. In particular, we exploit a corpus of HTML tables, also extracted from the web, to identify likely fields and good alignments. For each extracted table, we compute an extraction score that reflects our confidence in the table??s quality. We conducted an extensive experimental study using both real web lists and lists derived from tables on the web. The experiments demonstrate the ability of our technique to extract tables with high accuracy. In addition, we applied our technique on a large sample of about 100,000 lists crawled from the web. The analysis of the extracted tables has led us to believe that there are likely to be tens of millions of useful and query-able relational tables extractable from lists on the web.  相似文献   

2.
本体技术是数据可以达到语义层次交换的关键,如何将当前各类数据形成本体知识库表示是一个非常重要的问题。针对这个问题,以一种关系模式到一种语义扩展ER模型的正确性可满足转换算法为基础,提出了一种通过数据库反向工程到OWL DL本体的翻译算法,说明了该算法使得转换是正确性可满足的,并实验实现验证了算法。  相似文献   

3.
Transformation of sedimentary organic matter (OM) to hydrocarbons is best modeled by assuming the total reaction suite consists of parallel degradations of ‘i’ hypothetical components following the Arrhenius equation and first order kinetics. A kerogen can be defined by characterizing each constituent component by its activation energy (Ei) their initial potentials (Xios) and a single frequency factor (A). We present a user friendly Lotus 1-2-3 program to determine A, Ei and Xio distribution of OMs using a multiple linear regression utility and programmed macros. Rock Eval (RE) S2 curves of three heating rates are required. Equally spaced time/temperature and peak height data for S2 curves of ‘n’ temperature steps in increasing order of heating rates are the inputs for the program. The fraction of hydrocarbon generated (f) from 19 hypothetical components of Ei 30, 32,34…78 Kcal/mole for ‘n’ temperature/time steps, by using frequency factor (A) value and assuming Xios=1, are calculated and set up in a ‘n×19’ matrix (matrix M). The fraction of total hydrocarbon generated (f) at ‘n’ temperature steps, obtained from the observed peak heights, are set up in a ‘n×1’ matrix (matrix L). Matrix M is suitably reduced by the program to ‘n×k’ matrix (matrix N) where ‘k’ is a variable, facilitating matrix inversion. Regressing matrix N against matrix L by the program, gives the Xios for ‘k’ Ei components along with a standard error (ERR) of Y estimates and R2. Xios and A are then optimized iteratively by varying A values and selecting the solution associated with the lowest ERR value. Results of applying the program on data sets of two widely different types of samples from Indian basins are shown. They match the results obtained from the more sophisticated proprietary software.  相似文献   

4.
XML在关系数据库中的存储问题是XML研究领域中的一个重要问题。在总结多种映射方法的基础上,提出了一种方法将多个相似的XML文档进行解析,根据映射关系,生成各自的关系模式,并分析归纳出一个集成的关系模式,然后创建一个关系数据库,并在映射关系的基础上提取并存储XML文档数据到关系数据库。此方法以较为简洁的结构保存了XML文档的数据信息,其最大的特点就是不用考虑文档的模式信息(DTD,XML Schema)。并通过一个具体的实验结果来说明这种方法的有效性。  相似文献   

5.
6.
用XML在Word文档表格中转换非结构化数据   总被引:1,自引:0,他引:1  
非结构化数据广泛存在于各种应用系统中,对非结构化数据进行管理以及通过转换成为结构化数据是非常重要的.XML语言非常适合用于数据存储与数据交换.本文使用Microsoft visual Studio 2005开发了基于XML的非结构化数据转换工具.该工具可以把Microsoft word表格中的文本数据转换成可以导入到数据库中的纯文本数据文件.使用该工具,可以完成类似的非结构化数据的转换工作.  相似文献   

7.
In this paper we present a generic model for automatic generation of basic multi-partite graphs obtained from collections of arbitrary input data following user indications. The paper also presents GraphGen, a tool that implements this model. The input data is a collection of complex objects composed by a set or list of heterogeneous elements. Our tool provides a simple interface for the user to specify the types of nodes that are relevant for the application domain in each case. The nodes and the relationships between them are derived from the input data through the application of a set of derivation rules specified by the user. The resulting graph can be exported in the standard GraphML format so that it can be further processed with other graph management and mining systems. We end by giving some examples in real scenarios that show the usefulness of this model.  相似文献   

8.
Data streams are long, relatively unstructured sequences of characters that contain information such as electronic mail or a tape backup of various documents and reports created in an office. A conceptual framework is presented, using relational algebra and relational databases, within which data streams may be queried. As information is extracted from the data streams, it is put into a relational database that may be queried in the usual manner. The database schema evolves as the user's knowledge of the content of the data stream changes. Operators are defined in terms of relational algebra that can be used to extract data from a specially defined relation that contains all or part of the data stream. This approach to querying data streams permits the integration of unstructured data with structured data. The operators defined extend the functionality of relational algebra in much the same way that the join does relative to the basic operators select, project, union, difference, and Cartesian product  相似文献   

9.
Temporal data mining is still one of important research topic since there are application areas that need knowledge from temporal data such as sequential patterns, similar time sequences, cyclic and temporal association rules, and so on. Although there are many studies for temporal data mining, they do not deal with discovering knowledge from temporal interval data such as patient histories, purchaser histories, and web logs etc. We propose a new temporal data mining technique that can extract temporal interval relation rules from temporal interval data by using Allen’s theory: a preprocessing algorithm designed for the generalization of temporal interval data and a temporal relation algorithm for mining temporal relation rules from the generalized temporal interval data. This technique can provide more useful knowledge in comparison with conventional data mining techniques.  相似文献   

10.
The tremendous success of the World Wide Web is countervailed by efforts needed to search and find relevant information. For tabular structures embedded in HTML documents, typical keyword or link-analysis based search fails. The Semantic Web relies on annotating resources such as documents by means of ontologies and aims to overcome the bottleneck of finding relevant information. Turning the current Web into a Semantic Web requires automatic approaches for annotation since manual approaches will not scale in general. Most efforts have been devoted to automatic generation of ontologies from text, but with quite limited success. However, tabular structures require additional efforts, mainly because understanding of table contents requires the comprehension of the logical structure of the table on the one hand, as well as its semantic interpretation on the other. The focus of this paper is on the automatic transformation and generation of semantic (F-Logic) frames from table-like structures. The presented work consists of a methodology, an accompanying implementation (called TARTAR) and a thorough evaluation. It is based on a grounded cognitive table model which is stepwise instantiated by the methodology. A typical application scenario is the automatic population of ontologies to enable query answering over arbitrary tables (e.g. HTML tables).  相似文献   

11.

We describe our winning solution to the 2017’s Soccer Prediction Challenge organized in conjunction with the MLJ’s special issue on Machine Learning for Soccer. The goal of the challenge was to predict outcomes of future matches within a selected time-frame from different leagues over the world. A dataset of over 200,000 past match outcomes was provided to the contestants. We experimented with both relational and feature-based methods to learn predictive models from the provided data. We employed relevant latent variables computable from the data, namely so called pi-ratings and also a rating based on the PageRank method. A method based on manually constructed features and the gradient boosted tree algorithm performed best on both the validation set and the challenge test set. We also discuss the validity of the assumption that probability predictions on the three ordinal match outcomes should be monotone, underlying the RPS measure of prediction quality.

  相似文献   

12.
Temporal relational data model   总被引:3,自引:0,他引:3  
This paper incorporates a temporal dimension to nested relations. It combines research in temporal databases and nested relations for managing the temporal data in nontraditional database applications. A temporal data value is represented as a temporal atom; a temporal atom consists of two parts: a temporal set and a value. The temporal atom asserts that the value is valid over the time duration represented by its temporal set. The data model allows relations with arbitrary levels of nesting and can represent the histories of objects and their relationships. Temporal relational algebra and calculus languages are formulated and their equivalence is proved. Temporal relational algebra includes operations to manipulate temporal data and to restructure nested temporal relations. Additionally, we define operations to generate a power set of a relation, a set membership test, and a set inclusion test, which are all derived from the other operations of temporal relational algebra. To obtain a concise representation of temporal data (temporal reduction), collapsed versions of the set-theoretic operations are defined. Procedures to express collapsed operations by the regular operations of temporal relational algebra are included. The paper also develops procedures to completely flatten a nested temporal relation into an equivalent 1 NF relation and back to its original form, thus providing a basis for the semantics of the collapsed operations by the traditional operations on 1 NF relations  相似文献   

13.
This paper approaches the relation classification problem in information extraction framework with different machine learning strategies, from strictly supervised to weakly supervised. A number of learning algorithms are presented and empirically evaluated on a standard data set. We show that a supervised SVM classifier using various lexical and syntactic features can achieve competitive classification accuracy. Furthermore, a variety of weakly supervised learning algorithms can be applied to take advantage of large amount of unlabeled data when labeling is expensive. Newly introduced random-subspace-based algorithms demonstrate their empirical advantage over competitors in the context of both active learning and bootstrapping.  相似文献   

14.
裴松  武彤 《微型机与应用》2013,32(17):56-59
为从企业生产线上XML半结构化数据中抽取富有意义数据,分析了XML半结构化数据和关系数据库中结构化数据特点,以及XML半结构化数据在关系数据库中的存储方法。针对实际应用,提出采用扩展哈弗曼前缀编码方法,对XML文档树进行唯一编码,实现XML文档与关系数据库映射,同时给出最长前缀匹配策略,支持数据查询,以提高查询效率。  相似文献   

15.
Web services offer a more reliable and efficient way to access online data than scraping web pages. However, interacting with web services to retrieve data often requires people to write a lot of code. Moreover, many web services return data in complex hierarchical structures that make it difficult for people to perform any further data manipulation. We developed Gneiss, a tool that extends the familiar spreadsheet metaphor to support using structured web service data. Gneiss lets users retrieve or stream arbitrary JSON data returned from web services to a spreadsheet using interaction techniques without writing any code. It introduces a novel visualization that represents hierarchies in data using nested spreadsheet cells and allows users to easily reshape and regroup the extracted structured data. Data flow is two-way between the spreadsheet and the web services, enabling people to easily make a new web service call and retrieve new data by modifying spreadsheet cells. We report results form a user study that showed that Gneiss helped spreadsheet users use and analyze structured data more efficiently than Excel and even outperform professional programmers writing code. We further use a set of examples to demonstrate our tool's ability to create reusable data extraction and manipulation programs that work with complex web service data.  相似文献   

16.
Rights protection for relational data   总被引:1,自引:0,他引:1  
we introduce a solution for relational database content rights protection through watermarking. Rights protection for relational data is of ever-increasing interest, especially considering areas where sensitive, valuable content is to be outsourced. A good example is a data mining application, where data is sold in pieces to parties specialized in mining it. Different avenues are available, each with its own advantages and drawbacks. Enforcement by legal means is usually ineffective in preventing theft of copyrighted works, unless augmented by a digital counterpart, for example, watermarking. While being able to handle higher level semantic constraints, such as classification preservation, our solution also addresses important attacks, such as subset selection and random and linear data changes. We introduce wmdb., a proof-of-concept implementation and its application to real-life data, namely, in watermarking the outsourced Wal-Mart sales data that we have available at our institute.  相似文献   

17.
According to the soundness and completeness of information in databases,the expressive form and the semantics of incomplete information are discussed in this paper.On the basis of the discussion,the current studies on incomplete data in relational databases are reviewed.In order to represent stochastic uncertainty in most general sense in the real world,probabilistic data are introduced into relational databases.An extended relational data model is presented to express and manipulate probabilistic data and the operations in relational algebra based on the extended model are defined in this paper.  相似文献   

18.
Data incompleteness is one of the most important data quality problems in enterprise information systems. Most existing data imputing techniques just deduce approximate values for the incomplete attributes by means of some specific data quality rules or some mathematical methods. Unfortunately, approximation may be far away from the truth. Furthermore, when observed data is inadequate, they will not work well. The World Wide Web (WWW) has become the most important and the most widely used information source. Several current works have proven that using Web data can augment the quality of databases. In this paper, we propose a Web-based relational data imputing framework, which tries to automatically retrieve real values from the WWW for the incomplete attributes. In the paper, we try to take full advantage of relations among different kinds of objects based on the idea that the same kind of things must have the same kind of relations with their relatives in a specific world. Our proposed techniques consist of two automatic query formulation algorithms and one graph-based candidates extraction model. Several evaluations are proposed on two high-quality real datasets and one poor-quality real dataset to prove the effectiveness of our approaches.  相似文献   

19.
Abstract: The paper is concerned with the creation of predictive models from data within the framework of the variable precision rough set model. It is focused on two aspects of the model derivation: computation of uncertain, in general, rules from information contained in probabilistic decision tables and forming hierarchies of decision tables with the objective of reduction or elimination of decision boundaries in the resulting classifiers. A new technique of creation of a linearly structured hierarchy of decision tables is introduced and compared to tree‐structured hierarchy. It is argued that the linearly structured hierarchy has significant advantages over tree‐structured hierarchy.  相似文献   

20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号