首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Much of the world’s quantitative data reside in scattered web tables. For a meaningful role in Big Data analytics, the facts reported in these tables must be brought into a uniform framework. Based on a formalization of header-indexed tables, we proffer an algorithmic solution to end-to-end table processing for a large class of human-readable tables. The proposed algorithms transform header-indexed tables to a category table format that maps easily to a variety of industry-standard data stores for query processing. The algorithms segment table regions based on the unique indexing of the data region by header paths, classify table cells, and factor header category structures of two-dimensional as well as the less common multidimensional tables. Experimental evaluations substantiate the algorithmic approach to processing heterogeneous tables. As demonstrable results, the algorithms generate queryable relational database tables and semantic-web triple stores. Application of our algorithms to 400 web tables randomly selected from diverse sources shows that the algorithmic solution automates end-to-end table processing.  相似文献   

2.
In documents, tables are important structured objects that present statistical and relational information. In this paper, we present a robust system which is capable of detecting tables from free style online ink notes and extracting their structure so that they can be further edited in multiple ways. First, the primitive structure of tables, i.e., candidates for ruling lines and table bounding boxes, are detected among drawing strokes. Second, the logical structure of tables is determined by normalizing the table skeletons, identifying the skeleton structure, and extracting the cell contents. The detection process is similar to a decision tree so that invalid candidates can be ruled out quickly. Experimental results suggest that our system is robust and accurate in dealing with tables having complex structure or drawn under complex situations.  相似文献   

3.
In this paper, we present a temporal web data model designed for warehousing historical data from World Wide Web (WWW). As the Web is now populated with large volume of information, it has become necessary to capture selected portions of web information in a data warehouse that supports further information processing such as data extraction, data classification, and data mining. Nevertheless, due to the unstructured and dynamic nature of Web, the traditional relational model and its temporal variants could not be used to build such a data warehouse. In this paper, we therefore propose a temporal web data model that represents web documents and their connectivities in the form of temporal web tables. To represent web data that evolve with time, a visible time interval is associated with each web document. To manipulate temporal web tables, we have defined a set of web operators with capabilities ranging from extracting WWW information into web tables, to merging information from different web tables. We further illustrate the use of our temporal web data model using some realistic motivating examples.  相似文献   

4.
识别和抽取Web列表中的关系信息   总被引:1,自引:1,他引:0  
有大量的关系信息存在于各种各样的Web列表中,但使用目前的搜索引擎却难以找到它们。本文提出了一种基于语义和数据特征的方法,用于识别和抽取Web列表中的关系信息。我们首先建立一个模型,描述所要的关系信息,然后寻找Web上的列表并估计它们是否包含所要的关系信息,当估计值足够大时.则从中抽取所要的关系信息。  相似文献   

5.
Aggregate keyword search on large relational databases   总被引:2,自引:1,他引:1  
Keyword search has been recently extended to relational databases to retrieve information from text-rich attributes. However, all the existing methods focus on finding individual tuples matching a set of query keywords from one table or the join of multiple tables. In this paper, we motivate a novel problem of aggregate keyword search: finding minimal group-bys covering a set of query keywords well, which is useful in many applications. We develop two interesting approaches to tackle the problem. We further extend our methods to allow partial matches and matches using a keyword ontology. An extensive empirical evaluation using both real data sets and synthetic data sets is reported to verify the effectiveness of aggregate keyword search and the efficiency of our methods.  相似文献   

6.
7.
Sets and bags are closely related structures and have been studied in relational databases. A bag is different from a set in that it is sensitive to the number of times an element occurs while a set is not. In this paper, we introduce the concept of web bag in the context of a web warehouse called Whoweda (Warehouse Of Weda Data) which we are currently building. Informally, a web bag is a web table which allows multiple occurrences of identical web tuples. Web bag helps to discover useful knowledge from a web table such as visible documents (or web sites), luminous documents and luminous paths. In this paper, we perform a cost-benefit analysis with respect to storage, transmission and operational cost of web bags and discussed issues and implication of materializing web bags as opposed to web tables containing distinct web tuples. We have computed analytically the upper and lower bounds for the parameters which affect the cost of materializing web bags. This revised version was published online in August 2006 with corrections to the Cover Date.  相似文献   

8.
We consider verification of programs manipulating dynamic linked data structures such as various forms of singly and doubly-linked lists or trees. We consider important properties for this kind of systems like no null-pointer dereferences, absence of garbage, shape properties, etc. We develop a verification method based on a novel use of tree automata to represent heap configurations. A heap is split into several ??separated?? parts such that each of them can be represented by a tree automaton. The automata can refer to each other allowing the different parts of the heaps to mutually refer to their boundaries. Moreover, we allow for a hierarchical representation of heaps by allowing alphabets of the tree automata to contain other, nested tree automata. Program instructions can be easily encoded as operations on our representation structure. This allows verification of programs based on symbolic state-space exploration together with refinable abstraction within the so-called abstract regular tree model checking. A motivation for the approach is to combine advantages of automata-based approaches (higher generality and flexibility of the abstraction) with some advantages of separation-logic-based approaches (efficiency). We have implemented our approach and tested it successfully on multiple non-trivial case studies.  相似文献   

9.
We describe a novel technique for the simultaneous visualization of multiple scalar fields, e.g. representing the members of an ensemble, based on their contour trees. Using tree alignments, a graph-theoretic concept similar to edit distance mappings, we identify commonalities across multiple contour trees and leverage these to obtain a layout that can represent all trees simultaneously in an easy-to-interpret, minimally-cluttered manner. We describe a heuristic algorithm to compute tree alignments for a given similarity metric, and give an algorithm to compute a joint layout of the resulting aligned contour trees. We apply our approach to the visualization of scalar field ensembles, discuss basic visualization and interaction possibilities, and demonstrate results on several analytic and real-world examples.  相似文献   

10.
Exposes problems of the commonly used technique of splitting the available data into training, validation, and test sets that are held fixed, warns about drawing too strong conclusions from such static splits, and shows potential pitfalls of ignoring variability across splits. Using a bootstrap or resampling method, we compare the uncertainty in the solution stemming from the data splitting with neural-network specific uncertainties (parameter initialization, choice of number of hidden units, etc.). We present two results on data from the New York Stock Exchange. First, the variation due to different resamplings is significantly larger than the variation due to different network conditions. This result implies that it is important to not over-interpret a model (or an ensemble of models) estimated on one specific split of the data. Second, on each split, the neural-network solution with early stopping is very close to a linear model; no significant nonlinearities are extracted.  相似文献   

11.
Dynamic web sites commonly return information in the form of lists and tables. Although hand crafting an extraction program for a specific template is time-consuming but straightforward, it is desirable to automatically generate template extraction programs from examples of lists and tables in html documents. Supervised approaches have been shown to achieve high accuracy, but they require manual labelling of training examples, which is also time consuming. Fully unsupervised approaches, which extract rows and columns by detecting regularities in the data, cannot provide sufficient accuracy for practical domains. We describe a novel technique, Post-supervised Learning, which exploits unsupervised learning to avoid the need for training examples, while minimally involving the user to achieve high accuracy. We have developed unsupervised algorithms to extract the number of rows and adopted a dynamic programming algorithm for extracting columns. Our method achieves high performance with minimal user input compared to fully supervised techniques.  相似文献   

12.
Starting from fuzzy binary data represented as tables in the fuzzy relational database, in this paper, we use fuzzy formal concept analysis to reduce the tables size to only keep the minimal rows in each table, without losing knowledge (i.e., association rules extracted from reduced databases are identical at given precision level). More specifically, we develop a fuzzy extension of a previously proposed algorithm for crisp data reduction without loss of knowledge. The fuzzy Galois connection based on the Lukasiewicz implication is mainly used in the definition of the closure operator according to a precision level, which makes data reduction sensitive to the variation of this precision level.  相似文献   

13.
The analysis of large volumes of unordered multidimensional data is a problem confronted by scientists and data analysts every day. Often, it involves searching for data alignments that emerge as well-defined structures or geometric patterns in datasets. For example, straight lines, circles, and ellipses represent meaningful structures in data collected from electron backscatter diffraction, particle accelerators, and clonogenic assays. Also, customers with similar behavior describe linear correlations in e-commerce databases. We describe a general approach for detecting data alignments in large unordered noisy multidimensional datasets. In contrast to classical techniques such as the Hough transforms, which are designed for detecting a specific type of alignment on a given type of input, our approach is independent of the geometric properties of the alignments to be detected, as well as independent of the type of input data. Thus, it allows concurrent detection of multiple kinds of data alignments, in datasets containing multiple types of data. Given its general nature, optimizations developed for our technique immediately benefit all its applications, regardless the type of input data.  相似文献   

14.
The paper discusses issues of rule-based data transformation from arbitrary spreadsheet tables to a canonical (relational) form. We present a novel table object model and rule-based language for table analysis and interpretation. The model is intended to represent a physical (cellular) and logical (semantic) structure of an arbitrary table in the transformation process. The language allows drawing up this process as consecutive steps of table understanding, i. e. recovering implicit semantics. Both are implemented in our tool for spreadsheet data canonicalization. The presented case study demonstrates the use of the tool for developing a task-specific rule-set to convert data from arbitrary tables of the same genre (government statistical websites) to flat file databases. The performance evaluation confirms the applicability of the implemented rule-set in accomplishing the stated objectives of the application.  相似文献   

15.
在版面分析过程中,有时会将表格误判为图形或将图形误判为表格。为避免对误判的表格或图形进行识别而产生的错误结果,文章提出了一种根据表格框线信息和表格单元信息来区分表格与图形的方法。该方法结合表格的结构特征,提出了作为一个表格的重要组成要素的表格框线和表格单元所必须满足的若干约束条件,通过验证每个条件是否得到满足来区分表格与图形。实验表明,该方法能有效地区分绝大多数表格与图形,极大地降低了对表格与图形的误判率。  相似文献   

16.
Transduction is an inference mechanism adopted from several classification algorithms capable of exploiting both labeled and unlabeled data and making the prediction for the given set of unlabeled data only. Several transductive learning methods have been proposed in the literature to learn transductive classifiers from examples represented as rows of a classical double-entry table (or relational table). In this work we consider the case of examples represented as a set of multiple tables of a relational database and we propose a new relational classification algorithm, named TRANSC, that works in a transductive setting and employs a probabilistic approach to classification. Knowledge on the data model, i.e., foreign keys, is used to guide the search process. The transductive learning strategy iterates on a k-NN based re-classification of labeled and unlabeled examples, in order to identify borderline examples, and uses the relational probabilistic classifier Mr-SBC to bootstrap the transductive algorithm. Experimental results confirm that TRANSC outperforms its inductive counterpart (Mr-SBC).  相似文献   

17.
陈吉荣  乐嘉锦 《计算机应用》2013,33(9):2486-2489
针对Sqoop在导入大表时表现出的不稳定和效率较低两个主要问题,设计并实现了一种新的基于MapReduce的大表导入编程模型。该模型对于大表的切分算法是:将大表总的记录数对mapper数求步长,获得对应每个split的SQL查询语句的起始行和区间长度(等于步长),从而保证每个mapper的导入工作量完全相同。该模型的map方式是:进入map函数的键值对中的键是一个split所对应的SQL语句,将查询放在map函数中完成,从而使得模型中的每个mapper只调用一次map函数。对比实验表明:两个记录数相同的大表,无论其记录区间如何分布,其导入时间基本相同,或者对同一表分别用不同的分割字段,导入时间也完全相同;而对于同一个大表,模型的导入效率比Sqoop有显著提高。  相似文献   

18.
Euler diagrams are a popular technique to visualize set-typed data. However, creating diagrams using simple shapes remains a challenging problem for many complex, real-life datasets. To solve this, we propose RectEuler: a flexible, fully-automatic method using rectangles to create Euler-like diagrams. We use an efficient mixed-integer optimization scheme to place set labels and element representatives (e.g., text or images) in conjunction with rectangles describing the sets. By defining appropriate constraints, we adhere to well-formedness properties and aesthetic considerations. If a dataset cannot be created within a reasonable time or at all, we iteratively split the diagram into multiple components until a drawable solution is found. Redundant encoding of the set membership using dots and set lines improves the readability of the diagram. Our web tool lets users see how the layout changes throughout the optimization process and provides interactive explanations. For evaluation, we perform quantitative and qualitative analysis across different datasets and compare our method to state-of-the-art Euler diagram generation methods.  相似文献   

19.
Classification trees with bivariate splits   总被引:1,自引:1,他引:0  
We extend the recursive partitioning approach to classifier learning to use more complex types of split at each decision node. The new split types we permit are bivariate and can thus be interpreted visually in plots and tables. In order to find optimal splits of these new types, a new split criterion is introduced that allows the development of divide-and-conquer type algorithms. Two experiments are presented in which the bivariate trees—both with the Gini split criterion and with the new split criterion—are compared to a traditional tree-growing procedure. With the Gini criterion, the bivariate trees show a slight improvement in predictive accuracy and a considerable improvement in tree size over univariate trees. Under the new split criterion, accuracy is also improved, but there is no consistent improvement in tree size.Much of this work was completed while the author was an employee of AT&T Bell Laboratories.  相似文献   

20.
Feature selection is an essential data processing step to remove irrelevant and redundant attributes for shorter learning time, better accuracy, and better comprehensibility. A number of algorithms have been proposed in both data mining and machine learning areas. These algorithms are usually used in a single table environment, where data are stored in one relational table or one flat file. They are not suitable for a multi‐relational environment, where data are stored in multiple tables joined to one another by semantic relationships. To address this problem, in this article, we propose a novel approach called FARS to conduct both Feature And Relation Selection for efficient multi‐relational classification. Through this approach, we not only extend the traditional feature selection method to select relevant features from multi‐relations, but also develop a new method to reconstruct the multi‐relational database schema and eliminate irrelevant tables to improve classification performance further. The results of the experiments conducted on both real and synthetic databases show that FARS can effectively choose a small set of relevant features, thereby enhancing classification efficiency and prediction accuracy significantly.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号