首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
We have established a preprocessing method for determining the meaningfulness of a table to allow for information extraction from tables on the Internet. A table offers a preeminent clue in text mining because it contains meaningful data displayed in rows and columns. However, tables are used on the Internet for both knowledge structuring and document design. Therefore, we were interested in determining whether or not a table has meaningfulness that is related to the structural information provided at the abstraction level of the table head. Accordingly, we: 1) investigated the types of tables present in HTML documents, 2) established the features that distinguished meaningful tables from others, 3) constructed a training data set using the established features after having filtered any obvious decorative tables, and 4) constructed a classification model using a decision tree. Based on these features, we set up heuristics for table head extraction from meaningful tables, and obtained an F-measure of 95.0 percent in distinguishing meaningful tables from decorative tables and an accuracy of 82.1 percent in extracting the table head from the meaningful tables.  相似文献   

2.
Much of the world’s quantitative data reside in scattered web tables. For a meaningful role in Big Data analytics, the facts reported in these tables must be brought into a uniform framework. Based on a formalization of header-indexed tables, we proffer an algorithmic solution to end-to-end table processing for a large class of human-readable tables. The proposed algorithms transform header-indexed tables to a category table format that maps easily to a variety of industry-standard data stores for query processing. The algorithms segment table regions based on the unique indexing of the data region by header paths, classify table cells, and factor header category structures of two-dimensional as well as the less common multidimensional tables. Experimental evaluations substantiate the algorithmic approach to processing heterogeneous tables. As demonstrable results, the algorithms generate queryable relational database tables and semantic-web triple stores. Application of our algorithms to 400 web tables randomly selected from diverse sources shows that the algorithmic solution automates end-to-end table processing.  相似文献   

3.
Tabular data often refers to data that is organized in a table with rows and columns. We observe that this data format is widely used on the Web and within enterprise data repositories. Tables potentially contain rich semantic information that still needs to be interpreted. The process of extracting meaningful information out of tabular data with respect to a semantic artefact, such as an ontology or a knowledge graph, is often referred to as Semantic Table Interpretation (STI) or Semantic Table Annotation. In this survey paper, we aim to provide a comprehensive and up-to-date state-of-the-art review of the different tasks and methods that have been proposed so far to perform STI. First, we propose a new categorization that reflects the heterogeneity of table types that one can encounter, revealing different challenges that need to be addressed. Next, we define five major sub-tasks that STI deals with even if the literature has mostly focused on three sub-tasks so far. We review and group the many approaches that have been proposed into three macro families and we discuss their performance and limitations with respect to the various datasets and benchmarks proposed by the community. Finally, we detail what are the remaining scientific barriers to be able to truly automatically interpret any type of tables that can be found in the wild Web.  相似文献   

4.
利用单元格和特征点实现图纸信息的自动提取   总被引:2,自引:0,他引:2  
工程图中的标题栏和明细栏是产品数据集中管理的重要数据来源.出于重用CAD数据的考虑,提出了一种有效的工程图零部件信息提取方法.通过分析工程图中标题栏和明细栏的形式,从宏观布局和微观结构出发归纳了表格的位形特征;提出了基于单元格和特征点的图纸数据自动提取策略,详细介绍了算法思想和实施步骤;开发了实用程序并在工程项目中得到应用.  相似文献   

5.
构建知网关系的网状表示   总被引:9,自引:2,他引:7  
本文介绍了一个针对知网关系的网状表示结构及其实现方法。通过构建三张数据表:概念表、特征表和关系表,以及建立它们的记录项之间的双向多元联系,可以方便地把知网的所有知识(概念、特征以及它们之间的各种关系) 集成在一起,从而为进一步进行基于知网的信息检索和知识推理打下很好的基础。  相似文献   

6.
We present a method for structuring a document according to the information present in its different organizational tables: table of contents, tables of figures, etc. This method is based on a two-step approach that leverages functional and formal (layout-based) kinds of knowledge. The functional definition of organizational table, based on five properties, is used to provide a first solution, which is improved in a second step by automatically learning the form of the table of contents. We also report on the robustness and performance of the method and we illustrate its use in a real conversion case.  相似文献   

7.
Towards Ontology Generation from Tables   总被引:3,自引:0,他引:3  
At the heart of today's information-explosion problems are issues involving semantics, mutual understanding, concept matching, and interoperability. Ontologies and the Semantic Web are offered as a potential solution, but creating ontologies for real-world knowledge is nontrivial. If we could automate the process, we could significantly improve our chances of making the Semantic Web a reality. While understanding natural language is difficult, tables and other structured information make it easier to interpret new items and relations. In this paper we introduce an approach to generating ontologies based on table analysis. We thus call our approach TANGO (Table ANalysis for Generating Ontologies). Based on conceptual modeling extraction techniques, TANGO attempts to (i) understand a table's structure and conceptual content; (ii) discover the constraints that hold between concepts extracted from the table; (iii) match the recognized concepts with ones from a more general specification of related concepts; and (iv) merge the resulting structure with other similar knowledge representations. TANGO is thus a formalized method of processing the format and content of tables that can serve to incrementally build a relevant reusable conceptual ontology.  相似文献   

8.
In documents, tables are important structured objects that present statistical and relational information. In this paper, we present a robust system which is capable of detecting tables from free style online ink notes and extracting their structure so that they can be further edited in multiple ways. First, the primitive structure of tables, i.e., candidates for ruling lines and table bounding boxes, are detected among drawing strokes. Second, the logical structure of tables is determined by normalizing the table skeletons, identifying the skeleton structure, and extracting the cell contents. The detection process is similar to a decision tree so that invalid candidates can be ruled out quickly. Experimental results suggest that our system is robust and accurate in dealing with tables having complex structure or drawn under complex situations.  相似文献   

9.
Web表格知识抽取是一种重要的获取高质量知识的途径,在知识图谱、网页挖掘等方面具有广泛的研究意义与应用价值。传统的Web表格知识抽取方法主要依赖于良好的表格结构和足够的先验知识,但在复杂的表格结构以及先验知识不足等情形下难以奏效。针对这类方法的问题,该文通过充分利用表格自身的结构特点,提出了一套可面向大规模数据的基于等价压缩快速聚类的Web表格知识抽取方法,以无监督的聚类方式获得相似形式结构的表格,从而推测其语义结构以抽取知识。实验结果表明,基于等价压缩的快速聚类算法在保持同水平的聚类准确率的前提下,在时间性能上相比传统方法有大幅度的提升,5 000个表格的聚类时间由72小时缩短为20分钟,且在表格聚类后利用表格模板所抽取的知识三元组的准确率也达到了令人满意的结果。  相似文献   

10.
Web信息的自主抽取方法   总被引:12,自引:0,他引:12  
许建潮  侯锟 《计算机工程与应用》2005,41(14):185-189,198
提出了基于表格结构及列表结构的W eb页面信息自主抽取的方法。可根据用户对信息的需求自主地从相关页面中抽取信息并将抽取信息按关系模型进行重组存放在数据库中,对表格结构信息源仅需标注一页网页,即可获取抽取知识,通过自学习能够较好地适应网页信息的动态变化,实现信息的自动抽取。对列表结构信息源信息,通过对DOM树结构的分析,动态获得信息块在DOM层次结构中的路径,根据信息对象基本的抽取知识,获得信息对象值。采用自学习的方法以适应网页信息的动态变化。  相似文献   

11.
建立临时关系的目的是使子表的记录指针随父表的记录指针的移动而移动,从而达到同时浏览多个表中数据的目的,本文首先简单介绍了表的关联、数据工作期和临时关系等基本概念,然后通过举例重点介绍了运用数据工作期如何建立表间关联、如何实施查询的步骤和方法。  相似文献   

12.
A large number of web pages contain data structured in the form of ??lists??. Many such lists can be further split into multi-column tables, which can then be used in more semantically meaningful tasks. However, harvesting relational tables from such lists can be a challenging task. The lists are manually generated and hence need not have well-defined templates??they have inconsistent delimiters (if any) and often have missing information. We propose a novel technique for extracting tables from lists. The technique is domain independent and operates in a fully unsupervised manner. We first use multiple sources of information to split individual lines into multiple fields and then, compare the splits across multiple lines to identify and fix incorrect splits and bad alignments. In particular, we exploit a corpus of HTML tables, also extracted from the web, to identify likely fields and good alignments. For each extracted table, we compute an extraction score that reflects our confidence in the table??s quality. We conducted an extensive experimental study using both real web lists and lists derived from tables on the web. The experiments demonstrate the ability of our technique to extract tables with high accuracy. In addition, we applied our technique on a large sample of about 100,000 lists crawled from the web. The analysis of the extracted tables has led us to believe that there are likely to be tens of millions of useful and query-able relational tables extractable from lists on the web.  相似文献   

13.
The well-known marching cubes method is used to generate isosurfaces from volume data or data on a 3D rectilinear grid.To do so,it refers to a lookup table to decide on the possible configurations of the isosurface within a given cube,assuming we know whether each vertex lies inside or outside the surface.However,the vertex values alone do not uniquely determine how the isosurface may pass through the cube,and in particular how it cuts each face of the cube.Earlier lookup tables are deficient in various respects.The possible combinations of the different configurations of such ambiguous faces are used in this paper to find a complete and correct lookup table.Isosurfaces generated using the new lookup table here are guaranteed to be watertight.  相似文献   

14.
Industrial tabular information extraction and its semantic fusion with text (ITIESF) is of great significance in converting and fusing industrial unstructured data into structured knowledge to guide cognitive intelligence analysis in the manufacturing industry. A novel end-to-end ITIESF approach is proposed to integrate tabular information and construct a tabular information-oriented causality event evolutionary knowledge graph (TCEEKG). Specifically, an end-to-end joint learning strategy is presented to mine the semantic information in tables. The definition and modeling method of the intrinsic relationships between tables with their rows and columns in engineering documents are provided to model the tabular information. Due to this, an end-to-end joint entity relationship extraction method for textual and tabular information from engineering documents is proposed to construct text-based knowledge graphs (KG) and tabular information-based causality event evolutionary graphs (CEEG). Then, a novel NSGCN (neighborhoods sample graph convolution network)-based entity alignment is proposed to fuse the cross-knowledge graphs into a unified knowledge base. Furthermore, a translation-based graph structure-driven Q&A (question and answer) approach is designed to respond to cause analysis and problem tracing. Our models can be easily integrated into a prototype system to provide a joint information processing and cognitive analysis. Finally, the approach is evaluated by employing the aerospace machining documents to illustrate that the TCEEKG can considerably help workers strengthen their skills in the cause-and-effect analysis of machining quality issues from a global perspective.  相似文献   

15.
本文指出文献[2]中一个提取不相容决策表决策规则的分解处理方法存在的错误,并且分析错误产生的原因.在此基础上,本文进一步对这种分解方法进行考察,给出了原始不相容决策表和分解后的子决策表之间有关核属性和属性约简等方面的若干关系和性质。  相似文献   

16.
A table is a well-organized and summarized knowledge expression for a domain. Therefore, it is of great importance to extract information from tables. However, many tables in Web pages are used not to transfer information but to decorate pages. One of the most critical tasks in Web table mining is thus to discriminate meaningful tables from decorative ones. The main obstacle of this task comes from the difficulty of generating relevant features for discrimination. This paper proposes a novel discrimination method using a composite kernel which combines parse tree kernels and a linear kernel. Because a Web table is represented as a parse tree by an HTML parser, it is natural to represent the structural information of a table as a parse tree. In this paper, two types of parse trees are used to represent structural information within and around a table. These two trees define the structure kernel that handles the structural information of tables. The contents of a Web table are manipulated by a linear kernel with content features. Support vector machines with the composite kernel distinguish meaningful tables from decorative ones with high accuracy. A series of experiments show that the proposed method achieves state-of-the-art performance.  相似文献   

17.
Decision tables are widely used in many knowledge-based and decision support systems. They allow relatively complex logical relationships to be represented in an easily understood form and processed efficiently. This paper describes second-order decision tables (decision tables that contain rows whose components have sets of atomic values) and their role in knowledge engineering to: (1) support efficient management and enhance comprehensibility of tabular knowledge acquired by knowledge engineers, and (2) automatically generate knowledge from a tabular set of examples. We show how second-order decision tables can be used to restructure acquired tabular knowledge into a condensed but logically equivalent second-order table. We then present the results of experiments with such restructuring. Next, we describe SORCER, a learning system that induces second-order decision tables from a given database. We compare SORCER with IDTM, a system that induces standard decision tables, and a state-of-the-art decision tree learner, C4.5. Results show that in spite of its simple induction methods, on the average over the data sets studied, SORCER has the lowest error rate.  相似文献   

18.
The paper discusses issues of rule-based data transformation from arbitrary spreadsheet tables to a canonical (relational) form. We present a novel table object model and rule-based language for table analysis and interpretation. The model is intended to represent a physical (cellular) and logical (semantic) structure of an arbitrary table in the transformation process. The language allows drawing up this process as consecutive steps of table understanding, i. e. recovering implicit semantics. Both are implemented in our tool for spreadsheet data canonicalization. The presented case study demonstrates the use of the tool for developing a task-specific rule-set to convert data from arbitrary tables of the same genre (government statistical websites) to flat file databases. The performance evaluation confirms the applicability of the implemented rule-set in accomplishing the stated objectives of the application.  相似文献   

19.
In this paper we present the design, implementation and evaluation of SOBA, a system for ontology-based information extraction from heterogeneous data resources, including plain text, tables and image captions. SOBA is capable of processing structured information, text and image captions to extract information and integrate it into a coherent knowledge base. To establish coherence, SOBA interlinks the information extracted from different sources and detects duplicate information. The knowledge base produced by SOBA can then be used to query for information contained in the different sources in an integrated and seamless manner. Overall, this allows for advanced retrieval functionality by which questions can be answered precisely. A further distinguishing feature of the SOBA system is that it straightforwardly integrates deep and shallow natural language processing to increase robustness and accuracy. We discuss the implementation and application of the SOBA system within the SmartWeb multimodal dialog system. In addition, we present a thorough evaluation of the different components of the system. However, an end-to-end evaluation of the whole SmartWeb system is out of the scope of this paper and has been presented elsewhere by the SmartWeb consortium.  相似文献   

20.
An information table or a training/designing sample set is all that can be obtained to infer the underlying generation mechanism (distribution) of tuples or samples. However, how an information table is available in representation, in treatment, and in interpretation, can still be discussed. In this paper, these matters are discussed on the basis of ldquogranularity.rdquo First, an explanation is given to identify the reasons why different goals/treatments of information tables exist in some different research fields. In this stage, it will be emphasized that ldquogranularity conceptrdquo plays an important role. Next, a framework of information tables is reformulated in terms of attribute sets and tuple sets. Here, a ldquoGalois connectionrdquo helps to understand their relationship. Then, the use of ldquoclosed subsetsrdquo is proposed instead of given tuples, for efficiency and for interpretability. With a special type of closed subsets, the traditional logical DNF expression framework can be naturally extended to those with multivalues and continuous values. Last, several concepts on rough sets are reformulated using ldquovariable granularityrdquo connected to closed subsets. This paper determines how and in what points granularity can give flexibility in dealing with several problems. Through several concepts defined in this paper, some intuitions toward development of data exploration and data mining are given.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号