首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
A table is a well-organized and summarized knowledge expression for a domain. Therefore, it is of great importance to extract information from tables. However, many tables in Web pages are used not to transfer information but to decorate pages. One of the most critical tasks in Web table mining is thus to discriminate meaningful tables from decorative ones. The main obstacle of this task comes from the difficulty of generating relevant features for discrimination. This paper proposes a novel discrimination method using a composite kernel which combines parse tree kernels and a linear kernel. Because a Web table is represented as a parse tree by an HTML parser, it is natural to represent the structural information of a table as a parse tree. In this paper, two types of parse trees are used to represent structural information within and around a table. These two trees define the structure kernel that handles the structural information of tables. The contents of a Web table are manipulated by a linear kernel with content features. Support vector machines with the composite kernel distinguish meaningful tables from decorative ones with high accuracy. A series of experiments show that the proposed method achieves state-of-the-art performance.  相似文献   

2.
A large number of web pages contain data structured in the form of ??lists??. Many such lists can be further split into multi-column tables, which can then be used in more semantically meaningful tasks. However, harvesting relational tables from such lists can be a challenging task. The lists are manually generated and hence need not have well-defined templates??they have inconsistent delimiters (if any) and often have missing information. We propose a novel technique for extracting tables from lists. The technique is domain independent and operates in a fully unsupervised manner. We first use multiple sources of information to split individual lines into multiple fields and then, compare the splits across multiple lines to identify and fix incorrect splits and bad alignments. In particular, we exploit a corpus of HTML tables, also extracted from the web, to identify likely fields and good alignments. For each extracted table, we compute an extraction score that reflects our confidence in the table??s quality. We conducted an extensive experimental study using both real web lists and lists derived from tables on the web. The experiments demonstrate the ability of our technique to extract tables with high accuracy. In addition, we applied our technique on a large sample of about 100,000 lists crawled from the web. The analysis of the extracted tables has led us to believe that there are likely to be tens of millions of useful and query-able relational tables extractable from lists on the web.  相似文献   

3.
为提高Web数据表识别的准确性,提出一种基于支持向量机与混合核函数的数据表识别方法。给出表格的结构特征、内容特征以及行(列)相似特征,将多项式核函数和线性核函数组成混合核函数,利用其进行Web数据表的自动识别。实验结果表明,该方法在7个站点上,准确率和召回率的平均值为95.14%和95.69%。  相似文献   

4.
The TABLE tags in HTML (Hypertext Markup Language) documents are widely used for formatting layout of Web documents as well as for describing genuine tables with relational information. As a prerequisite for information extraction from the Web, this paper presents an efficient method for sophisticated table detection. The proposed method consists of two phases: preprocessing and attribute–value relations extraction. During preprocessing, a part of genuine or non-genuine tables are filtered out using a set of rules, which are devised based on careful examination of general characteristics of various HTML tables. The remaining tables are detected at the attribute–value relations extraction phase. Specifically, a value area is extracted and checked out whether there is syntactic coherency. Furthermore, the method looks for semantic coherency between an attribute area and a value area of a table. Experimental results with 11,477 TABLE tags from 1393 HTML documents show that the method has performed better compared with previous works, resulting in a precision of 97.54% and a recall of 99.22%.  相似文献   

5.
Web表格信息抽取模型的设计与实现   总被引:1,自引:0,他引:1  
Web表格作为一种简洁有效的数据信息表达方式,已广泛应用于Web页面中.现提出一种基于表格结构的Web表格信息抽取模型,该模型主要有表格定位模块、表格结构预处理模块和表格信息抽取与重构模块三个模块组成,根据Web表格的结构标记和自定义的启发式规则来抽取表格信息.实验结果表明该模型能够很好地应用于Web表格信息的抽取.  相似文献   

6.
Tables are a ubiquitous form of communication. While everyone seems to know what a table is, a precise, analytical definition of “tabularity” remains elusive because some bureaucratic forms, multicolumn text layouts, and schematic drawings share many characteristics of tables. There are significant differences between typeset tables, electronic files designed for display of tables, and tables in symbolic form intended for information retrieval. Most past research has addressed the extraction of low-level geometric information from raster images of tables scanned from printed documents, although there is growing interest in the processing of tables in electronic form as well. Recent research on table composition and table analysis has improved our understanding of the distinction between the logical and physical structures of tables, and has led to improved formalisms for modeling tables. This review, which is structured in terms of generalized paradigms for table processing, indicates that progress on half-a-dozen specific research issues would open the door to using existing paper and electronic tables for database update, tabular browsing, structured information retrieval through graphical and audio interfaces, multimedia table editing, and platform-independent display.  相似文献   

7.
Web表格知识抽取是一种重要的获取高质量知识的途径,在知识图谱、网页挖掘等方面具有广泛的研究意义与应用价值。传统的Web表格知识抽取方法主要依赖于良好的表格结构和足够的先验知识,但在复杂的表格结构以及先验知识不足等情形下难以奏效。针对这类方法的问题,该文通过充分利用表格自身的结构特点,提出了一套可面向大规模数据的基于等价压缩快速聚类的Web表格知识抽取方法,以无监督的聚类方式获得相似形式结构的表格,从而推测其语义结构以抽取知识。实验结果表明,基于等价压缩的快速聚类算法在保持同水平的聚类准确率的前提下,在时间性能上相比传统方法有大幅度的提升,5 000个表格的聚类时间由72小时缩短为20分钟,且在表格聚类后利用表格模板所抽取的知识三元组的准确率也达到了令人满意的结果。  相似文献   

8.
Rough集理论代数观与信息观的差异量化分析   总被引:5,自引:1,他引:5  
决策表是Rough集理论的处理对象,其核属性的计算往往是信息约简过程的出发点和关键.代数观和信息观是Rough集理论研究中的两种主要理论观点和方法.本文将针对决策表核属性的计算问题,探讨Rough集理论代数观和信息观这两种形式的关系,通过仿真实验,得到它们在决策表核属性问题上的统计量化差异,并发现在包括大量不相容信息的决策表系统中这两种观点之间的差异将达到极端情况.  相似文献   

9.
Much of the world’s quantitative data reside in scattered web tables. For a meaningful role in Big Data analytics, the facts reported in these tables must be brought into a uniform framework. Based on a formalization of header-indexed tables, we proffer an algorithmic solution to end-to-end table processing for a large class of human-readable tables. The proposed algorithms transform header-indexed tables to a category table format that maps easily to a variety of industry-standard data stores for query processing. The algorithms segment table regions based on the unique indexing of the data region by header paths, classify table cells, and factor header category structures of two-dimensional as well as the less common multidimensional tables. Experimental evaluations substantiate the algorithmic approach to processing heterogeneous tables. As demonstrable results, the algorithms generate queryable relational database tables and semantic-web triple stores. Application of our algorithms to 400 web tables randomly selected from diverse sources shows that the algorithmic solution automates end-to-end table processing.  相似文献   

10.
Previous research is equivocal regarding the most effective methods of presenting quantitative information displays. The differences in results may be due to numerous reasons including the display and inquiry type. This study examines several methods of displaying quantitative information (e.g., line graphs, line grables, bar charts, bar grables, tables, pie charts and pie grables) that were factorially crossed with different kinds of data extraction inquiries (i.e., questions about exact numerical quantities, comparisons, and trends). Grables are displays that combines features of graphs and tables including specific numerical information with each graphically presented category. Results showed that tables, bar grables and line grables produced the fewest errors, and line graphs and bar charts produced the fastest responses across question types. Error rates combining the accuracy and time (i.e., errors/s) were lowest for the three grables and table. Results are discussed with respect to prior theoretical work and the potential benefits of hybrid forms of quantitative displays for multiple kinds of data extraction inquiries.

Relevance to industry

Choosing the best method of displaying information is important for effective decision making. This study evaluates seven types of graphical displays to answer three types of inquiries. Results indicate that in general, the most efficient data extraction (fewest errors per unit time) were produced using grable or table displays across question types. The appropriate display fosters better communication of information.  相似文献   


11.
表格广泛存在于科技文献、财务报表、报纸杂志等各类文档中,用于紧凑地存储和展现数据,蕴含着大量有用信息。表格识别是表格信息再利用的基础,具有重要的应用价值,也一直是模式识别领域的研究热点之一。随着深度学习的发展,针对表格识别的新研究和新方法纷纷涌现。然而,由于表格应用场景广泛、样式众多、图像质量参差不齐等因素,表格识别领域仍然存在着大量问题亟需解决。为了更好地总结前人工作,为后续研究提供支持,本文围绕表格区域检测、结构识别和内容识别等3个表格识别子任务,从传统方法、深度学习方法等方面,综述该领域国内外的发展历史和最新进展。梳理了表格识别相关数据集及评测标准,并基于主流数据集和标准,分别对表格区域检测、结构识别、表格信息抽取的典型方法进行了性能比较。然后,对比分析了国内相对于国外,在表格识别方面的研究进展与水平。最后,结合表格识别领域目前面临的主要困难与挑战,对未来的研究趋势和技术发展目标进行了展望。  相似文献   

12.
一种基于Rough集理论的属性约简启发式算法   总被引:9,自引:1,他引:9  
属性约简是知识发现中的关键问题之一.为了能够有效地获取决策表中属性的最小相对约简,在Rough集理论的基础上构造了一个新的算子,将信息论角度定义的属性的重要性作为启发式信息,来描述在决策表中条件属性所提供的知识对决策属性的影响;并采用宽度优先搜索策略,提出了一种新的属性约简启发式算法.以原始条件属性集为起点并结合算子,通过向属性核的递减式逼近,得到属性的最小相对约简.实例分析表明,该算法能有效地对决策表属性进行约简.  相似文献   

13.
随着互联网的飞速发展,大量的文本信息被分享到网上,如何在海量的网络信息中提取出可靠性较高的人物关系已成为信息抽取领域中的一个重要研究课题。为深入进行人物关系识别任务在中文方面的研究,提出了基于多元特征的分块人物关系识别系统,设计了较为完备的特征池,包括词袋特征、相关频率特征、依存树(DT)特征、命名实体识别(NER)特征等,为不同的关系从特征池中选择效果最佳的特征集合,并实验了多种基于有监督的机器学习分类算法。本系统在2015年中国机器学习会议竞赛(CCML Competition)举办的两个任务(Task1是从单个新闻标题中判定给定人物的关系;Task2是从多个新闻标题中判定人物的关系)的数据集上分别取得了75.68%和76.58%的MacroF1值,均位列参赛成绩的第一名。  相似文献   

14.
Two key features in the Icon programming language are tables and sets. An Icon program may use one large set or table, or thousands of small ones. To improve space and time performance for these diverse uses, their hashed data structures were reimplemented to dynamically resize during execution, reducing the minimum space requirement and achieving constant-time access to any element for virtually any size set or table. The implementation is adapted from Per-Åke Larson's dynamic hashing technique by using well-known base-2 arithmetic techniques to decrease the space required for small tables without degrading the performance of large tables. Also presented are techniques to prevent dynamic hashing from interfering with other Icon language features. Performance measurements are included to support the results.  相似文献   

15.
《Computer Networks》2007,51(6):1444-1458
Soft-state is a well established approach to designing robust network protocols and applications. However it is unclear how to apply soft-state approach to protocols that must maintain a large amount of state information in a scalable way. For example the Border Gateway Protocol (BGP) is used to maintain the global routing tables at core Internet routers, and the table size is typically above 180,000 entries and continues to grow over time. In this paper, we propose a novel approach, Persistent Detection and Recovery (PDR), to enable large-state protocols and applications to maintain state consistency using a soft-state approach. PDR uses state compression and receiver participation mechanisms to avoid per-state refresh overhead. We evaluate PDR’s effectiveness and scalability by applying its mechanisms to maintain the consistency of BGP routing tables between routers. Our results show that the proposed PDR mechanisms are effective and efficient in detecting and correcting route insertion, modification, and removal errors. Moreover, they eliminate the need for routers to exchange full routing tables after a session reset, thus enabling routers to recover quickly from transient session failures.  相似文献   

16.
通过研究飞机快速存取记录器(Quick Access Recorders,QAR)数据和粗糙集理论的特点,结合信息决策表的相关知识,对QAR数据中的异常数据进行检测挖掘,以辅助飞机故障检测及排除。主要工作是:应用粗糙集理论特点对QAR数据进行离散化,并建立离散化后数据的决策表,然后对决策表进行属性约简和规则提取。根据QAR数据的特殊性,给出了数据离散化和决策表属性约简的改进算法。最后通过对比项目实验及专家给出的数据证明了其可行性和有效性,提高了飞机排故效率,具有很重要的现实意义。  相似文献   

17.
A key element of bioinformatics research is the extraction of meaningful information from large experimental data sets. Various approaches, including statistical and graph theoretical methods, data mining, and computational pattern recognition, have been applied to this task with varying degrees of success. Using a novel classifier based on the Bayes discriminant function, we present a hybrid algorithm that employs feature selection and extraction to isolate salient features from large medical and other biological data sets. We have previously shown that a genetic algorithm coupled with a k-nearest-neighbors classifier performs well in extracting information about protein-water binding from X-ray crystallographic protein structure data. The effectiveness of the hybrid EC-Bayes classifier is demonstrated to distinguish the features of this data set that are the most statistically relevant and to weight these features appropriately to aid in the prediction of solvation sites.  相似文献   

18.
The next evolutionary step in wireless Internet information management is to provide support for tasks, which may be collaborative and may include multiple target devices, from desktop to handheld. This means that the information architecture supports the processes of the task, recognizes group interaction, and lets users migrate seamlessly among internet-compatible devices without losing the thread of the session. If users are free to migrate amongst devices during the course of a session then intelligent transformation of data is required to exploit the screen size and input characteristics of the target appliance with minimal loss of task effectiveness.In this paper we first review general characteristics related to the performance of users on small screens and then examine the navigation of full tables on small screens for users in multi-device scenarios. We examine the methodologies available for access to full tables in environments where the full table cannot be viewed in its entirety. In particular, we examine the situation where users are collaborating across platform and referring to the same table of data. We ask three basic questions: Does screen size affect the performance of table lookup tasks? Does a search function improve performance of table lookup based tasks on reduced screen sizes? Does including context information improve the performance of table lookup based tasks on reduced screen sizes? The answers to these questions are important as individual and intuitive responses are used by the designers of small screen interfaces for use with large tables of data. We report on the results of a user study that examines factors that may affect the use of large tables on small display devices. The use of large tables on small devices in their native state becomes important in at least two circumstances. First, when collaboration involves two or more users sharing a view of data when the individual screen sizes are different. Second, when the exact table structure replication may be critical as a user moves quickly from a larger to a smaller screen or back again mid-task. Performance is measured by both effectiveness, correctness of result, and efficiency, effort to reach a result.  相似文献   

19.
20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号