共查询到19条相似文献,搜索用时 150 毫秒
1.
Web信息抽取引发了大规模的应用。基于包装器的Web信息抽取有两个研究领域:包装器产生和包装器平衡,提出了一种新的包装器自动平衡算法。它基于以下的观察:尽管页面有多种多样的变化方式,但是许多重要的页面特征在新页面都得到了保存,例如文本模式、注释信息和超级链接。新的算法能充分利用这些保存下来的页面特征在变化的页面中定位目标信息,并能自动修复失效的包装器。对实际Web站点信息抽取的实验表明,新的算法能有效地维持包装器的平衡以便更精确地抽取信息。 相似文献
2.
3.
4.
开发Web信息抽取系统的核心是为各个Web信息源构造包装器,而构造包装器的关键在于规则学习器。鉴于传统的规则学习器一般都基于单一的学习策略,结合归纳学习和分析学习的优点,提出了基于解释学习的规则学习器,以此为核心生成包装器,并将其应用到了实际的包装器生成系统中去。 相似文献
5.
6.
7.
源搜索可定制的元搜索引擎设计技术 总被引:1,自引:0,他引:1
本文介绍了一种源搜索引擎可以定制的元搜索引擎的实现技术,描述了系统的总体结构,并着重介绍了包装器与抽取器的格式与实现。通过包装器与抽取器实现源搜索引擎的定制,使系统集成的源搜索引擎的增加、更改、删除变得容易。 相似文献
8.
Web包装器将网页内容转换为XML格式,用于系统集成。进行XML转换的XSLT技术能较好地支持包装器的信息抽取和组织。本文从包含查询接口、结果模式和映射规则的包装器描述文件(XML)出发,给出了自动生成可执行代码的技术方案。包装器的执行及其生成过程完全基于XSLT技术,系统具有较强的可移植性。提出“元数据对齐”方法进行内
容辅助定位,提高了对页面变化的容忍度。原型系统的实现验证了以上技术的可行性。 相似文献
容辅助定位,提高了对页面变化的容忍度。原型系统的实现验证了以上技术的可行性。 相似文献
9.
10.
信息抽取技术是一种广泛运用于互联网的数据挖掘技术。其目的是从互联网海量数据中抽取有意义、有价值的数据和信息,从而能更好的利用互联网资源。文中采用一种统计网页特征的方法,将中文网页中的正文部分抽取出来。该方法首先将网页表示成基于XML的DOM树形式,利用统计的节点信息从树中过滤掉噪音数据节点,最后再选取正文节点。该方法相比传统的基于包装器的抽取方法,具有简单,实用的特点,试验结果表明,该抽取方法准确率达到90%以上,具有很好的实用价值。 相似文献
11.
Hierarchical Wrapper Induction for Semistructured Information Sources 总被引:16,自引:0,他引:16
Muslea Ion Minton Steven Knoblock Craig A. 《Autonomous Agents and Multi-Agent Systems》2001,4(1-2):93-114
With the tremendous amount of information that becomes available on the Web on a daily basis, the ability to quickly develop information agents has become a crucial problem. A vital component of any Web-based information agent is a set of wrappers that can extract the relevant data from semistructured information sources. Our novel approach to wrapper induction is based on the idea of hierarchical information extraction, which turns the hard problem of extracting data from an arbitrarily complex document into a series of simpler extraction tasks. We introduce an inductive algorithm, STALKER, that generates high accuracy extraction rules based on user-labeled training examples. Labeling the training data represents the major bottleneck in using wrapper induction techniques, and our experimental results show that STALKER requires up to two orders of magnitude fewer examples than other algorithms. Furthermore, STALKER can wrap information sources that could not be wrapped by existing inductive techniques. 相似文献
12.
基于集成学习和二维关联边条件随机场的Web数据语义标注方法 总被引:2,自引:0,他引:2
大规模Web信息抽取需要准确、自动地从众多相关网站上抽取Web数据对象.现有的Web信息抽取方法主要针对单个网站进行处理,无法适应大规模Web信息抽取的需要.调查研究表明,有效地实现Web数据语义自动标注,结合现有的包装器生成技术,可以满足大规模Web信息抽取的要求.文中提出一种基于集成学习和二维关联边条件随机场的Web数据语义自动标注方法,首先,利用已抽取的信息和目标网站训练页面中呈现的特征构造多个分类器,使用Dempster合成法则合并分类器结果,区分训练页面中的属性标签和数据元素;然后,利用二维关联边条件随机场模型对Web数据元素间的长距离依赖联系和短距离依赖联系进行建模,实现数据元素的自动语义标注.通过在多个领域真实数据集上的实验结果表明,所提出的方法可以高效地解决Web数据语义自动标注问题,满足大规模Web信息抽取的需要. 相似文献
13.
14.
Bettina Fazzinga Sergio Flesca Andrea Tagarelli 《Knowledge and Information Systems》2011,26(1):127-173
An effective solution to automate information extraction from Web pages is represented by wrappers. A wrapper associates a Web page with an XML document that represents part of the information in that page in a machine-readable
format. Most existing wrapping approaches have traditionally focused on how to generate extraction rules, while they have
ignored potential benefits deriving from the use of the schema of the information being extracted in the wrapper evaluation. In this paper, we investigate how the schema of extracted information
can be effectively used in both the design and evaluation of a Web wrapper. We define a clean declarative semantics for schema-based wrappers by introducing the notion of (preferred) extraction model, which is essential to compute a valid XML document containing
the information extracted from a Web page. We developed the SCRAP (SChema-based wRAPper for web data) system for the proposed schema-based wrapping approach, which also provides visual support tools to the wrapper designer.
Moreover, we present a wrapper generalization framework to profitably speed up the design of schema-based wrappers. Experimental
evaluation has shown that SCRAP wrappers are not only able to successfully extract the required data, but also they are robust
to changes that may occur in the source Web pages. 相似文献
15.
Bernd Thomas 《Journal of Intelligent Information Systems》2000,14(2-3):241-261
We present a general framework for the information extraction from web pages based on a special wrapper language, called token-templates. By using token-templates in conjunction with logic programs we are able to reason about web page contents, search and collect facts and derive new facts from various web pages. We give a formal definition for the semantics of logic programs extended by token-templates and define a general answer-complete calculus for these extended programs. These methods and techniques are used to build intelligent mediators and web information systems. 相似文献
16.
An XML-enabled data extraction toolkit for web sources 总被引:7,自引:0,他引:7
The amount of useful semi-structured data on the web continues to grow at a stunning pace. Often interesting web data are not in database systems but in HTML pages, XML pages, or text files. Data in these formats are not directly usable by standard SQL-like query processing engines that support sophisticated querying and reporting beyond keyword-based retrieval. Hence, the web users or applications need a smart way of extracting data from these web sources. One of the popular approaches is to write wrappers around the sources, either manually or with software assistance, to bring the web data within the reach of more sophisticated query tools and general mediator-based information integration systems. In this paper, we describe the methodology and the software development of an XML-enabled wrapper construction system—XWRAP for semi-automatic generation of wrapper programs. By XML-enabled we mean that the metadata about information content that are implicit in the original web pages will be extracted and encoded explicitly as XML tags in the wrapped documents. In addition, the query-based content filtering process is performed against the XML documents. The XWRAP wrapper generation framework has three distinct features. First, it explicitly separates tasks of building wrappers that are specific to a web source from the tasks that are repetitive for any source, and uses a component library to provide basic building blocks for wrapper programs. Second, it provides inductive learning algorithms that derive or discover wrapper patterns by reasoning about sample pages or sample specifications. Third and most importantly, we introduce and develop a two-phase code generation framework. The first phase utilizes an interactive interface facility to encode the source-specific metadata knowledge identified by individual wrapper developers as declarative information extraction rules. The second phase combines the information extraction rules generated at the first phase with the XWRAP component library to construct an executable wrapper program for the given web source. 相似文献
17.
Semantic integration of heterogeneous information sources 总被引:15,自引:0,他引:15
Sonia Bergamaschi Silvana Castano Maurizio Vincini Domenico Beneventano 《Data & Knowledge Engineering》2001,36(3):215-249
18.
19.
针对现有网上论坛信息抽取的不足,提出一种基于后缀树的论坛信息抽取方法.将标准化后的HTML文档转换为后缀树,查找出其中的重复模式并产生分装器,将分装器转换为NFA(非确定型有穷自动机)达到抽取论坛信息的目的.该方法运用构造后缀树的技术来抽取论坛信息,较好地解决了现有的抽取方法准确性较差、通用性不强的问题.实验结果表明,该方法具有较高的准确性和实用性. 相似文献