基于DOM的Web信息提取 DOM-based Information Extraction for the Web Sources期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

基于DOM的Web信息提取

引用本文：	李效东,顾毓清. 基于DOM的Web信息提取[J]. 计算机学报, 2002, 25(5): 526-533

作者姓名：	李效东顾毓清

作者单位：	中国科学院软件研究所,北京,100080

摘要：	当前，Web已经成为人们获取信息的主要渠道之一。然而，用于表达Web页面信息的HTML语言存在着与生俱来的缺点。HTML的“标记”只是告诉浏览器软件如何显示所定义的信息，却不包含任何语义。因此由HTML语言所表述的Web页面经过浏览器分析后只适合人们浏览，不适合作为一种数据交换的方式由机器处理。该文以文档对象模型DOM为基础，把所要提取的信息在DOM层次结构中的路径作为信息抽取的“坐标”，并以这个基本原理为基础设计了一种归纳学习算法来半自动地生成提取规则，然后根据提取规则生成Java类.生成的Java类可以作为Web数据源包装器组成的重要构件。
关键词：	DOM Web 信息提取归纳学习文档对象模型路径表达式 XML Internet
修稿时间：	2001-02-12
DOM-based Information Extraction for the Web Sources

LI Xiao Dong GU Yu\|Qing. DOM-based Information Extraction for the Web Sources[J]. Chinese Journal of Computers, 2002, 25(5): 526-533

Authors:	LI Xiao Dong GU Yu\|Qing

Abstract:	At present, the Web becomes a major channel for people to obtain information. However, there exist inherent drawbacks in the HTML language used to represent information of Web pages. The HTML tags only tell the browsers how to display the information on the screen, but no any semantics in them. So the HTML document is not suited to be a way of data exchange for computer to process. Based on DOM and inductive learning, the paper presents a novel approach to semi automatically generate Java classes which can be dominant part of a wrapper for Web sources. The paper is an important part of the research on integrated query processing over hetergeneous data sources.

Keywords:	inductive learning document object model path expression XML
本文献已被 CNKI 维普万方数据等数据库收录！

设为首页 | 免责声明 | 关于勤云 | 加入收藏