首页 | 本学科首页   官方微博 | 高级检索  
     

基于HTML模式代数的Web信息提取方法
引用本文:李石君, 于俊清, 欧伟杰. 基于HTML模式代数的Web信息提取方法[J]. 计算机研究与发展, 2006, 43(9): 1644-1650.
作者姓名:李石君  于俊清  欧伟杰
作者单位:1(武汉大学计算机学院 武汉 430072) 2(中国科学院计算机科学重点实验室 北京 100080) 3(华中科技大学计算机科学与技术学院 武汉 430072) (shjli@public.wh.hb.cn)
基金项目:国家自然科学基金;湖北省自然科学基金
摘    要:高效地生成提取Web信息的包装器有着广阔的应用前景,同时也是至今没有得到有效解决的难题.为此,提出了基于HTML文档的模式代数,该代数包括一致模式集等重要概念以及模式的加法运算.在此基础上,提出了一种提取Web信息的新方法,该方法采用在整个训练例子中学习表示各属性提取规则的一致模式集,再由多个模式组成的一致模式集提取数据,适用于提取具有缺省属性、多值属性、属性具有多种不同顺序的表结构网页和层次结构网页,其有效性在原型系统中通过实验得到验证.

关 键 词:Web信息提取  包装器归纳学习  Web挖掘
收稿时间:2005-05-11
修稿时间:2005-05-112006-03-14

Web Information Extraction Based on HTML Pattern Algebra
Li Shijun, Yu Junqing, Ou Weijie. Web Information Extraction Based on HTML Pattern Algebra[J]. Journal of Computer Research and Development, 2006, 43(9): 1644-1650.
Authors:Li Shijun  Yu Junqing  Ou Weijie
Affiliation:1(School of Computer Science, Wuhan University, Wuhan 430072) 2(Laboratory of Computer Science, Chinese Academy of Sciences, Beijing 100080) 3(School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan 430072)
Abstract:Generating wrapper efficiently for extracting Web data has broad application prospect, but is also a difficult problem that is not yet solved efficiently till now. To tackle this problem, a pattern algebra for HTML documents is introduced, which includes key concepts, such as consistent pattern set, and the addition operation of pattern, and based on it a new approach to extract Web information is presented. It induces the consistent pattern set which represents identifying rules of each attribute by exploring the whole samples, and then extracts data by the consistent pattern set with multiple patterns. It can apply Web pages with tabular structure, in which there are missing attributes or attributes with multiple values or different order and hierarchical structure, and has been validated experimentally in the prototype.
Keywords:Web information extraction   wrapper induction   Web mining
本文献已被 CNKI 维普 万方数据 等数据库收录!
点击此处可从《计算机研究与发展》浏览原始摘要信息
点击此处可从《计算机研究与发展》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号