基于网页分割的Web信息提取算法 Web information extraction algorithm based on Web page segmentation期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

基于网页分割的Web信息提取算法

引用本文：	侯明燕,杨天奇. 基于网页分割的Web信息提取算法[J]. 微型机与应用, 2011, 30(5): 54-56

作者姓名：	侯明燕杨天奇

作者单位：	暨南大学计算机科学系,广东,广州,510632

基金项目：	广东省软科学研究项目(2009B070300052)

摘要：	针对网页非结构化信息抽取复杂度高的问题,提出了一种基于网页分割的Web信息提取算法。对网页噪音进行预处理,根据网页的文档对象模型树结构进行标签路径聚类,通过自动训练的阈值和网页分割算法快速判定网页的关键部分,根据数据块中的嵌套结构获取网页文本提取模板。对不同类型网站的实验结果表明,该算法运行速度快、准确度高。
关键词：	网页分割信息提取聚类阈值
Web information extraction algorithm based on Web page segmentation

Hou Mingyan,Yang Tianqi. Web information extraction algorithm based on Web page segmentation[J]. Microcomputer & its Applications, 2011, 30(5): 54-56

Authors:	Hou Mingyan Yang Tianqi

Affiliation:	Hou Mingyan,Yang Tianqi (Department of Computer Science,Jinan University,Guangzhou 510632,China)

Abstract:	This paper proposes a Web information extraction algorithm based on Web division to solve the high complexity problem of unstructured information extraction. The method adopts Web noise pretreatment, carries on the tag path clustering according to the document object model tree structure of Web. The key part of the Web is determined rapidly through automatic training threshold value and Web page segmentation algorithm, and Web text extracted templates are obtained according to nesting structure in the data ...

Keywords:	Web page segmentation information extraction clustering threshold
本文献已被 CNKI 万方数据等数据库收录！

设为首页 | 免责声明 | 关于勤云 | 加入收藏