基于网页聚类的Web信息自动抽取 Automatic Web information extraction based on page clustering期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

基于网页聚类的Web信息自动抽取

引用本文：	邱韬奋,杨天奇,曾洪波.基于网页聚类的Web信息自动抽取[J].微型机与应用,2011,30(4):71-74.

作者姓名：	邱韬奋杨天奇曾洪波

作者单位：	暨南大学,信息科学技术学院计算机系,广东,广州,510632

摘要：	针对现今较流行的动态Web网页数量巨大、数据价值高,并且网页结构高度模板化的特点,设计了一个基于网页聚类的Web信息自动抽取系统。在DOM抽取技术基础上利用网页聚类寻找高相似簇,并引入列相似度和全局自相似度计算方法,提高了聚类结果的准确性。抽取模板中应用了可选节点对模板的修正和调整,以提高内容节点的正确标识。实验结果表明,该方法能够自动寻找并抽取网页主要信息,达到了较高的准确率和查全率。
关键词：	Web信息抽取网页聚类包装器生成
Automatic Web information extraction based on page clustering

Qiu Taofen,Yang Tianqi,Zeng Hongbo.Automatic Web information extraction based on page clustering[J].Microcomputer & its Applications,2011,30(4):71-74.

Authors:	Qiu Taofen Yang Tianqi Zeng Hongbo

Affiliation:	Qiu Taofen,Yang Tianqi,Zeng Hongbo(Dept.of Computer,College of Information Science and Technology,Jinan University,Guangzhou 510632,China)

Abstract:	Dynamic Web page has a large amount of pages,high-value data and high-modularity structure.According to these feature,this paper developed an automatic Web information extraction system based on page clustering.On the basis of DOM extraction technique,it used page clustering to find the high similarity clusters,and improved the accuracy of clustering results by using the column similarity measure and global auto-similarity measure.Extraction template applied the optional nodes to modify and adjust the templ...

Keywords:	Web information extraction page clustering wrapper generation
本文献已被 CNKI 万方数据等数据库收录！

设为首页 | 免责声明 | 关于勤云 | 加入收藏