首页 | 本学科首页   官方微博 | 高级检索  
     

灵活结构网页的正文提取
引用本文:殷彬,杨会志.灵活结构网页的正文提取[J].微机发展,2011(9):111-113,117.
作者姓名:殷彬  杨会志
作者单位:电子科技大学中山学院,广东中山528400
基金项目:中山市科技计划项目(20092A210)
摘    要:在Web数据挖掘中,由于网页大多都含有指向其他页面的超链接等噪音信息,为了减少噪音信息对Web数据挖掘效果的影响,有必要对网页进行净化处理,提取其中的正文,同时,现实中很多网页的代码结构不是特别规范,对此,提出一种对灵活结构网页适用的正文抽取算法。将网页用HTML标签分割成节点形式,找出其中含有正文内容的一个节点,以此节点为基础向前和向后进行余下正文内容的抽取。实验结果表明,本算法的适用性强、正确率较高。

关 键 词:Web数据挖掘  网页内容提取  正文节点  超链接节点  节点权值  链接密度

Content Extraction Based on Unknown Structure Web
YIN Bin,YANG Hui-zhi.Content Extraction Based on Unknown Structure Web[J].Microcomputer Development,2011(9):111-113,117.
Authors:YIN Bin  YANG Hui-zhi
Affiliation:(Zhongshan Institute,University of Electronic Science and Technology of China, Zhongshan 528400,China)
Abstract:There is often some useless information in the Web page,such as hyperlinks,copyright,which will affect the accurateness of Web data mining results.Extracting useful text content from a Web page for the mining is necessary.On the other hand,some pages' HTML codes are not standard.To solve this problem,propose an approach of Web information extraction based on unknown structure Web.It splits a Web page into a lot of nodes using HTML tags,then finds out one of the nodes which contained valuable information,and searches out other informative content nodes in front or back of the node,finally extracts the article from the Web page after connecting all found nodes' contents together.Experiments show that the arithmetic can deal with unstructured Web pages and is effective.
Keywords:Web data mining  Web information extraction  content node  hyperlink node  node weight  link density
本文献已被 维普 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号