灵活结构网页的正文提取 Content Extraction Based on Unknown Structure Web期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

灵活结构网页的正文提取

引用本文：	殷彬,杨会志.灵活结构网页的正文提取[J].微机发展,2011(9):111-113,117.

作者姓名：	殷彬杨会志

作者单位：	电子科技大学中山学院,广东中山528400

基金项目：	中山市科技计划项目（20092A210）

摘要：	在Web数据挖掘中,由于网页大多都含有指向其他页面的超链接等噪音信息,为了减少噪音信息对Web数据挖掘效果的影响,有必要对网页进行净化处理,提取其中的正文,同时,现实中很多网页的代码结构不是特别规范,对此,提出一种对灵活结构网页适用的正文抽取算法。将网页用HTML标签分割成节点形式,找出其中含有正文内容的一个节点,以此节点为基础向前和向后进行余下正文内容的抽取。实验结果表明,本算法的适用性强、正确率较高。
关键词：	Web数据挖掘网页内容提取正文节点超链接节点节点权值链接密度
Content Extraction Based on Unknown Structure Web

YIN Bin,YANG Hui-zhi.Content Extraction Based on Unknown Structure Web[J].Microcomputer Development,2011(9):111-113,117.

Authors:	YIN Bin YANG Hui-zhi

Affiliation:	(Zhongshan Institute,University of Electronic Science and Technology of China, Zhongshan 528400,China)

Abstract:	There is often some useless information in the Web page,such as hyperlinks,copyright,which will affect the accurateness of Web data mining results.Extracting useful text content from a Web page for the mining is necessary.On the other hand,some pages＇ HTML codes are not standard.To solve this problem,propose an approach of Web information extraction based on unknown structure Web.It splits a Web page into a lot of nodes using HTML tags,then finds out one of the nodes which contained valuable information,and searches out other informative content nodes in front or back of the node,finally extracts the article from the Web page after connecting all found nodes＇ contents together.Experiments show that the arithmetic can deal with unstructured Web pages and is effective.

Keywords:	Web data mining Web information extraction content node hyperlink node node weight link density
本文献已被维普等数据库收录！

设为首页 | 免责声明 | 关于勤云 | 加入收藏