基于标记窗的网页正文信息提取方法* Web Content Information Extraction Method Based on Tag Window期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

基于标记窗的网页正文信息提取方法*

引用本文：	赵欣欣,索红光,刘玉树.基于标记窗的网页正文信息提取方法*[J].计算机应用研究,2007,24(3):144-145.

作者姓名：	赵欣欣索红光刘玉树

作者单位：	1. 中国兵器工业,计算机应用技术研究所,北京,100089 2. 北京理工大学,计算机科学技术学院,计算机科学工程系,北京,100081

摘要：	提出了基于标记窗的网页正文信息提取方法.该方法不仅适合于处理一个网页中所有正文信息均放在一个td 中的情况,也适合于处理网页正文放在多个td中的情况,还可以处理网页正文文字短到与网页其余部分文字(如广告、导航条、版权)长度相当的情况.尤其重要的是,它能够解决非Table 结构的网页正文提取问题.实验表明,该方法可以提高网页正文提取的准确率,适用性强.
关键词：	标记窗提取文档对象模型基于标记网页信息提取方法 Window Based Method Information Extraction 适用性强准确率实验问题结构 Table 长度版权导航条广告文字情况
文章编号：	1001-3695（2007）03-0144-02
修稿时间：	2006-01-182006-04-05
Web Content Information Extraction Method Based on Tag Window

ZHAO Xin xin,SUO Hong guang,LIU Yu shu.Web Content Information Extraction Method Based on Tag Window[J].Application Research of Computers,2007,24(3):144-145.

Authors:	ZHAO Xin xin SUO Hong guang LIU Yu shu

Abstract:	This paper proposed a Web content information extraction method based on tag window, which could deal with some special circumstances, all the Web content information was put into one td or several tds, and the character numbers of Web content information was at most equal to that of the other information, navigation bars, advertisement, and the copyright, etc. Most especially, it could extract the Web content information which was not existed as the Table format. Experiments show that this method can improve the accuracy of the Web content information extraction and has a wide applicability.

Keywords:	tag window extraction DOM
本文献已被 CNKI 维普万方数据等数据库收录！
	点击此处可从《计算机应用研究》浏览原始摘要信息
	点击此处可从《计算机应用研究》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏