基于正文特征的网页正文信息提取方法 Web Page Topic Information Extraction Method Based on Topic Text Feature期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

基于正文特征的网页正文信息提取方法

引用本文：	孙桂煌,刘发升.基于正文特征的网页正文信息提取方法[J].现代计算机,2008(9).

作者姓名：	孙桂煌刘发升

作者单位：	江西理工大学信息工程学院,赣州341000

基金项目：	江西省科技厅科技攻关项目

摘要：	利用正文字数多、标点符号多两个特征,提出一种基于正文特征的网页正文信息提取方法.谊方法利用HTML标签对网页内容进行分块,把具有正文特征的块保留,不具有正文特征的块舍弃,从而准确得到具有较高完整性的网页正文信息.实验结果证明该方法是有效的、通用的.
关键词：	正文特征信息提取块识别
Web Page Topic Information Extraction Method Based on Topic Text Feature

SUN Gui-huang,LIU Fa-sheng.Web Page Topic Information Extraction Method Based on Topic Text Feature[J].Modem Computer,2008(9).

Authors:	SUN Gui-huang LIU Fa-sheng

Affiliation:	SUN Gui-huang,LIU Fa-sheng (Faculty of Information Engineering,Jiangxi University of Science , Technology,Ganzhou 341000)

Abstract:	Using two features of topical text with more characters and more punctuations, proposes a method based on the topical text feature to extract the topical information of Web page. The method gets more complete topical information of Web page exactly, using HTML tags to segment Web page, and then decides to leave the blocks with topical text feature and discard the blocks without topical text feature. Experiment shows the efficiency and generality of the method.

Keywords:	Topical Text Feature Information Extraction Block Identification
本文献已被 CNKI 维普万方数据等数据库收录！

设为首页 | 免责声明 | 关于勤云 | 加入收藏