基于网页分块的正文信息提取方法 Web information extraction based on visual block segmentation期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

基于网页分块的正文信息提取方法

引用本文：	黄玲,陈龙.基于网页分块的正文信息提取方法[J].计算机应用,2008,28(Z2).

作者姓名：	黄玲陈龙

作者单位：	重庆邮电大学计算机科学与技术研究所,重庆,400065

基金项目：	重庆中自然科学基金资助项目

摘要：	网页主题信息通常湮没在大量的无关文字和HTML标记中,给应用程序迅速获取主题信息增加的难度.提出了一种基于网页分块的正文信息抽取方法.该方法首先识别和提取网页正文内容块,然后利用正则表达式和简单的判别规则内容块滤除内容块中的HTML标记和无关文字.实验证明,该方法能够准确地提取网页正文信息,且通用性较强,易于实现.
关键词：	Web信息抽取主题内容块网页正文信息
Web information extraction based on visual block segmentation

HUANG Ling,CHEN Long.Web information extraction based on visual block segmentation[J].journal of Computer Applications,2008,28(Z2).

Authors:	HUANG Ling CHEN Long

Affiliation:	HUANG Ling,CHEN Long(Institute of Computer Science , Technology,Chongqing University of Posts , Telecommunication,Chongqing 400065,China)

Abstract:	Web pages always contain large numbers of irrelevant words and HTML tags except for informative information.This enhances the difficulties of extracting informative information from Web pages quickly.A method of extract informative information based on user's interest is proposed.The experimental results prove that this method is good universality and can obtain informative message accurately,so our approach is easy to realize.

Keywords:	Web information extraction informative content block main text of Web page
本文献已被 CNKI 万方数据等数据库收录！

设为首页 | 免责声明 | 关于勤云 | 加入收藏