一种Web主题文本通用提取方法 Study on general extracting method of Web topic text期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

一种Web主题文本通用提取方法

引用本文：	蒲强,李鑫,刘启和,杨国纬.一种Web主题文本通用提取方法[J].计算机应用,2007,27(6):1394-1396.

作者姓名：	蒲强李鑫刘启和杨国纬

作者单位：	电子科技大学计算机科学与工程学院四川成都610051

基金项目：	国家自然科学基金 , 国家高技术研究发展计划(863计划)

摘要：	为构建大规模中文文本语料库，提出了一种简单、有效、通用的中文Web主题文本提取方法。该方法巧妙地利用中文文本长度和标点符号序列，配合少量判别规则，便可准确地将主题文本从网页中提取出来。由于本方法不涉及具体的HTML标记分析，其通用性较强。实验结果表明该提取方法具有快速性和准确性，达到了构建大规模中文文本语料库的要求。
关键词：	Web文本文本提取文本语料库
文章编号：	1001-9081（2007）06-1394-03
收稿时间：	2006-12-04
修稿时间：	2006-12-04
Study on general extracting method of Web topic text

PU Qiang,LI Xin,LIU Qi-he,YANG Guo-wei.Study on general extracting method of Web topic text[J].journal of Computer Applications,2007,27(6):1394-1396.

Authors:	PU Qiang LI Xin LIU Qi-he YANG Guo-wei

Affiliation:	College of Computer Science and Engineering, University of Electronic Science and Chengdu Sichuan 610054, China

Abstract:	A simple and efficient method of generally extracting Chinese topic text from Web pages was proposed in this paper in order to build a large Chinese text corpus. This method just utilizes length of Chinese texts and series of punctuations, along with a few rules of discrimination, to extract needed text from Web pages accurately without analyzing HTML tags. The experiment shows the extraction is so fast and accurate that it can achieve the requirement of constructing a large Chinese text corpus.

Keywords:	Web text text extracting text corpus
本文献已被 CNKI 维普万方数据等数据库收录！
	点击此处可从《计算机应用》浏览原始摘要信息
	点击此处可从《计算机应用》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏