基于统计分词的中文网页分类 Chinese Web Page Classification Based On Statistical Word Segmentation期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

基于统计分词的中文网页分类

引用本文：	黄科,马少平.基于统计分词的中文网页分类[J].中文信息学报,2002,16(6):26-32.

作者姓名：	黄科马少平

作者单位：	清华大学计算机科学与技术系智能技术与系统国家重点实验室

基金项目：	国家重点基础研究 973(G19980 30 5 0 9)，86 3高技术项目 (2 0 0 1AA114 0 82 )

摘要：	本文将基于统计的二元分词方法应用于中文网页分类,实现了在事先没有词表的情况下通过统计构造二字词词表,从而根据网页中的文本进行分词,进而进行网页的分类。因特网上不同类型和来源的文本内容用词风格和类型存在相当的差别,新词不断出现,而且易于获得大量的同类型文本作为训练语料。这些都为实现统计分词提供了条件。本文通过试验测试了统计分词构造二字词表用于中文网页分类的效果。试验表明,在统计阈值选择合适的时候,通过构建的词表进行分词进而进行网页分类,能有效地提高网页分类的分类精度。此外,本文还分析了单字和分词对于文本分类的不同影响及其原因。
关键词：	文本分类统计分词机器学习计算机网络
修稿时间：	2002年5月7日
Chinese Web Page Classification Based On Statistical Word Segmentation

HUANG Ke,MA Shao,ping.Chinese Web Page Classification Based On Statistical Word Segmentation[J].Journal of Chinese Information Processing,2002,16(6):26-32.

Authors:	HUANG Ke MA Shao ping

Affiliation:	National Key Lab of Intelligent Technology and System Department of Computer Science and Technology Tsinghua University

Abstract:	Word segmentation is an important step in Chinese natural language processing.This paper explores the problem of classifying Chinese web pages based on statistical word segmentation.We first construct a Chinese word list of binary words automatically from training Chinese web pages.Then the texts in testing Chinese web pages are segmented with the word list.Web pages are classified based on the segmentation results.Experiments show that statistical word segmentation can efficiently improve classification precision.Based on the experiment results,we analyze the influence of statistical word segmentation on Chinese web page classification.Single Chinese characters and words play different roles in web page classification and the reason for the difference is also analyzed.

Keywords:	text categorization statistical word segmentation machine learning computer network
本文献已被 CNKI 维普万方数据等数据库收录！
	点击此处可从《中文信息学报》浏览原始摘要信息
	点击此处可从《中文信息学报》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏