首页 | 本学科首页   官方微博 | 高级检索  
     

一种新的基于统计的自动文本分类方法
引用本文:刘斌,黄铁军,程军,高文.一种新的基于统计的自动文本分类方法[J].中文信息学报,2002,16(6):19-25.
作者姓名:刘斌  黄铁军  程军  高文
作者单位:1.中国科学院计算技术研究所2.中国科学院研究生院3.中国科学院文献情报中心
基金项目:国家科学数字图书馆重大专项 (CSDL2 0 0 2 - 18)
摘    要:自动文本分类就是在给定的分类体系下,让计算机根据文本的内容确定与它相关联的类别。为了提高分类性能,本文提出了中文文本多层次特征提取方法和基于核的距离加权KNN算法。多层次特征提取方法在汉字、常用词表和专业词表三个层次上提取文档的统计特征,能够更好地反映文档的统计分布。基于核的距离加权KNN算法解决了样本的多峰分布、边界重叠问题和分类器的精确分类决策问题。实际应用中,互联网和文本库提供了大量经过粗分类的训练文本,但普遍存在样本质量较差的问题,本文通过样本重要性分析技术解决此问题。实验系统证明了新方法的有效性。

关 键 词:自动文本分类  多层次特征提取  基于核的距离加权KNN算法  样本重要性分析  
修稿时间:2002年8月29日

A New Statistical-based Method in Automatic Text Classification
LIU Bin,HUANG Tie jun,CHENG Jun,GAO Wen.A New Statistical-based Method in Automatic Text Classification[J].Journal of Chinese Information Processing,2002,16(6):19-25.
Authors:LIU Bin  HUANG Tie jun  CHENG Jun  GAO Wen
Affiliation:1.Institute of Computing Technology, Chinese Academy of Sciences2.Grduate School of Chinese Academy of Sciences3.The Library of Chinese Academy of Sciences
Abstract:Automatic text classification is defined as the task to assign pre defined category labels to documents.To improve the classification performance,this article puts forward the multi level feature selection method and the kernel based distance weighted KNN algorithm.We extract the statistical text features on three different levels as Chinese letters,the common wordlist and the professional wordlist,which can represent more statistical character of the document set.The kernel based weighted KNN algorithm solves the multi peak distribution problem and the overlap boundary problem of the sample set,as well as the classifier's precise decision problem.In practical use,the Internet and text data bases provide many pre classified training samples.But some of them are not good for training the classifier.We use sample weightiness analysis to address this problem.The experimental system shows the effectiveness of the method.
Keywords:automatic text classification  multi  level feature selection  Kernel  based Distance  weighted KNN algorithm  sample weightiness analysis
本文献已被 CNKI 维普 万方数据 等数据库收录!
点击此处可从《中文信息学报》浏览原始摘要信息
点击此处可从《中文信息学报》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号