首页 | 本学科首页   官方微博 | 高级检索  
     

文本分类中影响因素的定量分析
引用本文:高影繁,马润波,刘玉树.文本分类中影响因素的定量分析[J].计算机工程,2008,34(9):222-224.
作者姓名:高影繁  马润波  刘玉树
作者单位:1. 北京理工大学计算机科学与技术学院,北京,100081
2. 山西大学物理电子工程学院,太原,030006
摘    要:基于包含全部特征的类别特征数据库,利用基于距离度量的Rocchio算法、Fast TC算法和基于概率模型的NB算法,从定量的角度来分析停用词、词干合并、数字和测试文档长度4个因素对文本分类精度的影响程度。实验表明,过滤停用词方法是一种无损的特征压缩手段,词干合并虽然对分类精度略有减弱,但仍能保证特征压缩的可行性。数字与其他词汇的语义关联性提高了Rocchio算法和Fast TC算法的分类精度,但降低了视特征彼此独立的NB算法的分类精度。3种算法在测试文档取不同数量的关键词时分类精度的变化趋势说明了特征所包含的有益信息和噪音信息对分类精度的影响。

关 键 词:类别特征信息库  影响因素  分类效率
文章编号:1000-3428(2008)09-0222-03
修稿时间:2007年5月15日

Quantitative Analysis of Impact Factors in Text Categorization
GAO Ying-fan,MA Run-bo,LIU Yu-shu.Quantitative Analysis of Impact Factors in Text Categorization[J].Computer Engineering,2008,34(9):222-224.
Authors:GAO Ying-fan  MA Run-bo  LIU Yu-shu
Affiliation:(1. School of Computer Science & Technology, Beijing Institute of Technology, Beijing 100081; 2. College of Physics and Electronics, Shanxi University, Taiyuan 030006)
Abstract:This experiment is based on the category-feature database which includes all features of the training set and three text categorization algorithms: Rocchio, Fast TC and NB. The experiment analyzes quantificationally the effect on effectiveness of text categorization produced by stopwords list, stemming, digital and testing text length. The experimental results show that stopwords list has no effect on effectiveness of TC; stemming has some effect but cut little figure; the correlation of digital and other words of documents makes the effectiveness of Rocchio and TC higher but lower NB and different testing text length describes the effect of beneficial and noisy information on effectiveness of TC. These popular feature selection methods are connected with the result of text categorization tightly.
Keywords:category-feature database  impact factors  effectiveness of categorization
本文献已被 维普 万方数据 等数据库收录!
点击此处可从《计算机工程》浏览原始摘要信息
点击此处可从《计算机工程》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号