首页 | 本学科首页   官方微博 | 高级检索  
     

文本分类特征权重改进算法
引用本文:台德艺,王 俊. 文本分类特征权重改进算法[J]. 计算机工程, 2010, 36(9): 197-199,
作者姓名:台德艺  王 俊
作者单位:(合肥学院机器视觉与智能控制技术重点实验室,合肥 230601)
摘    要:TF-IDF是一种在文本分类领域获得广泛应用的特征词权重算法,着重考虑了词频与逆文档频等因素,但无法把握特征词在类间与类内的分布情况。为提高在同类中频繁出现、类内均匀分布的具有代表性的特征词权重,引入特征词分布集中度系数改进IDF函数、用分散度系数进行加权,提出TF-IIDF-DIC权重函数。实验结果表明,基于TF-IIDF-DIC权重算法的K-NN文本分类宏平均F1值比TF-IDF算法提高了6.79%。

关 键 词:向量空间模型  文本分类  特征权重  特征分布

Improved Feature Weighting Algorithm for Text Categorization
Affiliation:(Key Laboratory of Machine Vision and Intelligence Control Technology, Hefei University, Hefei 230601)
Abstract:TF-IDF as one of feature weighting schemes in Vector Space Model(VSM) is widely used and makes good results in the realm of text categorization. Although traditional algorithms consider about term frequency and inverse document frequency, Term Frequency/Inverse Document Frequency(TF-IDF) is oblivious to the term distribution information among and inside class. A new feature weighting algorithm based on the improved IDF and distribution coefficient is put forward to enhance the feature weighting of high frequency and homogeneous distribution in the same class. Experimental results show that compared with the conventional TF-IDF algorithm, f1 based on TF-IIDF-DIC raises by 6.79%.
Keywords:Vector Space Model(VSM)  text categorization  feature weighting  feature distribution
点击此处可从《计算机工程》浏览原始摘要信息
点击此处可从《计算机工程》下载免费的PDF全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号