首页 | 本学科首页   官方微博 | 高级检索  
     

基于文档类密度的特征权重算法*
引用本文:周鹏程,刘旭敏,徐维祥.基于文档类密度的特征权重算法*[J].计算机应用研究,2018,35(11).
作者姓名:周鹏程  刘旭敏  徐维祥
作者单位:首都师范大学 信息工程学院,首都师范大学 信息工程学院,北京交通大学 交通运输学院
基金项目:国家自然科学基金项目(面上项目,重点项目,重大项目)
摘    要:面对海量数据的管理和分析,文本自动分类技术必不可少。特征权重计算是文本分类过程的基础,一个好的特征权重算法能够明显提升文本分类的性能。本文对比了多种不同的特征权重算法,并针对前人算法的不足,提出了基于文档类密度的特征权重算法(tf-idcd)。该算法不仅包括传统的词频度量,还提出了一个新的概念,文档类密度,它通过计算类内包含特征的文档数和类内总文档数的比值来度量。最后,本文在两个中文常见数据集上对五种算法进行实验对比。实验结果显示,本文提出的算法相比较其他特征权重算法在F1宏平均和F1微平均上都有较大的提升。

关 键 词:特征权重  文档类密度  文本分类  支持向量机
收稿时间:2017/6/7 0:00:00
修稿时间:2018/9/26 0:00:00

A feature weighting algorithm based on document class density
Zhou Pengcheng,Liu Xuming and Xu Weixiang.A feature weighting algorithm based on document class density[J].Application Research of Computers,2018,35(11).
Authors:Zhou Pengcheng  Liu Xuming and Xu Weixiang
Affiliation:College of Information Engineering,The Capital Normal University,Beijing,,
Abstract:In the face of huge amounts of data management and analysis, automatic text classification technology is necessary. Feature weighting is the basis of text categorization process, a good feature weighting algorithm can significantly improve the performance of text categorization. This paper compares the different types of feature weighting algorithm, and aiming at the shortcomings of the previous algorithm, a feature weighting algorithm based on document class density (tf-idcd) is proposed. The algorithm not only includes traditional measure of word frequency, but also puts forward a new concept, the document class density, it contains characteristics within the class by calculation the number of documents and the ratio of the total number of documents to measure in the class. Finally, Based on the common data set on two Chinese experimental comparison of five kinds of algorithm. The experimental results show that the proposed algorithm compared with other feature weighting algorithm on average F1 macro and micro F1 average has larger ascension.
Keywords:feature weighting  document class density  text classification  SVM
点击此处可从《计算机应用研究》浏览原始摘要信息
点击此处可从《计算机应用研究》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号