首页 | 本学科首页   官方微博 | 高级检索  
     

基于改进TFIDF算法的文本分类研究
引用本文:郑霖,徐德华.基于改进TFIDF算法的文本分类研究[J].计算机与现代化,2014(9):6-9.
作者姓名:郑霖  徐德华
作者单位:同济大学经济与管理学院,上海200092
摘    要:由于文本分类在信息检索、邮件过滤、网页分类、个性化推荐等领域有着广泛的应用价值,所以自文本分类的概念提出以来,受到了学者们的广泛关注。在文本分类的研究中,学者们运用了很多方法,其中TFIDF是文档特征权重计算的最常用算法之一,但是传统的TFID算法忽略了特征项在类内和类间的分布,导致很多区分度不大的特征项被赋予了较大的权重。针对传统TFIDF算法的不足,本文在IDF的计算过程中,用词条在类内与类间的文档占比来考虑词条在类内与类间的分布。在实验中,用改进的权重算法表示文本向量,通过考察分类的效果,验证了改进算法的有效性。

关 键 词:TFIDF算法  特征选择  文本分类

Research on Text Categorization Based on Improved TFIDF Algorithm
ZHENG Lin,XU De-hua.Research on Text Categorization Based on Improved TFIDF Algorithm[J].Computer and Modernization,2014(9):6-9.
Authors:ZHENG Lin  XU De-hua
Affiliation:(School of Economics and Management, Tongji University, Shanghai 200092, China)
Abstract:Due to the broad application of text categorization in information retrieval , email filtering, Web page classification , personalized recommendation and other fields , it raised extensive attention among scholars since the concept of text categorization was presented .In text classification research , scholars have adopted a lot of methods , and TFIDF was one of the most commonly used algorithms to calculate the weight of feature items .But the traditional TFIDF algorithm ignored the distribution of feature items within classes and among classes , leading to high weight given to many items with little discrimination .In this paper, with the purpose of improving the traditional TFIDF algorithm , we modified the calculation method of IDF , adding some factors which reflected the distribution of feature items within classes and among classes .In the experiment , we applied the improved TFIDF algorithm into text categorization .By investigating the effect of text classification , the improving algorithm was verified valid .
Keywords:TFIDF algorithm  feature items selection  text categorization
本文献已被 CNKI 维普 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号