首页 | 本学科首页   官方微博 | 高级检索  
     

基于词频分类器集成的文本分类方法
引用本文:姜远,周志华.基于词频分类器集成的文本分类方法[J].计算机研究与发展,2006,43(10):1681-1687.
作者姓名:姜远  周志华
作者单位:1. 南京大学软件新技术国家重点实验室,南京,210093
2. 南京大学计算机科学与技术系,南京,210093
基金项目:国家自然科学基金;江苏省自然科学基金
摘    要:提出了一种基于词频分类器集成的文本分类方法.词频分类器是在对文本中的单词和它在每个文本中出现的频率进行统计后得到的简单分类器.虽然词频分类器本身泛化能力不强,但它不仅计算代较小,而且在训练样本甚至类别增加时易于进行更新,而整个学习系统的泛化能力可以由集成学习机制来提高,因此,词频分类器很适合用做集成学习的基分类器.在集成时,使用了改进的AdaBoost算法,加入了一种强制重新分布权的机制,避免算法过早停止,更加适合文本分类任务.在标准文集Reuters-21578上的实验结果表明,该方法能取得很好的效果.

关 键 词:文本分类  机器学习  集成学习  词频分类器
收稿时间:04 29 2006 12:00AM
修稿时间:2006-04-292006-05-29

A Text Classification Method Based on Term Frequency Classifier Ensemble
Jiang Yuan,Zhou Zhihua.A Text Classification Method Based on Term Frequency Classifier Ensemble[J].Journal of Computer Research and Development,2006,43(10):1681-1687.
Authors:Jiang Yuan  Zhou Zhihua
Affiliation:1,National Laboratory for Novel Software Technology, Nanjing University, Nanjing 210093;2,Department of Computer Science and Technology, Nanjing University, Nanjing 210093
Abstract:In this paper, a method of text classification based on term frequency classifier ensemble is proposed. Term frequency classifier is a kind of simple classifier obtained after calculating terms' frequency of texts in the corpus. Though the generalization ability of term frequency classifier is not strong enough, it is a qualified base learner for ensemble because of its low computational cost, flexibility in updating with new samples and classes, and the feasibility of improving generalization with the help of ensemble paradigms. An improved AdaBoost algorithm is used to build the ensemble, which employs a scheme of compulsive weights updating to avoid early stop. Therefore it is more suitable for text classification. Experimental results on the corpus of Reuters-21578 show that the proposed method can achieve good performance in text classification tasks.
Keywords:AdaBoost
本文献已被 CNKI 维普 万方数据 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号