贝叶斯文本分类中特征词缺失的补偿策略 Compensation strategy of unseen feature words in naive Bayes text classification期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

贝叶斯文本分类中特征词缺失的补偿策略

引用本文：	庞秀丽,冯玉强,姜维.贝叶斯文本分类中特征词缺失的补偿策略[J].哈尔滨工业大学学报,2008,40(6):956-960.

作者姓名：	庞秀丽冯玉强姜维

作者单位：	1. 哈尔滨工业大学,管理学院,哈尔滨,150001 2. 哈尔滨工业大学,计算机科学与技术学院,哈尔滨,150001

基金项目：	国家自然科学基金 , 黑龙江省自然科学基金

摘要：	为了解决朴素贝叶斯分类器在处理文本分类任务时,往往存在的特征词缺失问题,即由于语料库中的词语出现分布情况遵循Zipf定律,仅依靠简单的增加训练语料方式难以解决这种因数据稀疏而引发的特征词缺失问题.引入统计语言模型中的数据平滑算法,通过从已出现词中"折扣"出一定的概率再分配到未出现词中去,来计算缺失特征词的补偿概率,以此克服数据稀疏问题带来的影响.评测数据在去掉停用词的分类过程开放测试中,引入Good-Turing算法的分类性能比Laplace原则提高了3.05%,比Lidstone方法提高1.00%.而在交叉熵选择特征词的算法中,增加Good-Turing的贝叶斯分类方法可比最大熵分类性能高1.95%.通过这种数据平滑的算法,有助于克服因数据稀疏而引发的特征词缺失问题.
关键词：	文本分类贝叶斯分类特征词缺失数据平滑
Compensation strategy of unseen feature words in naive Bayes text classification

PANG Xiu-li,FENG Yu-qiang,JIANG Wei.Compensation strategy of unseen feature words in naive Bayes text classification[J].Journal of Harbin Institute of Technology,2008,40(6):956-960.

Authors:	PANG Xiu-li FENG Yu-qiang JIANG Wei

Affiliation:	1.School of Management,Harbin Institute of Technology,Harbin 150001,China, 2.School of Computer and Science and Technology,Harbin Institute of Technology,Harbin 150001,China)

Abstract:	When applied to deal with text classification task,naive Bayes is always suffered from the unseen feature words problem.Moreover,this problem is hardly to be solved by expanding the corpora for there is the sparse data problem in the corpora,in which the distribution of words complies with Zipf law.Inspired by statistical language model,a novel approach is proposed,which applies the smoothing algorithms to naive Bayes for text classification task to overcome the unseen feature words problem.The experimental corpora come from the data in National 863 Evaluation on text classification,and in the open test with removing the stop words,the naive Bayes performance with Good-Turing algorithm is 3.05% higher than that with Laplace,and 1.00% higher than that with Lidstone.And in the experiment with cross entropy extracting feature words,the performance of naive Bayes with Good-Turing algorithm is even 1.95% higher than that of Maximum Entropy model.The smoothing algorithm is helpful to solve the unseen feature words problem due to the sparse data.

Keywords:	text classification naive Bayes classification unseen feature words data smoothing
本文献已被 CNKI 维普万方数据等数据库收录！

设为首页 | 免责声明 | 关于勤云 | 加入收藏