基于TF-IDF和余弦相似度的文本分类方法 Text Classification Method Based on TF-IDF and Cosine Similarity期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

基于TF-IDF和余弦相似度的文本分类方法

引用本文：	武永亮,赵书良,李长镜,魏娜娣,王子晏.基于TF-IDF和余弦相似度的文本分类方法[J].中文信息学报,2017,31(5):138-145.

作者姓名：	武永亮赵书良李长镜魏娜娣王子晏

作者单位：	1.河北师范大学数学与信息科学学院,河北石家庄 050024; 2.河北省计算数学与应用重点实验室, 河北石家庄 050024; 3.河北师范大学汇华学院,河北石家庄 050091; 4.中国科学技术大学计算机科学与技术学院,安徽合肥 230022

基金项目：	国家自然科学基金(71271067);国家社科基金重大项目(13&ZD091);河北省高等学校科学技术研究项目(QN2014196);河北省科技计划项目(15210403D)

摘要：	文本分类是文本处理的基本任务。大数据处理时代的到来致使文本分类问题面临着新的挑战。研究者已经针对不同情况提出多种文本分类算法,如KNN、朴素贝叶斯、支持向量机及一系列改进算法。这些算法的性能取决于固定数据集,不具有自学习功能。该文提出一种新的文本分类方法,包括三个步骤: 基于TF-IDF方法提取类别关键词;通过类别关键词和待分类文本关键词的相似性进行文本分类;在分类过程中更新类别关键词改进分类器性能。仿真实验结果表明,本文提出方法的准确度较目前常用方法有较大提高,在实验数据集上分类准确度达到90%,当文本数据量较大时,分类准确度可达到95%。算法初次使用时,需要一定的训练样本和训练时间,但分类时间可下降到其他算法的十分之一。该方法具有自学习模块,在分类过程中,可以根据分类经验自动更新类别关键词,保证分类器准确率,具有很强的现实应用性。
关键词：	文本分类大数据 TF-IDF 余弦相似度类别关键词
Text Classification Method Based on TF-IDF and Cosine Similarity

WU Yongliang,ZHAO Shuliang,LI Changjing,WEI Nadi,WANG Ziyan.Text Classification Method Based on TF-IDF and Cosine Similarity[J].Journal of Chinese Information Processing,2017,31(5):138-145.

Authors:	WU Yongliang ZHAO Shuliang LI Changjing WEI Nadi WANG Ziyan

Affiliation:	1.College of Mathematics and Information Science, HeBei Normal University, Shijiazhuang, Hebei 050024, China; 2.Hebei Key Laboratory of Computational Mathematics and Applications, Shijiazhuang, Hebei 050024, China; 3.Huihua College of Hebei Normal University, Shijiazhuang, Hebei 050091, China; 4. College of Computer Science and Technology, University of Science&Technology China, Hefei, Anhui 230022, China

Abstract:	Text classification is the fundamental task for text mining. Many text classification algorithms have been presented in previous literatures, such as KNN, Nave Bayes, Support Vector Machine, and some improved algorithms. The performance of these algorithms depends on the data set and does not have self-learning function. This paper proposes an effective approach for text classification. The three key points of the approach are: 1)extracting the keywords of category (KWC) of labeled texts based on the TF-IDF approach, 2) classifying unlabeled text by the relevancy of category and unlabeled text, and 3) improving the performance of the approach via updating the KWC in the process of classification. Simulation experiment results show that the new approach can improve the accuracy of text classification to 90%, and even up to 95% when the data volume is large enough. The method can automatically update the keywords of category to improve the classification accuracy of the classifier.

Keywords:	text classification big data TF-IDF cosine similarity category keywords

	点击此处可从《中文信息学报》浏览原始摘要信息
	点击此处可从《中文信息学报》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏