首页 | 本学科首页   官方微博 | 高级检索  
     

基于改进TF-IDF算法的文本分类方法研究
引用本文:贺科达,朱铮涛,程昱.基于改进TF-IDF算法的文本分类方法研究[J].广东工业大学学报,2016,33(5):49-53.
作者姓名:贺科达  朱铮涛  程昱
作者单位:广东工业大学 信息工程学院,广东 广州 510006
基金项目:国家自然科学基金资助项目(11204043)
摘    要:类别关键词是文本分类首先要解决的关键问题,在研究利用类别关键词及TF-IDF算法对文本进行分类的基础上,提出了一种改进的TF-IDF算法.首先建立类别关键词库,并对其进行扩充及去重,克服了向量空间模型不能很好调节权重的缺点.通过加入文档长度权值修正文档中关键词的权重,有效地解决了原有特征词条类别区分能力不足的问题.采用贝叶斯分类方法,结合实验验证了该算法的有效性,提高了文本分类的准确度.

关 键 词:关键词提取    特征选择    文本分类    预处理  
收稿时间:2015-09-22

A Research on Text Classification Method Based on Improved TF-IDF Algorithm
HE Ke-da,ZHU Zheng-tao,CHENG Yu.A Research on Text Classification Method Based on Improved TF-IDF Algorithm[J].Journal of Guangdong University of Technology,2016,33(5):49-53.
Authors:HE Ke-da  ZHU Zheng-tao  CHENG Yu
Affiliation:School of Information Engineering, Guangdong University of Technology, Guangzhou 510006, China
Abstract:Establishing category keywords is the key problem in text classification, which should be solved first. On the basis of the classification of text by using the category keywords and TF-IDF algorithm, an improved TF-IDF algorithm has been proposed to overcome the shortcomings of the vector space model, which cannot well adjust the weights. Firstly, category keyword library should be established, and the expansion and duplication be carried out. The weight of keywords in the document is modified by the addition of the length of the document, and the shortage of the original features of the entry class distinction ability is solved effectively. By using Bayesian classification method, combined with the experiments, the effectiveness of the algorithm is verified, and the accuracy of text classification improved.
Keywords:keyword extraction  feature selection  text classification  pretreatment  
点击此处可从《广东工业大学学报》浏览原始摘要信息
点击此处可从《广东工业大学学报》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号