首页 | 本学科首页   官方微博 | 高级检索  
     

一种基于多特征因子改进的中文文本分类算法
引用本文:叶 敏,汤世平,牛振东.一种基于多特征因子改进的中文文本分类算法[J].中文信息学报,2017,31(4):132-137.
作者姓名:叶 敏  汤世平  牛振东
作者单位:北京理工大学 计算机学院, 北京 100081
摘    要:采用向量空间模型(vector space model,VSM)表示网页文本,通过在CHI(Chi-Square)特征选择算法中引入频度、集中度、分散度、位置信息这四个特征因子,并考虑词长和位置特征因子改进TF-IDF权重计算公式,提出了PCHI-PTFIDF(promoted CHI-promoted TF-IDF)算法用于中文文本分类。改进算法能降维得到分类能力更强的特征项集、更精确地反映特征项的权重分布情况。结果显示,与使用传统CHI和传统TF-IDF的文本分类算法相比,PCHI-PTFIDF算法的宏F1值平均提高了10%。

关 键 词:文本分类  χ2统计  特征选择  TF-IDF权重计算  

An Improved Chinese Text Classification Algorithm Based On Multiple Feature Factors
YE Min,TANG Shiping,NIU Zhendong.An Improved Chinese Text Classification Algorithm Based On Multiple Feature Factors[J].Journal of Chinese Information Processing,2017,31(4):132-137.
Authors:YE Min  TANG Shiping  NIU Zhendong
Affiliation:Department of Computer Science, Beijing Institute of Technology, Beijing 100081, China
Abstract:In the framework of the vector space model(VSM), a new PCHI-PTFIDF(promoted CHI-promoted TFIDF)method based on feature selection and weight calculation is proposed. First, the factors of frequency, concentration, dispersion and location are introduced to CHi-Square based feature selection. Then, the TF-IDF weight is proposed to be optimized by the length and location factors of text terms. The proposed method can reduce the dimensions of the features with better classification ability, and produce better estimation of the weight distribution. The experimental results show that, compared with the algorithm using the traditional CHI and traditional TFIDF, the PCHI-PTFIDF method achieves 10% improvement in Macro-F1 on average.
Keywords:text classification  χ2 statistic  feature selection  TF-IDF feature weighting  
点击此处可从《中文信息学报》浏览原始摘要信息
点击此处可从《中文信息学报》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号