首页 | 本学科首页   官方微博 | 高级检索  
     

基于类别分布差异和VPRS特征选择的文本分类方法
引用本文:吴迪,张亚平,殷福亮,李明.基于类别分布差异和VPRS特征选择的文本分类方法[J].电子与信息学报,2007,29(12):2880-2884.
作者姓名:吴迪  张亚平  殷福亮  李明
作者单位:1. 大连理工大学计算机系,大连,116024;中航一集团沈阳飞机设计研究所,沈阳,110035
2. 大连理工大学计算机系,大连,116024
3. 中航一集团沈阳飞机设计研究所,沈阳,110035
摘    要:权值计算和特征降维是影响文本分类的精度和效率的两个重要步骤。该文首先根据特征词的类别分布差异进行特征过滤;然后,分析传统的权值公式TF-IDF的缺点,采用改进的权值计算公式简记为TF-CDF,依据TF-CDF公式计算每个特征词的权值,生成文档集的向量空间模型VSM;接着,提出了一种基于可变精度粗糙理论(VPRS)的特征选择进一步选择对分类贡献度大的特征,并用SQL实现。最后利用支持向量机LibSVM分类器进行实验,实验结果表明特征过滤和选择方法及TF-CDF权值公式有助于提高分类精度和分类效率。

关 键 词:文本分类  特征过滤  权值计算  特征选择  可变精度粗糙集
文章编号:1009-5896(2007)12-2880-05
收稿时间:2006-12-28
修稿时间:2007-07-01

Feature Selection Based on Class Distribution Difference and VPRS for Text Classification
Wu Di,Zhang Ya-ping,Yin Fu-liang,Li Ming.Feature Selection Based on Class Distribution Difference and VPRS for Text Classification[J].Journal of Electronics & Information Technology,2007,29(12):2880-2884.
Authors:Wu Di  Zhang Ya-ping  Yin Fu-liang  Li Ming
Affiliation:Department of computer science and Engineering, Dalian university of technology, Dalian 116024, China;Shenyang Aircraft Design & Research Institute, China Aviation Industry Corporation I, Shenyang 110035, China
Abstract:Weight calculating and feature reduction are key preprocesses in text classification.Firstly,those useless to classify texts are filtered according the category document frequency distribution difference of each feature;and then in order to overcome the limitations of TF-IDF weighting formula a novel weighting formula called TF-CDF is presented.Calculate the weight of each feature according to TF-CDF and build the Vector Space Model(VSM) for the entire corpus.To select significant features,a feature selection approach based on Variable Precision Rough Set(VPRS) is also proposed and implement with some SQL sentences combining the definitions of VPRS with the advantages of SQL sentences.Finally,some experiments based on different weighting formulas and feature selection methods are conducted using libSVM as text classifier.The experimental results show that the novel feature filtering,weighting formula and feature selection method improve the performance of text classification.
Keywords:Text classification  Feature filtering  Weight calculating  Feature selection  Variable Precision Rough Set
本文献已被 CNKI 维普 万方数据 等数据库收录!
点击此处可从《电子与信息学报》浏览原始摘要信息
点击此处可从《电子与信息学报》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号