首页 | 本学科首页   官方微博 | 高级检索  
     

一种基于最大边缘相关的特征选择方法
引用本文:刘赫,张相洪,刘大有,李燕军,尹立军.一种基于最大边缘相关的特征选择方法[J].计算机研究与发展,2012,49(2):354-360.
作者姓名:刘赫  张相洪  刘大有  李燕军  尹立军
作者单位:1. 吉林大学计算机科学与技术学院 长春 130012;中国人民解放军总后勤部军需装备研究所 北京100010
2. 中国人民解放军总后勤部军需装备研究所 北京100010
3. 吉林大学计算机科学与技术学院 长春 130012
基金项目:国家自然科学基金,中央高校吉林大学基本科研业务费专项资金基金
摘    要:文本分类的特点是高维的特征空间和高度的特征冗余.针对这两个特点,采用χ2统计量处理高维的特征空间,利用信息新颖度的思想处理高度的特征冗余,根据最大边缘相关的定义,将二者有机结合,提出一种基于最大边缘相关的特征选择方法.该方法可以在特征选择过程中减少大量的冗余特征.最后,在Reuters-21578Top10和OHSCAL两个文本数据集上进行实验.实验结果表明,基于最大边缘相关的特征选择方法比χ2统计量和信息增益两种特征选择方法更高效,并且能够提高nave Bayes,Rocchio和kNN 3种不同分类器的性能.

关 键 词:文本分类  特征选择  最大边缘相关  CHI  信息新颖度

A Feature Selection Method Based on Maximal Marginal Relevance
Liu He , Zhang Xianghong , Liu Dayou , Li Yanjun , Yin Lijun.A Feature Selection Method Based on Maximal Marginal Relevance[J].Journal of Computer Research and Development,2012,49(2):354-360.
Authors:Liu He  Zhang Xianghong  Liu Dayou  Li Yanjun  Yin Lijun
Affiliation:1(College of Computer Science and Technology,Jilin University,Changchun 130012) 2(Quartermaster Equipment Institute of General Logistics Department of CPLA,Beijing 100010)
Abstract:With the rapid growth of textual information on the Internet,text categorization has already been one of the key research directions in data mining.Text categorization is a supervised learning process,defined as automatically distributing free text into one or more predefined categories.At the present,text categorization is necessary for managing textual information and has been applied into many fields.However,text categorization has two characteristics: high dimensionality of feature space and high level of feature redundancy.For the two characteristics,χ2 is used to deal with high dimensionality of feature space,and information novelty is used to deal with high level of feature redundancy.According to the definition of maximal marginal relevance,a feature selection method based on maximal marginal relevance is proposed,which can reduce redundancy between features in the process of feature selection.Furthermore,the experiments are carried out on two text data sets,namely,Reuters-21578 Top10 and OHSCAL.The results indicate that the feature selection method based on maximal marginal relevance is more efficient than χ2 and information gain.Moveover it can improve the performance of three different categorizers,namely,nave Bayes,Rocchio and k NN.
Keywords:text categorization  feature selection  maximal marginal relevance  CHI  information novelty
本文献已被 CNKI 万方数据 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号