首页 | 本学科首页   官方微博 | 高级检索  
     

基于类别特征域的文本分类特征选择方法
引用本文:赵世奇,张宇,刘挺,陈毅恒,黄永光,李生.基于类别特征域的文本分类特征选择方法[J].中文信息学报,2005,19(6):23-29.
作者姓名:赵世奇  张宇  刘挺  陈毅恒  黄永光  李生
作者单位:哈尔滨工业大学信息检索研究室,黑龙江哈尔滨 150001
摘    要:特征选择是文本分类的关键问题之一,而噪音与数据稀疏则是特征选择过程中遇到的主要障碍。本文介绍了一种基于类别特征域的特征选择方法。该方法首先利用“组合特征抽取”1 ]的方法去除原始特征空间中的噪音 ,从中抽取出候选特征。这里“, 组合特征抽取”是指先利用文档频率(DF)的方法去掉一部分低频词,再用互信息的方法选择出候选特征。接下来,本方法为分类体系中的每个类别构建一个类别特征域,对出现在类别特征域中的候选特征进行特征的合并和强化,从而解决数据稀疏的问题。实验表明,这种新的方法较之各种传统方法在特征选择的效果上有着明显改善,并能显著提高文本分类系统的性能。

关 键 词:计算机应用  中文信息处理  文本分类  特征选择  类别特征域  
文章编号:1003-0077(2005)06-0021-07
收稿时间:2004-11-24
修稿时间:2005-06-20

A Feature Selection Method Based on Class Feature Domains for Text Categorization
ZHAO Shi-qi,ZHANG Yu,LIU Ting,CHEN Yi-heng,HUANG Yong-guang,LI Sheng.A Feature Selection Method Based on Class Feature Domains for Text Categorization[J].Journal of Chinese Information Processing,2005,19(6):23-29.
Authors:ZHAO Shi-qi  ZHANG Yu  LIU Ting  CHEN Yi-heng  HUANG Yong-guang  LI Sheng
Affiliation:Information Retrieval Laboratory , Harbin Institute of Technology , Harbin , Heilongjiang 150001 , China
Abstract:Feature selection is one of the key problems in text categorization.The chief obstacles to feature selection are noise and sparseness.This paper presents a novel feature selection method which is based on class feature domains. First,we will make use of the combined feature selection method~(1]) to remove noisy features from the original feature space and extract candidate features.That is,we'll take off low frequency words using Document Frequency method firstly and then select candidate features using Mutual Information method.Then,we will construct a class feature domain for each class and conquer the sparseness of trainning datas by merging and strengthening the candidate features which appear in the class feature domains.Experiments show that our method is much better than kinds of traditional feature selection methods and it can improve the performance of text categorization systems markedly.
Keywords:computer application  Chinese information processing  text categorization  feature selection  class feature domains
本文献已被 CNKI 维普 万方数据 等数据库收录!
点击此处可从《中文信息学报》浏览原始摘要信息
点击此处可从《中文信息学报》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号