首页 | 本学科首页   官方微博 | 高级检索  
     

一种集成NER的文本分类特征选择方法
引用本文:施德明,林洋港,陈恩红.一种集成NER的文本分类特征选择方法[J].计算机工程与科学,2007,29(11):152-156.
作者姓名:施德明  林洋港  陈恩红
作者单位:中国科学技术大学计算机科学与技术系,安徽,合肥,230027
摘    要:文本分类是将自由文本自动划分到若干预先定义类别的方法,在信息检索等领域有很重要的作用。其中,如何选择有效的文本特征是影响文本分类器分类性能的一个重要步骤 。很多应用中需要处理的文本信息包含了很多的命名实体,如某个行业的名人,往往能够在很大程度上影响着文本所属的类别。然而,现阶段的文本特征方法都只利用关键词
词的统计意义,而没有考虑关键词作为命名实体所含有的分类特征。针对这一问题,本文提出了一种将命名实体识别方法NER集成到文本分类特征选择中的方法,在保留关键 词统计特征之外,还保留了单词作为命名实体的分类特征。实验结果表明,相对于其他特征选择方法而言,本文提出的方法在一定程度上提高了文本分类的分类准确率。

关 键 词:命名实体识别  命名实体  特征选择  文本分类  隐马尔可夫模型
文章编号:1007-130X(2007)11-0152-05
收稿时间:2006-06-01
修稿时间:2006-10-23

A NER-Based Feature Selection Method for Text Classification
SHI De-ming,LIN Yang-gang,CHEN En-hong.A NER-Based Feature Selection Method for Text Classification[J].Computer Engineering & Science,2007,29(11):152-156.
Authors:SHI De-ming  LIN Yang-gang  CHEN En-hong
Abstract:Text Classification (TC) is the process of automatically assigning predefined categories to free text documents, which is very important to information retrieval and some other areas. The most important step in TC is how to select the features that can effectively represent the class information of the original documents.In some TC applications,documents usually contain lots of named entities, e.g., some organization names in specific areas, which may significantly influence the classification of the documents. While in the recent researches, the selection of features mainly focuses on the orthography of words, disregarding the information the word contains as a named entity. To solve this problem, this paper proposes a method of feature selection for text classification based on named entity recognition (NER). This method makes use of the category information of the word as a named entity, as well as the orthography characteristics. According to the experiments, this method improves the efficiency compared with the classic feature selection methods.
Keywords:named entity recognition(NER)  named entity  feature selection  text classification  hidden markov model(HMM)
本文献已被 CNKI 维普 万方数据 等数据库收录!
点击此处可从《计算机工程与科学》浏览原始摘要信息
点击此处可从《计算机工程与科学》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号