首页 | 本学科首页   官方微博 | 高级检索  
     

面向短文本的特征选择及文本表示
引用本文:马建红,刘广森,姚爽,杨智. 面向短文本的特征选择及文本表示[J]. 计算机与现代化, 2019, 0(3): 95. DOI: 10.3969/j.issn.1006-2475.2019.03.018
作者姓名:马建红  刘广森  姚爽  杨智
作者单位:河北工业大学计算机科学与软件学院,天津,300401;河北工业大学计算机科学与软件学院,天津,300401;河北工业大学计算机科学与软件学院,天津,300401;河北工业大学计算机科学与软件学院,天津,300401
基金项目:中国科学技术咨询服务中心计算机辅助创新设计公共服务平台建设服务采购项目(HSZT2015FD/254)
摘    要:短文本由于其稀疏性、实时性、非标准性等特点,在文本特征选择和文本表示方面存在较多问题,从而影响文本分类精度。针对文本特征选择方面存在较高的特征维数灾难的问题,提出一种二阶段的文本特征选择算法。首先在互信息算法的基础上,引入平衡因子、频度、集中度、词性及词在文本中的位置等5个指标对互信息值进行计算,然后将排序结果靠前的特征集初始化进行遗传算法的训练从而得到最优特征集合。因为TFIDF在计算时针对的是整篇语料而没有考虑类间分布不均的情况,在计算IDF公式时引入方差,并将改进后的TFIDF公式对Word2Vec词向量进行加权表示文本。将改进算法应用在人工构建的百科用途短文本语料集中进行实验,实验结果表明改进的文本特征选择算法和文本表示算法对分类效果有2%~5%的提升。

关 键 词:文本特征选择  文本表示  遗传算法  文本分类
收稿时间:2019-04-10

Text Feature Selection and Text Representation for Short Essays
MA Jian-hong,LIU Guang-sen,YAO Shuang,YANG Zhi. Text Feature Selection and Text Representation for Short Essays[J]. Computer and Modernization, 2019, 0(3): 95. DOI: 10.3969/j.issn.1006-2475.2019.03.018
Authors:MA Jian-hong  LIU Guang-sen  YAO Shuang  YANG Zhi
Abstract: Due to its sparsity, real-time and non-standard features, short essay has many problems in text feature selection and text representation, which affects text classification accuracy. Aiming at the problem of high feature dimension disaster in text feature selection, a two-stage text feature selection algorithm is proposed. First,  balance parameter, frequency, concentration, part of speech, and location are introduced into mutual information algorithm, and then the characteristic set with previous rank in the sorting result is initialized to train genetic algorithm to get optimal text feature set. Because the calculation of TFIDF aims at the whole corpus without considering the uneven distribution between classes, the variance is introduced when calculating the IDF formula. And the improved TFIDF formula is used to weight the Word2Vec word vector to represent the text vector. The improved algorithms are applied in the artificially constructed encyclopedic short essay corpus for experiments. Experiments show that the improved text feature selection algorithm and text representation algorithm have a 2%-5% improvement in classification effect.
Keywords:text feature selection  text representation  genetic algorithm  text classification  
本文献已被 万方数据 等数据库收录!
点击此处可从《计算机与现代化》浏览原始摘要信息
点击此处可从《计算机与现代化》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号