首页 | 本学科首页   官方微博 | 高级检索  
     

基于Word2Vec词嵌入和高维生物基因选择遗传算法的文本特征选择方法
引用本文:张阳,王小宁.基于Word2Vec词嵌入和高维生物基因选择遗传算法的文本特征选择方法[J].计算机应用,2021,41(11):3151-3155.
作者姓名:张阳  王小宁
作者单位:中国传媒大学 数据科学与智能媒体学院,北京 100024
媒体融合与传播国家重点实验室(中国传媒大学),北京 100024
基金项目:北京市自然科学基金面上项目(9202018);中央高校基本科研业务费专项基金资助项目(CUC200F08)
摘    要:文本特征是自然语言处理中的关键部分。针对目前文本特征的高维性和稀疏性问题,提出了一种基于Word2Vec词嵌入和高维生物基因选择遗传算法(GARBO)的文本特征选择方法,从而便于后续文本分类任务。首先,优化数据输入形式,使用Word2Vec词嵌入方法将文本转变成类似基因表示的词向量;然后,将高维词向量模拟基因表达方式进行迭代进化;最后,使用随机森林分类器对特征选择后的文本进行分类。使用中文评论数据集对所提出的方法进行实验,实验结果表明了优化后的GARBO特征选择方法在文本特征选择上的有效性,该方法成功地将300维特征降低为50维更有价值的特征,分类准确率达到88%,与其他过滤式文本特征选择方法相比,能够有效地降低文本特征维度,提高文本分类效果。

关 键 词:文本分类  遗传算法  特征降维  Word2Vec  文本特征  
收稿时间:2020-12-24
修稿时间:2021-07-30

Text feature selection method based on Word2Vec word embedding and genetic algorithm for biomarker selection in high-dimensional omics
ZHANG Yang,WANG Xiaoning.Text feature selection method based on Word2Vec word embedding and genetic algorithm for biomarker selection in high-dimensional omics[J].journal of Computer Applications,2021,41(11):3151-3155.
Authors:ZHANG Yang  WANG Xiaoning
Affiliation:School of Data Science and Intelligent Media,Communication University of China,Beijing 100024,China
State Key Laboratory of Media Convergence and Communication (Communication University of China),Beijing 100024,China
Abstract:Text feature is the key part of natural language processing. Concerning the problems of high dimensionality and sparseness of text features, a text feature selection method based on Word2Vec word embedding and Genetic AlgoRithm for Biomarker selection in high-dimensional Omics (GARBO) was proposed, so as to facilitate the subsequent text classification tasks. Firstly, the data input form was optimized, and the Word2Vec word embedding method was used to transform the text into the word vectors similar to gene expression. Then, the gene expression simulated by the high-dimensional word vectors was iteratively evolved. Finally, the random forest classifier was used to classify the text after feature selection. The experiments were conducted on the Chinese comment dataset to verify the proposed method. The experimental results show that, the optimized GARBO feature selection method is effective in text feature selection, successfully reducing 300-dimensional features to 50-dimensional features with more value, and has the classification accuracy reached 88%. Compared with other filtering type text feature selection methods, the proposed method can effectively reduce the dimension of text features and improve the effect of text classification.
Keywords:text classification  genetic algorithm  feature dimensionality reduction  Word2Vec  text feature  
本文献已被 万方数据 等数据库收录!
点击此处可从《计算机应用》浏览原始摘要信息
点击此处可从《计算机应用》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号