首页 | 本学科首页   官方微博 | 高级检索  
     

基于宽度和词向量特征的文本分类模型
引用本文:李雪松.基于宽度和词向量特征的文本分类模型[J].计算机系统应用,2021,30(3):177-183.
作者姓名:李雪松
作者单位:中国银行总行个人数字金融部,北京 100818
摘    要:针对词向量文本分类模型记忆能力弱,缺少全局词特征信息等问题,提出基于宽度和词向量特征的文本分类模型(WideText):首先对文本进行清洗、分词、词元编码和定义词典等,计算全局词元的词频-逆文档频度(TFIDF)指标并将每条文本向量化,将输入文本中的词通过编码映射到词嵌入矩阵中,词向量特征经嵌入和平均叠加后,和基于TF-IDF的文本向量特征进行拼接,传入到输出层后计算属于每个分类的概率.该模型在低维词向量的基础上结合了文本向量特征的表达能力,具有良好的泛化和记忆能力.实验结果表明,在引入宽度特征后,WideText分类性能不仅较词向量文本分类模型有明显提升,且略优于前馈神经网络分类器.

关 键 词:Word2Vec  FastText  WideText  文本分类
收稿时间:2020/6/15 0:00:00
修稿时间:2020/7/14 0:00:00

Text Classification Model Based on Width and Word Vector Feature
LI Xue-Song.Text Classification Model Based on Width and Word Vector Feature[J].Computer Systems& Applications,2021,30(3):177-183.
Authors:LI Xue-Song
Affiliation:Digital Personal Banking Department, Bank of China, Beijing 100818, China
Abstract:To resolve the issues of weak memory ability and no global word feature information in the word-vector-based text classification model, we propose a text classification model (WideText) based on the width and word vector features. Firstly, text cleaning, word segmentation, unit encoding and dictionary definitions are carried out. Secondly, the Term Frequency-Inverse Document Frequency (TF-IDF) index of the global word units is calculated and each text is vectorized. Furthermore, the words in the input text are mapped to the word embedding matrix through encoding. After the word vector features are embedded and averagely superimposed, they are spliced with the text vector features based on TF-IDF and transmitted to the output layer. Finally, the probability of the features belonging to each category is calculated. The proposed model combines the expressive ability of text vector features on the basis of low-dimensional word vectors and has excellent generalization and memory abilities. The experimental results show that after the introduction of the width feature, the WideText classification performance is significantly improved in comparison with that in the word-vector-based text classification model and also slightly better than that in the feedforward neural network classifiers.
Keywords:Word2Vec  FastText  WideText  text classification
本文献已被 维普 万方数据 等数据库收录!
点击此处可从《计算机系统应用》浏览原始摘要信息
点击此处可从《计算机系统应用》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号