稀疏数据下结合词向量的短文本分类模型研究 Research on short text classification model combined with word vector for sparse data期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

稀疏数据下结合词向量的短文本分类模型研究

引用本文：	杨阳,刘恩博,顾春华,裴颂文.稀疏数据下结合词向量的短文本分类模型研究[J].计算机应用研究,2022,39(3):711-715+750.

作者姓名：	杨阳刘恩博顾春华裴颂文

作者单位：	上海理工大学光电信息与计算机工程学院,上海200082

基金项目：	国家自然科学基金资助项目(61975124)；;上海自然科学基金资助项目(20ZR1438500)；;上海市科委科技行动计划资助项目(20DZ2308700)；

摘要：	针对短文本缺乏足够共现信息所产生的词与词之间弱连接,且难以获取主题词的情况,导致面向短文本分类工作需要人工标注大量的训练样本,以及产生特征稀疏和维度爆炸的问题,提出了一种基于注意力机制和标签图的单词共生短文本分类模型(WGA-BERT)。首先利用预先训练好的BERT模型计算上下文感知的文本表示,并使用WNTM对每个单词的潜在单词组分布进行建模,以获取主题扩展特征向量;其次提出了一种标签图构造方法捕获主题词的结构和相关性;最后,提出了一种注意力机制建立主题词之间,以及主题词和文本之间的联系,解决了数据稀疏性和主题文本异构性的问题。实验结果表明,WGA-BERT模型对于新闻评论类的短文本分类,比传统的机器学习模型在分类精度上平均提高了3%。
关键词：	短文本分类词嵌入单词网络主题模型注意力机制
收稿时间：	2021/8/26 0:00:00
修稿时间：	2022/2/18 0:00:00
Research on short text classification model combined with word vector for sparse data

Yang Yang,Liu Enbo,Gu Chunhua and Pei Songwen.Research on short text classification model combined with word vector for sparse data[J].Application Research of Computers,2022,39(3):711-715+750.

Authors:	Yang Yang Liu Enbo Gu Chunhua and Pei Songwen

Affiliation:	(School of Optical-Electrical&Computer Engineering,University of Shanghai for Science&Technology,Shanghai 200082,China)

Abstract:	Due to the lack of sufficient co-occurrence information in short text, weak connections between words, and it is difficult to obtain subject words, which leads to the need to manually label a large number of training samples for short text classification, and the problems of sparse features and dimension explosion.This paper proposed a word symbiotic short text classification model based on attention mechanism and label graph(WGA-BERT).Firstly, this paper used the pretrained BERT model to calculate the context aware text representation, and used WNTM to model the potential word group distribution of each word to obtain the topic expansion feature vector.Secondly, this paper used a tag graph construction method to capture the structure and relevance of subject words.Finally, this paper used an attention mechanism to establish the relationship between subject words and between subject words and text, which solved the problems of data sparsity and subject text heterogeneity.The experimental results show that the WGA-BERT model improves the classification accuracy by an average of 3% compared with the traditional machine learning model.

Keywords:	short text classification word embedding word network topic model(WNTM) attention mechanism
本文献已被维普万方数据等数据库收录！
	点击此处可从《计算机应用研究》浏览原始摘要信息
	点击此处可从《计算机应用研究》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏