基于BERT的嵌入式文本主题模型研究 Research on Embedded Text Topic Model Based on BERT期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

基于BERT的嵌入式文本主题模型研究

引用本文：	王宇晗,林民,李艳玲,赵佳鹏.基于BERT的嵌入式文本主题模型研究[J].计算机工程与应用,2023,59(1):169-179.

作者姓名：	王宇晗林民李艳玲赵佳鹏

作者单位：	1.内蒙古师范大学计算机科学技术学院，呼和浩特 010022 2.中国科学院大学网络空间安全学院，北京 100089 3.中国科学院信息工程研究所，北京 100089

基金项目：	国家自然科学基金（61806103,61562068）；;内蒙古自然科学基金（2017MS0607）；;内蒙古自治区科技计划项目（JH20180175）；

摘要：	主题模型能够从海量文本数据中挖掘语义丰富的主题词，在文本分析的相关任务中发挥着重要作用。传统LDA主题模型在使用词袋模型表示文本时，无法建模词语之间的语义和序列关系，并且忽略了停用词与低频词。嵌入式主题模型（ETM）虽然使用Word2Vec模型来表示文本词向量解决上述问题，但在处理不同语境下的多义词时，通常将其表示为同一向量，无法体现词语的上下文语义差异。针对上述问题，设计了一种基于BERT的嵌入式主题模型BERT-ETM进行主题挖掘，在国内外通用数据集和《软件工程》领域文本语料上验证了所提方法的有效性。实验结果表明，该方法能克服传统主题模型存在的不足，主题一致性、多样性明显提升，在建模一词多义问题时表现优异，尤其是结合中文分词的WoBERT-ETM，能够挖掘出高质量、细粒度的主题词，对大规模文本十分有效。
关键词：	主题模型 BERT模型词嵌入词向量可视化
Research on Embedded Text Topic Model Based on BERT

WANG Yuhan,LIN Min,LI Yanling,ZHAO Jiapeng.Research on Embedded Text Topic Model Based on BERT[J].Computer Engineering and Applications,2023,59(1):169-179.

Authors:	WANG Yuhan LIN Min LI Yanling ZHAO Jiapeng

Affiliation:	1.College of Computer Science and Technology, Inner Mongolia Normal University, Hohhot 010022, China 2.School of Cyber Security, University of Chinese Academy of Sciences, Beijing 100089, China 3.Institute of Information Engineering, Chinese Academy of Sciences, Beijing 100089, China

Abstract:	Topic model can mining topic words with rich semantics from the massive text data, and plays an important role in the related tasks of text analysis. When the traditional LDA topic model uses word-bag model to represent text, it cannot model the semantic and sequence relationship between words, and ignore the words of deactivation and low frequency. Although the embedded topic model（ETM） solves the above problems by using Word2Vec model to represent the word vector of text, it usually represents the same vector when dealing with polysemy words in different contexts, which cannot reflect the semantic differences of words. To solve the above problems, a kind of ETM based on BERT named BERT-ETM is designed to mine the topic. The effectiveness of the proposed method is verified in general datasets at home and abroad and the text corpus of software engineering. The experimental results show that the method can overcome the shortcomings of traditional topic models, and the coherence and diversity of topic are improved obviously and performs well in modeling polysemy of a word, especially WoBERT-ETM combined with Chinese word segmentation, can dig out high-quality and fine-grained topic words, which is very effective for large vocabulary.

Keywords:	topic model BERT model word embedding word vector visualization

	点击此处可从《计算机工程与应用》浏览原始摘要信息
	点击此处可从《计算机工程与应用》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏