首页 | 官方网站   微博 | 高级检索  
     

基于BERT的嵌入式文本主题模型研究
引用本文:王宇晗,林民,李艳玲,赵佳鹏.基于BERT的嵌入式文本主题模型研究[J].计算机工程与应用,2023,59(1):169-179.
作者姓名:王宇晗  林民  李艳玲  赵佳鹏
作者单位:1.内蒙古师范大学 计算机科学技术学院,呼和浩特 010022 2.中国科学院大学 网络空间安全学院,北京 100089 3.中国科学院 信息工程研究所,北京 100089
基金项目:国家自然科学基金(61806103,61562068);;内蒙古自然科学基金(2017MS0607);;内蒙古自治区科技计划项目(JH20180175);
摘    要:主题模型能够从海量文本数据中挖掘语义丰富的主题词,在文本分析的相关任务中发挥着重要作用。传统LDA主题模型在使用词袋模型表示文本时,无法建模词语之间的语义和序列关系,并且忽略了停用词与低频词。嵌入式主题模型(ETM)虽然使用Word2Vec模型来表示文本词向量解决上述问题,但在处理不同语境下的多义词时,通常将其表示为同一向量,无法体现词语的上下文语义差异。针对上述问题,设计了一种基于BERT的嵌入式主题模型BERT-ETM进行主题挖掘,在国内外通用数据集和《软件工程》领域文本语料上验证了所提方法的有效性。实验结果表明,该方法能克服传统主题模型存在的不足,主题一致性、多样性明显提升,在建模一词多义问题时表现优异,尤其是结合中文分词的WoBERT-ETM,能够挖掘出高质量、细粒度的主题词,对大规模文本十分有效。

关 键 词:主题模型  BERT模型  词嵌入  词向量可视化

Research on Embedded Text Topic Model Based on BERT
WANG Yuhan,LIN Min,LI Yanling,ZHAO Jiapeng.Research on Embedded Text Topic Model Based on BERT[J].Computer Engineering and Applications,2023,59(1):169-179.
Authors:WANG Yuhan  LIN Min  LI Yanling  ZHAO Jiapeng
Affiliation:1.College of Computer Science and Technology, Inner Mongolia Normal University, Hohhot 010022, China 2.School of Cyber Security, University of Chinese Academy of Sciences, Beijing 100089, China 3.Institute of Information Engineering, Chinese Academy of Sciences, Beijing 100089, China
Abstract:Topic model can mining topic words with rich semantics from the massive text data, and plays an important role in the related tasks of text analysis. When the traditional LDA topic model uses word-bag model to represent text, it cannot model the semantic and sequence relationship between words, and ignore the words of deactivation and low frequency. Although the embedded topic model(ETM) solves the above problems by using Word2Vec model to represent the word vector of text, it usually represents the same vector when dealing with polysemy words in different contexts, which cannot reflect the semantic differences of words. To solve the above problems, a kind of ETM based on BERT named BERT-ETM is designed to mine the topic. The effectiveness of the proposed method is verified in general datasets at home and abroad and the text corpus of software engineering. The experimental results show that the method can overcome the shortcomings of traditional topic models, and the coherence and diversity of topic are improved obviously and performs well in modeling polysemy of a word, especially WoBERT-ETM combined with Chinese word segmentation, can dig out high-quality and fine-grained topic words, which is very effective for large vocabulary.
Keywords:topic model  BERT model  word embedding  word vector visualization  
点击此处可从《计算机工程与应用》浏览原始摘要信息
点击此处可从《计算机工程与应用》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号