首页 | 本学科首页   官方微博 | 高级检索  
     

BERT蒙古文词向量学习
引用本文:王玉荣,林民,李艳玲. BERT蒙古文词向量学习[J]. 计算机工程与应用, 2023, 59(2): 129-134. DOI: 10.3778/j.issn.1002-8331.2107-0102
作者姓名:王玉荣  林民  李艳玲
作者单位:内蒙古师范大学 计算机科学技术学院,呼和浩特 010022
基金项目:国家自然科学基金(61806103,61562068);;内蒙古自然科学基金(2017MS0607);;内蒙古自治区科技计划项目(JH20180175);
摘    要:以Word2Vec为代表的静态蒙古文词向量学习方法,将处于不同语境的多种语义词汇综合表示成一个词向量,这种上下文无关的文本表示方法对后续任务的提升非常有限。通过二次训练多语言BERT预训练模型与CRF相结合,并采用两种子词融合方式,提出一种新的蒙古文动态词向量学习方法。为验证方法的有效性,在内蒙古师范大学蒙古文硕博论文的教育领域、文学领域数据集上用不同的模型进行了同义词对比实验,并利用K-means聚类算法对蒙古文词语进行聚类分析,最后在嵌入式主题词挖掘任务中进行了验证。实验结果表明,BERT学出的词向量质量高于Word2Vec,相近词的向量在向量空间中的距离非常近,不相近词的向量较远,在主题词挖掘任务中获取的主题词有密切的关联。

关 键 词:蒙古文  词向量  BERT  条件随机场

BERT Mongolian Word Embedding Learning
WANG Yurong,LIN Min,LI Yanling. BERT Mongolian Word Embedding Learning[J]. Computer Engineering and Applications, 2023, 59(2): 129-134. DOI: 10.3778/j.issn.1002-8331.2107-0102
Authors:WANG Yurong  LIN Min  LI Yanling
Affiliation:College of Computer Science and Technology, Inner Mongolia Normal University, Hohhot 010022, China
Abstract:The static Mongolian word embedding learning method represented by Word2Vec comprehensively represents a variety of semantic words in different contexts into a word embedding. Such context-independent text representation method has limited improvement on subsequent tasks. Through the second training, the multilingual BERT pre-training model is combined with CRF, and adopting the fusion method of two seed words, a new dynamic Mongolian word embedding learning method is proposed, which can solve the problem of lexical aggregation. In order to verify the effectiveness of this method, a comparative experiment is carried out on the data sets of education and literature fields of Masters and Doctrines dissertations of Inner Mongolia Normal University, and the clustering analysis of Mongolian words is carried out by using [K]-means clustering algorithm, finally, it is verified in the task of embedded keyword mining. The experimental results show that the quality of the word vectors learned by BERT is higher than that of Word2Vec. The embedding of similar words is very close in the vector space, while the embedding of non-similar words is far away. The subject words obtained in the subject word mining task are closely related.
Keywords:Mongolian   word embedding   bidirectional encoder representations from transformers(BERT)   conditional random field  
点击此处可从《计算机工程与应用》浏览原始摘要信息
点击此处可从《计算机工程与应用》下载免费的PDF全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号