BERT蒙古文词向量学习 BERT Mongolian Word Embedding Learning期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

BERT蒙古文词向量学习

引用本文：	王玉荣,林民,李艳玲. BERT蒙古文词向量学习[J]. 计算机工程与应用, 2023, 59(2): 129-134. DOI: 10.3778/j.issn.1002-8331.2107-0102

作者姓名：	王玉荣林民李艳玲

作者单位：	内蒙古师范大学计算机科学技术学院，呼和浩特 010022

基金项目：	国家自然科学基金（61806103,61562068）；;内蒙古自然科学基金（2017MS0607）；;内蒙古自治区科技计划项目（JH20180175）；

摘要：	以Word2Vec为代表的静态蒙古文词向量学习方法，将处于不同语境的多种语义词汇综合表示成一个词向量，这种上下文无关的文本表示方法对后续任务的提升非常有限。通过二次训练多语言BERT预训练模型与CRF相结合，并采用两种子词融合方式，提出一种新的蒙古文动态词向量学习方法。为验证方法的有效性，在内蒙古师范大学蒙古文硕博论文的教育领域、文学领域数据集上用不同的模型进行了同义词对比实验，并利用K-means聚类算法对蒙古文词语进行聚类分析，最后在嵌入式主题词挖掘任务中进行了验证。实验结果表明，BERT学出的词向量质量高于Word2Vec，相近词的向量在向量空间中的距离非常近，不相近词的向量较远，在主题词挖掘任务中获取的主题词有密切的关联。
关键词：	蒙古文词向量 BERT 条件随机场
BERT Mongolian Word Embedding Learning

WANG Yurong,LIN Min,LI Yanling. BERT Mongolian Word Embedding Learning[J]. Computer Engineering and Applications, 2023, 59(2): 129-134. DOI: 10.3778/j.issn.1002-8331.2107-0102

Authors:	WANG Yurong LIN Min LI Yanling

Affiliation:	College of Computer Science and Technology, Inner Mongolia Normal University, Hohhot 010022, China

Abstract:	The static Mongolian word embedding learning method represented by Word2Vec comprehensively represents a variety of semantic words in different contexts into a word embedding. Such context-independent text representation method has limited improvement on subsequent tasks. Through the second training, the multilingual BERT pre-training model is combined with CRF, and adopting the fusion method of two seed words, a new dynamic Mongolian word embedding learning method is proposed, which can solve the problem of lexical aggregation. In order to verify the effectiveness of this method, a comparative experiment is carried out on the data sets of education and literature fields of Masters and Doctrines dissertations of Inner Mongolia Normal University, and the clustering analysis of Mongolian words is carried out by using [K]-means clustering algorithm, finally, it is verified in the task of embedded keyword mining. The experimental results show that the quality of the word vectors learned by BERT is higher than that of Word2Vec. The embedding of similar words is very close in the vector space, while the embedding of non-similar words is far away. The subject words obtained in the subject word mining task are closely related.

Keywords:	Mongolian word embedding bidirectional encoder representations from transformers（BERT） conditional random field

	点击此处可从《计算机工程与应用》浏览原始摘要信息
	点击此处可从《计算机工程与应用》下载免费的PDF全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏