首页 | 本学科首页   官方微博 | 高级检索  
     

融合主题及上下文特征的汉缅双语词汇抽取方法
引用本文:李越,毛存礼,余正涛,高盛祥,王振晗,张亚飞.融合主题及上下文特征的汉缅双语词汇抽取方法[J].小型微型计算机系统,2021(1):91-95.
作者姓名:李越  毛存礼  余正涛  高盛祥  王振晗  张亚飞
作者单位:昆明理工大学信息工程与自动化学院;昆明理工大学云南省人工智能重点实验室
基金项目:国家自然科学基金重点项目(61732005)资助;国家自然科学基金项目(61662041,61761026,6186019,61972186)资助;云南省中青年学术和技术带头人后备人才项目(2019HB006)资助;云南省自然科学基金重点项目(2019FA023)资助。
摘    要:缅甸语属于低资源语言,网络中获取大规模的汉-缅双语词汇一定程度上可以缓解汉-缅机器翻译中面临句子级对齐语料匮乏的问题.为此,本文提出了一种融合主题及上下文特征的汉缅双语词汇抽取方法.首先利用LDA主题模型获取汉缅文档主题分布,并通过双语词向量表征将跨语言主题向量映射到共享的语义空间后抽取同一主题下相似度较高的词作为汉-缅双语候选词汇,然后基于BERT获取候选双语词汇相关上下文的词汇语义表征构建上下文向量,最后通过计算候选词的上下文向量的相似度对候选双语词汇进行加权得到质量更高的汉缅互译词汇.实验结果表明,相对于基于双语词典的方法和基于双语LDA+CBW的方法,本文提出的方法准确率上分别提升了11.07%和3.82%.

关 键 词:汉缅双语词汇  主题特征  上下文特征  BERT  双语词向量

Method of Chinese Burmese Bilingual Vocabulary Extraction Based on Subject and Context Features
LI Yue,MAO Cun-li,YU Zheng-tao,GAO Sheng-xiang,WANG Zhen-han,ZHANG Ya-fei.Method of Chinese Burmese Bilingual Vocabulary Extraction Based on Subject and Context Features[J].Mini-micro Systems,2021(1):91-95.
Authors:LI Yue  MAO Cun-li  YU Zheng-tao  GAO Sheng-xiang  WANG Zhen-han  ZHANG Ya-fei
Affiliation:(Faculty of Information Engineering and Automation,Kunming University of Science and Technology,Kunming 650500,China;Yunnan Key Laboratory of Artificial Intelligence,Kunming University of Science and Technology,Kunming 650500,China)
Abstract:Burmese is a low-resource language.Obtaining large-scale Chinese-Burmese bilingual vocabularies on the Internet,which can be mitigated due to the lacking of sentence-level alignment corpora in Chinese-Burmese machine translation.Consequently,this article proposes a method of Chinese-Burmese bilingual vocabulary extraction based on subject and context features.Firstly,the topic distribution of the Chinese-Burmese document is obtained by the LDA topic model,what's more the cross-language topic vector is mapped to the shared semantic space through bilingual word vector representation.The words with higher similarity under the same topic are extracted as Chinese-Burmese bilingual candidate vocabulary,we obtain the linguistic semantic representation of the context of the candidate bilingual vocabulary to construct a context vector by BERT.Finally we weight the candidate bilingual vocabulary by calculating the similarity of the context vector of the candidate word to obtain the higher quality Chinese-Myanmar translation Vocabulary.Experimental results show that compared with the method based on bilingual dictionary and the method based on bilingual LDA+CBW,the accuracy of the proposed method is improved by 11.07%and 3.82%respectively.
Keywords:Chinese-Myanmar vocabulary  thematic features  contextual features  BERT  bilingual word vector
本文献已被 维普 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号