首页 | 本学科首页   官方微博 | 高级检索  
     

一种基于小字典不对等语料的跨语言词嵌入方法
引用本文:王红斌,冯银汉,余正涛,文永华.一种基于小字典不对等语料的跨语言词嵌入方法[J].中文信息学报,2019,33(8):46-52.
作者姓名:王红斌  冯银汉  余正涛  文永华
作者单位:昆明理工大学 信息工程与自动化学院,云南 昆明 650504
基金项目:国家自然科学基金(61462054,61732005,61672271);云南省科技厅项目(2015FB135)
摘    要:双语词嵌入通常采用从源语言空间到目标语言空间映射,通过源语言映射嵌入到目标语言空间的最小距离线性变换实现跨语言词嵌入。然而大型的平行语料难以获得,词嵌入的准确率难以提高。针对语料数量不对等、双语语料稀缺情况下的跨语言词嵌入问题,该文提出一种基于小字典不对等语料的跨语言词嵌入方法,首先对单语词向量进行归一化,对小字典词对正交最优线性变换求得梯度下降初始值,然后通过对大型源语言(英语)语料进行聚类,借助小字典找到与每一聚类簇相对应的源语言词,取聚类得到的每一簇词向量均值和源语言与目标语言对应的词向量均值,建立新的双语词向量对应关系,将新建立的双语词向量扩展到小字典中,使得小字典得以泛化和扩展。最后,利用泛化扩展后的字典对跨语言词嵌入映射模型进行梯度下降求得最优值。在英语-意大利语、德语和芬兰语上进行了实验验证,实验结果证明该文方法可以在跨语言词嵌入中减少梯度下降迭代次数,减少训练时间,同时在跨语言词嵌入上表现出较好的正确率。

关 键 词:小字典  不对等语料  词嵌入  k-means聚类  梯度下降

Cross Language Word Embedding Based on Small Dictionary and Unbalance Mono-lingual Corpus
WANG Hongbin,FENG Yinhan,YU Zhengtao,WEN Yonghua.Cross Language Word Embedding Based on Small Dictionary and Unbalance Mono-lingual Corpus[J].Journal of Chinese Information Processing,2019,33(8):46-52.
Authors:WANG Hongbin  FENG Yinhan  YU Zhengtao  WEN Yonghua
Affiliation:Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming, Yunnan 650504, China
Abstract:This paper proposed a cross-language word embedding method based on small dictionary and unbalanced monolingual corpus. This method first normalizes monolingual word vectors, obtaining gradient descent initial values for small dictionary words by orthogonal optimal linear transformations. And then the large-scale source language (English) corpus is clustered, and the source language words corresponding to each cluster are detected via dictionary. The average word vector value of each cluster, and the word vector value corresponding to the source language and the target language are thus obtained. A new bilingual word vector correspondence relationship is established, which are extended into the small dictionary. Finally, the generalized extended dictionary is used to conduct gradient descent on the cross-language word embedding mapping model to obtain the optimal value. Experiments in English-Italian, English-German and English-Finnish show that this method can reduce the number of gradient descent iterations in cross-language word embedding and reduce the training time, preserving a good accuracy rate in cross-language word embedding.
Keywords:small dictionary  unequal corpus  word embedding  k-means clustering  gradient descent  
本文献已被 维普 等数据库收录!
点击此处可从《中文信息学报》浏览原始摘要信息
点击此处可从《中文信息学报》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号