首页 | 本学科首页   官方微博 | 高级检索  
     

基于跨语言语料的汉泰词分布表示
引用本文:张金鹏,周兰江,线岩团,余正涛,何思兰.基于跨语言语料的汉泰词分布表示[J].计算机工程与科学,2015,37(12):2358-2365.
作者姓名:张金鹏  周兰江  线岩团  余正涛  何思兰
作者单位:;1.昆明理工大学信息工程与自动化学院;2.昆明理工大学智能信息处理重点实验室;3.昆明理工大学理学院
基金项目:国家自然科学基金资助项目(61363044)
摘    要:词汇的表示问题是自然语言处理的基础研究内容。目前单语词汇分布表示已经在一些自然语言处理问题上取得很好的应用效果,然而在跨语言词汇的分布表示上国内外研究很少,针对这个问题,利用两种语言名词、动词分布的相似性,通过弱监督学习扩展等方式在中文语料中嵌入泰语的互译词、同类词、上义词等,学习出泰语词在汉泰跨语言环境下的分布。实验基于学习到的跨语言词汇分布表示应用于双语文本相似度计算和汉泰混合语料集文本分类,均取得较好效果。

关 键 词:弱监督学习扩展  跨语言语料  跨语言词汇分布表示  神经概率语言模型
收稿时间:2015-08-20
修稿时间:2015-12-25

Distributed representation of Chinese and Thai words based on cross-lingual corpus
ZHANG Jin peng,ZHOU Lan jiang,XIAN Yan tuan,YU Zheng tao,HE Si lan.Distributed representation of Chinese and Thai words based on cross-lingual corpus[J].Computer Engineering & Science,2015,37(12):2358-2365.
Authors:ZHANG Jin peng  ZHOU Lan jiang  XIAN Yan tuan  YU Zheng tao  HE Si lan
Affiliation:(1.School of Information Engineering and Automation,Kunming University of Science and Technology,Kunming 650500; 2.The Key Laboratory of Intelligent Information Processing, Kunming University of Science and Technology,Kunming 650500; 3.School of Science,Kunming University of Science and Technology,Kunming 650500,China)
Abstract:Word representation is the basic research content of natural language processing. At present, distributed representation of monolingual words has shown satisfactory application effect in some Neural Probabilistic Language (NPL) research, while as for distributed representation of cross lingual words, there is little research both at home and abroad. Aiming at this problem, given distribution similarity of nouns and verbs in these two languages, we embed mutual translated words, synonyms, superordinates into Chinese corpus by the weakly supervised learning extension approach and other methods, thus Thai word distribution in cross lingual environment of Chinese and Thai is learned. We applied the distributed representation of the cross lingual words learned before to compute similarities of bilingual texts and classify the mixed text corpus of Chinese and Thai. Experimental results show that the proposal has a satisfactory effect on the two tasks.
Keywords:weakly supervised learning extension  cross-lingual corpus  cross-lingual word distribution representations  neural probabilistic language model  
本文献已被 万方数据 等数据库收录!
点击此处可从《计算机工程与科学》浏览原始摘要信息
点击此处可从《计算机工程与科学》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号