首页 | 本学科首页   官方微博 | 高级检索  
     

基于多特征融合和图匹配的维汉句子对齐
引用本文:倪耀群,许洪波,程学旗.基于多特征融合和图匹配的维汉句子对齐[J].中文信息学报,2016,30(4):124-133.
作者姓名:倪耀群  许洪波  程学旗
作者单位:1. 中国科学院 计算技术研究所网络数据科学与技术重点实验室,北京 100190;
2. 中国科学院大学,北京 100049;
3. 洛阳外国语学院 语言工程系,河南 洛阳 471003
基金项目:国家自然科学基金(61232010,61303156);国家973课题(2012CB316303);国家863课题(2012AA011003);国家科技支撑计划(2012BAH46B04)
摘    要:维吾尔语新闻网页与对应的中文翻译网页在内容上往往并非完全可比,主要表现为双语句子序列的错位甚至部分句子缺失,这给维汉句子对齐造成了困难。此外,作为新闻要素的人名地名很多是未登录词,这进一步增加了维汉句子对齐的难度。为了提高维汉词汇的匹配概率,作者自动提取中文人名、地名并翻译为维吾尔译名,构造双语名称映射表并加入维汉双语词典。然后用维文句中词典词对应的中文译词在中文句中进行串匹配,以避免中文分词错误,累计所有匹配词对得到双语句对的词汇互译率。最后融合数字、标点、长度特征计算双语句对的相似度。在所有双语句子相似度构成的矩阵上,使用图匹配算法寻找维汉平行句对,在900个句对上最高达到95.67%的维汉对齐准确率。

关 键 词:句子对齐  人名、地名翻译  多特征融合  二部图最佳匹配  

Uyghur Chinese Sentence Alignment Based on Multi Featuresand Optimal Matching
Ni Yaoqun,Xu Hongbo,Cheng Xueqi.Uyghur Chinese Sentence Alignment Based on Multi Featuresand Optimal Matching[J].Journal of Chinese Information Processing,2016,30(4):124-133.
Authors:Ni Yaoqun  Xu Hongbo  Cheng Xueqi
Affiliation:1. CAS Key Laboratory of Network Data Science & Technology, Institute of Computing Technology,
Chinese Academy of Sciences, Beijing 100190, China;
2. Department of Language Engineering, University of Chinese Academy of Sciences,Beijing 100049, China;
3. Department of Language Engineering, University of Foreign Languages, Luoyang, Henan 471003, China
Abstract:The content of Uyghur webpage news is usually partial comparable with the content of the Chinese counterpart. Uyghur sentence sequences may be shuffled or even partially missing in Chinese text, which cause some difficulties in mining parallel sentences (i.e. sentence bead) from bilingual news. Fist, to improve the word matching rate of this kind, person and location names in Chinese are extracted and translated into Uyghur to enhance bilingual mapping. Then we scan the Chinese sentences with translation of Uighur words and calculate the translation rate via string matching to avoid mistakes in Chinese word segmentation. The final similarity of a sentence pair is calculated by combining the word translation rate with the numbers, punctuations and length of sentences as features. Similarities of all the bilingual sentence pairs constructed a weight matrix. We used greedy algorithm and maximum weight matching algorithm in bipartite graph to find the parallel sentence pairs with highest probability. Our method achieves an accuracy of 95.67% in sentence alignment.
Keywords:sentence alignment  translation of human name and location name  multiple features blending  maximum weight matching in bipartite graph  
点击此处可从《中文信息学报》浏览原始摘要信息
点击此处可从《中文信息学报》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号