首页 | 本学科首页   官方微博 | 高级检索  
     

基于HITS算法的双语句对挖掘优化方法
引用本文:刘 昊,洪 宇,姚 亮,刘 乐,姚建民,周国栋. 基于HITS算法的双语句对挖掘优化方法[J]. 中文信息学报, 2017, 31(2): 25-35
作者姓名:刘 昊  洪 宇  姚 亮  刘 乐  姚建民  周国栋
作者单位:苏州大学 江苏省计算机信息处理重点实验室,江苏 苏州 215006
基金项目:国家自然科学基金(61373097, 61272259, 61272260, 90920004);教育部博士学科点专项基金(2009321110006, 20103201110021);江苏省自然科学基金(BK2011282);江苏省高校自然科学基金重大项目(11KJA520003);苏州市自然科学基金(SH201212)
摘    要:识别和定位特定领域双语网站,是基于Web自动构建特定领域双语语料库的关键。然而,特定领域双语网站之间的句对质量往往差异较大。相对于原有基于句对文本特征识别过滤质量较差句对的方法。该文从句对的来源(即特定领域双语网站)出发,依据领域权威性高的网站往往蕴含高质量平行句对这一假设,提出一种基于HITS算法的双语句对挖掘优化方法。该方法通过网站之间的链接信息建立有向图模型,利用HITS算法度量网站的权威性,在此基础上,仅从权威性高的网站中抽取双语句对,用于训练特定领域机器翻译系统。该文以教育领域为目标,验证“领域权威性高的网站蕴含高质量句对”假设的可行性。实验结果表明,利用该文所提方法挖掘双语句对训练的翻译系统,相比于基准系统,其平均性能提升0.44个BLEU值。此外,针对HITS算法存在的“主题偏离”问题,该文提出基于GHITS的改进算法。结果显示,基于GHITS算法改进的机器翻译系统,其性能继续提升0.40个BLEU值。

关 键 词:统计机器翻译  特定领域机器翻译  特定领域双语网站  权威性  

HITS-Based Optimization Method for Bilingual Corpus Mining
LIU Hao,HONG Yu,Yao Liang,LIU Le,YAO Jianmin,ZHOU Guodong. HITS-Based Optimization Method for Bilingual Corpus Mining[J]. Journal of Chinese Information Processing, 2017, 31(2): 25-35
Authors:LIU Hao  HONG Yu  Yao Liang  LIU Le  YAO Jianmin  ZHOU Guodong
Affiliation:Provincial Key Laboratory of Computer Information Processing Technology
Soochow University, Suzhou, Jiangsu 215006, China
Abstract:Identifying and locating domain-specific bilingual websites is a crucial step for the Web-based bilingual resource construction. However, the quality of sentence pairs varies among different bilingual websites. In contrast to the existing method focusing only on the sentence internal features, we explore the sentence pairs' origin information for identifying and filtering the low-quality sentences pairs. We hypothesize that, if a website is authoritative in the target domain, it tends to contain more high-quality sentence pairs. Thus, we propose a HITS based optimization method for mining domain-specific bilingual sentence pairs. In this method, we first construct a directed-graph model based on the link-info among the websites. Secondly, we propose a HITS based method for evaluating the authority of websites. Finally, we only extract the sentence pairs from the authoritative websites, and use them to enlarge the training-set of our machine translation system. Experimented on the education domain, our system achieves improvements of 0.44% BLEU score compared with existing method. A further proposed GHITS method achieve additional improvements of 0.40% BLEU score.
Keywords:statistical machine translation   specific-domain machine translation   specific-domain bilingual websites   authority   HITS  
点击此处可从《中文信息学报》浏览原始摘要信息
点击此处可从《中文信息学报》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号