首页 | 本学科首页   官方微博 | 高级检索  
     

一种基于分类的平行语料选择方法
引用本文:王 星,涂兆鹏,谢 军,吕雅娟,姚建民. 一种基于分类的平行语料选择方法[J]. 中文信息学报, 2013, 27(6): 144-151
作者姓名:王 星  涂兆鹏  谢 军  吕雅娟  姚建民
作者单位:1. 苏州大学 计算机科学与技术学院,江苏 苏州 215006;
2. 中国科学院 计算技术研究所 智能信息处理重点实验室,北京 100190;
3. 加州大学 戴维斯分校 计算机科学系,加州 95616
基金项目:863重大项目课题(No.2011AA01A207),国家自然科学基金资助项目(No. 61003152,61272259)。
摘    要:大规模高质量双语平行语料库是构造高质量统计机器翻译系统的重要基础,但语料库中的噪声影响着统计机器翻译系统的性能,因此有必要对大规模语料库中语料进行筛选。区别于传统的语料选择排序模型,本文提出一种基于分类的平行语料选择方法。通过少数句对特征构造差异较大的分类器训练句对,在该训练句对上使用更多的句对特征对分类器进行训练,然后对其他未分类句对进行分类。相比于基准系统,我们的方法不仅缩减40%训练语料规模,同时在NIST测试数据集合上将BLEU值提高了0.87个百分点。

关 键 词:统计机器翻译  平行语料选择  

Selection of Parallel Corpus Based on Classification
WANG Xing,TU Zhaopeng,XIE Jun,LV Yajuan,YAO Jianmin. Selection of Parallel Corpus Based on Classification[J]. Journal of Chinese Information Processing, 2013, 27(6): 144-151
Authors:WANG Xing  TU Zhaopeng  XIE Jun  LV Yajuan  YAO Jianmin
Affiliation:1. School of Computer Science &Technology, Soochow University, Suzhou, Jiangsu 215006;2. Key Laboratory of Intelligent Information Processing, Institute of Computing Technology,Chinese Academy of Sciences, Beijing 100190;3. University of California, Davis, Department of Computer Science, California, 95616
Abstract:Large-scale bilingual corpus is a fundamental resource to build a high-quality statistical machine translation system. However, there are usually a large number of noises in the corpus, which would affect the performance of translation system. Therefore, it is essential to filter noisy sentences. In this paper, we propose a classification based selection approach to distinguish high-quality bilingual sentences from the noisy ones. We first exploit several metrics to find the best and worst sentences in the corpus. Then we classify the rest sentences with the classifier, which is trained with more features on these sentences. Experimental results show that our approach not only eliminates 40% less promising sentences, but also significantly improves translation performance by 0.87 BLEU points over using all sentences.
Key wordsstatistical machine translation; bilingual corpus selection
Keywords:statistical machine translation   bilingual corpus selection  
点击此处可从《中文信息学报》浏览原始摘要信息
点击此处可从《中文信息学报》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号