一种基于分类的平行语料选择方法 Selection of Parallel Corpus Based on Classification期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

一种基于分类的平行语料选择方法

引用本文：	王星,涂兆鹏,谢军,吕雅娟,姚建民. 一种基于分类的平行语料选择方法[J]. 中文信息学报, 2013, 27(6): 144-151

作者姓名：	王星涂兆鹏谢军吕雅娟姚建民

作者单位：	1. 苏州大学计算机科学与技术学院,江苏苏州 215006; 2. 中国科学院计算技术研究所智能信息处理重点实验室,北京 100190; 3. 加州大学戴维斯分校计算机科学系,加州 95616

基金项目：	863重大项目课题(No.2011AA01A207),国家自然科学基金资助项目(No. 61003152,61272259)。

摘要：	大规模高质量双语平行语料库是构造高质量统计机器翻译系统的重要基础,但语料库中的噪声影响着统计机器翻译系统的性能,因此有必要对大规模语料库中语料进行筛选。区别于传统的语料选择排序模型,本文提出一种基于分类的平行语料选择方法。通过少数句对特征构造差异较大的分类器训练句对,在该训练句对上使用更多的句对特征对分类器进行训练,然后对其他未分类句对进行分类。相比于基准系统,我们的方法不仅缩减40%训练语料规模,同时在NIST测试数据集合上将BLEU值提高了0.87个百分点。
关键词：	统计机器翻译平行语料选择
Selection of Parallel Corpus Based on Classification

WANG Xing,TU Zhaopeng,XIE Jun,LV Yajuan,YAO Jianmin. Selection of Parallel Corpus Based on Classification[J]. Journal of Chinese Information Processing, 2013, 27(6): 144-151

Authors:	WANG Xing TU Zhaopeng XIE Jun LV Yajuan YAO Jianmin

Affiliation:	1. School of Computer Science &Technology, Soochow University, Suzhou, Jiangsu 215006;2. Key Laboratory of Intelligent Information Processing, Institute of Computing Technology,Chinese Academy of Sciences, Beijing 100190;3. University of California, Davis, Department of Computer Science, California, 95616

Abstract:	Large-scale bilingual corpus is a fundamental resource to build a high-quality statistical machine translation system. However, there are usually a large number of noises in the corpus, which would affect the performance of translation system. Therefore, it is essential to filter noisy sentences. In this paper, we propose a classification based selection approach to distinguish high-quality bilingual sentences from the noisy ones. We first exploit several metrics to find the best and worst sentences in the corpus. Then we classify the rest sentences with the classifier, which is trained with more features on these sentences. Experimental results show that our approach not only eliminates 40% less promising sentences, but also significantly improves translation performance by 0.87 BLEU points over using all sentences. Key wordsstatistical machine translation; bilingual corpus selection

Keywords:	statistical machine translation bilingual corpus selection

	点击此处可从《中文信息学报》浏览原始摘要信息
	点击此处可从《中文信息学报》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏