Automatic extraction of translations from web-based bilingual materials |
| |
Authors: | Qibo Zhu Diana Inkpen Ash Asudeh |
| |
Affiliation: | (1) Statistics Canada, Ottawa, Canada;(2) Institute of Cognitive Science, Carleton University, Ottawa, Canada;(3) School of Information Technology & Engineering, University of Ottawa, Ottawa, Canada;(4) School of Linguistics and Applied Language Studies, Carleton University, Ottawa, Canada |
| |
Abstract: | This paper describes the framework of the StatCan Daily Translation Extraction System (SDTES), a computer system that maps
and compares web-based translation texts of Statistics Canada (StatCan) news releases in the StatCan publication The Daily. The goal is to extract translations for translation memory systems, for translation terminology building, for cross-language
information retrieval and for corpus-based machine translation systems. Three years of officially published statistical news
release texts at were collected to compose the StatCan Daily data bank. The English and French texts in this collection were roughly aligned using the Gale-Church statistical algorithm.
After this, boundary markers of text segments and paragraphs were adjusted and the Gale-Church algorithm was run a second
time for a more fine-grained text segment alignment. To detect misaligned areas of texts and to prevent mismatched translation
pairs from being selected, key textual and structural properties of the mapped texts were automatically identified and used
as anchoring features for comparison and misalignment detection. The proposed method has been tested with web-based bilingual
materials from five other Canadian government websites. Results show that the SDTES model is very efficient in extracting
translations from published government texts, and very accurate in identifying mismatched translations. With parameters tuned,
the text-mapping part can be used to align corpus data collected from official government websites; and the text-comparing
component can be applied in prepublication translation quality control and in evaluating the results of statistical machine
translation systems. |
| |
Keywords: | Automatic translation extraction Bitext mapping Machine translation Parallel alignment Translation memory system |
本文献已被 SpringerLink 等数据库收录! |
|