共查询到10条相似文献,搜索用时 156 毫秒
1.
We report on a project to annotate biblical texts in order to create an aligned multilingual Bible corpus for linguistic research, particularly computational linguistics, including automatically creating and evaluating translation lexicons and semantically tagged texts. The output of this project will enable researchers to take advantage of parallel translations across a wider number of languages than previously available, providing, with relatively little effort, a corpus that contains careful translations and reliable alignment at the near-sentence level. We discuss the nature of the text, our annotation process, preliminary and planned uses for the corpus, and relevant aspects of the Corpus Encoding Standard (CES) with respect to this corpus. We also present a quantitative comparison with dictionary and corpus resources for modern-day English, confirming the relevance of this corpus for research on present day language. 相似文献
2.
Kikui G. Yamamoto S. Takezawa T. Sumita E. 《IEEE transactions on audio, speech, and language processing》2006,14(5):1674-1682
This paper investigates issues in preparing corpora for developing speech-to-speech translation (S2ST). It is impractical to create a broad-coverage parallel corpus only from dialog speech. An alternative approach is to have bilingual experts write conversational-style texts in the target domain, with translations. There is, however, a risk of losing fidelity to the actual utterances. This paper focuses on balancing a tradeoff between these two kinds of corpora through the analysis of two newly developed corpora in the travel domain: a bilingual parallel corpus with 420 K utterances and a collection of in-domain dialogs using actual S2ST systems. We found that the first corpus is effective for covering utterances in the second corpus if complimented with a small number of utterances taken from monolingual dialogs. We also found that characteristics of in-domain utterances become closer to those of the first corpus when more restrictive conditions and instructions to speakers are given. These results suggest the possibility of a bootstrap-style of development of corpora and S2ST systems, where an initial S2ST system is developed with parallel texts, and is then gradually improved with in-domain utterances collected by the system as restrictions are relaxed. 相似文献
3.
This paper presents a historical Arabic corpus named HAC. At this early embryonic stage of the project, we report about the design, the architecture and some of the experiments which we have conducted on HAC. The corpus, and accordingly the search results, will be represented using a primary XML exchange format. This will serve as an intermediate exchange tool within the project and will allow the user to process the results offline using some external tools. HAC is made up of Classical Arabic texts that cover 1600 years of language use; the Quranic text, Modern Standard Arabic texts, as well as a variety of monolingual Arabic dictionaries. The development of this historical corpus assists linguists and Arabic language learners to effectively explore, understand, and discover interesting knowledge hidden in millions of instances of language use. We used techniques from the field of natural language processing to process the data and a graph-based representation for the corpus. We provided researchers with an export facility to render further linguistic analysis possible. 相似文献
4.
This paper describes the framework of the StatCan Daily Translation Extraction System (SDTES), a computer system that maps
and compares web-based translation texts of Statistics Canada (StatCan) news releases in the StatCan publication The Daily. The goal is to extract translations for translation memory systems, for translation terminology building, for cross-language
information retrieval and for corpus-based machine translation systems. Three years of officially published statistical news
release texts at were collected to compose the StatCan Daily data bank. The English and French texts in this collection were roughly aligned using the Gale-Church statistical algorithm.
After this, boundary markers of text segments and paragraphs were adjusted and the Gale-Church algorithm was run a second
time for a more fine-grained text segment alignment. To detect misaligned areas of texts and to prevent mismatched translation
pairs from being selected, key textual and structural properties of the mapped texts were automatically identified and used
as anchoring features for comparison and misalignment detection. The proposed method has been tested with web-based bilingual
materials from five other Canadian government websites. Results show that the SDTES model is very efficient in extracting
translations from published government texts, and very accurate in identifying mismatched translations. With parameters tuned,
the text-mapping part can be used to align corpus data collected from official government websites; and the text-comparing
component can be applied in prepublication translation quality control and in evaluating the results of statistical machine
translation systems. 相似文献
5.
6.
Technical-term translation represents one of the most difficult tasks for human translators since (1) most translators are not familiar with terms and domain-specific terminology and (2) such terms are not adequately covered by printed dictionaries. This paper describes an algorithm for translating technical words and terms from noisy parallel corpora across language groups. Given any word which is part of a technical term in the source language, the algorithm produces a ranked candidate match for it in the target language. Potential translations for the term are compiled from the matched words and are also ranked. We show how this ranked list helps translators in technical-term translation. Most algorithms for lexical and term translation focus on Indo-European language pairs, and most use a sentence-aligned clean parallel corpus without insertion, deletion or OCR noise. Our algorithm is language- and character-set-independent, and is robust to noise in the corpus. We show how our algorithm requires minimum preprocessing and is able to obtain technical-word translations without sentence-boundary identification or sentence alignment, from the English–Japanese awk manual corpus with noise arising from text insertions or deletions and on the English–Chinese HKUST bilingual corpus. We obtain a precision of 55.35% from the awk corpus for word translation including rare words, counting only the best candidate and direct translations. Translation precision of the best-candidate translation is 89.93% from the HKUST corpus. Potential term translations produced by the program help bilingual speakers to get a 47% improvement in translating technical terms. 相似文献
7.
8.
Christine M. Tardy 《Computers and Composition》2005,22(3):319-336
Recent research has illuminated some of the ways in which multilingual writers project multiple identities in their writing, conveying disciplinary allegiances as well as more personal expressions of individuality. Such work has focused on the writers’ uses of various verbal expressions, but has to this point overlooked the ways in which they manipulate the visual mode as a means for identity expression. The present study examines expressions of identity in a corpus of multimodal texts written by four multilingual graduate student writers. I consider how the writers’ uses of various verbal and visual expressions in their Microsoft PowerPoint presentation slides project both disciplinarity and individuality and how each individual's habitus has been influenced by both the discourses they have encountered and their personal reactions towards those discourses. 相似文献
9.
Iria da Cunha Eric San Juan Juan Manuel Torres-Moreno Marina Lloberese Irene Castellóne 《Expert systems with applications》2012,39(2):1671-1678
Nowadays discourse parsing is a very prominent research topic. However, there is not a discourse parser for Spanish texts. The first stage in order to develop this tool is discourse segmentation. In this work, we present DiSeg, the first discourse segmenter for Spanish, which uses the framework of Rhetorical Structure Theory and is based on lexical and syntactic rules. We describe the system and we evaluate its performance against a gold standard corpus, divided in a medical and a terminological subcorpus. We obtain promising results, which means that discourse segmentation is possible using shallow parsing. 相似文献
10.
The Corpus of Electronic Texts (CELT) project at UniversityCollege Cork is an on-line corpus of multilingual texts thatare encoded in TEI conformant SGML/XML. As of September 2006,the corpus has 9.3 million words online. Over the last fiveyears, doctoral work carried out at the project has focusedon the development of lexicographical resources spanning theyears c. AD 700–1700, and on the development of toolsto integrate the corpus with these resources. This researchhas been further complimented by the Linking Dictionaries andText project, a North–South Ireland collaboration betweenthe University of Ulster, Coleraine, and University CollegeCork. The Linking Dictionaries and Text project will reach completionin October 2006. This article focuses on CELT's latest researchproject, the Digital Dinneen project, that aims to create anintegrated edition of Patrick S. Dinneen's Foclóir Gaedhilgeagus Béarla (Irish-English Dictionary). In this article,the newly developed research infrastructure—that is theculmination of the doctoral research carried out at CELT andthe Linking Dictionaries and Text collaboration—will bedescribed, and ways that the Digital Dinneen will be integratedinto this infrastructure established. Finally, avenues of futureresearch will be pointed to. 相似文献