共查询到20条相似文献,搜索用时 109 毫秒
1.
刘颖 《计算机应用与软件》2001,18(11):56-59
机器翻译中,在词性标注和句法语义分析阶段经常会遇到歧义,使用基于统计方法的词汇评分和句法语义评分就是对词性标注和句法语义分析阶段产生的歧义进行消歧,在用统计方法消歧时,经常遇到的一个现象就是数据稀疏问题,本文对词汇评分和句法语义评分遇到数据稀疏现象使用改进的Turing公式来平滑参数,给出平滑算法对词汇评分平滑的处理过程,在实验中给出语料与参数数量,正确率的实验结果。 相似文献
2.
Machine translation is traditionally formulated as the transduction of strings of words from the source to the target language. As a result, additional lexical processing steps such as morphological analysis, transliteration, and tokenization are required to process the internal structure of words to help cope with data-sparsity issues that occur when simply dividing words according to white spaces. In this paper, we take a different approach: not dividing lexical processing and translation into two steps, but simply viewing translation as a single transduction between character strings in the source and target languages. In particular, we demonstrate that the key to achieving accuracies on a par with word-based translation in the character-based framework is the use of a many-to-many alignment strategy that can accurately capture correspondences between arbitrary substrings. We build on the alignment method proposed in Neubig et al. (Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics. Portland, Oregon, pp. 632–641, 2011), improving its efficiency and accuracy with a focus on character-based translation. Using a many-to-many aligner imbued with these improvements, we demonstrate that the traditional framework of phrase-based machine translation sees large gains in accuracy over character-based translation with more naive alignment methods, and achieves comparable results to word-based translation for two distant language pairs. 相似文献
3.
针对现有的中文客户评论产品属性识别方法存在的不足,通过采用词法分析、句法分析、同义词词林等多项技术和资源,挖掘真实语料中蕴藏的语言知识,提出了一种基于模板的产品属性识别方法.该方法对评论语料进行词法、句法分析和人工标注,从标注结果中综合分析和归纳评论句的全局语言规则,提取属性词和评价词之间的词性和依存关系序列,借助同义词词林构建产品属性模板,使用属性模板识别产品属性.对比实验结果表明了提出方法的有效性. 相似文献
4.
The growing availability of large on-line corpora encourages the study of word behaviour directly from accessible raw texts. However, the methods by which lexical knowledge should be extracted from plain texts is still a matter of debate and experimentation. In this paper we present an integrated tool for lexical acquisition from corpora, ARIOSTO, based on a hybrid methodology that combines typical NLP techniques, such as (shallow) syntax and semantic markers, with numerical processing. The lexical data extracted by this method, calledclustered association data, are used for a variety of interesting purposes, such as the detection of selectional restrictions, the derivation of syntactic ambiguity criteria and the acquisition of taxonomic relations. 相似文献
5.
Stephan Mehl 《Machine Translation》1996,11(1-3):185-216
The process of lexical choice usually consists of determining a single way of expressing a given content. In some cases such as gerund translation, however, there is no single solution; a choice must be made among several variants which differ in their syntactic behavior. Based on a bilingual corpus analysis, this paper explains first which factors influence the availability of variants. In a second step, some criteria for deciding on one or the other variant are discussed. It will be shown that the stylistic evaluation of the syntactic structures induced by alternative lexical items is of central importance in lexical choice. Finally, an implementation of the resulting model is described. 相似文献
6.
7.
Technical-term translation represents one of the most difficult tasks for human translators since (1) most translators are not familiar with terms and domain-specific terminology and (2) such terms are not adequately covered by printed dictionaries. This paper describes an algorithm for translating technical words and terms from noisy parallel corpora across language groups. Given any word which is part of a technical term in the source language, the algorithm produces a ranked candidate match for it in the target language. Potential translations for the term are compiled from the matched words and are also ranked. We show how this ranked list helps translators in technical-term translation. Most algorithms for lexical and term translation focus on Indo-European language pairs, and most use a sentence-aligned clean parallel corpus without insertion, deletion or OCR noise. Our algorithm is language- and character-set-independent, and is robust to noise in the corpus. We show how our algorithm requires minimum preprocessing and is able to obtain technical-word translations without sentence-boundary identification or sentence alignment, from the English–Japanese awk manual corpus with noise arising from text insertions or deletions and on the English–Chinese HKUST bilingual corpus. We obtain a precision of 55.35% from the awk corpus for word translation including rare words, counting only the best candidate and direct translations. Translation precision of the best-candidate translation is 89.93% from the HKUST corpus. Potential term translations produced by the program help bilingual speakers to get a 47% improvement in translating technical terms. 相似文献
8.
Ananthakrishnan S. Narayanan S.S. 《IEEE transactions on audio, speech, and language processing》2008,16(1):216-228
With the advent of prosody annotation standards such as tones and break indices (ToBI), speech technologists and linguists alike have been interested in automatically detecting prosodic events in speech. This is because the prosodic tier provides an additional layer of information over the short-term segment-level features and lexical representation of an utterance. As the prosody of an utterance is closely tied to its syntactic and semantic content in addition to its lexical content, knowledge of the prosodic events within and across utterances can assist spoken language applications such as automatic speech recognition and translation. On the other hand, corpora annotated with prosodic events are useful for building natural-sounding speech synthesizers. In this paper, we build an automatic detector and classifier for prosodic events in American English, based on their acoustic, lexical, and syntactic correlates. Following previous work in this area, we focus on accent (prominence, or ldquostressrdquo) and prosodic phrase boundary detection at the syllable level. Our experiments achieved a performance rate of 86.75% agreement on the accent detection task, and 91.61% agreement on the phrase boundary detection task on the Boston University Radio News Corpus. 相似文献
9.
10.
11.
In this paper, we present a typology of ambiguity in Chinese, which includes morphological, lexical, syntactic, semantic, and contextual ambiguities. Examples are shown for each type of ambiguity and sometimes for subtypes. Ambiguity resolution strategies used in the ALICE machine translation system are presented in various levels of detail. A disambiguation model, called Four-Step, is proposed for resolving syntactic ambiguities involving serial verb construction and predication. As the name suggests, the model comprises four steps-well-formedness checking, preference for argument readings, precondition checking, and late closure. For resolving semantic ambiguity, we propose a new formalism, called Semantic Functional Grammar (SFG), to deal with the resolution problem. SFG integrates the concept of Semantic Grammar into Lexical-Functional Grammar (LFG) such that the functional structure (f-structures) include semantic functions in addition to grammatical functions. For dealing with lexical and contextual ambiguities, we briefly describe the mechanisms used in the ALICE system. As for morphological ambiguity, the resolution is a problem of word-boundary decision (segmentation) and is beyond the scope of this research. The mechanisms presented in the paper have been successfully applied to the translation of Chinese news headlines in the ALICE system.This research was supported partly by the Industrial Technology Research Institute, Taiwan under a grant for doctoral study to this author. 相似文献
12.
This paper describes a method for automatic detection of semantic relations between concept nodes of a networked ontological knowledge base by analyzing matrices of semantic-syntactic valences of words. These matrices are obtained by means of nonnegative factorization of tensors of syntactic compatibility of words. Such tensors are generated in the course of frequency analysis of syntactic structures of sentences taken from large text corpora of English Wikipedia and Simple English Wikipedia entries. 相似文献
13.
双语语料库的自动对齐已成为机器翻译研究中的一个重要研究课题.目前的句子对齐方法有基于长度的方法和基于词汇的方法,该文先分析了基于长度的方法,然后提出了基于译文的方法:通过使用一部翻译较完整的词典作为桥梁,把英汉句子之间的对应关系连接起来.根据英语文本中的单词,在词典中找到其对应的译文,并以译文到汉语句子中去匹配,根据评价函数和动态规划算法找到对齐句对.实验结果证明这种对齐方法消除了基于长度做法中错误蔓延的情况,它大大地提高了对齐的精度,其效果是令人满意的. 相似文献
14.
15.
Mireia Farrús Marta R. Costa-jussà José B. Mariño Marc Poch Adolfo Hernández Carlos Henríquez José A. R. Fonollosa 《Language Resources and Evaluation》2011,45(2):181-208
This work aims to improve an N-gram-based statistical machine translation system between the Catalan and Spanish languages,
trained with an aligned Spanish–Catalan parallel corpus consisting of 1.7 million sentences taken from El Periódico newspaper. Starting from a linguistic error analysis above this baseline system, orthographic, morphological, lexical, semantic
and syntactic problems are approached using a set of techniques. The proposed solutions include the development and application
of additional statistical techniques, text pre- and post-processing tasks, and rules based on the use of grammatical categories,
as well as lexical categorization. The performance of the improved system is clearly increased, as is shown in both human
and automatic evaluations of the system, with a gain of about 1.1 points BLEU observed in the Spanish-to-Catalan direction
of translation, and a gain of about 0.5 points in the reverse direction. The final system is freely available online as a
linguistic resource. 相似文献
16.
17.
本文从词汇-语义角度分析了机器翻译中与汉英语言相关的一些异化现象,探讨了语义-句法映射运算集的有效性和合理性.同时,本文讨论了通过变异映射集、变异类型指示器和参数合一机制解决机器翻译中异化现象的方法.这些方法提高了译准率,使生成的句子既能正确表达原中间语言的语义,又符合目标语言的表述习惯. 相似文献
18.
19.
20.
由于历史典籍术语存在普遍的多义性且缺少古汉语分词算法,使用基于双语平行语料的对齐方法来自动获取典籍术语翻译对困难重重。针对上述问题,该文提出一种基于子词的最大熵模型来进行典籍术语对齐。该方法结合两种统计信息抽取频繁在一起出现的字作为子词,使用子词对典籍进行分词,解决了缺少古汉语分词算法的问题。针对典籍术语的多义性,根据典籍术语的音译模式制定音译特征函数,并结合其他特征使用最大熵模型来确定术语的翻译。在《史记》双语平行语料上的实验表明,使用子词的方法远远优于未使用子词的方法,而结合三种特征的最大熵模型能有效的提高术语对齐的准确率。
相似文献
相似文献