首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
当今句子摘要研究主要针对单语,即源端句子和目标端摘要短语属于同种语言,然而单语句子摘要严重制约了不同语言文本信息的快速获取。为解决该问题,提出一种跨语言句子摘要系统。借鉴回译思想,将单语句子摘要平行语料中的源端通过神经机器翻译系统翻译成另一种语言,将其与句子摘要平行语料中目标端的摘要短语共同构成跨语言的伪平行语料。在此基础上,利用对比注意力机制,实现目标端与源端序列中不相关信息的获取,解决了传统注意力机制中源端和目标端句子长度不匹配的问题。实验结果表明,与基于管道方法的单语句子摘要系统相比,该跨语言系统生成的摘要短语更流畅且符合人类语言表述方式,可达到接近单语的句子摘要水平。  相似文献   

2.
One may indicate the potentials of an MT system by stating what text genres it can process, e.g., weather reports and technical manuals. This approach is practical, but misleading, unless domain knowledge is highly integrated in the system. Another way to indicate which fragments of language the system can process is to state its grammatical potentials, or more formally, which languages the grammars of the system can generate. This approach is more technical and less understandable to the layman (customer), but it is less misleading, since it stresses the point that the fragments which can be translated by the grammars of a system need not necessarily coincide exactly with any particular genre. Generally, the syntactic and lexical rules of an MT system allow it to translate many sentences other than those belonging to a certain genre. On the other hand it probably cannot translate all the sentences of a particular genre. Swetra is a multilanguage MT system defined by the potentials of a formal grammar (standard referent grammar) and not by reference to a genre. Successful translation of sentences can be guaranteed if they are within a specified syntactic format based on a specified lexicon. The paper discusses the consequences of this approach (Grammatically Restricted Machine Translation, GRMT) and describes the limits set by a standard choice of grammatical rules for sentences and clauses, noun phrases, verb phrases, sentence adverbials, etc. Such rules have been set up for English, Swedish and Russian, mainly on the basis of familiarity (frequency) and computer efficiency, but restricting the grammar and making it suitable for several languages poses many problems for optimization. Sample texts — newspaper reports — illustrate the type of text that can be translated with reasonable success among Russian, English and Swedish.  相似文献   

3.
Learning Translation Templates from Bilingual Translation Examples   总被引:9,自引:1,他引:8  
A mechanism for learning lexical correspondences between two languages from sets of translated sentence pairs is presented. These lexical level correspondences are learned using analogical reasoning between two translation examples. Given two translation examples, the similar parts of the sentences in the source language must correspond to the similar parts of the sentences in the target language. Similarly, the different parts must correspond to the respective parts in the translated sentences. The correspondences between similarities and between differences are learned in the form of translation templates. A translation template is a generalized translation exemplar pair where some components are generalized by replacing them with variables in both sentences and establishing bindings between these variables. The learned translation templates are obtained by replacing differences or similarities by variables. This approach has been implemented and tested on a set of sample training datasets and produced promising results for further investigation.  相似文献   

4.
One may indicate the potentials of an MT system by stating what text genres it can process, e.g., weather reports and technical manuals. This approach is practical, but misleading, unless domain knowledge is highly integrated in the system. Another way to indicate which fragments of language the system can process is to state its grammatical potentials, or more formally, which languages the grammars of the system can generate. This approach is more technical and less understandable to the layman (customer), but it is less misleading, since it stresses the point that the fragments which can be translated by the grammars of a system need not necessarily coincide exactly with any particular genre. Generally, the syntactic and lexical rules of an MT system allow it to translate many sentences other than those belonging to a certain genre. On the other hand it probably cannot translate all the sentences of a particular genre. Swetra is a multilanguage MT system defined by the potentials of a formal grammar (standard referent grammar) and not by reference to a genre. Successful translation of sentences can be guaranteed if they are within a specified syntactic format based on a specified lexicon. The paper discusses the consequences of this approach (Grammatically Restricted Machine Translation, GRMT) and describes the limits set by a standard choice of grammatical rules for sentences and clauses, noun phrases, verb phrases, sentence adverbials, etc. Such rules have been set up for English, Swedish and Russian, mainly on the basis of familiarity (frequency) and computer efficiency, but restricting the grammar and making it suitable for several languages poses many problems for optimization. Sample texts—newspaper reports—illustrate the type of text that can be translated with reasonable success among Russian, English and Swedish.  相似文献   

5.
There is significant lexical difference—words and usage of words-between spontaneous/colloquial language and the written language. This difference affects the performance of spoken language recognition systems that use statistical language models or context-free-grammars because these models are based on the written language rather than the spoken form. There are many filler phrases and colloquial phrases that appear solely or more often in spontaneous and colloquial speech. Chinese languages perhaps exemplify such a difference as many colloquial forms of the language, such as Cantonese, exist strictly in spoken forms and are different from the written standard Chinese, which is based on Mandarin. A conventional way of dealing with this issue is to add colloquial terms manually to the lexicon. However, this is time-consuming and expensive. Meanwhile, supervised learning requires manual tagging of large corpuses, which is also time-consuming. We propose an unsupervised learning method to find colloquial terms and classify filler and content phrases in spontaneous and colloquial Chinese, including Cantonese. We propose using frequency strength, and spread measures of character pairs and groups to extract automatically frequent, out-of-vocabulary colloquial terms to add to a standard Chinese lexicon. An unsegmented, and unannotated corpus is segmented with the augmented lexicon. We then propose a Markov classifier to classify Chinese characters into either content or filler phrases in an iterative training method. This method is task-independent and can extract even mixed language terms. We show the effectiveness of our method by both a natural language query processing task and an adaptive Cantonese language-modeling task. The precision for content phrase extraction and classification is around 80%, with a recall of 99%, and the precision for filler phrase extraction and classification is around 99.5% with a recall of approximately 89%. The web search precision using these extracted content words is comparable to that of the search results with content phrases selected by humans. We adapt a language model trained from written texts with the Hong Kong Newsgroup corpus. It outperforms both the standard Chinese language model and also the Cantonese language model. It also performs better than the language model trained a simply by concatenating two sets of standard and colloquial texts.  相似文献   

6.
An implementation and the functioning of a synchronized binary linear tree are described. A structure of the so-called time line is proposed; the structure is used at the discourse level to process semantic connections between elements of various sentences. The structure of a system that processes texts in natural languages is considered. It is assumed that a semiotic feedback can lead to an internal monologue and self-consciousness. Based on the synchronized linear tree, a realization of semiotic feedback is described. It is shown that a model of the above-mentioned internal monologue can be used to synthesize phrases in natural languages.  相似文献   

7.
自统计机器翻译技术出现以来,调序一直是语序差异显著的语言对互译系统中的关键问题,基于大规模语料训练的调序方法得到了广泛研究。目前汉蒙双语语料资源十分有限,使得现有的依赖于大规模语料和语言学知识的调序方法难以取得良好效果。该文对已有的相关研究进行了分析,提出了在有限语料条件下的汉蒙统计机器翻译调序方法。该方法依据语言学知识获取对译文语序影响显著的短语类型,研究这些短语类型的调序方案,并融入已有的调序模型实现调序的优化。实验表明该方法在有限语料条件下的效果提升显著。  相似文献   

8.
专利文献的自动翻译是机器翻译的一个重要应用领域,复杂长句的翻译是汉英机器翻译的难点。本研究期望找出汉英复杂长句中小句变换的形式化转换规则。汉语复杂长句中会包含多个小句,这些小句都是独立存在的,但翻译成英语时,一般只有一个核一心小句,其他小句都变换成doing、todo、从句或短语等其它形式。文中以1300句汉英双语专利文献语料为研究对象,对汉语中的小句翻译为英语的变换情况进行分类研究,从小句句间关系、共享关系的角度出发,描述激活特征,并按五种变换方式分类,提出了十二条变换规则,小规模语料实验结果证明规则可行有效。下一步工作需要扩充研究语料,对语料进行更深入的挖掘和分析,在更大规模语料中验证规则的实用性。  相似文献   

9.
In microblogs, authors use hashtags to mark keywords or topics. These manually labeled tags can be used to benefit various live social media applications (e.g., microblog retrieval, classification). However, because only a small portion of microblogs contain hashtags, recommending hashtags for use in microblogs are a worthwhile exercise. In addition, human inference often relies on the intrinsic grouping of words into phrases. However, existing work uses only unigrams to model corpora. In this work, we propose a novel phrase-based topical translation model to address this problem. We use the bag-of-phrases model to better capture the underlying topics of posted microblogs. We regard the phrases and hashtags in a microblog as two different languages that are talking about the same thing. Thus, the hashtag recommendation task can be viewed as a translation process from phrases to hashtags. To handle the topical information of microblogs, the proposed model regards translation probability as being topic specific. We test the methods on data collected from realworld microblogging services. The results demonstrate that the proposed method outperforms state-of-the-art methods that use the unigram model.  相似文献   

10.

Abstractive Text Summarization (ATS), which is the task of constructing summary sentences by merging facts from different source sentences and condensing them into a shorter representation while preserving information content and overall meaning. It is very difficult and time consuming for human beings to manually summarize large documents of text. In this paper, we propose an LSTM-CNN based ATS framework (ATSDL) that can construct new sentences by exploring more fine-grained fragments than sentences, namely, semantic phrases. Different from existing abstraction based approaches, ATSDL is composed of two main stages, the first of which extracts phrases from source sentences and the second generates text summaries using deep learning. Experimental results on the datasets CNN and DailyMail show that our ATSDL framework outperforms the state-of-the-art models in terms of both semantics and syntactic structure, and achieves competitive results on manual linguistic quality evaluation.

  相似文献   

11.
汉语篇章时间短语的分析与时制验算   总被引:5,自引:0,他引:5  
汉英机器翻译中,汉语篇章的时间信息是生成正确英语词时态的基础,时制是时间信息重要组成部分,需要在篇间中通过时间短语的语义分析获得,首先对汉语篇章时间短时间了语义分类,然后设计了时间短语语义表示结构TPSRS,用概念信息体关联网络CIURN表示了汉语篇间语境知识,给出了在篇章语境中分析时间短语的算法TPPA,提出了通过时制验算来推导汉语篇章中时间短语的时制和事件的时制,最后在汉英机译系统ICENT中进行了实现,对已知写作时间的汉语篇章取得了较好的实验结果。  相似文献   

12.
近年来,基于预训练语言模型的文本生成评价方法得到了广泛关注,其通过计算两个句子间子词粒度的相似度来评价生成文本的质量.但是对于越南语、泰语等存在大量黏着语素的语言,单个音节或子词不能独立成词表达语义,仅基于子词粒度匹配的方法并不能够完整表征两个句子间的语义相似关系.基于此,该文提出一种基于子词、音节、词组等多粒度特征的...  相似文献   

13.
Legal texts usually comprise many kinds of texts, such as contracts, patents and treaties. These texts usually include a huge quantity of unstructured information written in natural language. Thanks to automatic analysis and Information Retrieval (IR) techniques, it is possible to filter out information that is not relevant and, therefore, to reduce the amount of documents that users need to browse to find the information they are looking for. In this paper we adapted the JIRS passage retrieval system to work with three kinds of legal texts: treaties, patents and contracts, studying the issues related with the processing of this kind of information. In particular, we studied how a passage retrieval system might be linked up to automated analysis based on logic and algebraic programming for the detection of conflicts in contracts. In our set-up, a contract is translated into formal clauses, which are analysed by means of a model checking tool; then, the passage retrieval system is used to extract conflicting sentences from the original contract text.  相似文献   

14.
This paper report on some of the concrete outcomes of a larger research project on the study of syntactic change. In this part of the project, we are collecting and encoding historical texts and tagging them for syntactic analysis. We have so far produced a TEI-conformant version of an Old French text, La Vie de Saint Louis written by Jehan de Joinville around 1305, and we are in the process of adding syntactic tags to this text. Those syntactic tags are derived from the Penn-Helsinki coding scheme, which had been devised for the syntactic encoding of Middle English texts, and have been translated into TEI.Thus this paper addresses two issues: the development of a TEI encoding for the text, and the adaptation of the Penn-Helsinki syntactic coding scheme. While the first part of this work raises issues of a textual nature independently of the language of the text, and proposes concrete immediate solutions, the second part points to a more general extension of the PH tagset to other types of texts and to other languages.  相似文献   

15.
A language is presented for the representation of graphs and the formulation of related problems. For a language for graphs to be complete, it should have all computing facilities of symbolic languages, like ALGOL or FORTRAN, and, in addition, it should be able to easily handle graph-like data structures. As a consequence, our language is defined as a set of phrases to be added to another existing language, in particular to ALGOL. This set of phrases constitutes thegraphic language (G.L.), while the complete language for graphs handling is thegraphic extended ALGOL (G.E.A.). The phrases of the graphic language have to be translated into pieces of ALGOL programs by a pre-compiler. The graphic language has been defined by an operators precedence grammar, written in Backus normal form, so that a syntax directed compiler can be used.  相似文献   

16.
该文以阿勒泰语系下的维哈柯及蒙古语多语言平行文本和语音语料为研究对象,分别对比多语言文本量化序列向量及语音声学音律特征的相似度,研究语言信息间存在的相通性。试验发现,同语系同语族黏着语言相似度较高 文本相似性达85%;声频特征相似性达95%。从而确认在同语系多种黏着语言间创建语言信息共享云模的可行性,这将有利于实现语言文本及语音信息的跨语言转换处理,极大降低少数民族语言信息处理成本。  相似文献   

17.
针对传统翻译系统在时态翻译中不准确的问题,结合当前的机器学习算法,提出一种基于DBN的平行语料库时态翻译方法.为实现该方法,首先对时态标注模型和DBN基本理论进行介绍,并提出汉英语句时态翻译的思路;而在进行DBN平行语料库特征提取的过程中,采用自动时态标注算法对时态进行标注,并对得到的数据进行时态树编码;然后以编码数据...  相似文献   

18.
The World Wide Web is becoming increasingly necessary for everybody regardless of age, gender, culture, health and individual disabilities. Unfortunately, there are evidently still problems for some deaf and hard of hearing people trying to use certain web pages. These people require the translation of existing written information into their first language, which can be one of many sign languages. In previous technological solutions, the video window dominates the screen, interfering with the presentation and thereby distracting the general public, who have no need of a bilingual web site. One solution to this problem is the development of transparent sign language videos which appear on the screen on request. Therefore, we have designed and developed a system to enable the embedding of selective interactive elements into the original text in appropriate locations, which act as triggers for the video translation into sign language. When the short video clip terminates, the video window is automatically closed and the original web page is shown. In this way, the system significantly simplifies the expansion and availability of additional accessibility functions to web developers, as it preserves the original web page with the addition of a web layer of sign language video. Quantitative and qualitative evaluation has demonstrated that information presented through a transparent sign language video increases the users’ interest in the content of the material by interpreting terms, phrases or sentences, and therefore facilitates the understanding of the material and increases its usefulness for deaf people.  相似文献   

19.
We compare different strategies to apply statistical machine translation techniques in order to retrieve documents that are a plausible translation of a given source document. Finding the translated version of a document is a relevant task; for example, when building a corpus of parallel texts that can help to create and evaluate new machine translation systems.

In contrast to the traditional settings in cross-language information retrieval tasks, in this case both the source and the target text are long and, thus, the procedure used to select which words or phrases will be included in the query has a key effect on the retrieval performance. In the statistical approach explored here, both the probability of the translation and the relevance of the terms are taken into account in order to build an effective query.  相似文献   

20.
Interlingua and transfer-based approaches tomachine translation have long been in use in competing and complementary ways. The former proves economical in situations where translation among multiple languages is involved, and can be used as a knowledge-representation scheme. But given a particular interlingua, its adoption depends on its ability (a) to capture the knowledge in texts precisely and accurately and (b) to handle cross-language divergences. This paper studies the language divergence between English and Hindi and its implication to machine translation between these languages using the Universal Networking Language (UNL). UNL has been introduced by the United Nations University, Tokyo, to facilitate the transfer and exchange of information over the internet. The representation works at the level of single sentences and defines a semantic net-like structure in which nodes are word concepts and arcs are semantic relations between these concepts. The language divergences between Hindi, an Indo-European language, and English can be considered as representing the divergences between the SOV and SVO classes of languages. The work presented here is the only one to our knowledge that describes language divergence phenomena in the framework of computational linguistics through a South Asian language.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号