首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
This paper describes the design and implementation of a computational model for Arabic natural language semantics, a semantic parser for capturing the deep semantic representation of Arabic text. The parser represents a major part of an Interlingua-based machine translation system for translating Arabic text into Sign Language. The parser follows a frame-based analysis to capture the overall meaning of Arabic text into a formal representation suitable for NLP applications that need for deep semantics representation, such as language generation and machine translation. We will show the representational power of this theory for the semantic analysis of texts in Arabic, a language which differs substantially from English in several ways. We will also show that the integration of WordNet and FrameNet in a single unified knowledge resource can improve disambiguation accuracy. Furthermore, we will propose a rule based algorithm to generate an equivalent Arabic FrameNet, using a lexical resource alignment of FrameNet1.3 LUs and WordNet3.0 synsets for English Language. A pilot study of motion and location verbs was carried out in order to test our system. Our corpus is made up of more than 2000 Arabic sentences in the domain of motion events collected from Algerian first level educational Arabic books and other relevant Arabic corpora.  相似文献   

2.
The interlingual approach to machine translation (MT) is used successfully in multilingual translation. It aims to achieve the translation task in two independent steps. First, meanings of the source-language sentences are represented in an intermediate language-independent (Interlingua) representation. Then, sentences of the target language are generated from those meaning representations. Arabic natural language processing in general is still underdeveloped and Arabic natural language generation (NLG) is even less developed. In particular, Arabic NLG from Interlinguas was only investigated using template-based approaches. Moreover, tools used for other languages are not easily adaptable to Arabic due to the language complexity at both the morphological and syntactic levels. In this paper, we describe a rule-based generation approach for task-oriented Interlingua-based spoken dialogue that transforms a relatively shallow semantic interlingual representation, called interchange format (IF), into Arabic text that corresponds to the intentions underlying the speaker’s utterances. This approach addresses the handling of the problems of Arabic syntactic structure determination, and Arabic morphological and syntactic generation within the Interlingual MT approach. The generation approach is developed primarily within the framework of the NESPOLE! (NEgotiating through SPOken Language in E-commerce) multilingual speech-to-speech MT project. The IF-to-Arabic generator is implemented in SICStus Prolog. We conducted evaluation experiments using the input and output from the English analyzer that was developed by the NESPOLE! team at Carnegie Mellon University. The results of these experiments were promising and confirmed the ability of the rule-based approach in generating Arabic translation from the Interlingua taken from the travel and tourism domain.  相似文献   

3.
In this paper, we present a system that automatically translates Arabic text embedded in images into English. The system consists of three components: text detection from images, character recognition, and machine translation. We formulate the text detection as a binary classification problem and apply gradient boosting tree (GBT), support vector machine (SVM), and location-based prior knowledge to improve the F1 score of text detection from 78.95% to 87.05%. The detected text images are processed by off-the-shelf optical character recognition (OCR) software. We employ an error correction model to post-process the noisy OCR output, and apply a bigram language model to reduce word segmentation errors. The translation module is tailored with compact data structure for hand-held devices. The experimental results show substantial improvements in both word recognition accuracy and translation quality. For instance, in the experiment of Arabic transparent font, the BLEU score increases from 18.70 to 33.47 with use of the error correction module.  相似文献   

4.
The Internet and the World Wide Web have become an integral part of everyday life, an important source of information and a communication medium. One of the main problems confronting non-English speakers in using the Internet is that it is heavily dominated by the English language. Knowledge of English is a prerequisite for using the Web efficiently and benefiting from the multitude of services it offers. This paper introduces the approach underlying a prototype system that translates Web pages from English to Arabic automatically. The system uses a commercial machine translation system to translate the textual part of a Web page. It then displays a Web page containing the Arabic translation with all tags inserted in the right places so that the layout and content of the original (English) page are preserved.  相似文献   

5.
神经机器翻译前沿综述   总被引:3,自引:0,他引:3  
机器翻译是指通过计算机将源语言句子翻译到与之语义等价的目标语言句子的过程,是自然语言处理领域的一个重要研究方向。神经机器翻译仅需使用神经网络就能实现从源语言到目标语言的端到端翻译,目前已成为机器翻译研究的主流方向。该文选取了近期神经机器翻译的几个主要研究领域,包括同声传译、多模态机器翻译、非自回归模型、篇章翻译、领域自适应、多语言翻译和模型训练,并对这些领域的前沿研究进展做简要介绍。  相似文献   

6.
Fast access to information in different languages is still a major problem for many organizations. We have built a multilingual analyst's workstation integrated in the Tipster document management toolkit. The analyst workstation offers to an English-speaking analyst a variety of tools to browse sets of documents in Arabic, Japanese, Spanish and Russian, including a Unicode-based multilingual editor, and a simple machine translation functionality.The Temple project has developed an open multilingual architecture and software support for rapid development of extensible machine translation functionalities. The targeted languages are those for which natural language processing and human resources are scarce or difficult to obtain. The goal is to support rapid development of machine translation functionalities in a very short time with limited resources.Glossary-based machine-translation (GBMT) is used to provide an English gloss of a foreign document. A GBMT system uses a bilingual phrasal dictionary (glossary) to produce a phrase-by-phrase translation. Translation (based on phrase pattern-matching) is fast and accurate regarding the content of the document and browsed documents can be translated almost in real-time. A GBMT system for a language pair is also extremely simple, cheap and fast to develop. Moreover, all language resources used by the system are entirely under the control of the user.  相似文献   

7.
由于内蒙古地区蒙汉机器翻译水平落后、平行双语语料规模较小,利用传统的统计机器翻译方法会出现数据稀疏以及训练过拟合等问题,导致翻译质量不高。针对这种情况,提出基于LSTM的蒙汉神经机器翻译方法,通过利用长短时记忆模型构建端到端的神经网络框架并对蒙汉机器翻译系统进行建模。为了更有效地理解蒙古语语义信息,根据蒙古语的特点将蒙古文单词分割成词素形式,导入模型,并在模型中引入局部注意力机制计算与目标词有关联的源语词素的权重,获得蒙古语和汉语词汇间的对齐概率,从而提升翻译质量。实验结果表明,该方法相比传统蒙汉翻译系统提高了翻译质量。  相似文献   

8.
Although there is no machine learning technique that fully meets human requirements, finding a quick and efficient translation mechanism has become an urgent necessity, due to the differences between the languages spoken in the world’s communities and the vast development that has occurred worldwide, as each technique demonstrates its own advantages and disadvantages. Thus, the purpose of this paper is to shed light on some of the techniques that employ machine translation available in literature, to encourage researchers to study these techniques. We discuss some of the linguistic characteristics of the Arabic language. Features of Arabic that are related to machine translation are discussed in detail, along with possible difficulties that they might present. This paper summarizes the major techniques used in machine translation from Arabic into English, and discusses their strengths and weaknesses.  相似文献   

9.
Many natural language processing areas use semantic roles in order to improve the applications of the extracted information, the question answering and the machine translation, etc. In Arabic, the work of constructing the semantic role labeling system or the annotated corpus is extremely limited compared to their speaker’s number and to English language as well. In this paper, we present a supervised method for the semantic role labeling of Arabic sentences. Hence, we use the feedback capacity of the case-based reasoning to annotate new sentences from already annotated ones besides the use of the Arabic PropBank as a reference to the semantic labels. We test our method under a wide range corpus that contains 2332 attributes and 5291 arguments. Accordingly, an Arabic semantic role labeling system is tested, for the first time, in that corpus. As a result, our method shows the ability to annotate new sentences from the labeled sentences or the construction of the annotated corpus.  相似文献   

10.
Language modeling for an inflected language such as Arabic poses new challenges for speech recognition and machine translation due to its rich morphology. Rich morphology results in large increases in out-of-vocabulary (OOV) rate and poor language model parameter estimation in the absence of large quantities of data. In this study, we present a joint morphological-lexical language model (JMLLM) that takes advantage of Arabic morphology. JMLLM combines morphological segments with the underlying lexical items and additional available information sources with regards to morphological segments and lexical items in a single joint model. Joint representation and modeling of morphological and lexical items reduces the OOV rate and provides smooth probability estimates while keeping the predictive power of whole words. Speech recognition and machine translation experiments in dialectal-Arabic show improvements over word and morpheme based trigram language models. We also show that as the tightness of integration between different information sources increases, both speech recognition and machine translation performances improve.   相似文献   

11.
Morphologically rich languages pose a challenge for statistical machine translation (SMT). This challenge is magnified when translating into a morphologically rich language. In this work we address this challenge in the framework of a broad-coverage English-to-Arabic phrase based statistical machine translation (PBSMT). We explore the largest-to-date set of Arabic segmentation schemes ranging from full word form to fully segmented forms and examine the effects on system performance. Our results show a difference of 2.31 BLEU points averaged over all test sets between the best and worst segmentation schemes indicating that the choice of the segmentation scheme has a significant effect on the performance of an English-to-Arabic PBSMT system in a large data scenario. We show that a simple segmentation scheme can perform as well as the best and more complicated segmentation scheme. An in-depth analysis on the effect of segmentation choices on the components of a PBSMT system reveals that text fragmentation has a negative effect on the perplexity of the language models and that aggressive segmentation can significantly increase the size of the phrase table and the uncertainty in choosing the candidate translation phrases during decoding. An investigation conducted on the output of the different systems, reveals the complementary nature of the output and the great potential in combining them.  相似文献   

12.
In adding syntax to statistical machine translation, there is a tradeoff between taking advantage of linguistic analysis and allowing the model to exploit parallel training data with no linguistic analysis: translation quality versus coverage. A number of previous efforts have tackled this tradeoff by starting with a commitment to linguistically motivated analyses and then finding appropriate ways to soften that commitment. We present an approach that explores the tradeoff from the other direction, starting with a translation model learned directly from aligned parallel text, and then adding soft constituent-level constraints based on parses of the source language. We argue that in order for these constraints to improve translation, they must be fine-grained: the constraints should vary by constituent type, and by the type of match or mismatch with the parse. We also use a different feature weight optimization technique, capable of handling large amount of features, thus eliminating the bottleneck of feature selection. We obtain substantial improvements in performance for translation from Arabic to English.  相似文献   

13.
The implementation of a hierarchical, process-oriented programming language for simulation (HSL) is described. It features a hybrid approach, involving the front end of a compiler and the back end of an interpreter. An HSL program is dichotomous in structure. Source statements from each part are translated into three-address code for an abstract machine, and the resulting code is then interpreted. The algorithms and the supportive data structures that effect the translation and interpretation of HSL are detailed. The host language for HSL is C++. HSL is machine independent and can be ported to any machine on which the host language is available. Its initial implementation was carried out on an NCR Tower. More recently, it was transferred to an NCR PC916.  相似文献   

14.
训练语料库的规模对基于机器学习的命名实体间语义关系抽取具有重要的作用,而语料库的人工标注需要花费大量的时间和人力。该文提出了使用机器翻译的方法将源语言的关系实例转换成目标语言的关系实例,并通过实体对齐策略将它们加入到目标语言的训练集中,从而使资源丰富的源语言帮助欠资源的目标语言进行语义关系抽取。在ACE2005中英文语料库上的关系抽取实验表明,无论是将中文翻译成英文,还是将英文翻译成中文,都对另一种语言的关系抽取具有帮助作用。特别是当目标语言的训练语料库规模较小时,这种帮助就尤其显著。  相似文献   

15.
汉语和维吾尔语是在句法结构和语序上差异较大的两种语言。对于一个完备的汉维机器翻译系统而言,进行源语言的分析和目标语言时态、语态的准确表达是有必要的。针对统计机器翻译模型中所包含的句法、语义成分较低导致的准确率及语序问题,通过建立相关转换及匹配规则,以期用于机器翻译的混合方法之中来提高翻译系统的工作性能。  相似文献   

16.
With the expanding growth of Arabic electronic data on the web, extracting information, which is actually one of the major challenges of the question-answering, is essentially used for building corpus of documents. In fact, building a corpus is a research topic that is currently referred to among some other major themes of conferences, in natural language processing (NLP), such as, information retrieval (IR), question-answering (QA), automatic summary (AS), etc. Generally, a question-answering system provides various passages to answer the user questions. To make these passages truly informative, this system needs access to an underlying knowledge base; this requires the construction of a corpus. The aim of our research is to build an Arabic question-answering system. In addition, analyzing the question must be the first step. Next, it is essential to retrieve a passage from the web that can serve as an appropriate answer. In this paper, we propose a method to analysis the question and retrieve the passage answer in the Arabic language. For the question analysis, five factual question types are processed. Additionally, our purpose is to experiment with the generation of a logic representation from the declarative form of each question. Several studies, deal with the logic approaches in question-answering, are discussed in other languages than the Arabic language. This representation is very promising because it helps us later in the selection of a justifiable answer. The accuracy of questions that are correctly analyzed and translated into the logic form achieved 64%. And then, the results of passages of texts that are automatically generated achieved an 87% score for accuracy and a 98% score for c@1.  相似文献   

17.
18.
19.
We present MARS (Multilingual Automatic tRanslation System), a research prototype speech-to-speech translation system. MARS is aimed at two-way conversational spoken language translation between English and Mandarin Chinese for limited domains, such as air travel reservations. In MARS, machine translation is embedded within a complex speech processing task, and the translation performance is highly effected by the performance of other components, such as the recognizer and semantic parser, etc. All components in the proposed system are statistically trained using an appropriate training corpus. The speech signal is first recognized by an automatic speech recognizer (ASR). Next, the ASR-transcribed text is analyzed by a semantic parser, which uses a statistical decision-tree model that does not require hand-crafted grammars or rules. Furthermore, the parser provides semantic information that helps further re-scoring of the speech recognition hypotheses. The semantic content extracted by the parser is formatted into a language-independent tree structure, which is used for an interlingua based translation. A Maximum Entropy based sentence-level natural language generation (NLG) approach is used to generate sentences in the target language from the semantic tree representations. Finally, the generated target sentence is synthesized into speech by a speech synthesizer.Many new features and innovations have been incorporated into MARS: the translation is based on understanding the meaning of the sentence; the semantic parser uses a statistical model and is trained from a semantically annotated corpus; the output of the semantic parser is used to select a more specific language model to refine the speech recognition performance; the NLG component uses a statistical model and is also trained from the same annotated corpus. These features give MARS the advantages of robustness to speech disfluencies and recognition errors, tighter integration of semantic information into speech recognition, and portability to new languages and domains. These advantages are verified by our experimental results.  相似文献   

20.
Advancement in technology turns the big world into one small village. Regardless of what country you are living in, what language you are speaking or understanding, you should be able to benefit from the accumulated knowledge available on the Internet. Unfortunately, this is not the case with English being the de facto language of most programming languages, services, tools and web content. Many users are blocked from using these tools and services because they do not speak or understand English. Multilingual software evolved as a solution to this dilemma. In this paper, we describe the design and implementation of a user-friendly toolkit named Weka interface translator (WIT). It is dedicated to internationalize Weka, which is a collection of machine learning algorithms for data mining tasks widely used by many researchers around the world. WIT is a collaboration project between the Arabic natural language processing team from the University of Jordan and Weka’s development team from the University of Waikato. Its main goal is to facilitate the translation process of Weka’s interfaces into multi-languages. WIT is downloadable through SourceForge.net and is officially listed on Weka’s wiki spaces among its related projects. To experiment with WIT, we present Arabic as a pilot test among many languages that could benefit from this project.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号