首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
无监督神经机器翻译仅利用大量单语数据,无需平行数据就可以训练模型,但是很难在2种语系遥远的语言间建立联系。针对此问题,提出一种新的不使用平行句对的神经机器翻译训练方法,使用一个双语词典对单语数据进行替换,在2种语言之间建立联系,同时使用词嵌入融合初始化和双编码器融合训练2种方法强化2种语言在同一语义空间的对齐效果,以提高机器翻译系统的性能。实验表明,所提方法在中-英与英-中实验中比基线无监督翻译系统的BLEU值分别提高2.39和1.29,在英-俄和英-阿等单语实验中机器翻译效果也显著提高了。  相似文献   

2.
Although there is no machine learning technique that fully meets human requirements, finding a quick and efficient translation mechanism has become an urgent necessity, due to the differences between the languages spoken in the world’s communities and the vast development that has occurred worldwide, as each technique demonstrates its own advantages and disadvantages. Thus, the purpose of this paper is to shed light on some of the techniques that employ machine translation available in literature, to encourage researchers to study these techniques. We discuss some of the linguistic characteristics of the Arabic language. Features of Arabic that are related to machine translation are discussed in detail, along with possible difficulties that they might present. This paper summarizes the major techniques used in machine translation from Arabic into English, and discusses their strengths and weaknesses.  相似文献   

3.
The interlingual approach to machine translation (MT) is used successfully in multilingual translation. It aims to achieve the translation task in two independent steps. First, meanings of the source-language sentences are represented in an intermediate language-independent (Interlingua) representation. Then, sentences of the target language are generated from those meaning representations. Arabic natural language processing in general is still underdeveloped and Arabic natural language generation (NLG) is even less developed. In particular, Arabic NLG from Interlinguas was only investigated using template-based approaches. Moreover, tools used for other languages are not easily adaptable to Arabic due to the language complexity at both the morphological and syntactic levels. In this paper, we describe a rule-based generation approach for task-oriented Interlingua-based spoken dialogue that transforms a relatively shallow semantic interlingual representation, called interchange format (IF), into Arabic text that corresponds to the intentions underlying the speaker’s utterances. This approach addresses the handling of the problems of Arabic syntactic structure determination, and Arabic morphological and syntactic generation within the Interlingual MT approach. The generation approach is developed primarily within the framework of the NESPOLE! (NEgotiating through SPOken Language in E-commerce) multilingual speech-to-speech MT project. The IF-to-Arabic generator is implemented in SICStus Prolog. We conducted evaluation experiments using the input and output from the English analyzer that was developed by the NESPOLE! team at Carnegie Mellon University. The results of these experiments were promising and confirmed the ability of the rule-based approach in generating Arabic translation from the Interlingua taken from the travel and tourism domain.  相似文献   

4.
In this paper we present a speech-to-speech (S2S) translation system called the BBN TransTalk that enables two-way communication between speakers of English and speakers who do not understand or speak English. The BBN TransTalk has been configured for several languages including Iraqi Arabic, Pashto, Dari, Farsi, Malay, Indonesian, and Levantine Arabic. We describe the key components of our system: automatic speech recognition (ASR), machine translation (MT), text-to-speech (TTS), dialog manager, and the user interface (UI). In addition, we present novel techniques for overcoming specific challenges in developing high-performing S2S systems. For ASR, we present techniques for dealing with lack of pronunciation and linguistic resources and effective modeling of ambiguity in pronunciations of words in these languages. For MT, we describe techniques for dealing with data sparsity as well as modeling context. We also present and compare different user confirmation techniques for detecting errors that can cause the dialog to drift or stall.  相似文献   

5.
Closing educational gaps between sub‐populations in Israel, particularly between students in Hebrew‐speaking and Arabic‐speaking schools, persists to be one of the priorities of Israel's education system. In the field of information and communication technology (ICT), this goal refers to infrastructure as well as practice, i.e. teaching and learning. A secondary analysis of Second Information Technology in Education Study 2006 study findings portrays a multifaceted state of affairs on some issues, e.g. vision and goals, attitudes on ICT importance in general and as a lever for paradigmatic change in particular. This is contrary to expectations due to the inequality in allocation (mainly of infrastructure) between the two sectors. Arabic‐speaking mathematics teachers indicate greater ICT usage in their target class, while among science teachers, Hebrew‐speaking teachers report greater usage and influence on their pedagogy, indicating innovative usage . Conclusions suggest that further effort is needed to close the gaps between Hebrew‐ and Arabic‐speaking schools as well as collaboration and exchange of ideas, information and educational experience between staff members from both sectors.  相似文献   

6.
Summary We study the relationship between scattered and context-sensitive rewriting. We prove that an extended version of scattered grammars produces exactly the context-sensitive languages. Also unordered scattered context languages are a proper subset of scattered context languages, and unordered scattered rewriting with erasing does not generate all scattered context (and thus not all context-sensitive) languages.Part of this research was done while this author visited the Hebrew University and was supported by the Leibniz Center  相似文献   

7.
This paper presents two grammars for reading numbers of classical andmodern Arabic language. The grammars make use of the structured Arabiccounting system to present an accurate and compact grammar that can beeasily implemented in different platforms. Automating the process ofreading numbers from its numerical representation to its sentential formhas many applications. Inquiring about your bank balance over the phone,automatically writing the amount of checks (from numerical form toletter form), and reading for the blind people are some of the fieldsthat automated reading of numbers can be of service. The parsing problemof sentential representation of numbers in the Arabic language is alsoaddressed. A grammar to convert from sentential representation to thenumerical representation is also presented. Grammars presented can beused to translate from the sentential Arabic numbers to sententialEnglish numbers, and vice versa, by using the common numericalrepresentation as an intermediate code. Such methodology can be used toaid the automatic translation between the two natural languages. Allgrammars described in this paper have been implemented on a UNIX system.Examples of different number representations and the output of theimplementation of the grammars are given as part of the paper.  相似文献   

8.
Much of the work on statistical machine translation (SMT) from morphologically rich languages has shown that morphological tokenization and orthographic normalization help improve SMT quality because of the sparsity reduction they contribute. In this article, we study the effect of these processes on SMT when translating into a morphologically rich language, namely Arabic. We explore a space of tokenization schemes and normalization options. We also examine a set of six detokenization techniques and evaluate on detokenized and orthographically correct (enriched) output. Our results show that the best performing tokenization scheme is that of the Penn Arabic Treebank. Additionally, training on orthographically normalized (reduced) text then jointly enriching and detokenizing the output outperforms training on enriched text.  相似文献   

9.

Steganography is used for multimedia data security. It is a process of hiding the data within multimedia communication between the parties embedding the secret data inside a carrier file to be protected during its transmission. The research focus is on hiding within Arabic text steganography as a current challenging research area. The work innovation is utilizing text pseudo-spaces characters for data hiding. We present two studies for this text Steganography utilizing pseudo-spaces alone as well as combined with Kashida (extension character) as the old Arabic text stego techniques. Experimental results have shown that the proposed algorithms achieved high capacity and security ratio as compared to state-of-the-art Steganography methods presented for Arabic language. The proposed pseudo-spaces stego technique is of great benefit that can be further used for languages similar to Arabic such as Urdu and Persian as well as opening direction of text-stego research for other languages of the world.

  相似文献   

10.
Fast access to information in different languages is still a major problem for many organizations. We have built a multilingual analyst's workstation integrated in the Tipster document management toolkit. The analyst workstation offers to an English-speaking analyst a variety of tools to browse sets of documents in Arabic, Japanese, Spanish and Russian, including a Unicode-based multilingual editor, and a simple machine translation functionality.The Temple project has developed an open multilingual architecture and software support for rapid development of extensible machine translation functionalities. The targeted languages are those for which natural language processing and human resources are scarce or difficult to obtain. The goal is to support rapid development of machine translation functionalities in a very short time with limited resources.Glossary-based machine-translation (GBMT) is used to provide an English gloss of a foreign document. A GBMT system uses a bilingual phrasal dictionary (glossary) to produce a phrase-by-phrase translation. Translation (based on phrase pattern-matching) is fast and accurate regarding the content of the document and browsed documents can be translated almost in real-time. A GBMT system for a language pair is also extremely simple, cheap and fast to develop. Moreover, all language resources used by the system are entirely under the control of the user.  相似文献   

11.
Stemming is one of the basic steps in natural language processing applications such as information retrieval, parts of speech tagging, syntactic parsing and machine translation, etc. It is a morphological process that intends to convert the inflected forms of a word into its root form. Urdu is a morphologically rich language, emerged from different languages, that includes prefix, suffix, infix, co-suffix and circumfixes in inflected and multi-gram words that need to be edited in order to convert them into their stems. This editing (insertion, deletion and substitution) makes the stemming process difficult due to language morphological richness and inclusion of words of foreign languages like Persian and Arabic. In this paper, we present a comprehensive review of different algorithms and techniques of stemming Urdu text and also considering the syntax, morphological similarity and other common features and stemming approaches used in Urdu like languages, i.e. Arabic and Persian analyzed, extract main features, merits and shortcomings of the used stemming approaches. In this paper, we also discuss stemming errors, basic difference between stemming and lemmatization and coin a metric for classification of stemming algorithms. In the final phase, we have presented the future work directions.  相似文献   

12.
This paper describes the multilingual text editor MtScript developed in the framework of the MULTEXT project.MtScript enables the use of many differentwriting systems in the same document (Latin, Arabic,Cyrillic, Hebrew, Chinese, Japanese, etc.). Editingfunctions enable the insertion or deletion of textzones even if they have opposite writing directions.In addition, the languages in the text can be marked,customized keyboard input rules can be associated witheach language and different character coding systems(one or two bytes) can be combined. MtScript isbased on a portable environment (Tcl/Tk). MtScript.1.1version has been developed underUnix/X-Windows (Solaris, Linux systems) and otherversions are planned to be ported to the Windows andMacintosh environments. The current 1.1 versionpresents several limits that will be fixed in futureversions, such as the justification of bi-directionaltexts, printing support, and text import/exportsupport. Future versions will use SGML and TEI norms,which offer ways of encoding multilingual texts andare to a large extent meant for interchange.  相似文献   

13.
14.
Daniel M. Berry 《Software》1999,29(15):1417-1457
This paper describes an extension to ditroff/ffortid, a system for formatting bi‐directional text in Arabic, Hebrew and Persian. The previous version of the system is able to format mixed left‐to‐right and right‐to‐left text using fonts with separated letters or with connecting letters and only connection stretching, achieved by repeating fixed‐length baseline fillers. The latest extension adds the abilities to stretch letters themselves, as is common in Arabic, Hebrew and Persian calligraphic printing, and to slant the baselines of words, as is common in Persian calligraphic printing. The extension consists of modifications in ffortid that allow it to interface with (1) dynamic PostScript fonts to which one can pass to the outline procedure for any stretchable and/or connected letter, parameters specifying the amounts of stretch for the letter itself and/or for the connecting parts of the letter, and (2) PostScript fonts whose characters are slanted so that merely applying show to a word ends up printing that entire word on a single slanted baseline. As a self‐test, this paper was formatted using the described system, and it contains many examples of text written in Arabic, Hebrew and Persian. Copyright © 1999 John Wiley & Sons, Ltd.  相似文献   

15.
16.
This paper provides a thorough evaluation of a set of six important Arabic OCR systems available in the market; namely: Abbyy FineReader, Leadtools, Readiris, Sakhr, Tesseract and NovoVerus. We test the OCR systems using a randomly selected images from the well known Arabic Printed Text Image database (250 images from the APTI database) and using a set of 8 images from an Arabic book. The APTI database contains 45.313.600 of both decomposable and non-decomposable word images. In the evaluation, we conduct two tests. The first test is based on usual metrics used in the literature. In the second test, we provide a novel measure for Arabic language, which can be used for other non-Latin languages.  相似文献   

17.
In the context of Arabic optical characters recognition, Arabic poses more challenges because of its cursive nature. We purpose a system for recognizing a document containing Arabic text, using a pipeline of three neural networks. The first network model predicts the font size of an Arabic word, then the word is normalized to an 18pt font size that will be used to train the next two models. The second model is used to segment a word into characters. The problem of words segmentation in the Arabic language, as in many similar cursive languages, presents a challenge to the OCR systems. This paper presents a multichannel neural network to solve the offline segmentation of machine-printed Arabic documents. The segmented characters are then fed as an input to a convolutional neural network for Arabic characters recognition. The font size prediction model produced a test accuracy of 99.1%. The accuracy of the segmentation model using one font is 98.9%, while four-font model showed 95.5% accuracy. The whole pipeline showed an accuracy of 94.38% on Arabic Transparent font of size 18pt from APTI data set.  相似文献   

18.

Neural machine reading comprehension models have gained immense popularity over the last decade given the availability of large-scale English datasets. A key limiting factor for neural model development and investigations of the Arabic language is the limitation of the currently available datasets. Current available datasets are either too small to train deep neural models or created by the automatic translation of the available English datasets, where the exact answer may not be found in the corresponding text. In this paper, we propose two high quality and large-scale Arabic reading comprehension datasets: Arabic WikiReading and KaifLematha with around +100 K instances. We followed two different methodologies to construct our datasets. First, we employed crowdworkers to collect non-factoid questions from paragraphs on Wikipedia. Then, we constructed Arabic WikiReading following a distant supervision strategy, utilizing the Wikidata knowledge base as a ground truth. We carried out both quantitative and qualitative analyses to investigate the level of reasoning required to answer the questions in the proposed datasets. We evaluated competitive pre-trained language model that attained F1 scores of 81.77 and 68.61 for the Arabic WikiReading and KaifLematha datasets, respectively, but struggled to extract a precise answer for the KaifLematha dataset. Human performance reported an F1 score of 82.54 for the KaifLematha development set, which leaves ample room for improvement.

  相似文献   

19.
Abstract

The stemming is the process of transforming a word into its root or stem, hence, it is considered as a crucial pre-processing step before tackling any task of natural language processing or information retrieval. However, in the case of Arabic language, finding an effective stemming algorithm seems to be quite difficult, since the Arabic language has a specific morphology, which is different from many other languages. Although, there exist several algorithms in literature addressing the Arabic stemming issue, unfortunately, most of them are restricted to a limited number of words, present some confusions between original letters and affixes, and usually employ dictionary of words or patterns. For that purpose, we propose the design and implementation of a novel Arabic light stemmer, which is based on some new rules for stripping prefixes, suffixes and infixes in a smart way. And in our knowledge, it is the first work dealing with Arabic infixes with regards to their irregular rules. The empirical evaluation was conducted on a new Arabic data-set (called ARASTEM), which was conceived and collected from several Arabic discussion forums containing dialectical Arabic and modern pseudo-Arabic languages. Hence, we present a comparative investigation between our new stemmer and other existing stemmers using Paice’s parameters, namely: Under Stemming Index (UI), Over Stemming Index (OI) and Stemming Weight (SW). Results show that the proposed Arabic light stemmer maintains consistently high performances and outperforms several existing light stemmers.  相似文献   

20.
Morphologically rich languages pose a challenge for statistical machine translation (SMT). This challenge is magnified when translating into a morphologically rich language. In this work we address this challenge in the framework of a broad-coverage English-to-Arabic phrase based statistical machine translation (PBSMT). We explore the largest-to-date set of Arabic segmentation schemes ranging from full word form to fully segmented forms and examine the effects on system performance. Our results show a difference of 2.31 BLEU points averaged over all test sets between the best and worst segmentation schemes indicating that the choice of the segmentation scheme has a significant effect on the performance of an English-to-Arabic PBSMT system in a large data scenario. We show that a simple segmentation scheme can perform as well as the best and more complicated segmentation scheme. An in-depth analysis on the effect of segmentation choices on the components of a PBSMT system reveals that text fragmentation has a negative effect on the perplexity of the language models and that aggressive segmentation can significantly increase the size of the phrase table and the uncertainty in choosing the candidate translation phrases during decoding. An investigation conducted on the output of the different systems, reveals the complementary nature of the output and the great potential in combining them.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号