首页 | 本学科首页   官方微博 | 高级检索  
 共查询到20条相似文献,搜索用时 15 毫秒
This paper describes the preparation, recording, analyzing, and evaluation of a new speech corpus for Modern Standard Arabic (MSA). The speech corpus contains a total of 415 sentences recorded by 40 (20 male and 20 female) Arabic native speakers from 11 different Arab countries representing three major regions (Levant, Gulf, and Africa). Three hundred and sixty seven sentences are considered as phonetically rich and balanced, which are used for training Arabic Automatic Speech Recognition (ASR) systems. The rich characteristic is in the sense that it must contain all phonemes of Arabic language, whereas the balanced characteristic is in the sense that it must preserve the phonetic distribution of Arabic language. The remaining 48 sentences are created for testing purposes, which are mostly foreign to the training sentences and there are hardly any similarities in words. In order to evaluate the speech corpus, Arabic ASR systems were developed using the Carnegie Mellon University (CMU) Sphinx 3 tools at both training and testing/decoding levels. The speech engine uses 3-emitting state Hidden Markov Models (HMM) for tri-phone based acoustic models. Based on experimental analysis of about 8?h of training speech data, the acoustic model is best using continuous observation’s probability model of 16 Gaussian mixture distributions and the state distributions were tied to 500 senones. The language model contains uni-grams, bi-grams, and tri-grams. For same speakers with different sentences, Arabic ASR systems obtained average Word Error Rate (WER) of 9.70%. For different speakers with same sentences, Arabic ASR systems obtained average WER of 4.58%, whereas for different speakers with different sentences, Arabic ASR systems obtained average WER of 12.39%.  相似文献   

This paper investigates issues in preparing corpora for developing speech-to-speech translation (S2ST). It is impractical to create a broad-coverage parallel corpus only from dialog speech. An alternative approach is to have bilingual experts write conversational-style texts in the target domain, with translations. There is, however, a risk of losing fidelity to the actual utterances. This paper focuses on balancing a tradeoff between these two kinds of corpora through the analysis of two newly developed corpora in the travel domain: a bilingual parallel corpus with 420 K utterances and a collection of in-domain dialogs using actual S2ST systems. We found that the first corpus is effective for covering utterances in the second corpus if complimented with a small number of utterances taken from monolingual dialogs. We also found that characteristics of in-domain utterances become closer to those of the first corpus when more restrictive conditions and instructions to speakers are given. These results suggest the possibility of a bootstrap-style of development of corpora and S2ST systems, where an initial S2ST system is developed with parallel texts, and is then gradually improved with in-domain utterances collected by the system as restrictions are relaxed.  相似文献   

A vast amount of valuable human knowledge is recorded in documents. The rapid growth in the number of machine-readable documents for public or private access necessitates the use of automatic text classification. While a lot of effort has been put into Western languages—mostly English—minimal experimentation has been done with Arabic. This paper presents, first, an up-to-date review of the work done in the field of Arabic text classification and, second, a large and diverse dataset that can be used for benchmarking Arabic text classification algorithms. The different techniques derived from the literature review are illustrated by their application to the proposed dataset. The results of various feature selections, weighting methods, and classification algorithms show, on average, the superiority of support vector machine, followed by the decision tree algorithm (C4.5) and Naïve Bayes. The best classification accuracy was 97 % for the Islamic Topics dataset, and the least accurate was 61 % for the Arabic Poems dataset.  相似文献   

In this paper, we describe tools and resources for the study of African languages developed at the Collaborative Research Centre 632 “Information Structure”. These include deeply annotated data collections of 25 sub-Saharan languages that are described together with their annotation scheme, as well as the corpus tool ANNIS, which provides unified access to a broad variety of annotations created with a range of different tools. With the application of ANNIS to several African data collections, we illustrate its suitability for the purpose of language documentation, distributed access, and the creation of data archives.  相似文献   

This paper presents a lexical analyser for inflected Arabic words. An augmented transition network (ATN) technique was used to represent the context-sensitive knowledge about the relation between a stem and inflectional additions. An exhaustive-search algorithm is developed to traverse the ATN, generating all possible interpretations of an inflected Arabic word. The arcs of the ATN are augmented with rules containing conditions and actions. More than one rule is associated with some arcs. The states of the ATN are represented by Pascal procedures.  相似文献   

This paper presents the synergetic use of different evaluation tools, parameterization schemes and search methods on the levels of a multilevel optimization platform to efficiently solve single- and multi-objective computationally demanding optimization problems. The platform is formed by a number of levels which concurrently search for optimal solutions, by regularly exchanging promising individual solutions. Each level is associated with a problem-specific evaluation tool with its own accuracy and computational cost, a parameterization scheme which determines the design variables and their mapping to generate individual solutions and a search algorithm which is either a metamodel-assisted evolutionary algorithm or a gradient-based method. The use of the multilevel platform with only one of the aforementioned features changing from level to level was presented in a previous paper by the authors. The present paper shows that the combined use of hierarchical evaluation, hierarchical parameterization and hierarchical search decreases further the computational cost by increasing the efficiency of the optimization method. This is demonstrated on function minimization and aerodynamic shape optimization problems; though only two levels are used herein, this is not a restriction and the optimization platform may accommodate any number of them.  相似文献   


Software tools are of vital importance in corpus-based research, but they can also lead to restrictions on the type of supported corpora and the range of analyses that can be performed. For example, corpus analysis tools, as general purpose software, do not include specific features to process corpora of theatre plays. This situation is even worse for parallel corpora of theatrical texts, in that there is currently a lack of software that allows for both the alignment and analysis of parallel corpora here. In this contribution, we will first outline the peculiarities of theatre texts and suggest three software features to address them: annotation of the structural units of plays, alignment at the utterance level, and concordances and statistics using the annotated units. Second, we will present the specific functionalities of TAligner and ACM to build and analyse parallel corpora of play texts, showing how new avenues of research are opening up with the development of these tools.


Multimedia Tools and Applications - Sound duration is responsible for rhythm and speech rate. Furthermore, in some languages phoneme length is an important phonetic and prosodic factor. For...  相似文献   

搜索引擎在多成员搜索引擎搜索结果的整合过程中,搜索结果的排序在很大程度上决定着元搜索引擎的服务质量。为了实现搜索结果的有效整合,目前技术主要结合查询请求、文档内容、初始排序或(和)赋予搜索成员搜索引擎权重等因素。其中采用赋予搜索引擎权重时,往往根据用户和技术人员经验,主观地进行赋值,不能体现真实的用户搜索偏好。为此,提出了通过挖掘用户搜索及遍历情况,动态地赋予各成员搜索引擎权重的方法。通过用户遍历及点击下载情况,得到了用户搜索遍历与返回结果的匹配度,论证了该方法的可行性和有效性。  相似文献   

Universal Access in the Information Society - As mobile devices, such as e-book readers and tablet computers, have emerged as alternatives to traditional printed media, they are also being...  相似文献   

This paper introduces Ontoolsearch, a new search system that can be employed by educators in order to find suitable tools for supporting collaborative learning settings. Current tool search facilities commonly allow simple keyword searches, limiting the accuracy of obtained results. In contrast, Ontoolsearch supports semantic querying of tool knowledge bases annotated with the Ontoolcole ontology, specifically designed to fit educators’ questions. Moreover, Ontoolsearch offers an innovative direct manipulation interface to educators, intended to facilitate query formulation as well as the analysis of obtained results. To evaluate this proposal, a group of educators was engaged in a formal comparison study of Ontoolsearch with a keyword search facility based on Lucene. Six search tasks were proposed, each responding to the learning tool needs of a real CSCL setting. Participants had to find tools for these search tasks using both systems alternatively. Evaluation results showed that retrieval performance was significantly better with Ontoolsearch, despite educators’ previous experience with keyword searches. Further, educators rated very positively the user interface of Ontoolsearch and considered this system very useful to find tools for their own learning situations.  相似文献   

Multimedia Tools and Applications - Interactive video retrieval tools developed over the past few years are emerging as powerful alternatives to automatic retrieval approaches by giving the user...  相似文献   

Automatic Speaker Recognition (ASR) refers to the task of identifying a person based on his or her voice with the help of machines. ASR finds its potential applications in telephone based financial transactions, purchase of credit card and in forensic science and social anthropology for the study of different cultures and languages. Results of ASR are highly dependent on database, i.e., the results obtained in ASR are meaningless if recording conditions are not known. In this paper, a methodology and a typical experimental setup used for development of corpora for various tasks in the text-independent speaker identification in different Indian languages, viz., Marathi, Hindi, Urdu and Oriya have been described. Finally, an ASR system is presented to evaluate the corpora.  相似文献   

SAMAR is a system for subjectivity and sentiment analysis (SSA) for Arabic social media genres. Arabic is a morphologically rich language, which presents significant complexities for standard approaches to building SSA systems designed for the English language. Apart from the difficulties presented by the social media genres processing, the Arabic language inherently has a high number of variable word forms leading to data sparsity. In this context, we address the following 4 pertinent issues: how to best represent lexical information; whether standard features used for English are useful for Arabic; how to handle Arabic dialects; and, whether genre specific features have a measurable impact on performance. Our results show that using either lemma or lexeme information is helpful, as well as using the two part of speech tagsets (RTS and ERTS). However, the results show that we need individualized solutions for each genre and task, but that lemmatization and the ERTS POS tagset are present in a majority of the settings.  相似文献   

龚道雄  刘翔 《计算机应用研究》2011,28(12):4433-4436
研究面向搜救的应用,将事故环境抽象为一个迷宫,通过仿真实验比较研究了深度优先搜索算法和三种不同启发式函数的A*算法在Perfect迷宫中的应用,并分别将深度优先搜索算法和A*算法用于实际迷宫中进行实现与比较.在实验中,迷宫环境对机器人是未知的,而由于迷宫环境的特殊性——未知的迷宫环境中很少有不会碰撞的路径,从而增加了机器人搜索的难度.通过仿真实验对比了不同启发式函数的A*算法与深度优先搜索算法的性能,最后得出在迷宫搜索中A*算法要优于深度优先搜索算法;同时,在实际迷宫中实现了深度优先搜索算法与A*算法的搜救应用.  相似文献   

In this study, we introduce Slovene web-crawled news corpora with sentiment annotation on three levels of granularity: sentence, paragraph and document levels. We describe the methodology and tools that were required for their construction. The corpora contain more than 250,000 documents with political, business, economic and financial content from five Slovene media resources on the web. More than 10,000 of them were manually annotated as negative, neutral or positive. All corpora are publicly available under a Creative Commons copyright license. We used the annotated documents to construct a Slovene sentiment lexicon, which is the first of its kind for Slovene, and to assess the sentiment classification approaches used. The constructed corpora were also utilised to monitor within-the-document sentiment dynamics, its changes over time and relations with news topics. We show that sentiment is, on average, more explicit at the beginning of documents, and it loses sharpness towards the end of documents.  相似文献   

For stroke-order free online multi-stroke charac- ter recognition, stroke-to-stroke correspondence search be- tween an input pattern and a reference pattern plays an im- portant role to deal with the stroke-order variation. Although various methods of stroke correspondence have been pro- posed, no comparative study for clarifying the relative su- periority of those methods has been done before. In this pa- per, we firstly review the approaches for solving the stroke- order variation problem. Then, five representative methods of stroke correspondence proposed by different groups, includ- ing cube search (CS), bipartite weighted matching (BWM), individual correspondence decision (ICD), stable marriage (SM), and deviation-expansion model (DE), are experimen- tally compared, mainly in regard of recognition accuracy and efficiency. The experimental results on an online Kanji char- acter dataset, showed that 99.17%, 99.17%, 96.37%, 98.54%, and 96.59% were attained by CS, BWM, ICD, SM, and DE, respectively as their recognition rates. Extensive discussions are made on their relative superiorities and practicalities.  相似文献   

For stroke-order free online multi-stroke character recognition, stroke-to-stroke correspondence search between an input pattern and a reference pattern plays an important role to deal with the stroke-order variation. Although various methods of stroke correspondence have been proposed, no comparative study for clarifying the relative superiority of those methods has been done before. In this paper, we firstly review the approaches for solving the stroke-order variation problem. Then, five representative methods of stroke correspondence proposed by different groups, including cube search (CS), bipartite weighted matching (BWM), individual correspondence decision (ICD), stable marriage (SM), and deviation-expansion model (DE), are experimentally compared, mainly in regard of recognition accuracy and efficiency. The experimental results on an online Kanji character dataset, showed that 99.17%, 99.17%, 96.37%, 98.54%, and 96.59% were attained by CS, BWM, ICD, SM, and DE, respectively as their recognition rates. Extensive discussions are made on their relative superiorities and practicalities.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号