首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 0 毫秒
1.
This paper describes the preparation, recording, analyzing, and evaluation of a new speech corpus for Modern Standard Arabic (MSA). The speech corpus contains a total of 415 sentences recorded by 40 (20 male and 20 female) Arabic native speakers from 11 different Arab countries representing three major regions (Levant, Gulf, and Africa). Three hundred and sixty seven sentences are considered as phonetically rich and balanced, which are used for training Arabic Automatic Speech Recognition (ASR) systems. The rich characteristic is in the sense that it must contain all phonemes of Arabic language, whereas the balanced characteristic is in the sense that it must preserve the phonetic distribution of Arabic language. The remaining 48 sentences are created for testing purposes, which are mostly foreign to the training sentences and there are hardly any similarities in words. In order to evaluate the speech corpus, Arabic ASR systems were developed using the Carnegie Mellon University (CMU) Sphinx 3 tools at both training and testing/decoding levels. The speech engine uses 3-emitting state Hidden Markov Models (HMM) for tri-phone based acoustic models. Based on experimental analysis of about 8?h of training speech data, the acoustic model is best using continuous observation’s probability model of 16 Gaussian mixture distributions and the state distributions were tied to 500 senones. The language model contains uni-grams, bi-grams, and tri-grams. For same speakers with different sentences, Arabic ASR systems obtained average Word Error Rate (WER) of 9.70%. For different speakers with same sentences, Arabic ASR systems obtained average WER of 4.58%, whereas for different speakers with different sentences, Arabic ASR systems obtained average WER of 12.39%.  相似文献   

2.
3.
This paper describes various speaker normalization and adaptation techniques of a knowledge data base or reference templates to new speakers in automatic speech recognition (ASR). It focuses on a technique for learning spectral transformations, based on a statistical-analysis tool (canonical correlation analysis), to adapt a standard dictionary to arbitrary speakers. The proposed method should permit to improve speaker independence in large vocabulary ASR. Application to an isolated word recognizer improved a 70% correct score to 87%.A dynamic aspect of the speaker adaptation procedure is introduced and evaluated in a particular strategy.  相似文献   

4.
A conventional approach to noise robust speech recognition consists of employing a speech enhancement pre-processor prior to recognition. However, such a pre-processor usually introduces artifacts that limit recognition performance improvement. In this paper we discuss a framework for improving the interconnection between speech enhancement pre-processors and a recognizer. The framework relies on recent proposals for increasing robustness by replacing the point estimate of the enhanced features with a distribution with a dynamic (i.e. time varying) feature variance. We have recently proposed a model for the dynamic feature variance consisting of a dynamic feature variance root obtained from the pre-processor, which is multiplied by a weight representing the pre-processor uncertainty, and that uses adaptation data to optimize the pre-processor uncertainty weight. The formulation of the method is general and could be used with any speech enhancement pre-processor. However, we observed that in case of noise reduction based on spectral subtraction or related approaches, adaptation could fail because the proposed model is weak at representing well the actual dynamic feature variance. The dynamic feature variance changes according to the level of speech sound, which varies with the HMM states. Therefore, we propose improving the model by introducing HMM state dependency. We achieve this by using a cluster-based representation, i.e. the Gaussians of the acoustic model are grouped into clusters and a different pre-processor uncertainty weight is associated with each cluster. Experiments with various pre-processors and recognition tasks prove the generality of the proposed integration scheme and show that the proposed extension improves the performance with various speech enhancement pre-processors.  相似文献   

5.
6.
Text-to-speech system (TTS), known also as speech synthesizer, is one of the important technology in the last years due to the expanding field of applications. Several works on speech synthesizer have been made on English and French, whereas many other languages, including Arabic, have been recently taken into consideration. The area of Arabic speech synthesis has not sufficient progress and it is still in its first stage with a low speech quality. In fact, speech synthesis systems face several problems (e.g. speech quality, articulatory effect, etc.). Different methods were proposed to solve these issues, such as the use of large and different unit sizes. This method is mainly implemented with the concatenative approach to improve the speech quality and several works have proved its effectiveness. This paper presents an efficient Arabic TTS system based on statistical parametric approach and non-uniform units speech synthesis. Our system includes a diacritization engine. Modern Arabic text is written without mention the vowels, called also diacritic marks. Unfortunately, these marks are very important to define the right pronunciation of the text which explains the incorporation of the diacritization engine to our system. In this work, we propose a simple approach based on deep neural networks. Deep neural networks are trained to directly predict the diacritic marks and to predict the spectral and prosodic parameters. Furthermore, we propose a new simple stacked neural network approach to improve the accuracy of the acoustic models. Experimental results show that our diacritization system allows the generation of full diacritized text with high precision and our synthesis system produces high-quality speech.  相似文献   

7.
A real-time prototype speech recognizer has been implemented on a 66-processor distributed-memory parallel computer. Simple phrases are recognized in approximately 4 to 10 seconds. Scalability, performance and flexibility are the three main aims of this implementation, the ultimate goal being to construct a large vocabulary speech recognizer which responds quickly. A set of three techniques is investigated in this implementation: asynchronous methodology to minimize synchronization overheads, distributed control to avoid a central communications bottleneck, and dynamic load balancing to provide a flexible response to an unpredictable computational load. The effect on memory, processor time allocation and communications is observed in real-time using hardware monitoring aids.  相似文献   

8.
Text visualization has become a significant tool that facilitates knowledge discovery and insightful presentation of large amounts of data. This paper presents a visualization system for exploring Arabic text called ViStA. We report about the design, the implementation and some of the experiments we conducted on the system. The development of such tools assists Arabic language analysts to effectively explore, understand, and discover interesting knowledge hidden in text data. We used statistical techniques from the field of Information Retrieval to identify the relevant documents coupled with sophisticated natural language processing (NLP) tools to process the text. For text visualization, the system used a hybrid approach combining latent semantic indexing for feature selection and multidimensional scaling for dimensionality reduction. Initial results confirm the viability of using this approach to tackle the problem of Arabic text visualization and other Arabic NLP applications.  相似文献   

9.
A comprehensive Arabic handwritten text database is an essential resource for Arabic handwritten text recognition research. This is especially true due to the lack of such database for Arabic handwritten text. In this paper, we report our comprehensive Arabic offline Handwritten Text database (KHATT) consisting of 1000 handwritten forms written by 1000 distinct writers from different countries. The forms were scanned at 200, 300, and 600 dpi resolutions. The database contains 2000 randomly selected paragraphs from 46 sources, 2000 minimal text paragraph covering all the shapes of Arabic characters, and optionally written paragraphs on open subjects. The 2000 random text paragraphs consist of 9327 lines. The database forms were randomly divided into 70%, 15%, and 15% sets for training, testing, and verification, respectively. This enables researchers to use the database and compare their results. A formal verification procedure is implemented to align the handwritten text with its ground truth at the form, paragraph and line levels. The verified ground truth database contains meta-data describing the written text at the page, paragraph, and line levels in text and XML formats. Tools to extract paragraphs from pages and segment paragraphs into lines are developed. In addition we are presenting our experimental results on the database using two classifiers, viz. Hidden Markov Models (HMM) and our novel syntactic classifier.  相似文献   

10.
11.
Building a large vocabulary continuous speech recognition (LVCSR) system requires a lot of hours of segmented and labelled speech data. Arabic language, as many other low-resourced languages, lacks such data, but the use of automatic segmentation proved to be a good alternative to make these resources available. In this paper, we suggest the combination of hidden Markov models (HMMs) and support vector machines (SVMs) to segment and to label the speech waveform into phoneme units. HMMs generate the sequence of phonemes and their frontiers; the SVM refines the frontiers and corrects the labels. The obtained segmented and labelled units may serve as a training set for speech recognition applications. The HMM/SVM segmentation algorithm is assessed using both the hit rate and the word error rate (WER); the resulting scores were compared to those provided by the manual segmentation and to those provided by the well-known embedded learning algorithm. The results show that the speech recognizer built upon the HMM/SVM segmentation outperforms in terms of WER the one built upon the embedded learning segmentation of about 0.05%, even in noisy background.  相似文献   

12.
情感识别在人机交互中具有重要意义,为了提高情感识别准确率,将语音与文本特征融合。语音特征采用了声学特征和韵律特征,文本特征采用了基于情感词典的词袋特征(Bag-of-words,BoW)和N-gram模型。将语音与文本特征分别进行特征层融合与决策层融合,比较它们在IEMOCAP四类情感识别的效果。实验表明,语音与文本特征融合比单一特征在情感识别中表现更好;决策层融合比在特征层融合识别效果好。且基于卷积神经网络(Convolutional neural network,CNN)分类器,语音与文本特征在决策层融合中不加权平均召回率(Unweighted average recall,UAR)达到了68.98%,超过了此前在IEMOCAP数据集上的最好结果。  相似文献   

13.
This study explores the relationships between application usage, online communication patterns, problematic Internet use (PIU) of online applications, and online self-disclosure among children from culturally different groups. An online survey was administered in Hebrew and Arabic among 3867 Israeli 7–17 year old, including Jews, Arabs, and Bedouins. The level of PIU was relatively low—only 9.5% scored “very high” in the PIU index. For all the groups the highest level of communication was reported for safe interactions with family and friends, lower level for purely virtual communication with online acquaintances, and the lowest level for meeting online acquaintances face-to-face. However, various forms of the online communication patterns and use of applications differed across the groups, suggesting cultural diversity in Internet usage among children in the same country. PIU and self-disclosure explained 47.3% of variance in risky e-communication activities (e.g. sending ones' photos to online acquaintances, providing them with a school or home address, and meeting them face-to-face), as well as 34.4% of variance in exposure to unpleasant online experiences (e.g. receiving messages, pictures, or videos that make the children feel uncomfortable). However, both PIU and self-disclosure were unrelated to educational activities and to the use of educational applications.  相似文献   

14.
15.
We show the results of studying models of the Russian language constructed with recurrent artificial neural networks for systems of automatic recognition of continuous speech. We construct neural network models with different number of elements in the hidden layer and perform linear interpolation of neural network models with the baseline trigram language model. The resulting models were used at the stage of rescoring the N best list. In our experiments on the recognition of continuous Russian speech with extra-large vocabulary (150 thousands of word forms), the relative reduction in the word error rate obtained after rescoring the 50 best list with the neural network language models interpolated with the trigram model was 14%.  相似文献   

16.
This is a review of methods for the automated analysis of texts. The features of algorithms and programs that are used at the morphological, lexical, syntactical, and discursive levels of a language system are described.  相似文献   

17.
Multimedia Tools and Applications - Sound duration is responsible for rhythm and speech rate. Furthermore, in some languages phoneme length is an important phonetic and prosodic factor. For...  相似文献   

18.
Formative assessment and summative assessment are two widely accepted approaches of assessment. While summative assessment is a typically formal assessment used at the end of a lesson or course, formative assessment is an ongoing process of monitoring learners’ progresses of knowledge construction. Although empirical evidence has acknowledged that formal assessment is indeed superior to summative assessment, current e-learning assessment systems however seldom provide plausible solutions for conducting formative assessment. The major bottleneck of putting formative assessment into practice lies in its labor-intensive and time-consuming nature, which makes it hardly a feasible way of achievement evaluation especially when there are usually a large number of learners in e-learning environment. In this regard, this study developed EduMiner to relieve the burdens imposed on instructors and learners by capitalizing a series of text mining techniques. An empirical study was held to examine effectiveness and to explore outcomes of the features that EduMiner supported. In this study 56 participants enrolling in a “Human Resource Management” course were randomly divided into either experimental groups or control groups. Results of this study indicated that the algorithms introduced in this study serve as a feasible approach for conducting formative assessment in e-learning environment. In addition, learners in experimental groups were highly motivated to phrase the contents with higher-order level of cognition. Therefore a timely feedback of visualized representations is beneficial to facilitate online learners to express more in-depth ideas in discourses.  相似文献   

19.
20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号