首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 0 毫秒
1.
2.
3.
Multimedia Tools and Applications - Research in emotion recognition seeks to develop insights into the variances of features of emotion in one common domain. However, automatic emotion recognition...  相似文献   

4.
In this paper we present a speech-to-speech (S2S) translation system called the BBN TransTalk that enables two-way communication between speakers of English and speakers who do not understand or speak English. The BBN TransTalk has been configured for several languages including Iraqi Arabic, Pashto, Dari, Farsi, Malay, Indonesian, and Levantine Arabic. We describe the key components of our system: automatic speech recognition (ASR), machine translation (MT), text-to-speech (TTS), dialog manager, and the user interface (UI). In addition, we present novel techniques for overcoming specific challenges in developing high-performing S2S systems. For ASR, we present techniques for dealing with lack of pronunciation and linguistic resources and effective modeling of ambiguity in pronunciations of words in these languages. For MT, we describe techniques for dealing with data sparsity as well as modeling context. We also present and compare different user confirmation techniques for detecting errors that can cause the dialog to drift or stall.  相似文献   

5.
Most of the contemporary speech recognition systems exploit complex algorithms based on Hidden Markov Models (HMMs) to achieve high accuracy. However, in some cases rich computational resources are not available, and even isolated words recognition becomes challenging task. In this paper, we present two ways to simplify scoring in HMM-based speech recognition in order to reduce its computational complexity. We focus on core HMM procedure—forward algorithm, which is used to find the probability of generating observation sequence by given HMM, applying methods of dynamic programming. All proposed approaches were tested on Russian words recognition and the results were compared with those demonstrated by conventional forward algorithm.  相似文献   

6.
This paper describes our recent improvements to IBM TRANSTAC speech-to-speech translation systems that address various issues arising from dealing with resource-constrained tasks, which include both limited amounts of linguistic resources and training data, as well as limited computational power on mobile platforms such as smartphones. We show how the proposed algorithms and methodologies can improve the performance of automatic speech recognition, statistical machine translation, and text-to-speech synthesis, while achieving low-latency two-way speech-to-speech translation on mobiles.  相似文献   

7.
8.
A novel approach for joint speaker identification and speech recognition is presented in this article. Unsupervised speaker tracking and automatic adaptation of the human-computer interface is achieved by the interaction of speaker identification, speech recognition and speaker adaptation for a limited number of recurring users. Together with a technique for efficient information retrieval a compact modeling of speech and speaker characteristics is presented. Applying speaker specific profiles allows speech recognition to take individual speech characteristics into consideration to achieve higher recognition rates. Speaker profiles are initialized and continuously adapted by a balanced strategy of short-term and long-term speaker adaptation combined with robust speaker identification. Different users can be tracked by the resulting self-learning speech controlled system. Only a very short enrollment of each speaker is required. Subsequent utterances are used for unsupervised adaptation resulting in continuously improved speech recognition rates. Additionally, the detection of unknown speakers is examined under the objective to avoid the requirement to train new speaker profiles explicitly. The speech controlled system presented here is suitable for in-car applications, e.g. speech controlled navigation, hands-free telephony or infotainment systems, on embedded devices. Results are presented for a subset of the SPEECON database. The results validate the benefit of the speaker adaptation scheme and the unified modeling in terms of speaker identification and speech recognition rates.  相似文献   

9.
Concept classification has been used as a translation method and has shown notable benefits within the suite of speech-to-speech translation applications. However, the main bottleneck in achieving an acceptable performance with such classifiers is the cumbersome task of annotating large amounts of training data. Any attempt to develop a method to assist in, or to completely automate, data annotation needs a distance measure to compare sentences based on the concept they convey. Here, we introduce a new method of sentence comparison that is motivated from the translation point of view. In this method the imperfect translations produced by a phrase-based statistical machine translation system are used to compare the concepts of the source sentences. Three clustering methods are adapted to support the concept-base distance. These methods are applied to prepare groups of paraphrases and use them as training sets in concept classification tasks. The statistical machine translation is also used to enhance the training data for the classifier which is crucial when such data are sparse. Experiments show the effectiveness of the proposed methods.  相似文献   

10.
11.
This work, divided into Part I and II, describes the development of GorUP a Semantic Speech Recognition System in the Basque context. Part I analyses cross-lingual approaches oriented to under-resourced languages and Part II the development of the Language Identification system. During the development, data optimization methods and Soft Computing methodologies oriented to complex environment are used in order to overcome the lack of resources. Moreover, in this context three languages coexist: French, Spanish and Basque. Indeed our main goal is the development of robust Automatic Speech Recognition (ASR) systems for Basque, but all language variability has to be analyzed. In this regard, Basque speakers mix during the speech not only sounds but also words of the three languages which results in a strong presence of cross-lingual elements. Besides, Basque is an agglutinative language with a special morpho-syntactic structure inside the words that may lead to intractable vocabularies. Nowadays, our work is oriented to Information Retrieval and mainly to small internet mass-media. In these cases the available resources for Basque in general, and for this task in particular, are very few and complex to process because of the noisy environment. Thus, the methods employed in this development (ontology-based approach or cross-lingual methodologies oriented to profit from more powerful languages) could suit the requirements of many under-resourced languages.  相似文献   

12.
This work explores the use of speech enhancement for enhancing degraded speech which may be useful for text dependent speaker verification system. The degradation may be due to noise or background speech. The text dependent speaker verification is based on the dynamic time warping (DTW) method. Hence there is a necessity of the end point detection. The end point detection can be performed easily if the speech is clean. However the presence of degradation tends to give errors in the estimation of the end points and this error propagates into the overall accuracy of the speaker verification system. Temporal and spectral enhancement is performed on the degraded speech so that ideally the nature of the enhanced speech will be similar to the clean speech. Results show that the temporal and spectral processing methods do contribute to the task by eliminating the degradation and improved accuracy is obtained for the text dependent speaker verification system using DTW.  相似文献   

13.
LVCSR systems are usually based on continuous density HMMs, which are typically implemented using Gaussian mixture distributions. Such statistical modeling systems tend to operate slower than real-time, largely because of the heavy computational overhead of the likelihood evaluation. The objective of our research is to investigate approximate methods that can substantially reduce the computational cost in likelihood evaluation without obviously degrading the recognition accuracy. In this paper, the most common techniques to speed up the likelihood computation are classified into three categories, namely machine optimization, model optimization, and algorithm optimization. Each category is surveyed and summarized by describing and analyzing the basic ideas of the corresponding techniques. The distribution of the numerical values of Gaussian mixtures within a GMM model are evaluated and analyzed to show that computations of some Gaussians are unnecessary and can thus be eliminated. Two commonly used techniques for likelihood approximation, namely VQ-based Gaussian selection and partial distance elimination, are analyzed in detail. Based on the analyses, a fast likelihood computation approach called dynamic Gaussian selection (DGS) is proposed. DGS approach is a one-pass search technique which generates a dynamic shortlist of Gaussians for each state during the procedure of likelihood computation. In principle, DGS is an extension of both techniques of partial distance elimination and best mixture prediction, and it does not require additional memory for the storage of Gaussian shortlists. DGS algorithm has been implemented by modifying the likelihood computation procedure in HTK 3.4 system. Experimental results on TIMIT and WSJ0 corpora indicate that this approach can speed up the likelihood computation significantly without introducing apparent additional recognition error.  相似文献   

14.
15.
Several algorithms have been developed for tracking formant frequency trajectories of speech signals, however most of these algorithms are either not robust in real-life noise environments or are not suitable for real-time implementation. The algorithm presented in this paper obtains formant frequency estimates from voiced segments of continuous speech by using a time-varying adaptive filterbank to track individual formant frequencies. The formant tracker incorporates an adaptive voicing detector and a gender detector for formant extraction from continuous speech, for both male and female speakers. The algorithm has a low signal delay and provides smooth and accurate estimates for the first four formant frequencies at moderate and high signal-to-noise ratios. Thorough testing of the algorithm has shown that it is robust over a wide range of signal-to-noise ratios for various types of background noises.  相似文献   

16.
In this paper, an in-depth analysis is undertaken into effective strategies for integrating the audio-visual speech modalities with respect to two major questions. Firstly, at what level should integration occur? Secondly, given a level of integration how should this integration be implemented? Our work is based around the well-known hidden Markov model (HMM) classifier framework for modeling speech. A novel framework for modeling the mismatch between train and test observation sets is proposed, so as to provide effective classifier combination performance between the acoustic and visual HMM classifiers. From this framework, it can be shown that strategies for combining independent classifiers, such as the weighted product or sum rules, naturally emerge depending on the influence of the mismatch. Based on the assumption that poor performance in most audio-visual speech processing applications can be attributed to train/test mismatches we propose that the main impetus of practical audio-visual integration is to dampen the independent errors, resulting from the mismatch, rather than trying to model any bimodal speech dependencies. To this end a strategy is recommended, based on theory and empirical evidence, using a hybrid between the weighted product and weighted sum rules in the presence of varying acoustic noise for the task of text-dependent speaker recognition.  相似文献   

17.
Nowadays, several computational techniques for speech recognition have been proposed. These techniques suppose an important improvement in real time applications where speaker interacts with speech recognition systems. Although researchers proposed many methods, none of them solve the high false alarm problem when far-field speakers interfere in a human-machine conversation. This paper presents a two-class (speech and non-speech classes) decision-tree based approach for combining new speech pulse features in a VAD (Voice Activity Detector) for rejecting far-field speech in speech recognition systems. This Decision Tree is applied over the speech pulses obtained by a baseline VAD composed of a frame feature extractor, a HMM-based (Hidden Markov Model) segmentation module and a pulse detector. The paper also presents a detailed analysis of a great amount of features for discriminating between close and far-field speech. The detection error obtained with the proposed VAD is the lowest compared to other well-known VADs.  相似文献   

18.
This paper presents an effective approach for unsupervised language model adaptation (LMA) using multiple models in offline recognition of unconstrained handwritten Chinese texts. The domain of the document to recognize is variable and usually unknown a priori, so we use a two-pass recognition strategy with a pre-defined multi-domain language model set. We propose three methods to dynamically generate an adaptive language model to match the text output by first-pass recognition: model selection, model combination and model reconstruction. In model selection, we use the language model with minimum perplexity on the first-pass recognized text. By model combination, we learn the combination weights via minimizing the sum of squared error with both L2-norm and L1-norm regularization. For model reconstruction, we use a group of orthogonal bases to reconstruct a language model with the coefficients learned to match the document to recognize. Moreover, we reduce the storage size of multiple language models using two compression methods of split vector quantization (SVQ) and principal component analysis (PCA). Comprehensive experiments on two public Chinese handwriting databases CASIA-HWDB and HIT-MW show that the proposed unsupervised LMA approach improves the recognition performance impressively, particularly for ancient domain documents with the recognition accuracy improved by 7 percent. Meanwhile, the combination of the two compression methods largely reduces the storage size of language models with little loss of recognition accuracy.  相似文献   

19.
20.
This paper addresses the problem of recognising speech in the presence of a competing speaker. We review a speech fragment decoding technique that treats segregation and recognition as coupled problems. Data-driven techniques are used to segment a spectro-temporal representation into a set of fragments, such that each fragment is dominated by one or other of the speech sources. A speech fragment decoder is used which employs missing data techniques and clean speech models to simultaneously search for the set of fragments and the word sequence that best matches the target speaker model. The paper investigates the performance of the system on a recognition task employing artificially mixed target and masker speech utterances. The fragment decoder produces significantly lower error rates than a conventional recogniser, and mimics the pattern of human performance that is produced by the interplay between energetic and informational masking. However, at around 0 dB the performance is generally quite poor. An analysis of the errors shows that a large number of target/masker confusions are being made. The paper presents a novel fragment-based speaker identification approach that allows the target speaker to be reliably identified across a wide range of SNRs. This component is combined with the recognition system to produce significant improvements. When the target and masker utterance have the same gender, the recognition system has a performance at 0 dB equal to that of humans; in other conditions the error rate is roughly twice the human error rate.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号