期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Using speaker adaptive training to realize Mandarin-Tibetan cross-lingual speech synthesis

Hongwu Yang Keiichiro Oura Haiyan Wang Zhenye Gan Keiichi Tokuda 《Multimedia Tools and Applications》2015,74(22):9927-9942

相似文献

2.

Evaluation of speech unit modelling for HMM-based speech synthesis for Arabic

Amal Houidhek Vincent Colotte Zied Mnasri Denis Jouvet 《International Journal of Speech Technology》2018,21(4):895-906

相似文献

3.

Unsupervised domain adaptation for speech emotion recognition using PCANet

Huang Zhengwei Xue Wentao Mao Qirong Zhan Yongzhao 《Multimedia Tools and Applications》2017,76(5):6785-6799

Multimedia Tools and Applications - Research in emotion recognition seeks to develop insights into the variances of features of emotion in one common domain. However, automatic emotion recognition... 相似文献

4.

BBN TransTalk: Robust multilingual two-way speech-to-speech translation for mobile platforms

Rohit Prasad Prem Natarajan David Stallard Shirin Saleem Shankar Ananthakrishnan Stavros Tsakalidis Chia-lin Kao Fred Choi Ralf Meermeier Mark Rawls Jacob Devlin Kriste Krstovski Aaron Challenner 《Computer Speech and Language》2013,27(2):475-491

In this paper we present a speech-to-speech (S2S) translation system called the BBN TransTalk that enables two-way communication between speakers of English and speakers who do not understand or speak English. The BBN TransTalk has been configured for several languages including Iraqi Arabic, Pashto, Dari, Farsi, Malay, Indonesian, and Levantine Arabic. We describe the key components of our system: automatic speech recognition (ASR), machine translation (MT), text-to-speech (TTS), dialog manager, and the user interface (UI). In addition, we present novel techniques for overcoming specific challenges in developing high-performing S2S systems. For ASR, we present techniques for dealing with lack of pronunciation and linguistic resources and effective modeling of ambiguity in pronunciations of words in these languages. For MT, we describe techniques for dealing with data sparsity as well as modeling context. We also present and compare different user confirmation techniques for detecting errors that can cause the dialog to drift or stall. 相似文献

5.

Simplified scoring methods for HMM-based speech recognition

Pavel Paramonov Nadezhda Sutula 《Soft Computing - A Fusion of Foundations, Methodologies and Applications》2016,20(9):3455-3460

Most of the contemporary speech recognition systems exploit complex algorithms based on Hidden Markov Models (HMMs) to achieve high accuracy. However, in some cases rich computational resources are not available, and even isolated words recognition becomes challenging task. In this paper, we present two ways to simplify scoring in HMM-based speech recognition in order to reduce its computational complexity. We focus on core HMM procedure—forward algorithm, which is used to find the probability of generating observation sequence by given HMM, applying methods of dynamic programming. All proposed approaches were tested on Russian words recognition and the results were compared with those demonstrated by conventional forward algorithm. 相似文献

6.

The IBM speech-to-speech translation system for smartphone: Improvements for resource-constrained tasks

Bowen Zhou Xiaodong Cui Songfang Huang Martin Cmejrek Wei Zhang Jian Xue Jia Cui Bing Xiang Gregg Daggett Upendra Chaudhari Sameer Maskey Etienne Marcheret 《Computer Speech and Language》2013,27(2):592-618

This paper describes our recent improvements to IBM TRANSTAC speech-to-speech translation systems that address various issues arising from dealing with resource-constrained tasks, which include both limited amounts of linguistic resources and training data, as well as limited computational power on mobile platforms such as smartphones. We show how the proposed algorithms and methodologies can improve the performance of automatic speech recognition, statistical machine translation, and text-to-speech synthesis, while achieving low-latency two-way speech-to-speech translation on mobiles. 相似文献

7.

Concept-based speech-to-speech translation using maximum entropy models for statistical natural concept generation

Liang Gu Yuqing Gao Fu-Hua Liu Picheny M. 《IEEE transactions on audio, speech, and language processing》2006,14(2):377-392

相似文献

8.

Self-learning speaker identification for enhanced speech recognition

Tobias Herbig Franz Gerl Wolfgang Minker 《Computer Speech and Language》2012,26(3):210-227

A novel approach for joint speaker identification and speech recognition is presented in this article. Unsupervised speaker tracking and automatic adaptation of the human-computer interface is achieved by the interaction of speaker identification, speech recognition and speaker adaptation for a limited number of recurring users. Together with a technique for efficient information retrieval a compact modeling of speech and speaker characteristics is presented. Applying speaker specific profiles allows speech recognition to take individual speech characteristics into consideration to achieve higher recognition rates. Speaker profiles are initialized and continuously adapted by a balanced strategy of short-term and long-term speaker adaptation combined with robust speaker identification. Different users can be tracked by the resulting self-learning speech controlled system. Only a very short enrollment of each speaker is required. Subsequent utterances are used for unsupervised adaptation resulting in continuously improved speech recognition rates. Additionally, the detection of unknown speakers is examined under the objective to avoid the requirement to train new speaker profiles explicitly. The speech controlled system presented here is suitable for in-car applications, e.g. speech controlled navigation, hands-free telephony or infotainment systems, on embedded devices. Results are presented for a subset of the SPEECON database. The results validate the benefit of the speaker adaptation scheme and the unified modeling in terms of speaker identification and speech recognition rates. 相似文献

9.

Unsupervised data processing for classifier-based speech translator

Emil Ettelaie Panayiotis G. Georgiou Shrikanth S. Narayanan 《Computer Speech and Language》2013,27(2):438-454

Concept classification has been used as a translation method and has shown notable benefits within the suite of speech-to-speech translation applications. However, the main bottleneck in achieving an acceptable performance with such classifiers is the cumbersome task of annotating large amounts of training data. Any attempt to develop a method to assist in, or to completely automate, data annotation needs a distance measure to compare sentences based on the concept they convey. Here, we introduce a new method of sentence comparison that is motivated from the translation point of view. In this method the imperfect translations produced by a phrase-based statistical machine translation system are used to compare the concepts of the source sentences. Three clustering methods are adapted to support the concept-base distance. These methods are applied to prepare groups of paraphrases and use them as training sets in concept classification tasks. The statistical machine translation is also used to enhance the training data for the classifier which is crucial when such data are sparse. Experiments show the effectiveness of the proposed methods. 相似文献

10.

MPE-based discriminative linear transforms for speaker adaptation

Lan Wang Philip C. Woodland 《Computer Speech and Language》2008,22(3):256-272

相似文献

11.

Semantic speech recognition in the Basque context Part I: cross-lingual approaches

Nora Barroso Karmele López de?Ipi?a Odei Barroso Aitzol Ezeiza Carmen Hernández Manuel Gra?a 《International Journal of Speech Technology》2012,15(1):33-40

This work, divided into Part I and II, describes the development of GorUP a Semantic Speech Recognition System in the Basque context. Part I analyses cross-lingual approaches oriented to under-resourced languages and Part II the development of the Language Identification system. During the development, data optimization methods and Soft Computing methodologies oriented to complex environment are used in order to overcome the lack of resources. Moreover, in this context three languages coexist: French, Spanish and Basque. Indeed our main goal is the development of robust Automatic Speech Recognition (ASR) systems for Basque, but all language variability has to be analyzed. In this regard, Basque speakers mix during the speech not only sounds but also words of the three languages which results in a strong presence of cross-lingual elements. Besides, Basque is an agglutinative language with a special morpho-syntactic structure inside the words that may lead to intractable vocabularies. Nowadays, our work is oriented to Information Retrieval and mainly to small internet mass-media. In these cases the available resources for Basque in general, and for this task in particular, are very few and complex to process because of the noisy environment. Thus, the methods employed in this development (ontology-based approach or cross-lingual methodologies oriented to profit from more powerful languages) could suit the requirements of many under-resourced languages. 相似文献

12.

Processing degraded speech for text dependent speaker verification

Banriskhem K. Khonglah Ramesh K. Bhukya S. R. Mahadeva Prasanna 《International Journal of Speech Technology》2017,20(4):839-850

This work explores the use of speech enhancement for enhancing degraded speech which may be useful for text dependent speaker verification system. The degradation may be due to noise or background speech. The text dependent speaker verification is based on the dynamic time warping (DTW) method. Hence there is a necessity of the end point detection. The end point detection can be performed easily if the speech is clean. However the presence of degradation tends to give errors in the estimation of the end points and this error propagates into the overall accuracy of the speaker verification system. Temporal and spectral enhancement is performed on the degraded speech so that ideally the nature of the enhanced speech will be similar to the clean speech. Results show that the temporal and spectral processing methods do contribute to the task by eliminating the degradation and improved accuracy is obtained for the text dependent speaker verification system using DTW. 相似文献

13.

Efficient likelihood evaluation and dynamic Gaussian selection for HMM-based speech recognition

Jun Cai Ghazi Bouselmi Yves Laprie Jean-Paul Haton 《Computer Speech and Language》2009,23(2):147-164

LVCSR systems are usually based on continuous density HMMs, which are typically implemented using Gaussian mixture distributions. Such statistical modeling systems tend to operate slower than real-time, largely because of the heavy computational overhead of the likelihood evaluation. The objective of our research is to investigate approximate methods that can substantially reduce the computational cost in likelihood evaluation without obviously degrading the recognition accuracy. In this paper, the most common techniques to speed up the likelihood computation are classified into three categories, namely machine optimization, model optimization, and algorithm optimization. Each category is surveyed and summarized by describing and analyzing the basic ideas of the corresponding techniques. The distribution of the numerical values of Gaussian mixtures within a GMM model are evaluated and analyzed to show that computations of some Gaussians are unnecessary and can thus be eliminated. Two commonly used techniques for likelihood approximation, namely VQ-based Gaussian selection and partial distance elimination, are analyzed in detail. Based on the analyses, a fast likelihood computation approach called dynamic Gaussian selection (DGS) is proposed. DGS approach is a one-pass search technique which generates a dynamic shortlist of Gaussians for each state during the procedure of likelihood computation. In principle, DGS is an extension of both techniques of partial distance elimination and best mixture prediction, and it does not require additional memory for the storage of Gaussian shortlists. DGS algorithm has been implemented by modifying the likelihood computation procedure in HTK 3.4 system. Experimental results on TIMIT and WSJ0 corpora indicate that this approach can speed up the likelihood computation significantly without introducing apparent additional recognition error. 相似文献

14.

Enhancing GMM speaker identification by incorporating SVM speaker verification for intelligent web-based speech applications

Ing-Jr Ding Chih-Ta Yen 《Multimedia Tools and Applications》2015,74(14):5131-5140

相似文献

15.

Robust formant tracking for continuous speech with speaker variability

Mustafa K. Bruce I.C. 《IEEE transactions on audio, speech, and language processing》2006,14(2):435-444

Several algorithms have been developed for tracking formant frequency trajectories of speech signals, however most of these algorithms are either not robust in real-life noise environments or are not suitable for real-time implementation. The algorithm presented in this paper obtains formant frequency estimates from voiced segments of continuous speech by using a time-varying adaptive filterbank to track individual formant frequencies. The formant tracker incorporates an adaptive voicing detector and a gender detector for formant extraction from continuous speech, for both male and female speakers. The algorithm has a low signal delay and provides smooth and accurate estimates for the first four formant frequencies at moderate and high signal-to-noise ratios. Thorough testing of the algorithm has shown that it is robust over a wide range of signal-to-noise ratios for various types of background noises. 相似文献

16.

Integration strategies for audio-visual speech processing: applied to text-dependent speaker recognition

Lucey S. Chen T. Sridharan S. Chandran V. 《Multimedia, IEEE Transactions on》2005,7(3):495-506

In this paper, an in-depth analysis is undertaken into effective strategies for integrating the audio-visual speech modalities with respect to two major questions. Firstly, at what level should integration occur? Secondly, given a level of integration how should this integration be implemented? Our work is based around the well-known hidden Markov model (HMM) classifier framework for modeling speech. A novel framework for modeling the mismatch between train and test observation sets is proposed, so as to provide effective classifier combination performance between the acoustic and visual HMM classifiers. From this framework, it can be shown that strategies for combining independent classifiers, such as the weighted product or sum rules, naturally emerge depending on the influence of the mismatch. Based on the assumption that poor performance in most audio-visual speech processing applications can be attributed to train/test mismatches we propose that the main impetus of practical audio-visual integration is to dampen the independent errors, resulting from the mismatch, rather than trying to model any bimodal speech dependencies. To this end a strategy is recommended, based on theory and empirical evidence, using a hybrid between the weighted product and weighted sum rules in the presence of varying acoustic noise for the task of text-dependent speaker recognition. 相似文献

17.

Combining pulse-based features for rejecting far-field speech in a HMM-based Voice Activity Detector

Óscar Varela Rubén San-Segundo Luís A. HernándezAuthor vitae 《Computers & Electrical Engineering》2011,37(4):589-600

Nowadays, several computational techniques for speech recognition have been proposed. These techniques suppose an important improvement in real time applications where speaker interacts with speech recognition systems. Although researchers proposed many methods, none of them solve the high false alarm problem when far-field speakers interfere in a human-machine conversation. This paper presents a two-class (speech and non-speech classes) decision-tree based approach for combining new speech pulse features in a VAD (Voice Activity Detector) for rejecting far-field speech in speech recognition systems. This Decision Tree is applied over the speech pulses obtained by a baseline VAD composed of a frame feature extractor, a HMM-based (Hidden Markov Model) segmentation module and a pulse detector. The paper also presents a detailed analysis of a great amount of features for discriminating between close and far-field speech. The detection error obtained with the proposed VAD is the lowest compared to other well-known VADs. 相似文献

18.

Unsupervised language model adaptation for handwritten Chinese text recognition

Qiu-Feng Wang Fei Yin Cheng-Lin Liu 《Pattern recognition》2014

This paper presents an effective approach for unsupervised language model adaptation (LMA) using multiple models in offline recognition of unconstrained handwritten Chinese texts. The domain of the document to recognize is variable and usually unknown a priori, so we use a two-pass recognition strategy with a pre-defined multi-domain language model set. We propose three methods to dynamically generate an adaptive language model to match the text output by first-pass recognition: model selection, model combination and model reconstruction. In model selection, we use the language model with minimum perplexity on the first-pass recognized text. By model combination, we learn the combination weights via minimizing the sum of squared error with both L2-norm and L1-norm regularization. For model reconstruction, we use a group of orthogonal bases to reconstruct a language model with the coefficients learned to match the document to recognize. Moreover, we reduce the storage size of multiple language models using two compression methods of split vector quantization (SVQ) and principal component analysis (PCA). Comprehensive experiments on two public Chinese handwriting databases CASIA-HWDB and HIT-MW show that the proposed unsupervised LMA approach improves the recognition performance impressively, particularly for ancient domain documents with the recognition accuracy improved by 7 percent. Meanwhile, the combination of the two compression methods largely reduces the storage size of language models with little loss of recognition accuracy. 相似文献

19.

Wavelet basis selection for enhanced speech parametrization in speaker verification

Todor Ganchev Mihalis Siafarikas Iosif Mporas Tsenka Stoyanova 《International Journal of Speech Technology》2014,17(1):27-36

相似文献

20.

Speech fragment decoding techniques for simultaneous speaker identification and speech recognition 总被引：1，自引：1，他引：1

Jon Barker Ning Ma Andr Coy Martin Cooke 《Computer Speech and Language》2010,24(1):94-111

This paper addresses the problem of recognising speech in the presence of a competing speaker. We review a speech fragment decoding technique that treats segregation and recognition as coupled problems. Data-driven techniques are used to segment a spectro-temporal representation into a set of fragments, such that each fragment is dominated by one or other of the speech sources. A speech fragment decoder is used which employs missing data techniques and clean speech models to simultaneously search for the set of fragments and the word sequence that best matches the target speaker model. The paper investigates the performance of the system on a recognition task employing artificially mixed target and masker speech utterances. The fragment decoder produces significantly lower error rates than a conventional recogniser, and mimics the pattern of human performance that is produced by the interplay between energetic and informational masking. However, at around 0 dB the performance is generally quite poor. An analysis of the errors shows that a large number of target/masker confusions are being made. The paper presents a novel fragment-based speaker identification approach that allows the target speaker to be reliably identified across a wide range of SNRs. This component is combined with the recognition system to produce significant improvements. When the target and masker utterance have the same gender, the recognition system has a performance at 0 dB equal to that of humans; in other conditions the error rate is roughly twice the human error rate. 相似文献