期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Arabic phonemes recognition using hybrid LVQ/HMM model for continuous speech recognition

Khalid M. O. Nahar Mohammed Abu Shquier Wasfi G. Al-Khatib Husni Al-Muhtaseb Moustafa Elshafei 《International Journal of Speech Technology》2016,19(3):495-508

相似文献

2.

Constructing accurate and robust HMM/GMM models for an Arabic speech recognition system

Mohamed O. M. Khelifa Yahya Mohamed Elhadj Yousfi Abdellah Mostafa Belkasmi 《International Journal of Speech Technology》2017,20(4):937-949

Conventional Hidden Markov Model (HMM) based Automatic Speech Recognition (ASR) systems generally utilize cepstral features as acoustic observation and phonemes as basic linguistic units. Some of the most powerful features currently used in ASR systems are Mel-Frequency Cepstral Coefficients (MFCCs). Speech recognition is inherently complicated due to the variability in the speech signal which includes within- and across-speaker variability. This leads to several kinds of mismatch between acoustic features and acoustic models and hence degrades the system performance. The sensitivity of MFCCs to speech signal variability motivates many researchers to investigate the use of a new set of speech feature parameters in order to make the acoustic models more robust to this variability and thus improve the system performance. The combination of diverse acoustic feature sets has great potential to enhance the performance of ASR systems. This paper is a part of ongoing research efforts aspiring to build an accurate Arabic ASR system for teaching and learning purposes. It addresses the integration of complementary features into standard HMMs for the purpose to make them more robust and thus improve their recognition accuracies. The complementary features which have been investigated in this work are voiced formants and Pitch in combination with conventional MFCC features. A series of experimentations under various combination strategies were performed to determine which of these integrated features can significantly improve systems performance. The Cambridge HTK tools were used as a development environment of the system and experimental results showed that the error rate was successfully decreased, the achieved results seem very promising, even without using language models. 相似文献

3.

HMM/SVM segmentation and labelling of Arabic speech for speech recognition applications

Hamza Frihia Halima Bahi 《International Journal of Speech Technology》2017,20(3):563-573

Building a large vocabulary continuous speech recognition (LVCSR) system requires a lot of hours of segmented and labelled speech data. Arabic language, as many other low-resourced languages, lacks such data, but the use of automatic segmentation proved to be a good alternative to make these resources available. In this paper, we suggest the combination of hidden Markov models (HMMs) and support vector machines (SVMs) to segment and to label the speech waveform into phoneme units. HMMs generate the sequence of phonemes and their frontiers; the SVM refines the frontiers and corrects the labels. The obtained segmented and labelled units may serve as a training set for speech recognition applications. The HMM/SVM segmentation algorithm is assessed using both the hit rate and the word error rate (WER); the resulting scores were compared to those provided by the manual segmentation and to those provided by the well-known embedded learning algorithm. The results show that the speech recognizer built upon the HMM/SVM segmentation outperforms in terms of WER the one built upon the embedded learning segmentation of about 0.05%, even in noisy background. 相似文献

4.

基于HMM和GMM的维吾尔语联机手写体识别研究

许辉热依曼.吐尔逊吾守尔.斯拉木《计算机工程与应用》2014,(11):202-205,222

给出了一个基于HMM和GMM双引擎识别模型的维吾尔语联机手写体整词识别系统。在GMM部分,系统提取了8-方向特征,生成8-方向特征样式图像、定位空间采样点以及提取模糊的方向特征。在对模型精细化迭代训练之后,得到GMM模型文件。HMM部分,系统采用了笔段特征的方法来获取笔段分段点特征序列,在对模型进行精细化迭代训练后,得到HMM模型文件。将GMM模型文件和HMM模型文件分别打包封装再进行联合封装成字典。在第一期的实验中,系统的识别率达到97%,第二期的实验中,系统的识别率高达99%。相似文献

5.

Audio-visual speech modeling for continuous speech recognition 总被引：3，自引：0，他引：3

Dupont S. Luettin J. 《Multimedia, IEEE Transactions on》2000,2(3):141-151

This paper describes a speech recognition system that uses both acoustic and visual speech information to improve recognition performance in noisy environments. The system consists of three components: a visual module; an acoustic module; and a sensor fusion module. The visual module locates and tracks the lip movements of a given speaker and extracts relevant speech features. This task is performed with an appearance-based lip model that is learned from example images. Visual speech features are represented by contour information of the lips and grey-level information of the mouth area. The acoustic module extracts noise-robust features from the audio signal. Finally the sensor fusion module is responsible for the joint temporal modeling of the acoustic and visual feature streams and is realized using multistream hidden Markov models (HMMs). The multistream method allows the definition of different temporal topologies and levels of stream integration and hence enables the modeling of temporal dependencies more accurately than traditional approaches. We present two different methods to learn the asynchrony between the two modalities and how to incorporate them in the multistream models. The superior performance for the proposed system is demonstrated on a large multispeaker database of continuously spoken digits. On a recognition task at 15 dB acoustic signal-to-noise ratio (SNR), acoustic perceptual linear prediction (PLP) features lead to 56% error rate, noise robust RASTA-PLP (relative spectra) acoustic features to 7.2% error rate and combined noise robust acoustic features and visual features to 2.5% error rate 相似文献

6.

Hybrid continuous speech recognition systems by HMM,MLP and SVM: a comparative study

Elyes Zarrouk Yassine Ben Ayed Faiez Gargouri 《International Journal of Speech Technology》2014,17(3):223-233

This paper presents a new hybrid method for continuous Arabic speech recognition based on triphones modelling. To do this, we apply Support Vectors Machine (SVM) as an estimator of posterior probabilities within the Hidden Markov Models (HMM) standards. In this work, we describe a new approach of categorising Arabic vowels to long and short vowels to be applied on the labeling phase of speech signals. Using this new labeling method, we deduce that SVM/HMM hybrid model is more efficient then HMMs standards and the hybrid system Multi-Layer Perceptron (MLP) with HMM. The obtained results for the Arabic speech recognition system based on triphones are 64.68 % with HMMs, 72.39 % with MLP/HMM and 74.01 % for SVM/HMM hybrid model. The WER obtained for the recognition of continuous speech by the three systems proves the performance of SVM/HMM by obtaining the lowest average for 4 tested speakers 11.42 %. 相似文献

7.

A fast and memory-efficient N-gram language model lookup method for large vocabulary continuous speech recognition

《Computer Speech and Language》2007,21(1):1-25

Recently, minimum perfect hashing (MPH)-based language model (LM) lookup methods have been proposed for fast access of N-gram LM scores in lexical-tree based LVCSR (large vocabulary continuous speech recognition) decoding. Methods of node-based LM cache and LM context pre-computing (LMCP) have also been proposed to combine with MPH for further reduction of LM lookup time. Although these methods are effective, LM lookup still takes a large share of overall decoding time when trigram LM lookahead (LMLA) is used for lower word error rate than unigram or bigram LMLAs. Besides computation time, memory cost is also an important performance aspect of decoding systems. Most speedup methods for LM lookup obtain higher speed at the cost of increased memory demand, which makes system performance unpredictable when running on computers with smaller memory capacities. In this paper, an order-preserving LM context pre-computing (OPCP) method is proposed to achieve both fast speed and small memory cost in LM lookup. By reducing hashing operations through order-preserving access of LM scores, OPCP cuts down LM lookup time effectively. In the meantime, OPCP significantly reduces memory cost because of reduced size of hashing keys and the need for only last word index of each N-gram in LM storage. Experimental results are reported on two LVCSR tasks (Wall Street Journal 20K and Switchboard 33K) with three sizes of trigram LMs (small, medium, large). In comparison with above-mentioned existing methods, OPCP reduced LM lookup time from about 30–80% of total decoding time to about 8–14%, without any increase of word error rate. Except for the small LM, the total memory cost of OPCP for LM lookup and storage was about the same or less than the original N-gram LM storage, much less than the compared methods. The time and memory savings in LM lookup by using OPCP became more pronounced with the increase of LM size. 相似文献

8.

A cache-based natural language model for speech recognition 总被引：4，自引：0，他引：4

Kuhn R. De Mori R. 《IEEE transactions on pattern analysis and machine intelligence》1990,12(6):570-583

Speech-recognition systems must often decide between competing ways of breaking up the acoustic input into strings of words. Since the possible strings may be acoustically similar, a language model is required; given a word string, the model returns its linguistic probability. Several Markov language models are discussed. A novel kind of language model which reflects short-term patterns of word use by means of a cache component (analogous to cache memory in hardware terminology) is presented. The model also contains a 3g-gram component of the traditional type. The combined model and a pure 3g-gram model were tested on samples drawn from the Lancaster-Oslo/Bergen (LOB) corpus of English text. The relative performance of the two models is examined, and suggestions for the future improvements are made 相似文献

9.

A maximum likelihood approach to continuous speech recognition 总被引：8，自引：0，他引：8

Bahl LR Jelinek F Mercer RL 《IEEE transactions on pattern analysis and machine intelligence》1983,(2):179-190

Speech recognition is formulated as a problem of maximum likelihood decoding. This formulation requires statistical models of the speech production process. In this paper, we describe a number of statistical models for use in speech recognition. We give special attention to determining the parameters for such models from sparse data. We also describe two decoding methods, one appropriate for constrained artificial languages and one appropriate for more realistic decoding tasks. To illustrate the usefulness of the methods described, we review a number of decoding results that have been obtained with them. 相似文献

10.

Keyword detection in conversational speech utterances using hidden Markov model based continuous speech recognition

Rose R. C. 《Computer Speech and Language》1995,9(4)

This paper describes a set of modeling techniques for detecting a small vocabulary of keywords in running conversational speech. The techniques are applied in the context of a hidden Markov model (HMM) based continuous speech recognition (CSR) approach to keyword spotting. The word spotting task is derived from the Switchboard conversational speech corpus, and involves unconstrained conversational speech utterances spoken over the public switched telephone network. The utterances in this task contain many of the artifacts that are characteristic of unconstrained speech as it appears in many telecommunications based automatic speech recognition (ASR) applications. Results are presented for an experimental study that was performed on this task. Performance was measured by computing the percentage correct keyword detection over a range of false alarm rates evaluated over 2·2 h of speech for a 20 keyword vocabulary. The results of the study demonstrate the importance of several techniques. These techniques include the use of decision tree based allophone clustering for defining acoustic subword units, different representations for non-vocabulary words appearing in the input utterance, and the definition of simple language models for keyword detection. Decision tree based allophone clustering resulted in a significant increase in keyword detection performance over that obtained using tri-phone based subword units while at the same time reducing the size of the inventory of subword acoustic models by 40%. More complex representations of non-vocabulary speech were also found to significantly improve keyword detection performance; however, these representations also resulted in a significant increase in computational complexity. 相似文献

11.

Parallel algorithms for syllable recognition in continuous speech

De Mori R Laface P Mong Y 《IEEE transactions on pattern analysis and machine intelligence》1985,(1):56-69

A distributed rule-based system for automatic speech recognition is described. Acoustic property extraction and feature hypothesization are performed by the application of sequences of operators. These sequences, called plans, are executed by cooperative expert programs. Experimental results on the automatic segmentation and recognition of phrases, made of connected letters and digits, are described and discussed. 相似文献

12.

基于HMM和新型前馈型神经网络的语音识别研究

冯宏伟薛蕾《计算机工程与设计》2010,31(24)

为了进一步提高语音识别系统的准确率,使语音产品应用更加方便,提出了一种隐马尔可夫模型和代数神经网络相结合的语音识别方法.利用隐马尔可夫模型生成最佳语音状态序列,将最佳状态序列的输出概率作为前馈型神经网络的输入,通过代数神经网络进行分类识别.使用Matlab7.0实验平台进行仿真,实验结果表明,与传统神经网络相比,该方法在收敛速度、鲁棒性和识别率方面都有改善. 相似文献

13.

On noise masking for automatic missing data speech recognition: A survey and discussion

《Computer Speech and Language》2007,21(3):443-457

Automatic speech recognition (ASR) has reached very high levels of performance in controlled situations. However, the performance degrades significantly when environmental noise occurs during the recognition process. Nowadays, the major challenge is to reach a good robustness to adverse conditions, so that automatic speech recognizers can be used in real situations. Missing data theory is a very attractive and promising approach. Unlike other denoising methods, missing data recognition does not match the whole data with the acoustic models, but instead considers part of the signal as missing, i.e. corrupted by noise. While speech recognition with missing data can be handled efficiently by methods such as data imputation or marginalization, accurately identifying missing parts (also called masks) remains a very challenging task. This paper reviews the main approaches that have been proposed to address this problem. The objective of this study is to identify the mask estimation methods that have been proposed so far, and to open this domain up to other related research, which could be adapted to overcome this difficult challenge. In order to restrict the range of methods, only the techniques using a single microphone are considered. 相似文献

14.

A syntactic procedure for the recognition of glottal pulses in continuous speech

R. De Mori P. Laface V.A. Makhonine M. Mezzalama 《Pattern recognition》1977,9(4):181-189

Speech prosody contains an important structural information for performing speech analysis and for extracting syntactic nuclei from spoken sentences. This paper describes a procedure based on a multichannel system of epoch filters for recognizing the pulses of glottal chord vibrations by an analysis of the speech waveform. Recognition is performed by a stochastic finite state automaton automatically inferred after experiments. 相似文献

15.

A flexible framework for HMM based noise robust speech recognition using generalized parametric space polynomial regression

CHENG Ning LIU XunYing WANG Lan 《中国科学:信息科学(英文版)》2011,(12):2481-2491

相似文献

16.

Study of feature combination using HMM and SVM for multilingual Odiya speech emotion recognition

Monorama Swain Subhasmita Sahoo Aurobinda Routray P. Kabisatpathy Jogendra N. Kundu 《International Journal of Speech Technology》2015,18(3):387-393

相似文献

17.

Discriminative semi-parametric trajectory model for speech recognition

K.C. Sim M.J.F. Gales 《Computer Speech and Language》2007,21(4):669-687

Hidden Markov models (HMMs) are the most commonly used acoustic model for speech recognition. In HMMs, the probability of successive observations is assumed independent given the state sequence. This is known as the conditional independence assumption. Consequently, the temporal (inter-frame) correlations are poorly modelled. This limitation may be reduced by incorporating some form of trajectory modelling. In this paper, a general perspective on trajectory modelling is provided, where time-varying model parameters are used for the Gaussian components. A discriminative semi-parametric trajectory model is then described where the Gaussian mean vector and covariance matrix parameters vary with time. The time variation is modelled as a semi-parametric function of the observation sequence via a set of centroids in the acoustic space. The model parameters are estimated discriminatively using the minimum phone error (MPE) criterion. The performance of these models is investigated and benchmarked against a state-of-the-art CUHTK Mandarin evaluation systems. 相似文献

18.

Topic tracking language model for speech recognition

Shinji Watanabe Tomoharu Iwata Takaaki Hori Atsushi Sako Yasuo Ariki 《Computer Speech and Language》2011,25(2):440-461

In a real environment, acoustic and language features often vary depending on the speakers, speaking styles and topic changes. To accommodate these changes, speech recognition approaches that include the incremental tracking of changing environments have attracted attention. This paper proposes a topic tracking language model that can adaptively track changes in topics based on current text information and previously estimated topic models in an on-line manner. The proposed model is applied to language model adaptation in speech recognition. We use the MIT OpenCourseWare corpus and Corpus of Spontaneous Japanese in speech recognition experiments, and show the effectiveness of the proposed method. 相似文献

19.

Gesture sequence recognition with one shot learned CRF/HMM hybrid model

《Image and vision computing》2017

相似文献

20.

Mask estimation for missing data speech recognition based on statistics of binaural interaction

Harding S. Barker J. Brown G.J. 《IEEE transactions on audio, speech, and language processing》2006,14(1):58-67

This paper describes a perceptually motivated computational auditory scene analysis (CASA) system that combines sound separation according to spatial location with the "missing data" approach for robust speech recognition in noise. Missing data time-frequency masks are created using probability distributions based on estimates of interaural time and level differences (ITD and ILD) for mixed utterances in reverberated conditions; these masks indicate which regions of the spectrum constitute reliable evidence of the target speech signal. A number of experiments compare the relative efficacy of the binaural cues when used individually and in combination. We also investigate the ability of the system to generalize to acoustic conditions not encountered during training. Performance on a continuous digit recognition task using this method is found to be good, even in a particularly challenging environment with three concurrent male talkers. 相似文献