首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 62 毫秒
1.
Optimal representation of acoustic features is an ongoing challenge in automatic speech recognition research. As an initial step toward this purpose, optimization of filterbanks for the cepstral coefficient using evolutionary optimization methods is proposed in some approaches. However, the large number of optimization parameters required by a filterbank makes it difficult to guarantee that an individual optimized filterbank can provide the best representation for phoneme classification. Moreover, in many cases, a number of potential solutions are obtained. Each solution presents discrimination between specific groups of phonemes. In other words, each filterbank has its own particular advantage. Therefore, the aggregation of the discriminative information provided by filterbanks is demanding challenging task. In this study, the optimization of a number of complementary filterbanks is considered to provide a different representation of speech signals for phoneme classification using the hidden Markov model (HMM). Fuzzy information fusion is used to aggregate the decisions provided by HMMs. Fuzzy theory can effectively handle the uncertainties of classifiers trained with different representations of speech data. In this study, the output of the HMM classifiers of each expert is fused using a fuzzy decision fusion scheme. The decision fusion employed a global and local confidence measurement to formulate the reliability of each classifier based on both the global and local context when making overall decisions. Experiments were conducted based on clean and noisy phonetic samples. The proposed method outperformed conventional Mel frequency cepstral coefficients under both conditions in terms of overall phoneme classification accuracy. The fuzzy fusion scheme was shown to be capable of the aggregation of complementary information provided by each filterbank.  相似文献   

2.
Keyword spotting refers to detection of all occurrences of any given keyword in input speech utterances. In this paper, we define a keyword spotter as a binary classifier that separates a class of sentences containing a target keyword from a class of sentences which do not include the target keyword. In order to discriminate the mentioned classes, an efficient classification method and a suitable feature set are to be studied. For the classification method, we propose an evolutionary algorithm to train the separating hyper-plane between the two classes. As our discriminative feature set, we propose two confidence measure functions. The first confidence measure function computes the possibility of phonemes presence in the speech frames, and the second one determines the duration of each phoneme. We define these functions based on the acoustic, spectral and statistical features of speech. The results on TIMIT indicate that the proposed evolutionary-based discriminative keyword spotter has lower computational complexity and higher speed in both test and train phases, in comparison to the SVM-based discriminative keyword spotter. Additionally, the proposed system is robust in noisy conditions.  相似文献   

3.
该文对不同语速下,人工标注的维吾尔语连续语音语料中各音素进行共振峰频率、音长、音强的统计分析,并完成辅-元结构下的塞音、塞擦音的声学特征分析。该文通过美尔频率倒谱系数与共振峰频率等声学特征的融合及模型状态数的修改,对维吾尔语音素识别的声学模型进行了改进,并验证了不同声学特征对音素识别的影响。相比于基线系统,改进后声学模型的识别率取得一定提升。同时,利用语音学知识分析维吾尔语易混淆音素产生原因,为音素识别声学模型的进一步改进提供参考依据。  相似文献   

4.
The shapes of speakers' vocal organs change under their different emotional states, which leads to the deviation of the emotional acoustic space of short-time features from the neutral acoustic space and thereby the degradation of the speaker recognition performance. Features deviating greatly from the neutral acoustic space are considered as mismatched features, and they negatively affect speaker recognition systems. Emotion variation produces different feature deformations for different phonemes, so it is reasonable to build a finer model to detect mismatched features under each phoneme. However, given the difficulty of phoneme recognition, three sorts of acoustic class recognition--phoneme classes, Gaussian mixture model (GMM) tokenizer, and probabilistic GMM tokenizer--are proposed to replace phoneme recognition. We propose feature pruning and feature regulation methods to process the mismatched features to improve speaker recognition performance. As for the feature regulation method, a strategy of maximizing the between-class distance and minimizing the within-class distance is adopted to train the transformation matrix to regulate the mismatched features. Experiments conducted on the Mandarin affective speech corpus (MASC) show that our feature pruning and feature regulation methods increase the identification rate (IR) by 3.64% and 6.77%, compared with the baseline GMM-UBM (universal background model) algorithm. Also, corresponding IR increases of 2.09% and 3.32% can be obtained with our methods when applied to the state-of-the-art algorithm i-vector.  相似文献   

5.
Although Hidden Markov Models (HMMs) are still the mainstream approach towards speech recognition, their intrinsic limitations such as first-order Markov models in use or the assumption of independent and identically distributed frames lead to the extensive use of higher level linguistic information to produce satisfactory results. Therefore, researchers began investigating the incorporation of various discriminative techniques at the acoustical level to induce more discrimination between speech units. As is known, the k-nearest neighbour (k-NN) density estimation is discriminant by nature and is widely used in the pattern recognition field. However, its application to speech recognition has been limited to few experiments. In this paper, we introduce a new segmental k-NN-based phoneme recognition technique. In this approach, a group-delay-based method generates phoneme boundary hypotheses, and an approximate version of k-NN density estimation is used for the classification and scoring of variable-length segments. During the decoding, the construction of the phonetic graph starts from the best phoneme boundary setting and progresses through splitting and merging segments using the remaining boundary hypotheses and constraints such as phoneme duration and broad-class similarity information. To perform the k-NN search, we take advantage of a similarity search algorithm called Spatial Approximate Sample Hierarchy (SASH). One major advantage of the SASH algorithm is that its computational complexity is independent of the dimensionality of the data. This allows us to use high-dimensional feature vectors to represent phonemes. By using phonemes as units of speech, the search space is very limited and the decoding process fast. Evaluation of the proposed algorithm with the sole use of the best hypothesis for every segment and excluding phoneme transitional probabilities, context-based, and language model information results in an accuracy of 58.5% with correctness of 67.8% on the TIMIT test dataset.  相似文献   

6.
The authors present the Meta-Pi network, a multinetwork connectionist classifier that forms distributed low-level knowledge representations for robust pattern recognition, given random feature vectors generated by multiple statistically distinct sources. They illustrate how the Meta-Pi paradigm implements an adaptive Bayesian maximum a posteriori classifier. They also demonstrate its performance in the context of multispeaker phoneme recognition in which the Meta-Pi superstructure combines speaker-dependent time-delay neural network (TDNN) modules to perform multispeaker /b,d,g/ phoneme recognition with speaker-dependent error rates of 2%. Finally, the authors apply the Meta-Pi architecture to a limited source-independent recognition task, illustrating its discrimination of a novel source. They demonstrate that it can adapt to the novel source (speaker), given five adaptation examples of each of the three phonemes  相似文献   

7.
提出利用基于隐马尔可夫模型的谱特征模型、基于高斯混合模型的声调分类器以及基于多层感知器的音素分类器模型的组合来提高语音识别中二次解码中的识别率。在模型组合中,使用上下文相关的模型权重加权模型得分,并使用区分性训练来优化上下文相关权重来进一步改进识别结果。对人工选取各种上下文相关权重集合进行了性能评估,连续语音识别实验表明,使用局部分类器进行二次解码能够明显降低系统误识率。在模型组合中,使用当前音节类型及左上下文相结合的模型权重集合能够最大程度降低系统误识率。实验表明该方法得到的识别结果优于基于谱特征与基频特征和音素后验概率特征合并得到特征组合的识别系统。  相似文献   

8.
以建立维吾尔语连续音素识别基础平台为目标,在HTK(基于隐马尔可夫模型的工具箱)的基础上,首次研究了其语言相关环节的几项关键技术;结合维吾尔语的语言特征,完成了用于语言模型建立和语音语料库建设的维吾尔语基础文本设计;根据具体技术指标,录制了较大规模语音语料库;确定音素作为基元,训练了维吾尔语声学模型;在基于字母的N-gram语言模型下,得出了从语音句子向字母序列句子的识别结果;统计了维吾尔语32个音素的识别率,给出了容易混淆的音素及其根源分析,为进一步提高识别率奠定了基础。  相似文献   

9.
Pronunciation variation is a major obstacle in improving the performance of Arabic automatic continuous speech recognition systems. This phenomenon alters the pronunciation spelling of words beyond their listed forms in the pronunciation dictionary, leading to a number of out of vocabulary word forms. This paper presents a direct data-driven approach to model within-word pronunciation variations, in which the pronunciation variants are distilled from the training speech corpus. The proposed method consists of performing phoneme recognition, followed by a sequence alignment between the observation phonemes generated by the phoneme recognizer and the reference phonemes obtained from the pronunciation dictionary. The unique collected variants are then added to dictionary as well as to the language model. We started with a Baseline Arabic speech recognition system based on Sphinx3 engine. The Baseline system is based on a 5.4 hours speech corpus of modern standard Arabic broadcast news, with a pronunciation dictionary of 14,234 canonical pronunciations. The Baseline system achieves a word error rate of 13.39%. Our results show that while the expanded dictionary alone did not add appreciable improvements, the word error rate is significantly reduced by 2.22% when the variants are represented within the language model.  相似文献   

10.
In speaker recognition tasks, one of the reasons for reduced accuracy is due to closely resembling speakers in the acoustic space. In order to increase the discriminative power of the classifier, the system must be able to use only the unique features of a given speaker with respect to his/her acoustically resembling speaker. This paper proposes a technique to reduce the confusion errors, by finding speaker-specific phonemes and formulate a text using the subset of phonemes that are unique, for speaker verification task using i-vector based approach. In this paper spectral features such as linear prediction cepstral co-efficients (LPCC), perceptual linear prediction co-efficients (PLP) and phase feature such as modified group delay are experimented to analyse the importance of speaker-specific-text in speaker verification task. Experiments have been conducted on speaker verification task using speech data of 50 speakers collected in a laboratory environment. The experiments show that the equal error rate (EER) has been decreased significantly using i-vector approach with speaker-specific-text when compared to i-vector approach with random-text using different spectral and phase based features.  相似文献   

11.
Investigating new effective feature extraction methods applied to the speech signal is an important approach to improve the performance of automatic speech recognition (ASR) systems. Owing to the fact that the reconstructed phase space (RPS) is a proper field for true detection of signal dynamics, in this paper we propose a new method for feature extraction from the trajectory of the speech signal in the RPS. This method is based upon modeling the speech trajectory using the multivariate autoregressive (MVAR) method. Moreover, in the following, we benefit from linear discriminant analysis (LDA) for dimension reduction. The LDA technique is utilized to simultaneously decorrelate and reduce the dimension of the final feature set. Experimental results show that the MVAR of order 6 is appropriate for modeling the trajectory of speech signals in the RPS. In this study recognition experiments are conducted with an HMM-based continuous speech recognition system and a naive Bayes isolated phoneme classifier on the Persian FARSDAT and American English TIMIT corpora to compare the proposed features to some older RPS-based and traditional spectral-based MFCC features.  相似文献   

12.
本文提出了一种新的聚类分段算法,这个算法以段内平均离散度最小、段间平均离散度最大为准则,采用聚类的方法逐次迭代选择最佳分段断点和分段段数,能正确地对汉语语音进行音素分段,它和以往分段方法相比在性能上有很大提高.文中还给出了应用该算法对汉语单音所作的部分实验统计结果,可供进一步开展基于音素或音位的汉语语音识别研究参考.  相似文献   

13.
It was observed that for non-stationary and quasi-stationary signals, wavelet transform has been found to be an effective tool for the time–frequency analysis. In the recent years wavelet transform being used for feature extraction in speech recognition applications. Here a new filter structure using admissible wavelet packet analysis is proposed for Hindi phoneme recognition. These filters have the benefit of having frequency bands spacing similar to the auditory Equivalent Rectangular Bandwidth (ERB) scale whose central frequencies are equally distributed along the frequency response of human cochlea. The phoneme recognition performance of proposed feature is compared with the standard baseline features and 24-band admissible wavelet packet-based features using a Hidden Markov Model (HMM) based classifier. Proposed feature shows better performance compared to conventional features for Hindi consonant recognition. To evaluate the robustness of proposed feature in the noisy environment NOISEX-92 database has been used.  相似文献   

14.
为了改善传统语音特征参数在复杂环境下识别性能不足的问题,提出了一种基于Gammatone滤波器和子带能量规整的语音特征提取方法.该方法以能量规整倒谱系数(PNCC)特征算法为基础,在前端引入平滑幅度包络和归一化Gammatone滤波器组,并通过子带能量规整方法抑制真实环境的背景噪声,最后在后端进行特征弯折和信道补偿处理加以改进.实验采用高斯混合通用背景分类器模型(GMM-UBM)将该算法和其他特征参数进行对比.结果表明,在多种噪声环境中相比其他特征参数,本文方法表现出良好的抗噪能力,即使在低信噪比下仍有较好的识别效果.  相似文献   

15.
语音识别领域的发展日新月异.同时,现有的研究结果表明声学特性集中存在较多的互补信息.本文提出了一种基于轨迹的空间-时间谱特语音情感识别方法.其核心思想是从语音频谱图中获得空间和时间上的描述符,进行分类和维度情感识别.本方法采用了穷举特征提取的实验表明:与MFCCs和基频等特征提取方法相比,提出的方法在噪声条件下,更具鲁棒性.通过在4类情感识别实验中获得了可比较的非加权平均回馈,得到了较为准确的结果,语音激活检测方面也具有显著的改进.  相似文献   

16.
17.
18.
Spectro-temporal representation of speech has become one of the leading signal representation approaches in speech recognition systems in recent years. This representation suffers from high dimensionality of the features space which makes this domain unsuitable for practical speech recognition systems. In this paper, a new clustering based method is proposed for secondary feature selection/extraction in the spectro-temporal domain. In the proposed representation, Gaussian mixture models (GMM) and weighted K-means (WKM) clustering techniques are applied to spectro-temporal domain to reduce the dimensions of the features space. The elements of centroid vectors and covariance matrices of clusters are considered as attributes of the secondary feature vector of each frame. To evaluate the efficiency of the proposed approach, the tests were conducted for new feature vectors on classification of phonemes in main categories of phonemes in TIMIT database. It was shown that by employing the proposed secondary feature vector, a significant improvement was revealed in classification rate of different sets of phonemes comparing with MFCC features. The average achieved improvements in classification rates of voiced plosives comparing to MFCC features is 5.9% using WKM clustering and 6.4% using GMM clustering. The greatest improvement is about 7.4% which is obtained by using WKM clustering in classification of front vowels comparing to MFCC features.  相似文献   

19.
This paper describes an approach for automatic scoring of pronunciation quality for non-native speech. It is applicable regardless of the foreign language student’s mother tongue. Sentences and words are considered as scoring units. Additionally, mispronunciation and phoneme confusion statistics for the target language phoneme set are derived from human annotations and word level scoring results using a Markov chain model of mispronunciation detection. The proposed methods can be employed for building a part of the scoring module of a system for computer assisted pronunciation training (CAPT). Methods from pattern and speech recognition are applied to develop appropriate feature sets for sentence and word level scoring. Besides features well-known from and approved in previous research, e.g. phoneme accuracy, posterior score, duration score and recognition accuracy, new features such as high-level phoneme confidence measures are identified. The proposed method is evaluated with native English speech, non-native English speech from German, French, Japanese, Indonesian and Chinese adults and non-native speech from German school children. The speech data are annotated with tags for mispronounced words and sentence level ratings by native English teachers. Experimental results show, that the reliability of automatic sentence level scoring by the system is almost as high as the average human evaluator. Furthermore, a good performance for detecting mispronounced words is achieved. In a validation experiment, it could also be verified, that the system gives the highest pronunciation quality scores to 90% of native speakers’ utterances. Automatic error diagnosis based on a automatically derived phoneme mispronunciation statistic showed reasonable results for five non-native speaker groups. The statistics can be exploited in order to provide the non-native feedback on mispronounced phonemes.  相似文献   

20.
The prime objective of this paper is to conduct phoneme categorization experiments for Indian languages. In this direction a major effort has been made to categorize Hindi phonemes using a time delay neural network (TDNN), and compare the recognition scores with other languages. A total of six neural nets aimed at the major coarse of phonetic classes in Hindi were trained. Evaluation of each net on 350 training tokens and 40 test tokens revealed a 99% recognition rate for vowel classes, 87% for unvoiced stops, 82% for voiced stops, 94.7% for semi vowels, 98.1% for nasals and 96.4% for fricatives. A new feature vector normalisation technique has been proposed to improve the recognition scores.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号