首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
In this paper the comparison of performances of different feature representations of the speech signal and comparison of classification procedures for Slovene phoneme recognition are presented. Recognition results are obtained on the database of continuous Slovene speech consisting of short Slovene sentences spoken by female speakers. MEL-cepstrum and LPC-cepstrum features combined with the normalized frame loudness were found to be the most suitable feature representations for Slovene speech. It was found that determination of MEL-cepstrum using linear spacing of bandpass filters gave significantly better results for speaker dependent recognition. Comparison of classification procedures favours the Bayes classification assuming normal distribution of the feature vectors (BNF) to the classification based on quadratic discriminant functions (DF) for minimum mean-square error and subspace method (SM), which does not confirm the results obtained in some previous studies for German and Finn speech. Additionally, classification procedures based on hidden Markov models (HMM) and the Kohonen Self-Organizing Map (KSOM) were tested on a smaller amount of speech data (1 speaker only). Classification results are comparable with classification using BNF.  相似文献   

2.
We derive learning rates such that all training patterns are equally important statistically and the learning outcome is independent of the order in which training patterns are presented, if the competitive neurons win the same sets of training patterns regardless the order of presentation. We show that under these schemes, the learning rules in the two different weight normalization approaches, the length-constraint and the sum-constraint, yield practically the same results, if the competitive neurons win the same sets of training patterns with both constraints. These theoretical results are illustrated with computer simulations.  相似文献   

3.
Single-speaker and multispeaker recognition results are presented for the voice-stop consonants /b,d,g/ using time-delay neural networks (TDNNs) with a number of enhancements, including a new objective function for training these networks. The new objective function, called the classification figure of merit (CFM), differs markedly from the traditional mean-squared-error (MSE) objective function and the related cross entropy (CE) objective function. Where the MSE and CE objective functions seek to minimize the difference between each output node and its ideal activation, the CFM function seeks to maximize the difference between the output activation of the node representing incorrect classifications. A simple arbitration mechanism is used with all three objective functions to achieve a median 30% reduction in the number of misclassifications when compared to TDNNs trained with the traditional MSE back-propagation objective function alone.  相似文献   

4.
基于动态贝叶斯网络的语音识别及音素切分研究   总被引:1,自引:1,他引:0  
研究了一种基于动态贝叶斯网络(dynamic bayesian networks, DBN)的语音识别建模方法,利用GMTK(graphical model tool kits)工具构建音素级音频流DBN语音训练和识别模型,同时与传统的基于隐马尔可夫的语音识别结果进行比较,并给出词与音素的切分结果.实验表明,在各种信噪比测试条件下,基于DBN的语音识别结果与基于HMM的语音识别结果相当,并表现出一定的抗噪性,音素的切分结果也比较准确.  相似文献   

5.
The Journal of Supercomputing - Because of the rise of deep learning and neural networks, algorithms based on deep learning have also been developed and subtly applied in daily life. This paper...  相似文献   

6.
This paper presents a novel pipelined architecture for competitive learning (CL). The architecture is implemented by the field programmable gate array (FPGA). It is used as a hardware accelerator in a system on programmable chip (SOPC) for reducing the computation time. In the architecture, a novel codeword swapping scheme is adopted so that neuron competitions for different training vectors can be operated concurrently. The neuron updating process is based on a hardware divider with simple table lookup operations. The divider performs finite precision calculations for area cost reduction at the expense of slight degradation in training performance. The CPU time of the NIOS processor executing the CL training with the proposed architecture as an accelerator is measured. Experimental results show that the NIOS processor with the proposed architecture as an accelerator can achieve up to a speedup of 254 over its software counterpart running on a general purpose processor Pentium IV without hardware support.  相似文献   

7.
In this paper, we propose a new method called information enhancement to interpret internal representations of competitive learning. We consider competitive learning as a process of mutual information maximisation on input patterns. Then, we examine to what extent this mutual information can be increased or decreased by focusing upon or enhancing some elements in a network. If this enhancement for the elements increases information on input patterns, these elements possess more information on input patterns. Thus, we only have to carefully examine those elements in a network. We applied the method to an artificial problem, the Iris problem and an air pollution problem. In all problems, we succeeded in extracting important features in patterns. In addition, final maps were better than those obtained by the conventional self-organising map. We can say that this is the first step towards the full understanding of internal representations in competitive learning.  相似文献   

8.
Nonlinear feature extraction of speech signals has been the main concern of many researches in recent years. In this paper, feature extraction of phonemes using NPC (neural predictive coding) model is generalized to a combination of time and DCT domains. Two main ideas were proposed and evaluated in this paper. First, a frame-wise DCT-based NPC feature extractor is proposed to overcome the computational complexity deficiency of the system. The basis of this approach is the application of a DCT pre-feature extractor to remove unwanted additional data. In this approach, the extracted features are the output of the hidden layer. It is shown that the use of a pre-processing stage can improve both computational complexity efficiency and accuracy issues. At the second approach, we proposed a complementary role for DCT domain features in classic NPC modeling. This approach uses the signal residual of the predicted signal in the DCT domain. The experiments were conducted on voiced plosive phonemes of TIMIT database. Simulations showed that the performance of the combined method is good at the plosive phonemes. The achieved accuracy that was resulted from the proposed method was 70.3% recognition rate on /b/d/g/ phonemes, which is higher than the results of traditional NPC approaches.  相似文献   

9.
10.
An unsupervised competitive learning algorithm based on the classical k-means clustering algorithm is proposed. The proposed learning algorithm called the centroid neural network (CNN) estimates centroids of the related cluster groups in training date. This paper also explains algorithmic relationships among the CNN and some of the conventional unsupervised competitive learning algorithms including Kohonen's self-organizing map and Kosko's differential competitive learning algorithm. The CNN algorithm requires neither a predetermined schedule for learning coefficient nor a total number of iterations for clustering. The simulation results on clustering problems and image compression problems show that CNN converges much faster than conventional algorithms with compatible clustering quality while other algorithms may give unstable results depending on the initial values of the learning coefficient and the total number of iterations.  相似文献   

11.
12.
This paper introduces a neural network optimization procedure allowing the generation of multilayer perceptron (MLP) network topologies with few connections, low complexity and high classification performance for phoneme’s recognition. An efficient constructive algorithm with incremental training using a new proposed Frame by Frame Neural Networks (FFNN) classification approach for automatic phoneme recognition is thus proposed. It is based on a novel recruiting hidden neuron’s procedure for a single hidden-layer. After an initializing phase started with initial small number of hidden neurons, this algorithm allows the Neural Networks (NNs) to adjust automatically its parameters during the training phase. The modular FFNN classification method is then constructed and tested to recognize 5 broad phonetic classes extracted from the TIMIT database. In order to take into account the speech variability related to the coarticulation effect, a Context Window of Three Successive Frame’s (CWTSF) analysis is applied. Although, an important reduction of the computational training time is observed, this technique penalized the overall Phone Recognition Rate (PRR) and increased the complexity of the recognition system. To alleviate these limitations, two feature dimensionality reduction techniques respectively based on Principal Component Analysis (PCA) and Self Organizing Maps (SOM) are investigated. It is observed an important improvement in the performance of the recognition system when the PCA technique is applied. Optimal neuronal phone recognition architecture is finally derived according to the following criteria: best PRR, minimum computational training time and complexity of the BPNN architecture.  相似文献   

13.
Optimal representation of acoustic features is an ongoing challenge in automatic speech recognition research. As an initial step toward this purpose, optimization of filterbanks for the cepstral coefficient using evolutionary optimization methods is proposed in some approaches. However, the large number of optimization parameters required by a filterbank makes it difficult to guarantee that an individual optimized filterbank can provide the best representation for phoneme classification. Moreover, in many cases, a number of potential solutions are obtained. Each solution presents discrimination between specific groups of phonemes. In other words, each filterbank has its own particular advantage. Therefore, the aggregation of the discriminative information provided by filterbanks is demanding challenging task. In this study, the optimization of a number of complementary filterbanks is considered to provide a different representation of speech signals for phoneme classification using the hidden Markov model (HMM). Fuzzy information fusion is used to aggregate the decisions provided by HMMs. Fuzzy theory can effectively handle the uncertainties of classifiers trained with different representations of speech data. In this study, the output of the HMM classifiers of each expert is fused using a fuzzy decision fusion scheme. The decision fusion employed a global and local confidence measurement to formulate the reliability of each classifier based on both the global and local context when making overall decisions. Experiments were conducted based on clean and noisy phonetic samples. The proposed method outperformed conventional Mel frequency cepstral coefficients under both conditions in terms of overall phoneme classification accuracy. The fuzzy fusion scheme was shown to be capable of the aggregation of complementary information provided by each filterbank.  相似文献   

14.
In this paper, we present a general guideline to find a better distance measure for similarity estimation based on statistical analysis of distribution models and distance functions. A new set of distance measures are derived from the harmonic distance, the geometric distance, and their generalized variants according to the Maximum Likelihood theory. These measures can provide a more accurate feature model than the classical Euclidean and Manhattan distances. We also find that the feature elements are often from heterogeneous sources that may have different influence on similarity estimation. Therefore, the assumption of single isotropic distribution model is often inappropriate. To alleviate this problem, we use a boosted distance measure framework that finds multiple distance measures which fit the distribution of selected feature elements best for accurate similarity estimation. The new distance measures for similarity estimation are tested on two applications: stereo matching and motion tracking in video sequences. The performance of boosted distance measure is further evaluated on several benchmark data sets from the UCI repository and two image retrieval applications. In all the experiments, robust results are obtained based on the proposed methods.  相似文献   

15.
ABSTRACT

The average cross-correlation coefficient (ACCC) is a traditional Doppler centroid estimation (DCE) method applied for complex radar data. Though the ACCC method is accurate enough for moving platforms with small acceleration, it cannot be applied for high-squint curved-trajectory synthetic aperture radar because the accompanied larger acceleration will lead to serious Doppler spectrum expansion and result in Doppler aliasing. To solve this problem, a new DCE method is proposed in the paper for the cases with large acceleration. The performance improvement is achieved by employing an additional compensation for the phase and envelope induced by the acceleration. The presented approach is evaluated by the computer simulations.  相似文献   

16.
Physical activity monitoring for youth is an area of increasing scientific and public health interest due to the high prevalence of obesity and downward trend in physical activity. However, accurate assessment of such activity remains a challenging problem because of the complex nature in which certain activities are performed. In this study, we formulated the issue as a machine learning problem—using a diverse set of 19 physical activities commonly performed by youth—via two approaches: activity recognition and intensity estimation. With the aid of training data, we implemented a distance metric learning method called DML-KNN that utilizes time-frequency features and is capable of effectively classifying both continuous and intermittent movement in youth subjects. Four different time-frequency feature extraction methods were then systematically evaluated. Our results show that the DML-KNN method performed competitively, especially when using features extracted by the Tamura method for intensity estimation, and by the Square Coefficient method for activity recognition.  相似文献   

17.
音素层特征等高层信息的参数由于完全不受信道的影响,被认为可对基于声学参数的低层信息系统进行有益的补充,但高层信息存在数据稀少的缺点。建立了基于音素特征超矢量的识别方法,并采用BUT的音素层语音识别器对其识别性能进行分析,进而尝试通过数据裁剪和KPCA映射的方法来提升该识别方法的性能。结果表明,采用裁剪并不能有效提升其识别性能,但融合KPCA映射的识别算法的性能得到了显著提升。进一步与主流的GMM-UBM系统融合后,相对于GMM-UBM系统,EER从8.4%降至6.7%。  相似文献   

18.
极端学习机以其快速高效和良好的泛化能力在模式识别领域得到了广泛应用,然而现有的ELM及其改进算法并没有充分考虑到数据维数对ELM分类性能和泛化能力的影响,当数据维数过高时包含的冗余属性及噪音点势必降低ELM的泛化能力,针对这一问题本文提出一种基于流形学习的极端学习机,该算法结合维数约减技术有效消除数据冗余属性及噪声对ELM分类性能的影响,为验证所提方法的有效性,实验使用普遍应用的图像数据,实验结果表明本文所提算法能够显著提高ELM的泛化性能。  相似文献   

19.
Recognizing emotions in conversations is a challenging task due to the presence of contextual dependencies governed by self- and inter-personal influences. Recent approaches have focused on modeling these dependencies primarily via supervised learning. However, purely supervised strategies demand large amounts of annotated data, which is lacking in most of the available corpora in this task. To tackle this challenge, we look at transfer learning approaches as a viable alternative. Given the large amount of available conversational data, we investigate whether generative conversational models can be leveraged to transfer affective knowledge for detecting emotions in context. We propose an approach, TL-ERC, where we pre-train a hierarchical dialogue model on multi-turn conversations (source) and then transfer its parameters to a conversational emotion classifier (target). In addition to the popular practice of using pre-trained sentence encoders, our approach also incorporates recurrent parameters that model inter-sentential context across the whole conversation. Based on this idea, we perform several experiments across multiple datasets and find improvement in performance and robustness against limited training data. TL-ERC also achieves better validation performances in significantly fewer epochs. Overall, we infer that knowledge acquired from dialogue generators can indeed help recognize emotions in conversations.  相似文献   

20.
A system is described which automatically classifies terrain types from photography. Input is conventional panchromatic single-frame aerial photography of the earth. A flying-spot scanner converts this input to a time-varying video signal suitable for processing by the pattern recognition system. Processing consists of a series of analog and digital operations to arrive at a terrain classification based on spatial texture in the region of the input point. A learning strategy enables the system to refine its processing operations in order to improve classification performance with time. Test results are summarized.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号