期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Discriminant-function-based minimum recognition error ratepattern-recognition approach to speech recognition

Wu Chou 《Proceedings of the IEEE. Institute of Electrical and Electronics Engineers》2000,88(8):1201-1223

A discriminant function-based minimum recognition error rate pattern recognition approach is described and studied for various applications in speech processing. This approach departs from the conventional paradigm, which links a classification/recognition task to the problem of distribution estimation. Instead, it takes a discriminant function based statistical pattern recognition approach. The suitability of this approach for classification error rate minimization is established through a special loss function. It is meaningful even when the model correctness assumption is known to be not valid. We study the theoretical basis of this approach and compare it with various criteria used in speech recognition. We differentiate the method of classifier design by way of distribution estimation and the discriminant function methods of minimizing classification error rate, based on the fact that in many realistic applications, such as speech recognition, the true distribution form of the source is rarely known precisely, and without model correctness assumption, the classical optimality theory of the distribution estimation approach cannot be applied directly. We discuss issues in this new classifier design paradigm and present various extensions of this approach to classifier design applications in speech processing 相似文献

2.

基于LS-SVM的情感语音识别

周慧魏霖静《电子设计工程》2012,20(16):188-190

提出了一种基于LS-SVM的情感语音识别方法。即先提取实验中语音信号的基频,能量,语速等参数为情感特征,然后采用LS-SVM方法对相应的情感语音信号建立模型,进行识别。实验结果表明,利用LS-SVM进行基本情感识别时,识别率较高。相似文献

3.

基于端到端的多语种语音识别研究

下载免费PDF全文

胡文轩王秋林李松洪青阳李琳《信号处理》2021,37(10):1816-1824

端到端语音识别模型无需发音词典进行训练,可以大幅降低开发新语种语音识别系统的负担。本文利用端到端模型的这一优势,建立了一种语种无关的端到端多语种语音识别系统。该模型使用基于字符的建模方法进行训练,同时构建多语种输出符号集,使其包括所有目标语言中出现的字符。模型训练生成单一模型,其网络参数为所有语种共享。在OLR竞赛提供的10个语种数据集上,相较于单语种语音识别系统,本文提出的多语种语音识别系统在所有语言上的表现都更加优秀。相似文献

4.

Improved discriminative training for generative model

WU Ya-hui GUO Jun LIU Gang 《中国邮电高校学报(英文版)》2009,16(3):126-130

This article proposes a model combination method to enhance the discriminability of the generative model. Generative and discriminative models have different optimization objectives and have their own advantages and drawbacks. The method proposed in this article intends to strike a balance between the two models mentioned above. It extracts the discriminative parameter from the generative model and generates a new model based on a multi-model combination. The weight for combining is determined by the ratio of the inter-variance to the intra-variance of the classes. The higher the ratio is, the greater the weight is, and the more discriminative the model will be. Experiments on speech recognition demonstrate that the performance of the new model outperforms the model trained with the traditional generative method. 相似文献

5.

用反馈式语音识别理解方案进行汉语短语的识别理解

傅秋良袁保宗《电子与信息学报》1998,20(2):194-198

汉语语音理解系统的任务之一是把语音识别系统获得的汉语单音节转换成正确的汉字、词,乃至汉语的短语、语句,与语音识别系统一起,完成一个语音到文本(speech to text)的转换系统。本文利用一个闭环反馈方式汉语语音识别理解方案,在汉语词识别理解的基础上,进一步实现对汉语结构性短语的识别理解,获得了预期的结果。最后本文对实验结果和反馈式语音识别理解方案进行了讨论。相似文献

6.

Blind Model Selection for Automatic Speech Recognition in Reverberant Environments

Laurent Couvreur Christophe Couvreur 《Journal of Signal Processing Systems》2004,36(2-3):189-203

This communication presents a new method for automatic speech recognition in reverberant environments. Our approach consists in the selection of the best acoustic model out of a library of models trained on artificially reverberated speech databases corresponding to various reverberant conditions. Given a speech utterance recorded within a reverberant room, a Maximum Likelihood estimate of the fullband room reverberation time is computed using a statistical model for short-term log-energy sequences of anechoic speech. The estimated reverberation time is then used to select the best acoustic model, i.e., the model trained on a speech database most closely matching the estimated reverberation time, which serves to recognize the reverberated speech utterance. The proposed model selection approach is shown to improve significantly recognition accuracy for a connected digit task in both simulated and real reverberant environments, outperforming standard channel normalization techniques. 相似文献

7.

On the use of different speech representations for speaker modeling

Ke Chen 《IEEE transactions on systems, man and cybernetics. Part C, Applications and reviews》2005,35(3):301-314

Numerous speech representations have been reported to be useful in speaker recognition. However, there is much less agreement on which speech representation provides a perfect representation of speaker-specific information conveyed in a speech signal. Unlike previous work, we propose an alternative approach to speaker modeling by the simultaneous use of different speech representations in an optimal way. Inspired by our previous empirical studies, we present a soft competition scheme on different speech representations to exploit different speech representations in encoding speaker-specific information. On the basis of this soft competition scheme, we present a parametric statistical model, generalized Gaussian mixture model (GGMM), to characterize a speaker identity based on different speech representations. Moreover, we develop an expectation-maximization algorithm for parameter estimation in the GGMM. The proposed speaker modeling approach has been applied to text-independent speaker recognition and comparative results on the KING speech corpus demonstrate its effectiveness. 相似文献

8.

基于多流多状态动态贝叶斯网络的音视频连续语音识别

吕国云蒋冬梅张艳宁赵荣椿 H Sahli Ilse Ravyse W Verhelst 《电子与信息学报》2008,30(12):2906-2911

语音和唇部运动的异步性是多模态融合语音识别的关键问题,该文首先引入一个多流异步动态贝叶斯网络(MS-ADBN)模型,在词的级别上描述了音频流和视频流的异步性,音视频流都采用了词-音素的层次结构.而多流多状态异步DBN(MM-ADBN)模型是MS-ADBN模型的扩展,音视频流都采用了词-音素-状态的层次结构.本质上,MS-ADBN是一个整词模型,而MM-ADBN模型是一个音素模型,适用于大词汇量连续语音识别.实验结果表明：基于连续音视频数据库,在纯净语音环境下,MM-ADBN比MS-ADBN模型和多流HMM识别率分别提高35.91%和9.97%. 相似文献

9.

Feature classification criterion for missing features mask estimation in robust speaker recognition

Dayana Ribas González José Ramón Calvo de Lara 《Signal, Image and Video Processing》2014,8(2):365-375

Currently, many speaker recognition applications must handle speech corrupted by environmental additive noise without having a priori knowledge about the characteristics of noise. Some previous works in speaker recognition have used the missing feature (MF) approach to compensate for noise. In most of those applications, the spectral reliability decision step is performed using the signal to noise ratio (SNR) criterion, which attempts to directly measure the relative signal to noise energy at each frequency. An alternative approach to spectral data reliability has been used with some success in the MF approach to speech recognition. Here, we compare the use of this new criterion with the SNR criterion for MF mask estimation in speaker recognition. The new reliability decision is based on the extraction and analysis of several spectro-temporal features from across the entire speech frame, but not across the time, which highlight the differences between spectral regions dominated by speech and by noise. We call it the feature classification (FC) criterion. It uses several spectral features to establish spectrogram reliability unlike SNR criterion that relies only in one feature: SNR. We evaluated our proposal through speaker verification experiments, in Ahumada speech database corrupted by different types of noise at various SNR levels. Experiments demonstrated that the FC criterion achieves considerably better recognition accuracy than the SNR criterion in the speaker verification tasks tested. 相似文献

10.

A Subspace Projection Approach for Analysis of Speech Under Stressed Condition

Sumitra Shukla S. Dandapat S. R. Mahadeva Prasanna 《Circuits, Systems, and Signal Processing》2016,35(12):4486-4500

In this paper, a novel subspace projection approach is proposed for analysis of speech signal under stressed condition. The subspace projection method is based on the assumption of orthogonality between speech subspace and stress subspace. Speech and stress subspaces contain speech and stress information, respectively. The projection of stressed speech vectors onto the speech subspace will separate speech-specific information. In this work, the speech subspace consists of neutral speech vectors. Speech and stress recognition techniques are used to verify the orthogonal relation between speech and stress subspaces. The evaluation database consists of 119 word vocabulary under neutral, angry, sad and Lombard conditions. Hidden Markov models for speech and stress recognition are used with mel-frequency cepstral coefficient features for evaluation of estimated speech and stress information. 相似文献

11.

一种基于模糊规则的鲁棒语音识别方法

张军章熙春曹燕韦岗《电路与系统学报》2006,11(5):96-100

本文在丢失数据技术与声学后退技术的基础上,提出了一种基于模糊规则的鲁棒语音识别方法,首先根据先验知识或假定建立特征分量的可靠程度与其概率分布之间的模糊规则,识别时观察矢量的输出概率由一个基于规则的模糊逻辑系统来得到,并针对倒谱识别系统给出了一种具体的实现方法.实验结果表明,所提识别方法的性能显著优于丢失数据技术和声学后退技术. 相似文献

12.

基于声门波和声道特征的语音情感识别

下载免费PDF全文

李永伟陶建华李凯《信号处理》2023,39(4):632-638

语音情感识别是实现自然人机交互不可缺失的部分,是人工智能的重要组成部分。发音器官的调控引起情感语音声学特征的差异,从而被感知到不同的情感。传统的语音情感识别只是针对语音信号中的声学特征或听觉特征进行情感分类,忽略了声门波和声道等发音特征对情感感知的重要作用。在我们前期工作中,理论分析了声门波和声道形状对感知情感的重要影响,但未将声门波与声道特征用于语音情感识别。因此,本文从语音生成的角度重新探讨了声门波与声道特征对语音情感识别的可能性,提出一种基于源-滤波器模型的声门波和声道特征语音情感识别方法。首先,利用Liljencrants-Fant和Auto-Regressive eXogenous(ARX-LF)模型从语音信号中分离出情感语音的声门波和声道特征;然后,将分离出的声门波和声道特征送入双向门控循环单元（BiGRU）进行情感识别分类任务。在公开的情感数据集IEMOCAP上进行了情感识别验证,实验结果证明了声门波和声道特征可以有效的区分情感,且情感识别性能优于一些传统特征。本文从发音相关的声门波与声道研究语音情感识别,为语音情感识别技术提供了一种新思路。相似文献

13.

利用背景知识提高web语音浏览中的识别精度的方法 总被引：7，自引：0，他引：7

下载免费PDF全文

李红莲王春花袁保宗《电子学报》2002,30(12):1836-1839

语音识别的精度不够高一直是阻碍语音技术得以广泛应用的瓶颈,在具体的应用中充分利用背景知识是解决此问题的一种有效方法.在web语音浏览中,用户的语音输入为某个有限集的元素之一,本文利用这个特点,首先定义了一种文本字符串之间的相似度,利用相似度对识别引擎的识别结果进行后处理,进而给出更准确的识别结果.实验结果表明,采用这种方法,语音识别的正确率能够达到95%以上,为真正实现语音上网提供了有力支持. 相似文献

14.

Talking to Machines: Introducing Robot Perception to Resolve Speech Recognition Uncertainties 总被引：1，自引：0，他引：1

Stanislao Lauria 《Circuits, Systems, and Signal Processing》2007,26(4):513-526

The use of spontaneous speech as a form of communication between humans and robots is a potential solution for more efficient human-robot interactions. Accuracy is one of the main problems associated with the automatic speech recognition (ASR) component of human-robot interactive systems. The standard ASR approach is based on statistical methods applied to phoneme domains. However, some problems cannot be solved with the rule-based approaches used so far; therefore, alternative strategies could be the solution. The aim of this paper is to investigate some aspects related to the use of a robot's perceptive abilities to increase the robustness of ASR components. The robot evaluative abilities are used to incrementally build knowledge that will be used during the recognition phase. This paper covers aspects concerning the use of time-warping algorithms to improve the speech recognition performance. In particular, aspects related to the accuracy and efficiency of this approach when applied to whole-sentence speech signals are discussed. 相似文献

15.

Further results on the information theory of deterministic functions and its application to pattern recognition

Guy Jumarie 《电信纪事》1990,45(1-2):66-88

The author derived recently Shannon entropy and Renyi entropy for deterministic maps (different from the concepts which are utilized by physicists in the study of deterministic chaos) and his purpose herein is to extend the theory and to outline its prospects in pattern recognition. Entropies of random and of distributed functions are defined, and then entropic variance, divergence, mean square divergence and cross-entropic variance are obtained in quite a meaningful way. The results so obtained are used to derive identification criteria for pattern recognition, and an approach involving the local maximization of a multi-model landscape function is suggested. Basically, we are so dealing with an information theory without explicitly referring to probability. 相似文献

16.

Statistical-model-based speech enhancement systems 总被引：3，自引：0，他引：3

Ephraim Y. 《Proceedings of the IEEE. Institute of Electrical and Electronics Engineers》1992,80(10):1526-1555

Since the statistics of the speech signal as well as of the noise are not explicitly available, and the most perceptually meaningful distortion measure is not known, model-based approaches have recently been extensively studied and applied to the three basic problems of speech enhancement: signal estimation from a given sample function of noisy speech, signal coding when only noisy speech is available, and recognition of noisy speech signals in man-machine communication. Research on the model-based approach is integrated and put into perspective with other more traditional approaches for speech enhancement. A unified statistical approach for the three basic problems of speech enhancement is developed, using composite source models for the signal and noise and a fairly large set of distortion measures 相似文献

17.

Improved Phoneme-Based Myoelectric Speech Recognition

Quan Zhou Ning Jiang Englehart K. Hudgins B. 《IEEE transactions on bio-medical engineering》2009,56(8):2016-2023

This paper introduces an enhanced phoneme-based myoelectric signal (MES) speech recognition system. The system can recognize new words without retraining the phoneme classifier, which is considered to be the main advantage of phoneme-based speech recognition. It is shown that previous systems experience severe performance degradation when new words are added to a testing dataset. To maintain high accuracy with new words, several improvements are proposed. In the proposed MES speech recognition approach, the raw MES is processed by class-specific rotation matrices to spatially decorrelate the data prior to feature extraction in a preprocessing stage. Then, an uncorrelated linear discriminant analysis is used for dimensionality reduction. The resulting data are classified through a hidden Markov model classifier to obtain the phonemic log likelihoods of the phonemes, which are mapped to corresponding words using a word classifier. An average word classification accuracy of 98.533% is achieved over six subjects. The system offers dramatically improved accuracy when expanding a vocabulary, offering promise for robust large-vocabulary myoelectric speech recognition. 相似文献

18.

一种语音特征参数子分量分析与有效性评价的新方法 总被引：2，自引：0，他引：2

俞一彪许允喜芮贤义《信号处理》2007,23(2):188-191

语音信号中包含语义和说话人个性两大特征,其有效提取和强化对语音识别和说话人识别有着非常重要的意义。本文提出了一种语音特征参数中语义和个性特征子分量分析与有效性评价的4S方法,对语义和个性特征的成份比例进行分析,并通过量化指标评判特征参数对语音识别和说话人识别的有效性。运用4S分析方法对目前常用的特征参数LPC, LPCC和MFCC的子分量分析与有效性评价结果表明,所有的特征参数都更多地包含了语义特征信息,语义特征和说话人个性特征的成份比例因子LIR分别为1．30、1．44和1．61,并且,三种参数对语音识别和说话人识别的有效性均呈现出依次提高的特性。相似文献

19.

A Bayesian classification approach with application to speechrecognition

Merhav N. Ephraim Y. 《Signal Processing, IEEE Transactions on》1991,39(10):2157-2166

A Bayesian approach to classification of parametric information sources whose statistics are not explicitly given is studied and applied to recognition of speech signals based upon Markov modeling. A classifier based on generalized likelihood ratios, which depends only on the available training and testing data, is developed and shown to be optimal in the sense of achieving the highest asymptotic exponential rate of decay of the error probability. The proposed approach is compared to the standard classification approach used in speech recognition, in which the parameters for the sources are first estimated from the given training data, and then the maximum a posteriori decision rule is applied using the estimated statistics 相似文献

20.

Neural networks for statistical recognition of continuous speech 总被引：4，自引：0，他引：4

Morgan N. Bourlard H.A. 《Proceedings of the IEEE. Institute of Electrical and Electronics Engineers》1995,83(5):742-772

In recent years there has been a significant body of work, both theoretical and experimental, that has established the viability of artificial neural networks (ANN's) as a useful technology for speech recognition. It has been shown that neural networks can be used to augment speech recognizers whose underlying structure is essentially that of hidden Markov models (HMM's). In particular, we have demonstrated that fairly simple layered structures, which we lately have termed big dumb neural networks (BDNN's), can be discriminatively trained to estimate emission probabilities for an HMM. Recently simple speech recognition systems (using context-independent phone models) based on this approach have been proved on controlled tests, to be both effective in terms of accuracy (i.e., comparable or better than equivalent state-of-the-art systems) and efficient in terms of CPU and memory run-time requirements. Research is continuing on extending these results to somewhat more complex systems. In this paper, we first give a brief overview of automatic speech recognition (ASR) and statistical pattern recognition in general. We also include a very brief review of HMM's, and then describe the use of ANN's as statistical estimators. We then review the basic principles of our hybrid HMM/ANN approach and describe some experiments. We discuss some current research topics, including new theoretical developments in training ANN's to maximize the posterior probabilities of the correct models for speech utterances. We also discuss some issues of system resources required for training and recognition. Finally, we conclude with some perspectives about fundamental limitations in the current technology and some speculations about where we can go from here 相似文献