首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 125 毫秒
1.
Spectro-temporal representation of speech has become one of the leading signal representation approaches in speech recognition systems in recent years. This representation suffers from high dimensionality of the features space which makes this domain unsuitable for practical speech recognition systems. In this paper, a new clustering based method is proposed for secondary feature selection/extraction in the spectro-temporal domain. In the proposed representation, Gaussian mixture models (GMM) and weighted K-means (WKM) clustering techniques are applied to spectro-temporal domain to reduce the dimensions of the features space. The elements of centroid vectors and covariance matrices of clusters are considered as attributes of the secondary feature vector of each frame. To evaluate the efficiency of the proposed approach, the tests were conducted for new feature vectors on classification of phonemes in main categories of phonemes in TIMIT database. It was shown that by employing the proposed secondary feature vector, a significant improvement was revealed in classification rate of different sets of phonemes comparing with MFCC features. The average achieved improvements in classification rates of voiced plosives comparing to MFCC features is 5.9% using WKM clustering and 6.4% using GMM clustering. The greatest improvement is about 7.4% which is obtained by using WKM clustering in classification of front vowels comparing to MFCC features.  相似文献   

2.
This paper investigates the enhancement of a speech recognition system that uses both audio and visual speech information in noisy environments by presenting contributions in two main system stages: front-end and back-end. The double use of Gabor filters is proposed as a feature extractor in the front-end stage of both modules to capture robust spectro-temporal features. The performance obtained from the resulted Gabor Audio Features (GAFs) and Gabor Visual Features (GVFs) is compared to the performance of other conventional features such as MFCC, PLP, RASTA-PLP audio features and DCT2 visual features. The experimental results show that a system utilizing GAFs and GVFs has a better performance, especially in a low-SNR scenario. To improve the back-end stage, a complete framework of synchronous Multi-Stream Hidden Markov Model (MSHMM) is used to solve the dynamic stream weight estimation problem for Audio-Visual Speech Recognition (AVSR). To demonstrate the usefulness of the dynamic weighting in the overall performance of AVSR system, we empirically show the preference of Late Integration (LI) compared to Early Integration (EI) especially when one of the modalities is corrupted. Results confirm the superior recognition accuracy for all SNR levels the superiority of the AVSR system with the Late Integration.  相似文献   

3.
This paper presents the design and development of unrestricted text to speech synthesis (TTS) system in Bengali language. Unrestricted TTS system is capable to synthesize good quality of speech in different domains. In this work, syllables are used as basic units for synthesis. Festival framework has been used for building the TTS system. Speech collected from a female artist is used as speech corpus. Initially five speakers’ speech is collected and a prototype TTS is built from each of the five speakers. Best speaker among the five is selected through subjective and objective evaluation of natural and synthesized waveforms. Then development of unrestricted TTS is carried out by addressing the issues involved at each stage to produce good quality synthesizer. Evaluation is carried out in four stages by conducting objective and subjective listening tests on synthesized speech. At the first stage, TTS system is built with basic festival framework. In the following stages, additional features are incorporated into the system and quality of synthesis is evaluated. The subjective and objective measures indicate that the proposed features and methods have improved the quality of the synthesized speech from stage-2 to stage-4.  相似文献   

4.
Several algorithms have been proposed to characterize the spectro-temporal tuning properties of auditory neurons during the presentation of natural stimuli. Algorithms designed to work at realistic signal-to-noise levels must make some prior assumptions about tuning in order to produce accurate fits, and these priors can introduce bias into estimates of tuning. We compare a new, computationally efficient algorithm for estimating tuning properties, boosting, to a more commonly used algorithm, normalized reverse correlation. These algorithms employ the same functional model and cost function, differing only in their priors. We use both algorithms to estimate spectro-temporal tuning properties of neurons in primary auditory cortex during the presentation of continuous human speech. Models estimated using either algorithm, have similar predictive power, although fits by boosting are slightly more accurate. More strikingly, neurons characterized with boosting appear tuned to narrower spectral bandwidths and higher temporal modulation rates than when characterized with normalized reverse correlation. These differences have little impact on responses to speech, which is spectrally broadband and modulated at low rates. However, we find that models estimated by boosting also predict responses to non-speech stimuli more accurately. These findings highlight the crucial role of priors in characterizing neuronal response properties with natural stimuli.  相似文献   

5.
A speech pre-processing algorithm is presented that improves the speech intelligibility in noise for the near-end listener. The algorithm improves intelligibility by optimally redistributing the speech energy over time and frequency according to a perceptual distortion measure, which is based on a spectro-temporal auditory model. Since this auditory model takes into account short-time information, transients will receive more amplification than stationary vowels, which has been shown to be beneficial for intelligibility of speech in noise. The proposed method is compared to unprocessed speech and two reference methods using an intelligibility listening test. Results show that the proposed method leads to significant intelligibility gains while still preserving quality. Although one of the methods used as a reference obtained higher intelligibility gains, this happened at the cost of decreased quality. Matlab code is provided.  相似文献   

6.
为了提高语音识别系统的鲁棒性,提出一种基于GBFB(spectro-temporal Gabor filter bank)的声学特征提取方法,并通过分块PCA算法对高维的GBFB特征进行降维处理,最后在多个相同噪音环境对GBFB特征以及常用的GFCC,MFCC,LPCC等特征进行抗噪性能对比,与GFCC相比GBFB特征的识别率提高了5.35%,与MFCC特征相比提升了7.05%,比LPCC特征识别的基线低9个分贝。实验结果表明,在噪音环境下与传统的GFCC、MFCC以及LPCC等特征相比GBFB特征有更优越的鲁棒性。  相似文献   

7.
8.
Design, development, and maintenance of firewall ACLs are very hard and error-prone tasks. Two of the reasons for these difficulties are, on the one hand, the big gap that exists between the access control requirements and the complex and heterogeneous firewall platforms and languages and, on the other hand, the absence of ACL design, development and maintenance environments that integrate inconsistency and redundancy diagnosis. The use of modelling languages surely helps but, although several ones have been proposed, none of them has been widely adopted by industry due to a combination of factors: high complexity, unsupported firewall important features, no integrated model validation stages, etc. In this paper, CONFIDDENT, a model-driven design, development and maintenance framework for layer-3 firewall ACLs is proposed. The framework includes different modelling stages at different abstraction levels. In this way, non-experienced administrators can use more abstract models while experienced ones can refine them to include platform-specific features. CONFIDDENT includes different model diagnosis stages where the administrators can check the inconsistencies and redundancies of their models before the automatic generation of the ACL to one of the many of the market-leader firewall platforms currently supported.  相似文献   

9.
基于发音特征的声效相关鲁棒语音识别算法   总被引:1,自引:0,他引:1  
晁浩  宋成  彭维平 《计算机应用》2015,35(1):257-261
针对声效(VE)相关的语音识别鲁棒性问题,提出了基于多模型框架的语音识别算法.首先,分析了不同声效模式下语音信号的声学特性以及声效变化对语音识别精度的影响;然后,提出了基于高斯混合模型(GMM)的声效模式检测方法;最后,根据声效检测的结果,训练专门的声学模型用于耳语音识别,而将发音特征与传统的谱特征一起用于其余4种声效模式的语音识别.基于孤立词识别的实验结果显示,采用所提方法后语音识别准确率有了明显的提高:与基线系统相比,所提方法5种声效的平均字错误率降低了26.69%;与声学模型混合语料训练方法相比,平均字错误率降低了14.51%;与最大似然线性回归(MLLR)自适应方法相比,平均字错误率降低了15.30%.实验结果表明:与传统谱特征相比发音特征对于声效变化更具鲁棒性,而多模型框架是解决声效相关的语音识别鲁棒性问题的有效方法.  相似文献   

10.
Models of auditory processing, particularly of speech, face many difficulties. These difficulties include variability among speakers, variability in speech rate and robustness to moderate distortions such as time compression. In contrast to the 'invariance of percept' (across different speakers, of different sexes, using different intonation, and so on) is the observation that we are sensitive to the identity, sex and intonation of the speaker. In previous work we have reported that a model based on ensembles of spectro-temporal feature detectors, derived from onset sensitive pre-processing of a limited class of stimuli, preserves significant information about the stimulus class. We have also shown that this is robust with respect to the exact choice of feature set, moderate time compression in the stimulus and speaker variation. Here we extend these results to show a) that by using a classifier based on a network of spiking neurons with spike-driven plasticity, the output of the ensemble constitutes an effective rate coding representation of complex sounds; and b) that the same set of spectro-temporal features concurrently preserve information about a range of qualitatively different classes into which the stimulus might fall. We show that it is possible for multiple views of the same pattern of responses to generate different percepts. This is consistent with suggestions that multiple parallel processes exist within the auditory 'what' pathway with attentional modulation enhancing the task-relevant classification type. We also show that the responses of the ensemble are sparse in the sense that a small number of features respond for each stimulus type. This has implications for the ensembles' ability to generalise, and to respond differentially to a wide variety of stimulus classes.  相似文献   

11.
A framework is proposed for synchronization in feature-based data embedding systems that is tolerant of errors in estimated features. The method combines feature-based embedding with codes capable of simultaneous synchronization and error correction, thereby allowing recovery from both desynchronization caused by feature estimation discrepancies between the embedder and receiver; and alterations in estimated symbols arising from other channel perturbations. A speech watermark is presented that constitutes a realization of the framework for 1-D signals. The speech watermark employs pitch modification for data embedding and Davey and Mackay's insertion, deletion, and substitution (IDS) codes for synchronization and error recovery. Experimental results demonstrate that the system indeed allows watermark data recovery, despite feature desynchronization. The performance of the speech watermark is optimized by estimating the channel parameters required for the IDS decoding at the receiver via the expectation-maximization algorithm. In addition, acceptable watermark power levels (i.e., the range of pitch modification that is perceptually tolerable) are determined from psychophysical tests. The proposed watermark demonstrates robustness to low-bit-rate speech coding channels (Global System for Mobile Communications at 13 kb/s and AMR at 5.1 kb/s), which have posed a serious challenge for prior speech watermarks. Thus, the watermark presented in this paper not only highlights the utility of the proposed framework but also represents a significant advance in speech watermarking. Issues in extending the proposed framework to 2-D and 3-D signals and different application scenarios are identified.  相似文献   

12.

Emotion recognition from speech signals is an interesting research with several applications like smart healthcare, autonomous voice response systems, assessing situational seriousness by caller affective state analysis in emergency centers, and other smart affective services. In this paper, we present a study of speech emotion recognition based on the features extracted from spectrograms using a deep convolutional neural network (CNN) with rectangular kernels. Typically, CNNs have square shaped kernels and pooling operators at various layers, which are suited for 2D image data. However, in case of spectrograms, the information is encoded in a slightly different manner. Time is represented along the x-axis and y-axis shows frequency of the speech signal, whereas, the amplitude is indicated by the intensity value in the spectrogram at a particular position. To analyze speech through spectrograms, we propose rectangular kernels of varying shapes and sizes, along with max pooling in rectangular neighborhoods, to extract discriminative features. The proposed scheme effectively learns discriminative features from speech spectrograms and performs better than many state-of-the-art techniques when evaluated its performance on Emo-DB and Korean speech dataset.

  相似文献   

13.
We describe an architecture that gives a robot the capability to recognize speech by cancelling ego noise, even while the robot is moving. The system consists of three blocks: (1) a multi-channel noise reduction block, comprising consequent stages of microphone-array-based sound localization, geometric source separation and post-filtering; (2) a single-channel noise reduction block utilizing template subtraction; and (3) an automatic speech recognition block. In this work, we specifically investigate a missing feature theory-based automatic speech recognition (MFT-ASR) approach in block (3). This approach makes use of spectro-temporal elements derived from (1) and (2) to measure the reliability of the acoustic features, and generates masks to filter unreliable acoustic features. We then evaluated this system on a robot using word correct rates. Furthermore, we present a detailed analysis of recognition accuracy to determine optimal parameters. Implementation of the proposed MFT-ASR approach resulted in significantly higher recognition performance than single or multi-channel noise reduction methods.  相似文献   

14.
Recently, there is a significant increase in research interest in the area of biologically inspired systems, which, in the context of speech communications, attempt to learn from human's auditory perception and cognition capabilities so as to derive the knowledge and benefits currently unavailable in practice. One particular pursuit is to understand why the human auditory system generally performs with much more robustness than an engineering system, say a state-of-the-art automatic speech recognizer. In this study, we adopt a computational model of the mammalian central auditory system and develop a methodology to analyze and interpret its behavior for an enhanced understanding of its end product, which is a data-redundant, dimension-expanded representation of neural firing rates in the primary auditory cortex (A1). Our first approach is to reinterpret the well-known Mel-frequency cepstral coefficients (MFCCs) in the context of the auditory model. We then present a framework for interpreting the cortical response as a place-coding of speech information, and identify some key advantages of the model's dimension expansion. The framework consists of a model of ldquosourcerdquo-invariance that predicts how speech information is encoded in a class-dependent manner, and a model of ldquoenvironmentrdquo-invariance that predicts the noise-robustness of class-dependent signal-respondent neurons. The validity of these ideas are experimentally assessed under existing recognition framework by selecting features that demonstrate their effects and applying them in a conventional phoneme classification task. The results are quantitatively and qualitatively discussed, and our insights inspire future research on category-dependent features and speech classification using the auditory model.  相似文献   

15.
Over the past several years there has been considerable attention focused on coding and enhancement of speech signals. This interest is progressed towards the development of new techniques capable of producing good quality speech at the output. Speech coding is a process of converting human speech into efficient encoded representations that can be decoded to produce a close approximation of the original signal. This paper deals with the problem of speech coding. It proposes novel approach called Best Tree Encoding (BTE) to encode the wavelet packet Best Tree Structure into a vector of four elements. This research is introducing BTE for solving another problem for speech compression and syntheses. Tree node data coefficients are encoded using LPC Filters and trigonometric features. The encoded vector consists of 4 elements from BTE analysis as well as LPC and trigonometric vector for each leaf node. The quality of the reproduced speech is evaluated for both understanding and quality. The quality of speech signal is measured on the basis of signal to noise ratio, log likelihood ratio, and spectral distortion.  相似文献   

16.
Speech translation is a technology that helps people communicate across different languages. The most commonly used speech translation model is composed of automatic speech recognition, machine translation and text-to-speech synthesis components, which share information only at the text level. However, spoken communication is different from written communication in that it uses rich acoustic cues such as prosody in order to transmit more information through non-verbal channels. This paper is concerned with speech-to-speech translation that is sensitive to this paralinguistic information. Our long-term goal is to make a system that allows users to speak a foreign language with the same expressiveness as if they were speaking in their own language. Our method works by reconstructing input acoustic features in the target language. From the many different possible paralinguistic features to handle, in this paper we choose duration and power as a first step, proposing a method that can translate these features from input speech to the output speech in continuous space. This is done in a simple and language-independent fashion by training an end-to-end model that maps source-language duration and power information into the target language. Two approaches are investigated: linear regression and neural network models. We evaluate the proposed methods and show that paralinguistic information in the input speech of the source language can be reflected in the output speech of the target language.  相似文献   

17.
语音是一种重要的信息资源传递与交流方式,人们经常使用语音作为交流信息的媒介,在语音的声学信号中包含大量的说话者信息、语义信息和丰富的情感信息,因此形成了解决语音学任务的3个不同方向,即声纹识别(Speaker Recognition,SR)、语音识别(Auto Speech Recognition,ASR)和情感识别(Speech Emotion Recognition,SER),3个任务均在各自的领域使用不同的技术与特定的方法进行信息提取与模型设计。文中首先综述了3个任务在国内外早期的发展历史路线,将语音任务的发展归纳为4个不同阶段,同时总结了3个语音学任务在特征提取时所采用的公共语音学特征,并针对每类特征的侧重点进行了说明。然后,随着近年来深度学习技术在各个领域中的广泛应用,语音任务也得到了很好的发展,文中针对目前流行的深度学习模型在声学建模中的应用分别进行了分析,按照有监督、无监督的方式总结了针对3种不同语音任务的声学特征提取方式及技术路线,还总结了基于多通道并融合注意力机制的模型,用于语音的特征提取。为了同时完成语音识别、声纹识别和情感识别任务,针对声学信号的个性化特征提出了一个基于多任务的Tandem模型;此外,提出了一个多通道协作网络模型,利用这种设计思路可以提升多任务特征提取的准确度。  相似文献   

18.
In this paper, we present an improved estimator for the speech presence probability at each time-frequency point in the short-time Fourier transform domain. In contrast to existing approaches, this estimator does not rely on an adaptively estimated and thus signal-dependent a priori signal-to-noise ratio estimate. It therefore decouples the estimation of the speech presence probability from the estimation of the clean speech spectral coefficients in a speech enhancement task. Using both a fixed a priori signal-to-noise ratio and a fixed prior probability of speech presence, the proposed a posteriori speech presence probability estimator achieves probabilities close to zero for speech absence and probabilities close to one for speech presence. While state-of-the-art speech presence probability estimators use adaptive prior probabilities and signal-to-noise ratio estimates, we argue that these quantities should reflect true a priori information that shall not depend on the observed signal. We present a detection theoretic framework for determining the fixed a priori signal-to-noise ratio. The proposed estimator is conceptually simple and yields a better tradeoff between speech distortion and noise leakage than state-of-the-art estimators.  相似文献   

19.
Interaction with electronic speech products is becoming a fact of life through telephone answering systems and speech-driven booking systems, and is set to increase in the future. Older adults will be obliged to use more of these electronic products, and because of their special interactional needs due to age-related impairments it is important that such interactions are designed to suit the needs of such users, and in particular, that appropriate mechanisms are put in place to support learning of older users about interaction. Drawing upon the expertise of tutors at Age Concern Oxfordshire, and the results of preliminary investigations with older adults using dialogues in a speech system, this paper explores the conditions which best provide for the learning experience of older adults, and looks at special features which enable instructions and help for learning to be embedded within speech dialogue design.  相似文献   

20.
In this paper, a sinusoidal model has been proposed for characterization and classification of different stress classes (emotions) in a speech signal. Frequency, amplitude and phase features of the sinusoidal model are analyzed and used as input features to a stressed speech recognition system. The performances of sinusoidal model features are evaluated for recognition of different stress classes with a vector-quantization classifier and a hidden Markov model classifier. To find the effectiveness of these features for recognition of different emotions in different languages, speech signals are recorded and tested in two languages, Telugu (an Indian language) and English. Average stressed speech index values are proposed for comparing differences between stress classes in a speech signal. Results show that sinusoidal model features are successful in characterizing different stress classes in a speech signal. Sinusoidal features perform better compared to the linear prediction and cepstral features in recognizing the emotions in a speech signal.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号