首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
We describe a content-based audio classification algorithm based on novel multiscale spectro-temporal modulation features inspired by a model of auditory cortical processing. The task explored is to discriminate speech from nonspeech consisting of animal vocalizations, music, and environmental sounds. Although this is a relatively easy task for humans, it is still difficult to automate well, especially in noisy and reverberant environments. The auditory model captures basic processes occurring from the early cochlear stages to the central cortical areas. The model generates a multidimensional spectro-temporal representation of the sound, which is then analyzed by a multilinear dimensionality reduction technique and classified by a support vector machine (SVM). Generalization of the system to signals in high level of additive noise and reverberation is evaluated and compared to two existing approaches (Scheirer and Slaney, 2002 and Kingsbury et al., 2002). The results demonstrate the advantages of the auditory model over the other two systems, especially at low signal-to-noise ratios (SNRs) and high reverberation.  相似文献   

2.
A speech pre-processing algorithm is presented that improves the speech intelligibility in noise for the near-end listener. The algorithm improves intelligibility by optimally redistributing the speech energy over time and frequency according to a perceptual distortion measure, which is based on a spectro-temporal auditory model. Since this auditory model takes into account short-time information, transients will receive more amplification than stationary vowels, which has been shown to be beneficial for intelligibility of speech in noise. The proposed method is compared to unprocessed speech and two reference methods using an intelligibility listening test. Results show that the proposed method leads to significant intelligibility gains while still preserving quality. Although one of the methods used as a reference obtained higher intelligibility gains, this happened at the cost of decreased quality. Matlab code is provided.  相似文献   

3.
This paper shows an improved statistical test for voice activity detection in noise adverse environments. The method is based on a revised contextual likelihood ratio test (LRT) defined over a multiple observation window. The motivations for revising the original multiple observation LRT (MO-LRT) are found in its artificially added hangover mechanism that exhibits an incorrect behavior under different signal-to-noise ratio (SNR) conditions. The new approach defines a maximum a posteriori (MAP) statistical test in which all the global hypotheses on the multiple observation window containing up to one speech-to-nonspeech or nonspeech-to-speech transitions are considered. Thus, the implicit hangover mechanism artificially added by the original method was not found in the revised method so its design can be further improved. With these and other innovations, the proposed method showed a higher speech/nonspeech discrimination accuracy over a wide range of SNR conditions when compared to the original MO-LRT voice activity detector (VAD). Experiments conducted on the AURORA databases and tasks showed that the revised method yields significant improvements in speech recognition performance over standardized VADs such as ITU T G.729 and ETSI AMR for discontinuous voice transmission and the ETSI AFE for distributed speech recognition (DSR), as well as over recently reported methods.  相似文献   

4.
Perceptually optimal processing of speech and audio signals demands a rigorous approach using a distortion measure that resembles human perception. This requires distortion measures based on sophisticated, complex auditory models. Under the assumption of small distortions these models can be simplified by means of a sensitivity matrix. In this paper, we show the power of this approach. We present a method to derive the sensitivity matrix for distortion measures based on spectro-temporal auditory models. This method is applied to an example auditory model and the region of validity of the approximation and the application of linear algebra to analyze the characteristics of the given model are discussed. Furthermore, we show how to build a coder minimizing a sensitivity matrix distortion measure given the typically long support of a perceptual distortion measure  相似文献   

5.
The representation of sound signals at the cochlea and auditory cortical level has been studied as an alternative to classical analysis methods. In this work, we put forward a recently proposed feature extraction method called approximate auditory cortical representation, based on an approximation to the statistics of discharge patterns at the primary auditory cortex. The approach here proposed estimates a non-negative sparse coding with a combined dictionary of atoms. These atoms represent the spectro-temporal receptive fields of the auditory cortical neurons, and are calculated from the auditory spectrograms of clean signal and noise. The denoising is carried out on noisy signals by the reconstruction of the signal discarding the atoms corresponding to the noise. Experiments are presented using synthetic (chirps) and real data (speech), in the presence of additive noise. For the evaluation of the new method and its variants, we used two objective measures: the perceptual evaluation of speech quality and the segmental signal-to-noise ratio. Results show that the proposed method improves the quality of the signals, mainly under severe degradation.  相似文献   

6.
Spectro-temporal features of speech are the basis of phonological comprehension and production in the brain. Thus, these features provide a relevant framework to study speech and language development in children. In this paper, we present a novel framework to study the statistics of spectro-temporal features of speech that are encoded at different timescales. These timescales correspond to different linguistic units such as prosodic or syllabic components. The framework is tested on a speech corpus consisting of 169 speech samples. The paper demonstrates usage of the proposed framework in finding milestones of speech development in children. The results indicate the presence of more number of spectro-temporal features encoded at short timescales in adults as compared to children. However, no significant difference is observed in the spectro-temporal features encoded at long timescales between these groups. The proposed framework is also used in studying the speech impairments of children and adults with mild to moderate intellectual disabilities. The results reveal the absence of some spectro-temporal features encoded at both the timescales and their absence is more prominent in shorter timescales. The suggested framework can be used for studying speech development and impairment in different disorders.  相似文献   

7.
Several algorithms have been proposed to characterize the spectro-temporal tuning properties of auditory neurons during the presentation of natural stimuli. Algorithms designed to work at realistic signal-to-noise levels must make some prior assumptions about tuning in order to produce accurate fits, and these priors can introduce bias into estimates of tuning. We compare a new, computationally efficient algorithm for estimating tuning properties, boosting, to a more commonly used algorithm, normalized reverse correlation. These algorithms employ the same functional model and cost function, differing only in their priors. We use both algorithms to estimate spectro-temporal tuning properties of neurons in primary auditory cortex during the presentation of continuous human speech. Models estimated using either algorithm, have similar predictive power, although fits by boosting are slightly more accurate. More strikingly, neurons characterized with boosting appear tuned to narrower spectral bandwidths and higher temporal modulation rates than when characterized with normalized reverse correlation. These differences have little impact on responses to speech, which is spectrally broadband and modulated at low rates. However, we find that models estimated by boosting also predict responses to non-speech stimuli more accurately. These findings highlight the crucial role of priors in characterizing neuronal response properties with natural stimuli.  相似文献   

8.
语音质量是评价通信系统的一项重要指标。现有的语音质量感知评估算法采用基于Bark谱的感知模型,其算法复杂度较大,并且对于人耳的频率选择性的模拟存在不足。针对这一问题,本文提出一种新的客观语音质量评估方法,采用更加符合人耳听觉特性的Gammatone滤波器组提取特征参数,计算原始语音与失真语音的平均失真距离,并由主观平均意见分值和归一化平均失真距离之间的映射关系求出客观平均意见分值。实验表明,与感知评估方法相比,本文所提出算法的计算复杂度大大降低,同时保持了客观平均意见分值与主观平均意见分值之间的高相关度。  相似文献   

9.
语音的电子伪装是指采用变声设备或语音处理软件改变说话人的个性特征,以达到故意隐藏该说话人身份的目的。电子伪装语音还原是指通过技术手段将伪装语音变回原声,这对基于语音的身份鉴别具有重要意义。本文将频域和时域伪装语音的还原问题抽象为伪装因子的估计问题,通过基于i-vector的自动说话人确认方法估计伪装因子,并引入对称变换进一步提高估计效果。该方法借助于i-vector的噪声鲁棒性,提高了真实含噪场景下伪装因子的估计精度,从而改进了噪声条件下电子伪装语音的还原效果。在干净语音库TIMIT上训练i-vector并在含噪语音库VoxCeleb1上对本文方法进行测试,结果表明,伪装因子估计的错误率从基线系统的9.19%降低为4.49%,还原语音在自动说话人确认等错误率和听觉感知方面也取得了提升。  相似文献   

10.
受声学研究启发,结合人脑人耳听觉特性对语音的处理方式,建立了一个完整的模拟听觉中枢系统的语音分离模型.首先利用外周听觉模型对语音信号进行多频谱分析,然后建立重合神经元模型提取语音信号的特征,最后在脑下丘的神经细胞模型中完成对语音的分离.基于现有的语音识别方法,该模型能够很好地解决绝大多数的语音识别方法都只能在单声源和低噪声的环境下使用的问题.实验结果表明,该模型能够实现多声源环境下语音的分离并且具有较高的鲁棒性.随着研究的深入,基于人耳听觉特性的语音分离模型将有很广泛的应用前景.  相似文献   

11.
Spectro-temporal representation of speech has become one of the leading signal representation approaches in speech recognition systems in recent years. This representation suffers from high dimensionality of the features space which makes this domain unsuitable for practical speech recognition systems. In this paper, a new clustering based method is proposed for secondary feature selection/extraction in the spectro-temporal domain. In the proposed representation, Gaussian mixture models (GMM) and weighted K-means (WKM) clustering techniques are applied to spectro-temporal domain to reduce the dimensions of the features space. The elements of centroid vectors and covariance matrices of clusters are considered as attributes of the secondary feature vector of each frame. To evaluate the efficiency of the proposed approach, the tests were conducted for new feature vectors on classification of phonemes in main categories of phonemes in TIMIT database. It was shown that by employing the proposed secondary feature vector, a significant improvement was revealed in classification rate of different sets of phonemes comparing with MFCC features. The average achieved improvements in classification rates of voiced plosives comparing to MFCC features is 5.9% using WKM clustering and 6.4% using GMM clustering. The greatest improvement is about 7.4% which is obtained by using WKM clustering in classification of front vowels comparing to MFCC features.  相似文献   

12.
The singular value decomposition (SVD)-based method for single-channel speech enhancement has been shown to be very useful when the additive noise is white. For colored noise, with this approach, one needs to whiten the noise spectrum prior to SVD-based approach and perform the inverse whitening processing afterwards. A truncated quotient SVD (QSVD)-based approach has been proposed to handle this problem and found very useful. In this paper, a generalized SVD (GSVD)-based subspace approach for speech enhancement is first extended from the concept of the truncated QSVD-based approach, in which the dimension of the signal subspace can be precisely and automatically determined for each frame of the noisy signal. But with this new approach some residual noise is still perceivable under lower signal-to-noise ratio conditions. Therefore a perceptually constrained GSVD (PCGSVD)-based approach is further proposed to incorporate the masking properties of human auditory system to make sure the undesired residual noise to be nearly un-perceivable. Closed-form solutions are obtained for both the GSVD- and PCGSVD-based enhancement approaches. Very carefully performed objective evaluations and subjective listening tests show that the PCGSVD-based approach proposed here can offer improved speech quality, intelligibility and recognition accuracy, whether the noise is stationary or nonstationary, especially when the additive noise is nonwhite  相似文献   

13.
Models of auditory processing, particularly of speech, face many difficulties. These difficulties include variability among speakers, variability in speech rate and robustness to moderate distortions such as time compression. In contrast to the 'invariance of percept' (across different speakers, of different sexes, using different intonation, and so on) is the observation that we are sensitive to the identity, sex and intonation of the speaker. In previous work we have reported that a model based on ensembles of spectro-temporal feature detectors, derived from onset sensitive pre-processing of a limited class of stimuli, preserves significant information about the stimulus class. We have also shown that this is robust with respect to the exact choice of feature set, moderate time compression in the stimulus and speaker variation. Here we extend these results to show a) that by using a classifier based on a network of spiking neurons with spike-driven plasticity, the output of the ensemble constitutes an effective rate coding representation of complex sounds; and b) that the same set of spectro-temporal features concurrently preserve information about a range of qualitatively different classes into which the stimulus might fall. We show that it is possible for multiple views of the same pattern of responses to generate different percepts. This is consistent with suggestions that multiple parallel processes exist within the auditory 'what' pathway with attentional modulation enhancing the task-relevant classification type. We also show that the responses of the ensemble are sparse in the sense that a small number of features respond for each stimulus type. This has implications for the ensembles' ability to generalise, and to respond differentially to a wide variety of stimulus classes.  相似文献   

14.
For individuals with severe speech impairment accurate spoken communication can be difficult and require considerable effort. Some may choose to use a voice output communication aid (or VOCA) to support their spoken communication needs. A VOCA typically takes input from the user through a keyboard or switch-based interface and produces spoken output using either synthesised or recorded speech. The type and number of synthetic voices that can be accessed with a VOCA is often limited and this has been implicated as a factor for rejection of the devices. Therefore, there is a need to be able to provide voices that are more appropriate and acceptable for users.This paper reports on a study that utilises recent advances in speech synthesis to produce personalised synthetic voices for 3 speakers with mild to severe dysarthria, one of the most common speech disorders. Using a statistical parametric approach to synthesis, an average voice trained on data from several unimpaired speakers was adapted using recordings of the impaired speech of 3 dysarthric speakers. By careful selection of the speech data and the model parameters, several exemplar voices were produced for each speaker. A qualitative evaluation was conducted with the speakers and listeners who were familiar with the speaker. The evaluation showed that for one of the 3 speakers a voice could be created which conveyed many of his personal characteristics, such as regional identity, sex and age.  相似文献   

15.
System Combination for Machine Translation of Spoken and Written Language   总被引:1,自引:0,他引:1  
This paper describes an approach for computing a consensus translation from the outputs of multiple machine translation (MT) systems. The consensus translation is computed by weighted majority voting on a confusion network, similarly to the well-established ROVER approach of Fiscus for combining speech recognition hypotheses. To create the confusion network, pairwise word alignments of the original MT hypotheses are learned using an enhanced statistical alignment algorithm that explicitly models word reordering. The context of a whole corpus of automatic translations rather than a single sentence is taken into account in order to achieve high alignment quality. The confusion network is rescored with a special language model, and the consensus translation is extracted as the best path. The proposed system combination approach was evaluated in the framework of the TC-STAR speech translation project. Up to six state-of-the-art statistical phrase-based translation systems from different project partners were combined in the experiments. Significant improvements in translation quality from Spanish to English and from English to Spanish in comparison with the best of the individual MT systems were achieved under official evaluation conditions.   相似文献   

16.
Here, formation of continuous attractor dynamics in a nonlinear recurrent neural network is used to achieve a nonlinear speech denoising method, in order to implement robust phoneme recognition and information retrieval. Formation of attractor dynamics in recurrent neural network is first carried out by training the clean speech subspace as the continuous attractor. Then, it is used to recognize noisy speech with both stationary and nonstationary noise. In this work, the efficiency of a nonlinear feedforward network is compared to the same one with a recurrent connection in its hidden layer. The structure and training of this recurrent connection, is designed in such a way that the network learns to denoise the signal step by step, using properties of attractors it has formed, along with phone recognition. Using these connections, the recognition accuracy is improved 21% for the stationary signal and 14% for the nonstationary one with 0db SNR, in respect to a reference model which is a feedforward neural network.  相似文献   

17.
将非平稳噪声估计算法以及基于听觉掩蔽效应得到的噪声被掩蔽概率应用于维纳滤波语音增强中,提出了一种听觉掩蔽效应和维纳滤波的语音增强方法。几种噪声背景下对语音增强的客观测试表明,提出的算法相比较于传统的维纳滤波语音增强算法而言不但可以提高语音信噪比,而且可以明显减少语音失真。  相似文献   

18.
The voice activity detectors (VADs) based on statistical models have shown impressive performances especially when fairly precise statistical models are employed. Moreover, the accuracy of the VAD utilizing statistical models can be significantly improved when machine-learning techniques are adopted to provide prior knowledge for speech characteristics. In the first part of this paper, we introduce a more accurate and flexible statistical model, the generalized gamma distribution (GΓD) as a new model in the VAD based on the likelihood ratio test. In practice, parameter estimation algorithm based on maximum likelihood principle is also presented. Experimental results show that the VAD algorithm implemented based on GΓD outperform those adopting the conventional Laplacian and Gamma distributions. In the second part of this paper, we introduce machine learning techniques such as a minimum classification error (MCE) and support vector machine (SVM) to exploit automatically prior knowledge obtained from the speech database, which can enhance the performance of the VAD. Firstly, we present a discriminative weight training method based on the MCE criterion. In this approach, the VAD decision rule becomes the geometric mean of optimally weighted likelihood ratios. Secondly, the SVM-based approach is introduced to assist the VAD based on statistical models. In this algorithm, the SVM efficiently classifies the input signal into two classes which are voice active and voice inactive regions with nonlinear boundary. Experimental results show that these training-based approaches can effectively enhance the performance of the VAD.  相似文献   

19.
This paper considers estimation of the noise spectral variance from speech signals contaminated by highly nonstationary noise sources. The method can accurately track fast changes in noise power level (up to about 10 dB/s). In each time frame, for each frequency bin, the noise variance estimate is updated recursively with the minimum mean-square error (mmse) estimate of the current noise power. A time- and frequency-dependent smoothing parameter is used, which is varied according to an estimate of speech presence probability. In this way, the amount of speech power leaking into the noise estimates is kept low. For the estimation of the noise power, a spectral gain function is used, which is found by an iterative data-driven training method. The proposed noise tracking method is tested on various stationary and nonstationary noise sources, for a wide range of signal-to-noise ratios, and compared with two state-of-the-art methods. When used in a speech enhancement system, improvements in segmental signal-to-noise ratio of more than 1 dB can be obtained for the most nonstationary noise sources at high noise levels.  相似文献   

20.
Speech and voice technologies are experiencing a profound review as new paradigms are sought to overcome some specific problems which cannot be completely solved by classical approaches. Neuromorphic Speech Processing is an emerging area in which research is turning the face to understand the natural neural processing of speech by the Human Auditory System in order to capture the basic mechanisms solving difficult tasks in an efficient way. In the present paper a further step ahead is presented in the approach to mimic basic neural speech processing by simple neuromorphic units standing on previous work to show how formant dynamics - and henceforth consonantal features - can be detected by using a general neuromorphic unit which can mimic the functionality of certain neurons found in the upper auditory pathways. Using these simple building blocks a General Speech Processing Architecture can be synthesized as a layered structure. Results from different simulation stages are provided as well as a discussion on implementation details. Conclusions and future work are oriented to describe the functionality to be covered in the next research steps.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号