首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
In this paper, we propose a novel multicomponent amplitude and frequency modulated (AFM) signal model for parametric representation of speech phonemes. An efficient technique is developed for parameter estimation of the proposed model. The Fourier–Bessel series expansion is used to separate a multicomponent speech signal into a set of individual components. The discrete energy separation algorithm is used to extract the amplitude envelope (AE) and the instantaneous frequency (IF) of each component of the speech signal. Then, the parameter estimation of the proposed AFM signal model is carried out by analysing the AE and IF parts of the signal component. The developed model is found to be suitable for representation of an entire speech phoneme (voiced or unvoiced) irrespective of its time duration, and the model is shown to be applicable for low bit-rate speech coding. The symmetric Itakura–Saito and the root-mean-square log-spectral distance measures are used for comparison of the original and reconstructed speech signals.  相似文献   

2.
Direct word discovery from audio speech signals is a very difficult and challenging problem for a developmental robot. Human infants are able to discover words directly from speech signals, and, to understand human infants’ developmental capability using a constructive approach, it is very important to build a machine learning system that can acquire knowledge about words and phonemes, i.e. a language model and an acoustic model, autonomously in an unsupervised manner. To achieve this, the nonparametric Bayesian double articulation analyzer (NPB-DAA) with the deep sparse autoencoder (DSAE) is proposed in this paper. The NPB-DAA has been proposed to achieve totally unsupervised direct word discovery from speech signals. However, the performance was still unsatisfactory, although it outperformed pre-existing unsupervised learning methods. In this paper, we integrate the NPB-DAA with the DSAE, which is a neural network model that can be trained in an unsupervised manner, and demonstrate its performance through an experiment about direct word discovery from auditory speech signals. The experiment shows that the combined method, the NPB-DAA with the DSAE, outperforms pre-existing unsupervised learning methods, and shows state-of-the-art performance. It is also shown that the proposed method outperforms several standard speech recognizer-based methods with true word dictionaries.  相似文献   

3.
Considering a real signal as the sum of a number of sinusoidal signals in the presence of additive noise, maximum windowed likelihood (MWL) criterion is introduced and applied to construct an adaptive algorithm in order to estimate the amplitude and frequency of these components. The amplitudes, phases and frequencies are assumed to be slowly time varying. Employing MWL an adaptive algorithm is obtained in two steps. First, assuming some initial values for the frequency of each component, a closed form is derived to estimate the amplitudes. Then, the gradient of MWL is used to adaptively track the frequencies, using the latter values of amplitudes. The proposed algorithm has a parallel structure in which each branch estimates parameters of one of the components. The proposed multicomponent phase locked loop (MPLL) algorithm is implemented employing low complexity blocks. It is adjustable to be used in different conditions. The mean squared error of the algorithm is studied to analyze the effect of the window length and type and the step size. Simulations have been conducted to illustrate the efficiency and the performance of the algorithm in different conditions including: the effect of the initialization, the frequency resolution, for chirp components, for components during frequency crossover and for speech signals. Simulations illustrate that the method efficiently tracks slowly time-varying components of the signals such as voiced speech segments.  相似文献   

4.
Copyright by Science in China Press 2Linear frequency modulation (LFM or chirp) signals are widely used in information systems such as radar, sonar, and communications. In these systems, to detect and estimate LFM signals is an important problem. For a long time, various methods based on maximum likelihood estimator are the predominant solutions to this task. Most of these methods can be ascribed to a multivariable optimization algorithm and are usually computationally demanding in impleme…  相似文献   

5.
Recognition of emotion in speech has recently matured to one of the key disciplines in speech analysis serving next generation human-machine interaction and communication. However, compared to automatic speech recognition, that emotion recognition from an isolated word or a phrase is inappropriate for conversation. Because a complete emotional expression may stride across several sentences, and may fetch-up on any word in dialogue. In this paper, we present a segment-based emotion recognition approach to continuous Mandarin Chinese speech. In this proposed approach, the unit for recognition is not a phrase or a sentence but an emotional expression in dialogue. To that end, the following procedures are presented: First, we evaluate the performance of several classifiers in short sentence speech emotion recognition architectures. The results of the experiments show that the WD-KNN classifier achieves the best accuracy for the 5-class emotion recognition what among the five classification techniques. We then implemented a continuous Mandarin Chinese speech emotion recognition system with an emotion radar chart which is based on WD-KNN; this system can represent the intensity of each emotion component in speech. This proposed approach shows how emotions can be recognized by speech signals, and in turn how emotional states can be visualized.  相似文献   

6.
This paper presents a method for reconstructing unreliable spectral components of speech signals using the statistical distributions of the clean components. Our goal is to model the temporal patterns in speech signal and take advantage of correlations between speech features in both time and frequency domain simultaneously. In this approach, a hidden Markov model (HMM) is first trained on clean speech data to model the temporal patterns which appear in the sequences of the spectral components. Using this model and according to the probabilities of occurring noisy spectral component at each states, a probability distributions for noisy components are estimated. Then, by applying maximum a posteriori (MAP) estimation on the mentioned distributions, the final estimations of the unreliable spectral components are obtained. The proposed method is compared to a common missing feature method which is based on the probabilistic clustering of the feature vectors and also to a state of the art method based on sparse reconstruction. The experimental results exhibits significant improvement in recognition accuracy over a noise polluted Persian corpus.  相似文献   

7.
In this paper, the problem of modeling the time-trajectory of the sinusoidal components of voiced speech signals is addressed. A new global approach is presented: a single so-called long-term (LT) model, based on discrete cosine functions, is used to model the overall trajectories of amplitude and phase parameters, for each entire voiced section of speech, differing from usual (short-term) models defined on a frame-by-frame basis. The complete analysis-modeling-synthesis process is presented, including an iterative algorithm for optimal fitting between LT model and measures. A major issue of this paper concerns the use of perceptual criteria in the LT model fitting process (both for amplitude and phase modeling). The adaptation of perceptual criteria usually defined in the short-term and/or stationary cases to the long-term processing is proposed. Experiments dealing with the ten first harmonics of voiced signals show that the proposed approach provides an efficient variable-rate representation of voiced speech signals. Promising results are given in terms of modeling accuracy, synthesis quality, and data compression. The interest of the presented approach for speech coding and speech watermarking is discussed  相似文献   

8.
Distant acquisition of acoustic signals in an enclosed space often produces reverberant components due to acoustic reflections in the room. Speech dereverberation is in general desirable when the signal is acquired through distant microphones in such applications as hands-free speech recognition, teleconferencing, and meeting recording. This paper proposes a new speech dereverberation approach based on a statistical speech model. A time-varying Gaussian source model (TVGSM) is introduced as a model that represents the dynamic short time characteristics of nonreverberant speech segments, including the time and frequency structures of the speech spectrum. With this model, dereverberation of the speech signal is formulated as a maximum-likelihood (ML) problem based on multichannel linear prediction, in which the speech signal is recovered by transforming the observed signal into one that is probabilistically more like nonreverberant speech. We first present a general ML solution based on TVGSM, and derive several dereverberation algorithms based on various source models. Specifically, we present a source model consisting of a finite number of states, each of which is manifested by a short time speech spectrum, defined by a corresponding autocorrelation (AC) vector. The dereverberation algorithm based on this model involves a finite collection of spectral patterns that form a codebook. We confirm experimentally that both the time and frequency characteristics represented in the source models are very important for speech dereverberation, and that the prior knowledge represented by the codebook allows us to further improve the dereverberated speech quality. We also confirm that the quality of reverberant speech signals can be greatly improved in terms of the spectral shape and energy time-pattern distortions from simply a short speech signal using a speaker-independent codebook.   相似文献   

9.
The localized faults of rolling bearings can be diagnosed by its vibration impulsive signals. However, it is always a challenge to extract the impulsive feature under background noise and non-stationary conditions. This paper investigates impulsive signals detection of a single-point defect rolling bearing and presents a novel data-driven detection approach based on dictionary learning. To overcome the effects harmonic and noise components, we propose an autoregressive-minimum entropy deconvolution model to separate harmonic and deconvolve the effect of the transmission path. To address the shortcomings of conventional sparse representation under the changeable operation environment, we propose an approach that combines K-clustering with singular value decomposition (K-SVD) and split-Bregman to extract impulsive components precisely. Via experiments on synthetic signals and real run-to-failure signals, the excellent performance for different impulsive signals detection verifies the effectiveness and robustness of the proposed approach. Meanwhile, a comparison with the state-of-the-art methods is illustrated, which shows that the proposed approach can provide more accurate detected impulsive signals.   相似文献   

10.
This paper presents a novel method for the enhancement of independent components of mixed speech signal segregated by the frequency domain independent component analysis (FDICA) algorithm. The enhancement algorithm proposed here is based on maximum a posteriori (MAP) estimation of the speech spectral components using generalized Gaussian distribution (GGD) function as the statistical model for the time–frequency series of speech (TFSS) signal. The proposed MAP estimator has been used and evaluated as the post-processing stage for the separation of convolutive mixture of speech signals by the fixed-point FDICA algorithm. It has been found that the combination of separation algorithm with the proposed enhancement algorithm provides better separation performance under both the reverberant and non-reverberant conditions.  相似文献   

11.
This paper proposes a method for enhancing speech signals contaminated by room reverberation and additive stationary noise. The following conditions are assumed. 1) Short-time spectral components of speech and noise are statistically independent Gaussian random variables. 2) A room's convolutive system is modeled as an autoregressive system in each frequency band. 3) A short-time power spectral density of speech is modeled as an all-pole spectrum, while that of noise is assumed to be time-invariant and known in advance. Under these conditions, the proposed method estimates the parameters of the convolutive system and those of the all-pole speech model based on the maximum likelihood estimation method. The estimated parameters are then used to calculate the minimum mean square error estimates of the speech spectral components. The proposed method has two significant features. 1) The parameter estimation part performs noise suppression and dereverberation alternately. (2) Noise-free reverberant speech spectrum estimates, which are transferred by the noise suppression process to the dereverberation process, are represented in the form of a probability distribution. This paper reports the experimental results of 1500 trials conducted using 500 different utterances. The reverberation time RT60 was 0.6 s, and the reverberant signal to noise ratio was 20, 15, or 10 dB. The experimental results show the superiority of the proposed method over the sequential performance of the noise suppression and dereverberation processes.  相似文献   

12.
Emotion recognition in speech signals is currently a very active research topic and has attracted much attention within the engineering application area. This paper presents a new approach of robust emotion recognition in speech signals in noisy environment. By using a weighted sparse representation model based on the maximum likelihood estimation, an enhanced sparse representation classifier is proposed for robust emotion recognition in noisy speech. The effectiveness and robustness of the proposed method is investigated on clean and noisy emotional speech. The proposed method is compared with six typical classifiers, including linear discriminant classifier, K-nearest neighbor, C4.5 decision tree, radial basis function neural networks, support vector machines as well as sparse representation classifier. Experimental results on two publicly available emotional speech databases, that is, the Berlin database and the Polish database, demonstrate the promising performance of the proposed method on the task of robust emotion recognition in noisy speech, outperforming the other used methods.  相似文献   

13.
《Applied Soft Computing》2007,7(1):145-155
A neural network based model is developed to quantify speech intelligibility by blind-estimating speech transmission index, an objective rating index for speech intelligibility of transmission channels, from transmitted speech signals without resort to knowledge of original speech signals. It consists of a Hilbert transform processor for speech envelope detection, a Welch average periodogram algorithm for envelope spectrum estimation, a principal components analysis (PCA) network for speech feature extraction and a multi-layer back-propagation network for non-linear mapping and case generalisation. The developed model circumvents the use of artificial test signals by exploiting naturally occurring speech signals as probe stimuli, reduces measurement channels from two to one and hence facilitates in situ assessment of speech intelligibility. From a cognitive science viewpoint, the proposed method might be viewed as a successful paradigm of mimicking human perception of speech intelligibility using a hybrid model built around artificial neural networks.  相似文献   

14.
15.
对DCT城基于拉普拉斯统计模型的语音增强,分析了模型因子的估计误差及其对于算法整体增强性能的影响,并根据广义高斯分布模型度其形态参数的概念与性质,提出了一种新的拉普拉斯模型因子估计方法,该方法结构简单,它利用拉普拉斯模型条件下语音分量方差与模型因子的对应关系,间接地获取模型因子的估计,算法不仅有效地消除了噪声分量对于估计精度的影响,而且可以快速地跟踪语音分量的变化。仿真结果表明,基于该模型因子估计方法的语音增强算法在多种噪声背景下具有更出色的语音增强效果。  相似文献   

16.
An improved genetic programming is proposed in this paper to construct the nonlinear models of speech signals, and the speech coding is further accomplished. After the preprocessing of the speech signals, the improved GP is used to construct the corresponding model of each speech frame. Then by analyzing these models, a normalized model that has generalization ability is obtained. And finally the process of speech coding is accomplished by the optimizing the parameters of the normalized model using an optimization algorithm. Experiments demonstrate that the feasibility of the improved GP in the modeling of speech signals, and show the superiority of the proposed method in speech coding based on the comparisons with the linear predictive coding.  相似文献   

17.
A new approach to representing a time-limited, and essentially bandpass signal x(t) , by a set of discrete frequency values is proposed. The set of discrete frequency values is the set of locations along the frequency axis at which (real and/or imaginary parts of) the Fourier transform of the signal x(t) cross certain levels (especially zero level). Analogously, invoking time-frequency duality, a set of time instants denoting the zero/level crossings of a waveform x(t) can be used to represent a bandlimited spectrum X(f) . The proposed signal representation is based on a simple bandpass signal model that exploits our prior knowledge of the bandwidth/timewidth of the signal. We call it a Sum-of-Sincs (SOS) model, where Sinc stands for the familiar sinx/x function. Given the discrete fequency/time locations, we can accurately reconstruct the signal x(t) or the spectrum X(f) by solving a simple eigenvalue or a least squares problem. Using this approach as the basis, we propose an analysis/synthesis algorithm to decompose and represent complex multicomponent signals like speech over the entire time-frequency region. The proposed signal representation is an alternative to standard analog to discrete conversion based on the sampling theorem, and in principle, possesses some of the desirable attributes of signal representation in natural sensory systems.  相似文献   

18.
Looking at the speaker's face can be useful to better hear a speech signal in noisy environment and extract it from competing sources before identification. This suggests that the visual signals of speech (movements of visible articulators) could be used in speech enhancement or extraction systems. In this paper, we present a novel algorithm plugging audiovisual coherence of speech signals, estimated by statistical tools, on audio blind source separation (BSS) techniques. This algorithm is applied to the difficult and realistic case of convolutive mixtures. The algorithm mainly works in the frequency (transform) domain, where the convolutive mixture becomes an additive mixture for each frequency channel. Frequency by frequency separation is made by an audio BSS algorithm. The audio and visual informations are modeled by a newly proposed statistical model. This model is then used to solve the standard source permutation and scale factor ambiguities encountered for each frequency after the audio blind separation stage. The proposed method is shown to be efficient in the case of 2 times 2 convolutive mixtures and offers promising perspectives for extracting a particular speech source of interest from complex mixtures  相似文献   

19.
Two-microphone separation of speech mixtures.   总被引:1,自引:0,他引:1  
Separation of speech mixtures, often referred to as the cocktail party problem, has been studied for decades. In many source separation tasks, the separation method is limited by the assumption of at least as many sensors as sources. Further, many methods require that the number of signals within the recorded mixtures be known in advance. In many real-world applications, these limitations are too restrictive. We propose a novel method for underdetermined blind source separation using an instantaneous mixing model which assumes closely spaced microphones. Two source separation techniques have been combined, independent component analysis (ICA) and binary time - frequency (T-F) masking. By estimating binary masks from the outputs of an ICA algorithm, it is possible in an iterative way to extract basis speech signals from a convolutive mixture. The basis signals are afterwards improved by grouping similar signals. Using two microphones, we can separate, in principle, an arbitrary number of mixed speech signals. We show separation results for mixtures with as many as seven speech signals under instantaneous conditions. We also show that the proposed method is applicable to segregate speech signals under reverberant conditions, and we compare our proposed method to another state-of-the-art algorithm. The number of source signals is not assumed to be known in advance and it is possible to maintain the extracted signals as stereo signals.  相似文献   

20.
基于分数阶Fourier变换的多分量LFM信号检测与参数估计   总被引:8,自引:0,他引:8  
介绍了分数阶Fourier变换的基本原理和基本性质,提出了基于分数阶Fourier变换的多分量LFM信号检测和参数估计方法。为了解决多个LFM分量之间的相互影响问题.特别是强分量掩盖弱分量的问题.本文还提出了一种结合逐次消去思想和分数阶Fourier变换的多分量LFM信号检测和参数估计算法,它可以解决强度相差较大的多分量LFM信号中检测和估计弱LFM分量参数的问题。仿真实验结果证明了该算法的有效性。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号