首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到19条相似文献,搜索用时 171 毫秒
1.
一种基于线性预测残差倒谱的基音检测算法   总被引:2,自引:1,他引:1       下载免费PDF全文
基音检测算法一直是音频处理领域的研究热点,但是语音信号声道特征对基音及其谐振结构的影响很大,增加了检测的难度。利用LP残差仅仅保留声门激励信号的特点,通过倒谱分析避免了声道特征和噪声的影响。同时针对倒谱分析中经常出现的半频倍频问题和低频截断问题,引入了谐波积谱(HPS)的解决方案,提高了识别的精度。实验表明,该方法能够较好地避免半频倍频错误,对于截去了低频和高频的电话信道语音也能够得到令人满意的检测结果,同时作为帧检测技术能够满足实时应用的需要。  相似文献   

2.
基音周期检测一直是音频处理领域的研究热点,基音周期的精确检测实际上是一件比较困难的事情。提出了一种LPC残差与SCMDSF相结合的基音周期检测,该算法的特点在于着重对被处理的语音进行滤波预处理,提取语音信号的LPC残差,消除了声道响应信息,对求出的语音残差信号做SCMDSF计算,并求出语音的基音周期。实验表明,在噪声环境下这种处理方法能够比较准确的提取基音周期。  相似文献   

3.
为了进行有效的语音信号处理,并降低语音信号的冗余度,通常采用端点检测技术来提取语音信号中的有效部分。本文在传统语音端点检测方法的基础上,提出了一种基于基音周期对语音段末尾进行判别的方法,针对汉语发音都是以浊音结尾的特点,同时利用基音周期对浊音段信号比较敏感这一特性,能够有效地避开汉语语音信号尾部拖音段中所包含的无效信息,既提高了端点检测的准确性,又减少了后续语音识别系统样本训练时间。实验结果证明,该方法对于汉语中孤立词末尾的拖音段,可以得到较好的端点检测效果。  相似文献   

4.
一种多小波滤波器在基音周期提取中的应用   总被引:1,自引:0,他引:1       下载免费PDF全文
基音周期是嗓音信号的重要特征参数之一,它在病态嗓音医学临床检测中有广泛的应用。推导了一组双正交多小波滤波器,提出了多小波滤波器与自相关结合的基音周期检测方法,构造了包含汉语拼音6个元音的短语句,利用多小波滤波器从语音信号中提取低频信号,再使用自相关法检测语音信号基音周期。结果表明,该方法提取基音周期具有正确率较高、准确率较高和抗噪性较强的特点,在不同噪声环境下均优于自相关法、单小波与自相关法相结合的方法,尤其在较大噪声干扰下该方法具有明显的抗噪能力,不受语音信号非平稳特性的影响,可以有效地提取病态嗓音的基音周期。  相似文献   

5.
提出了一种语音信号非线性动力学特性递归分析的基音检测新方法;提出了递归数计算基音,研究了用递归度与所算基音之积精确区分清浊音法,给出了波动基音修正方法。实验结果表明,该方法可以得到比传统自相关法和倒谱法更好的结果,尤其是清浊音特征不明显的音段,基音检测的性能更好,在带噪语音音环境下也具有很好的适应性。  相似文献   

6.
一种无门限U/V判决和基音检测算法   总被引:1,自引:1,他引:0       下载免费PDF全文
在实验研究自相关法(ACF)和平均幅度差法(AMDF)基音检测性能的基础上,提出了一种无门限清/浊音判决和基音检测算法。该算法对语音帧分别计算AMDF和LPC残差信号的自相关(LACF),比较两种方法所得的基音,得出清/浊音判决结果和浊音帧的基音周期。只用一次逻辑判断,无需比较门限;在多种声码器上应用该算法进行语音编/解码仿真实验,表明判决和检测算法具有较高的准确性和较强的噪声鲁棒性。  相似文献   

7.
在低信噪比和非平稳噪声干扰下,语音信号的清浊音检测是语音信号处理中的一个重要研究问题。论文基于语音正弦模型,提出了一种清浊音分类和浊音谐波提取算法。该方法在分析了语音的三阶累积量谱后,用子谐波-谐波方法取得基音,并计算出谐波参数和高低频能量比值。它利用谱包络估计器得到谱包络及尖峰信号,结合最小均方估计准则下的迭代算法计算语音谐波的信噪比;通过对上面各计算结果的综合评价得出语音帧的浊音度,从而得到语音清浊音的分类和浊音谐波数。仿真结果表明,该算法在复杂噪声背景下,能有效进行语音分类,准确得到浊音度。同时该算法还具有实时性好、语音参数分析精度高的特点。  相似文献   

8.
针对单声道语音分离中浊音分离的问题,提出了一种准确估计基音周期的方法。首先,以语音的短时平稳性和基音周期的连续性等为线索,利用语音信号的倒谱峰值构成基音周期谱图,并自动提取基音周期轨迹。然后,利用谐波频率为基音频率整数倍的性质来拾取各次谐波的频谱。最后,通过傅里叶逆变换对浊音进行重构。实验结果表明,该方法能准确提取基音周期轨迹,有效分离浊音信号。  相似文献   

9.
李永宁 《福建电脑》2008,24(11):92-93
用基音周期检测算法从浊音语音信号中提取基音周期,有益于语音合成、语音编码及语音识别等语音信号处理工作。通过对具体语音时域信号分帧、求短时自相关函数,可得到浊音语音的基音周期,实现基音周期的检测。  相似文献   

10.
针对低信噪比情况下,语音信号传统的基音检测方法鲁棒性较差的问题,提出一种结合语音增强的基音检测改进方法。通过基于听觉掩蔽的多频带谱减法减小带噪语音信号背景噪声,得到较干净的语音;将增强后的语音作为基音检测的待处理语音,利用能零积和能零比的多门限法对其进行清浊音判决;在平均幅度差函数(AMDF)加权自相关函数(ACF)的基础方法上进行改进,实现精确的基音检测。理论与仿真结果表明,在信噪比为-10dB时,该方法依然能够精确检测基音周期,鲁棒性明显提高。  相似文献   

11.
We propose a pitch synchronous approach to design the voice conversion system taking into account the correlation between the excitation signal and vocal tract system characteristics of speech production mechanism. The glottal closure instants (GCIs) also known as epochs are used as anchor points for analysis and synthesis of the speech signal. The Gaussian mixture model (GMM) is considered to be the state-of-art method for vocal tract modification in a voice conversion framework. However, the GMM based models generate overly-smooth utterances and need to be tuned according to the amount of available training data. In this paper, we propose the support vector machine multi-regressor (M-SVR) based model that requires less tuning parameters to capture a mapping function between the vocal tract characteristics of the source and the target speaker. The prosodic features are modified using epoch based method and compared with the baseline pitch synchronous overlap and add (PSOLA) based method for pitch and time scale modification. The linear prediction residual (LP residual) signal corresponding to each frame of the converted vocal tract transfer function is selected from the target residual codebook using a modified cost function. The cost function is calculated based on mapped vocal tract transfer function and its dynamics along with minimum residual phase, pitch period and energy differences with the codebook entries. The LP residual signal corresponding to the target speaker is generated by concatenating the selected frame and its previous frame so as to retain the maximum information around the GCIs. The proposed system is also tested using GMM based model for vocal tract modification. The average mean opinion score (MOS) and ABX test results are 3.95 and 85 for GMM based system and 3.98 and 86 for the M-SVR based system respectively. The subjective and objective evaluation results suggest that the proposed M-SVR based model for vocal tract modification combined with modified residual selection and epoch based model for prosody modification can provide a good quality synthesized target output. The results also suggest that the proposed integrated system performs slightly better than the GMM based baseline system designed using either epoch based or PSOLA based model for prosody modification.  相似文献   

12.
In this work, we have developed a speech mode classification model for improving the performance of phone recognition system (PRS). In this paper, we have explored vocal tract system, excitation source and prosodic features for development of speech mode classification (SMC) model. These features are extracted from voiced regions of a speech signal. In this study, conversation, extempore, and read speech are considered as three different modes of speech. The vocal tract component of speech is extracted using Mel-frequency cepstral coefficients (MFCCs). The excitation source features are captured through Mel power differences of spectrum in sub-bands (MPDSS) and residual Mel-frequency cepstral coefficients (RMFCCs) of the speech signal. The prosody information is extracted from pitch and intensity. Speech mode classification models are developed using above described features independently, and in fusion. The experiments carried out on Bengali speech corpus to analyze the accuracy of the speech mode classification model using the artificial neural network (ANN), naive Bayes, support vector machines (SVMs) and k-nearest neighbor (KNN). We proposed four classification models which are combined using maximum voting approach for optimal performance. From the results, it is observed that speech mode classification model developed using the fusion of vocal tract system, excitation source and prosodic features of speech, yields the best performance of 98%. Finally, the proposed speech mode classifier is integrated to the PRS, and the accuracy of phone recognition system is observed to be improved by 11.08%.  相似文献   

13.
The intelligibility of speech transmitted through low-rate coders is severely degraded when high levels of acoustic noise are present in the acoustic environment. Recent advances in nonacoustic sensors, including microwave radar, skin vibration, and bone conduction sensors, provide the exciting possibility of both glottal excitation and, more generally, vocal tract measurements that are relatively immune to acoustic disturbances and can supplement the acoustic speech waveform. We are currently investigating methods of combining the output of these sensors for use in low-rate encoding according to their capability in representing specific speech characteristics in different frequency bands. Nonacoustic sensors have the ability to reveal certain speech attributes lost in the noisy acoustic signal; for example, low-energy consonant voice bars, nasality, and glottalized excitation. By fusing nonacoustic low-frequency and pitch content with acoustic-microphone content, we have achieved significant intelligibility performance gains using the DRT across a variety of environments over the government standard 2400-bps MELPe coder. By fusing quantized high-band 4-to-8-kHz speech, requiring only an additional 116 bps, we obtain further DRT performance gains by exploiting the ear's insensitivity to fine spectral detail in this frequency region.  相似文献   

14.
为了增加读书机器人(JoyT0n)朗读声音的多样性,设计了一种基于单一语音库的声音变换系统。将读书机器TTS(text to speech)合成出的初始声音分解成声音激励信号和声道滤波器信号,并转换到频域进行修改。利用短时傅立叶幅度谱重构激励信号的方法以及通过修改声道滤波器参数的方法来变换音速、音调和音色。修改后的声音激励信号和声道滤波器信号被重新合成产生新的声音信号。该变声系统能在不增加语音库容量的情况下使读书机器人用丰富多彩的感情和声调朗读。  相似文献   

15.
Prosody modification involves changing the pitch and duration of speech without affecting the message and naturalness. This paper proposes a method for prosody (pitch and duration) modification using the instants of significant excitation of the vocal tract system during the production of speech. The instants of significant excitation correspond to the instants of glottal closure (epochs) in the case of voiced speech, and to some random excitations like onset of burst in the case of nonvoiced speech. Instants of significant excitation are computed from the linear prediction (LP) residual of speech signals by using the property of average group-delay of minimum phase signals. The modification of pitch and duration is achieved by manipulating the LP residual with the help of the knowledge of the instants of significant excitation. The modified residual is used to excite the time-varying filter, whose parameters are derived from the original speech signal. Perceptual quality of the synthesized speech is good and is without any significant distortion. The proposed method is evaluated using waveforms, spectrograms, and listening tests. The performance of the method is compared with linear prediction pitch synchronous overlap and add (LP-PSOLA) method, which is another method for prosody manipulation based on the modification of the LP residual. The original and the synthesized speech signals obtained by the proposed method and by the LP-PSOLA method are available for listening at http://speech.cs.iitm.ernet.in/Main/result/prosody.html.  相似文献   

16.
The objective of voice conversion system is to formulate the mapping function which can transform the source speaker characteristics to that of the target speaker. In this paper, we propose the General Regression Neural Network (GRNN) based model for voice conversion. It is a single pass learning network that makes the training procedure fast and comparatively less time consuming. The proposed system uses the shape of the vocal tract, the shape of the glottal pulse (excitation signal) and long term prosodic features to carry out the voice conversion task. In this paper, the shape of the vocal tract and the shape of source excitation of a particular speaker are represented using Line Spectral Frequencies (LSFs) and Linear Prediction (LP) residual respectively. GRNN is used to obtain the mapping function between the source and target speakers. The direct transformation of the time domain residual using Artificial Neural Network (ANN) causes phase change and generates artifacts in consecutive frames. In order to alleviate it, wavelet packet decomposed coefficients are used to characterize the excitation of the speech signal. The long term prosodic parameters namely, pitch contour (intonation) and the energy profile of the test signal are also modified in relation to that of the target (desired) speaker using the baseline method. The relative performances of the proposed model are compared to voice conversion system based on the state of the art RBF and GMM models using objective and subjective evaluation measures. The evaluation measures show that the proposed GRNN based voice conversion system performs slightly better than the state of the art models.  相似文献   

17.
声门激励信号是语音信号的源信号,可用于语音特征参数的有效提取。研究了从观测语音获取声门激励的两种方法——线性预测法和倒谱法;用实际录制的语音做计算机仿真实验,比较了两种方法的性能和特点。结果表明倒谱法获取声门激励、由它提取基因周期等激励特征参数的精度高,但计算量相对较大;线性预测法由于采用高效算法,不仅获取声门激励的速度快,而且可同时获取声道模型参数、语音功率谱等重要参数,是获取声门激励的常用方法。  相似文献   

18.
In our previous works, we have explored articulatory and excitation source features to improve the performance of phone recognition systems (PRSs) using read speech corpora. In this work, we have extended the use of articulatory and excitation source features for developing PRSs of extempore and conversation modes of speech, in addition to the read speech. It is well known that the overall performance of speech recognition system heavily depends on accuracy of phone recognition. Therefore, the objective of this paper is to enhance the accuracy of phone recognition systems using articulatory and excitation source features in addition to conventional spectral features. The articulatory features (AFs) are derived from the spectral features using feedforward neural networks (FFNNs). We have considered five AF groups, namely: manner, place, roundness, frontness and height. Five different AF-based tandem PRSs are developed using the combination of Mel frequency cepstral coefficients (MFCCs) and AFs derived from FFNNs. Hybrid PRSs are developed by combining the evidences from AF-based tandem PRSs using weighted combination approach. The excitation source information is derived by processing the linear prediction residual of the speech signal. The vocal tract information is captured using MFCCs. The combination of vocal tract and excitation source features is used for developing PRSs. The PRSs are developed using hidden Markov models. Bengali speech database is used for developing PRSs of read, extempore and conversation modes of speech. The results are analyzed and the performance is compared across different modes of speech. From the results, it is observed that the use of either articulatory or excitation source features along-with to MFCCs will improve the performance of PRSs in all three modes of speech. The improvement in the performance using AFs is much higher compared to the improvement obtained using excitation source features.  相似文献   

19.
In this work we develop a speaker recognition system based on the excitation source information and demonstrate its significance by comparing with the vocal tract information based system. The speaker-specific excitation information is extracted by the subsegmental, segmental and suprasegmental processing of the LP residual. The speaker-specific information from each level is modeled independently using Gaussian mixture modeling—universal background model (GMM-UBM) modeling and then combined at the score level. The significance of the proposed speaker recognition system is demonstrated by conducting speaker verification experiments on the NIST-03 database. Two different tests, namely, Clean test and Noisy test are conducted. In case of Clean test, the test speech signal is used as it is for verification. In case of Noisy test, the test speech is corrupted by factory noise (9 dB) and then used for verification. Even though for Clean test case, the proposed source based speaker recognition system still provides relatively poor performance than the vocal tract information, its performance is better for Noisy test case. Finally, for both clean and noisy cases, by providing different and robust speaker-specific evidences, the proposed system helps the vocal tract system to further improve the overall performance.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号