期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

陈凌辉凌震华戴礼荣《模式识别与人工智能》2013,26(3):254-259

提出一种基于话者无关模型的说话人转换方法.考虑到音素信息共同存在于所有说话人的语音中,假设存在一个可以用高斯混合模型来描述的话者无关空间,且可用分段线性变换来描述该空间到各说话人相关空间之间的映射关系.在一个多说话人的数据库上,用话者自适应训练算法来训练模型,并在转换阶段使用源目标说话人空间到话者无关空间的变换关系来构造源与目标之间的特征变换关系,快速、灵活的构造说话人转换系统.通过主观测听实验来验证该算法相对于传统的基于话者相关模型方法的优点. 相似文献

2.

噪声环境下基于单高斯模型的声道归一化研究

张文明张向东张兴敢候震《微处理机》2006,27(5):102-105

声道归一化是语音识别中说话人自适应的方法之一,在噪声环境下对其进行了研究并做了一系列的实验.在实现过程中,首次在噪声环境下采用了基于单高斯混合模型选择弯折因子的方法,并取得了良好的结果.实验基于AURORA语音数据库,并用其所带的汽车噪声环境下的测试集对模型进行了识别验证.实验结果表明,采用声道归一化后的识别结果在各个噪声下均比原来有不同程度的改善,迭代训练能改进单轮声道归一化的结果,最佳结果出现在迭代训练的第三轮.噪声环境下基于一个高斯混合模型选择的弯折因子相比其他高斯混合模型选择的弯折因子,句子平均识别率提高了近1.68%.经过声道归一化后的性别独立模型的识别结果能接近于未经声道归一化后的性别依赖模型的识别结果,如果训练数据充分,声道归一化后的性别独立模型的识别结果能更好. 相似文献

3.

一种改进的语音动态组合特征参数提取方法

钟浩鲍鸿张晶《电脑与信息技术》2017,25(3)

语音信号窗函数具有减少频谱能量泄露的作用,针对传统的语音加窗函数旁瓣衰减速度慢,信号频谱能量泄露大,不利于说话人识别特征参数提取的缺点,采用一种汉明自卷积窗函数取代汉明窗函数对语音信号预处理.为了进一步提高说话人系统的识别率,文章提出一种基于汉明自卷积窗的的一阶、二阶差分梅尔倒谱系数(MFCC)改进的动态组合特征参数方法.用高斯混合模型进行仿真实验,实验结果证明,用该方法提取的特征参数运用于说话人识别系统,相比于传统的MFCC说话人识别系统,其识别率大大提高. 相似文献

4.

基于段级特征的对话环境下说话人分段算法

王波徐毅琼李弼程《计算机工程与设计》2007,28(10):2401-2402,2416

提出了一种使用段级语音特征对测试进行说话人分段从而实现对话环境下说话人分段算法,算法实现中基于车比雪夫和不等式提出了基于协方差模型的段级特征的距离测度描述.该识别方法根据实验选择了合适的段级特征语音段长度,实验结果表明基于段级特征的说话人识别方法在有效地在对话环境下将多人的语音进行分段,从而提高了说话人识别系统的精度和识别速度. 相似文献

5.

基于线性对数似然核函数的说话人识别

何亮刘加《计算机应用》2011,31(8):2083-2086

为了提高文本无关的说话人识别系统的性能,提出了基于线性对数似然核函数的说话人识别系统。线性对数似然核函数利用高斯混合模型对频谱特征序列进行压缩;将频谱特征序列之间的相似程度转化为高斯混合模型参数之间的距离;根据距离表达式,利用极化恒等式求得频谱特征序列向高维矢量空间的映射方法;最后,在高维矢量空间,采用支持向量机(SVM)为目标说话人建立模型。在美国国家标准技术署公布的说话人识别数据库上的实验结果表明,所提核函数具有优异的识别性能。相似文献

6.

基于音素HMM模型语音转换

QIAN Kai-hua 《数字社区&智能家居》2008,(10)

通过对语音转换的研究,提出了一种把源说话人特征转换为目标说话人特征的方法。语音转换特征参数分为两类:(1)频谱特征参数;(2)基音和声调模式。分别描述信号模型和转换方法。频谱特征用基于音素的2维HMMS建模,F0轨迹用来表示基音和音调。用基音同步叠加法对基音周期﹑声调和语速进行变换。相似文献

7.

基于音素HMM模型语音转换

钱开华《数字社区&智能家居》2008,(4):132-134

通过对语音转换的研究,提出了一种把源说话人特征转换为目标说话人特征的方法。语音转换特征参数分为两类：（1）频谱特征参数;（2）基音和声调模式。分别描述信号模型和转换方法。频谱特征用基于音素的2维HMMS建模,F0轨迹用来表示基音和音调。用基音同步叠加法对基音厨期、声调和语速进行变换。相似文献

8.

针对构音异常辅助治疗的声道仿真研究

下载免费PDF全文

陈东帆王照亮刘佛生《计算机工程与科学》2010,31(1)

针对构音异常,本文提出了使用声道仿真来实现辅助治疗的方法。基于声道是一个弯曲的、三维的具有慢时变特性的声学管道,并且在声道中的声波传播是平面波的特性,可以把声道等效于一个具有不同截面的圆柱体或者椭圆体管道。使用极点形式,在牛顿插值的基础上得到共振峰。对声道进行了60段分段,通过经验公式得到声道在不同部位的面积。定义了描述声道特性的9个参数,进而对这9个参数使用Corana算法进行优化。使用辐射模型描述声音从嘴唇辐射出去以后的特性。最后进行声音的合成,这个声音可用于反馈治疗。经过实验证明,这种声道仿真模型可以为制定合适治疗方法提供参考。相似文献

9.

基于Mumford-Shah模型和G空间的图像结构纹理分解

王超方璐叶凤芹叶中付《数据采集与处理》2008,23(1):17-22

构造了一个变分问题来实现图像的纹理结构分解.其中结构成分用分段光滑的函数(即Mumford-Shah模型)刻画,纹理部分用振荡函数(G空间)来描述.由于Mumford-Shah模型将结构成分显式地描述为分段光滑函数,在非结构边缘点处的梯度采用L2范数约束,故相对于全变差(Total variation,TV,梯度的L1范数)的分解方式,很好地克服TV带来的阶梯效应;G空间本身定义的纹理函数的振荡特性保证了分解的结构成分中含有更少的纹理信息.实验表明,无论相对于经典的TV-L2分解还是TV-G分解,本文方法均体现出了很好的分解性能. 相似文献

10.

基于duffing随机共振的说话人特征提取方法 总被引：2，自引：0，他引：2

潘平何朝霞《计算机工程与应用》2012,48(35):123-125,142

说话人特征参数的提取直接影响识别模型的建立,MFCC与LPC参数提取方法,分别以局域低频信息和全局AR信号为主要特征。提出一种基于duffing随机共振的说话人频谱特征提取方法。仿真结果表明,该方法能识别说话人之间频谱的微小差别,有效地提取说话人频谱的基本特征,从而为说话人识别模型提供更为精细的识别模型。相似文献

11.

Adaptation of children’s speech with limited data based on formant-like peak alignment

Xiaodong Cui Abeer Alwan 《Computer Speech and Language》2006,20(4):400-419

Automatic recognition of children’s speech using acoustic models trained by adults results in poor performance due to differences in speech acoustics. These acoustical differences are a consequence of children having shorter vocal tracts and smaller vocal cords than adults. Hence, speaker adaptation needs to be performed. However, in real-world applications, the amount of adaptation data available may be less than what is needed by common speaker adaptation techniques to yield reasonable performance. In this paper, we first study, in the discrete frequency domain, the relationship between frequency warping in the front-end and corresponding transformations in the back-end. Three common feature extraction schemes are investigated and their transformation linearity in the back-end are discussed. In particular, we show that under certain approximations, frequency warping of MFCC features with Mel-warped triangular filter banks equals a linear transformation in the cepstral space. Based on that linear transformation, a formant-like peak alignment algorithm is proposed to adapt adult acoustic models to children’s speech. The peaks are estimated by Gaussian mixtures using the Expectation-Maximization (EM) algorithm [Zolfaghari, P., Robinson, T., 1996. Formant analysis using mixtures of Gaussians, Proceedings of International Conference on Spoken Language Processing, 1229–1232]. For limited adaptation data, the algorithm outperforms traditional vocal tract length normalization (VTLN) and maximum likelihood linear regression (MLLR) techniques. 相似文献

12.

Improved automatic speech recognition through speaker normalization

《Computer Speech and Language》2006,20(1):107-123

In this paper, speaker adaptive acoustic modeling is investigated by using a novel method for speaker normalization and a well known vocal tract length normalization method. With the novel normalization method, acoustic observations of training and testing speakers are mapped into a normalized acoustic space through speaker-specific transformations with the aim of reducing inter-speaker acoustic variability. For each speaker, an affine transformation is estimated with the goal of reducing the mismatch between the acoustic data of the speaker and a set of target hidden Markov models. This transformation is estimated through constrained maximum likelihood linear regression and then applied to map the acoustic observations of the speaker into the normalized acoustic space.Recognition experiments made use of two corpora, the first one consisting of adults’ speech, the second one consisting of children’s speech. Performing training and recognition with normalized data resulted in a consistent reduction of the word error rate with respect to the baseline systems trained on unnormalized data. In addition, the novel method always performed better than the reference vocal tract length normalization method adopted in this work.When unsupervised static speaker adaptation was applied in combination with each of the two speaker normalization methods, a different behavior was observed on the two corpora: in one case performance became very similar while in the other case the difference remained significant. 相似文献

13.

Frequency warping for VTLN and speaker adaptation by linear transformation of standard MFCC

Sankaran Panchapagesan Abeer Alwan 《Computer Speech and Language》2009,23(1):42-64

Vocal tract length normalization (VTLN) for standard filterbank-based Mel frequency cepstral coefficient (MFCC) features is usually implemented by warping the center frequencies of the Mel filterbank, and the warping factor is estimated using the maximum likelihood score (MLS) criterion. A linear transform (LT) equivalent for frequency warping (FW) would enable more efficient MLS estimation. We recently proposed a novel LT to perform FW for VTLN and model adaptation with standard MFCC features. In this paper, we present the mathematical derivation of the LT and give a compact formula to calculate it for any FW function. We also show that our LT is closely related to different LTs previously proposed for FW with cepstral features, and these LTs for FW are all shown to be numerically almost identical for the sine-log all-pass transform (SLAPT) warping functions. Our formula for the transformation matrix is, however, computationally simpler and, unlike other previous LT approaches to VTLN with MFCC features, no modification of the standard MFCC feature extraction scheme is required. In VTLN and speaker adaptive modeling (SAM) experiments with the DARPA resource management (RM1) database, the performance of the new LT was comparable to that of regular VTLN implemented by warping the Mel filterbank, when the MLS criterion was used for FW estimation. This demonstrates that the approximations involved do not lead to any performance degradation. Performance comparable to front end VTLN was also obtained with LT adaptation of HMM means in the back end, combined with mean bias and variance adaptation according to the maximum likelihood linear regression (MLLR) framework. The FW methods performed significantly better than standard MLLR for very limited adaptation data (1 utterance), and were equally effective with unsupervised parameter estimation. We also performed speaker adaptive training (SAT) with feature space LT denoted CLTFW. Global CLTFW SAT gave results comparable to SAM and VTLN. By estimating multiple CLTFW transforms using a regression tree, and including an additive bias, we obtained significantly improved results compared to VTLN, with increasing adaptation data. 相似文献

14.

基于Mellin变换的语音新特征与频率归正说话人自适应技术

陈景东徐波黄泰翼《自动化学报》2000,26(4):478-484

为了减小由于说话人之间声道形状的差异而引起的非特定人语音识别系统性能的下降,研究了两种方法,一种是基于最大似然估计的频率归正说话人自适应方法,另一种是基于Mellin变换的语音新特征.在非特定人孤立词语音识别系统上的初步实验表明,这两种方法都可以提高系统对不同说话人的鲁棒性,相比之下,基于Mellin变换的语音新特征具有更好的性能,它不仅提高了系统对不同话者的识别性能,而且也使系统对不同话者的误识率的离散程度大大减小. 相似文献

15.

A pitch synchronous approach to design voice conversion system using source-filter correlation

Rabul Hussain Laskar Kalyan Banerjee Fazal Ahmed Talukdar K. Sreenivasa Rao 《International Journal of Speech Technology》2012,15(3):419-431

We propose a pitch synchronous approach to design the voice conversion system taking into account the correlation between the excitation signal and vocal tract system characteristics of speech production mechanism. The glottal closure instants (GCIs) also known as epochs are used as anchor points for analysis and synthesis of the speech signal. The Gaussian mixture model (GMM) is considered to be the state-of-art method for vocal tract modification in a voice conversion framework. However, the GMM based models generate overly-smooth utterances and need to be tuned according to the amount of available training data. In this paper, we propose the support vector machine multi-regressor (M-SVR) based model that requires less tuning parameters to capture a mapping function between the vocal tract characteristics of the source and the target speaker. The prosodic features are modified using epoch based method and compared with the baseline pitch synchronous overlap and add (PSOLA) based method for pitch and time scale modification. The linear prediction residual (LP residual) signal corresponding to each frame of the converted vocal tract transfer function is selected from the target residual codebook using a modified cost function. The cost function is calculated based on mapped vocal tract transfer function and its dynamics along with minimum residual phase, pitch period and energy differences with the codebook entries. The LP residual signal corresponding to the target speaker is generated by concatenating the selected frame and its previous frame so as to retain the maximum information around the GCIs. The proposed system is also tested using GMM based model for vocal tract modification. The average mean opinion score (MOS) and ABX test results are 3.95 and 85 for GMM based system and 3.98 and 86 for the M-SVR based system respectively. The subjective and objective evaluation results suggest that the proposed M-SVR based model for vocal tract modification combined with modified residual selection and epoch based model for prosody modification can provide a good quality synthesized target output. The results also suggest that the proposed integrated system performs slightly better than the GMM based baseline system designed using either epoch based or PSOLA based model for prosody modification. 相似文献

16.

Comparing ANN and GMM in a voice conversion framework

R.H. Laskar D. Chakrabarty F.A. Talukdar K. Sreenivasa Rao K. Banerjee 《Applied Soft Computing》2012,12(11):3332-3342

In this paper, we present a comparative analysis of artificial neural networks (ANNs) and Gaussian mixture models (GMMs) for design of voice conversion system using line spectral frequencies (LSFs) as feature vectors. Both the ANN and GMM based models are explored to capture nonlinear mapping functions for modifying the vocal tract characteristics of a source speaker according to a desired target speaker. The LSFs are used to represent the vocal tract transfer function of a particular speaker. Mapping of the intonation patterns (pitch contour) is carried out using a codebook based model at segmental level. The energy profile of the signal is modified using a fixed scaling factor defined between the source and target speakers at the segmental level. Two different methods for residual modification such as residual copying and residual selection methods are used to generate the target residual signal. The performance of ANN and GMM based voice conversion (VC) system are conducted using subjective and objective measures. The results indicate that the proposed ANN-based model using LSFs feature set may be used as an alternative to state-of-the-art GMM-based models used to design a voice conversion system. 相似文献

17.

Parametric Formant Modelling and Transformation in Voice Conversion

Dimitrios?Rentzos Email author Saeed?Vaseghi Qin?Yan Ching-Hsiang?Ho 《International Journal of Speech Technology》2005,8(3):227-245

This paper presents a method for the estimation and mapping of parametric models of speech resonance at formants for voice conversion. The spectral features at formants that contribute to voice characteristics are the trajectories of the frequencies, the bandwidths and intensities of the resonance at formants. The formant features are extracted from the poles of a linear prediction (LP) model of speech. The statistical distributions of formants are modelled by a two-dimensional hidden Markov model (HMM) spanning the time and frequency dimensions. Experimental results are presented which show a close match between HMM-based formant models and the histograms of formants. For voice conversion two alternative methods are explored for mapping the formants of a source speaker to those of a target speaker. The first method is based on an adaptive formant-tracking warping of the frequency response of the LP model and the second method is based on the rotation of the poles of the LP model of speech. Both methods transform all spectral parameters of the resonance at formants of the source speaker towards those of the target speaker. In addition, the issues affecting the selection of the warping ratios for the mapping functions are investigated. Experimental results of formant estimation and perceptual evaluation of voice morphing based on parametric formant models are presented. 相似文献

18.

Reliable methods for estimating relative vocal tract lengths from formant trajectories of common words

Watanabe A. Sakata T. 《IEEE transactions on audio, speech, and language processing》2006,14(4):1193-1204

This paper describes reliable methods for estimating relative vocal tract lengths from speech signals. Two proposed methods are based on the simple principle that resonant frequencies in an acoustic tube are inversely proportional to the tube length in cases where the configuration is constant. We estimated the ratio between two speakers' vocal tract lengths using first and second formant trajectories of the same words uttered by them. In the first method, which is referred to as "strict estimation method", we sought instances at which the gross structures of two vocal tracts are analogous by applying dynamic time-warping to formant-trajectories of common words that were uttered at different speeds. In those instances, which were found from among more than 100 common words by two speakers, an average formant ratio proved to be an excellent estimate (about /spl plusmn/0.1% in errors) for a reciprocal of the vocal tract length ratio. Next, we examined a simplified method for estimating those ratios using all corresponding points of two formant-trajectories: it is the "direct estimation method". Estimation errors in the direct estimation were evaluated to be about /spl plusmn/0.3% at equal utterance-speeds and /spl plusmn/2% at most, within 2.0 of the ratios of "fast" to "slow". Finally, we estimated relative vocal tract lengths for four Japanese speaker groups, whose members differed in terms of age and gender. Experimental results showed that the average vocal tract length of adult females and that of 7-10-year-old boys and girls are 21%, 27%, and 30%, respectively, shorter than adult males'. 相似文献

19.

Capturing Local Variability for Speaker Normalization in Speech Recognition

Miguel A. Lleida E. Rose R. Buera L. Saz O. Ortega A. 《IEEE transactions on audio, speech, and language processing》2008,16(3):578-593

The new model reduces the impact of local spectral and temporal variability by estimating a finite set of spectral and temporal warping factors which are applied to speech at the frame level. Optimum warping factors are obtained while decoding in a locally constrained search. The model involves augmenting the states of a standard hidden Markov model (HMM), providing an additional degree of freedom. It is argued in this paper that this represents an efficient and effective method for compensating local variability in speech which may have potential application to a broader array of speech transformations. The technique is presented in the context of existing methods for frequency warping-based speaker normalization for ASR. The new model is evaluated in clean and noisy task domains using subsets of the Aurora 2, the Spanish Speech-Dat-Car, and the TIDIGITS corpora. In addition, some experiments are performed on a Spanish language corpus collected from a population of speakers with a range of speech disorders. It has been found that, under clean or not severely degraded conditions, the new model provides improvements over the standard HMM baseline. It is argued that the framework of local warping is an effective general approach to providing more flexible models of speaker variability. 相似文献

20.

Processing degraded speech for text dependent speaker verification

Banriskhem K. Khonglah Ramesh K. Bhukya S. R. Mahadeva Prasanna 《International Journal of Speech Technology》2017,20(4):839-850

This work explores the use of speech enhancement for enhancing degraded speech which may be useful for text dependent speaker verification system. The degradation may be due to noise or background speech. The text dependent speaker verification is based on the dynamic time warping (DTW) method. Hence there is a necessity of the end point detection. The end point detection can be performed easily if the speech is clean. However the presence of degradation tends to give errors in the estimation of the end points and this error propagates into the overall accuracy of the speaker verification system. Temporal and spectral enhancement is performed on the degraded speech so that ideally the nature of the enhanced speech will be similar to the clean speech. Results show that the temporal and spectral processing methods do contribute to the task by eliminating the degradation and improved accuracy is obtained for the text dependent speaker verification system using DTW. 相似文献