首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
提出一种将STRAIGHT模型和深度信念网络DBN相结合实现语音转换的方式。首先,通过STRAIGHT模型提取出源说话人和目标说话人的语音频谱参数,用提取的频谱参数分别训练两个DBN得到语音高阶空间的个性特征信息;然后,用人工神经网络ANN将两个具有高阶特征的空间连接并进行特征转换;最后,用基于目标说话人数据训练出的DBN来对转换后的特征信息进行逆处理得到语音频谱参数,并用STRAIGHT模型合成具有目标说话人个性化特征的语音。实验结果表明,采用此种方式获得的语音转换效果要比传统的采用GMM实现语音转换更好,转换后的语音音质和相似度与目标语音更接近。  相似文献   

2.
Voice conversion (VC) approach, which morphs the voice of a source speaker to be perceived as spoken by a specified target speaker, can be intentionally used to deceive the speaker identification (SID) and speaker verification (SV) systems that use speech biometric. Voice conversion spoofing attacks to imitate a particular speaker pose potential threat to these kinds of systems. In this paper, we first present an experimental study to evaluate the robustness of such systems against voice conversion disguise. We use Gaussian mixture model (GMM) based SID systems, GMM with universal background model (GMM-UBM) based SV systems and GMM supervector with support vector machine (GMM-SVM) based SV systems for this. Voice conversion is conducted by using three different techniques: GMM based VC technique, weighted frequency warping (WFW) based conversion method and its variation, where energy correction is disabled (WFW). Evaluation is done by using intra-gender and cross-gender voice conversions between fifty male and fifty female speakers taken from TIMIT database. The result is indicated by degradation in the percentage of correct identification (POC) score in SID systems and degradation in equal error rate (EER) in all SV systems. Experimental results show that the GMM-SVM SV systems are more resilient against voice conversion spoofing attacks than GMM-UBM SV systems and all SID and SV systems are most vulnerable towards GMM based conversion than WFW and WFW based conversion. From the results, it can also be said that, in general terms, all SID and SV systems are slightly more robust to voices converted through cross-gender conversion than intra-gender conversion. This work extended the study to find out the relationship between VC objective score and SV system performance in CMU ARCTIC database, which is a parallel corpus. The results of this experiment show an approach on quantifying objective score of voice conversion that can be related to the ability to spoof an SV system.  相似文献   

3.
Automatic speaker verification (ASV) is to automatically accept or reject a claimed identity based on a speech sample. Recently, individual studies have confirmed the vulnerability of state-of-the-art text-independent ASV systems under replay, speech synthesis and voice conversion attacks on various databases. However, the behaviours of text-dependent ASV systems have not been systematically assessed in the face of various spoofing attacks. In this work, we first conduct a systematic analysis of text-dependent ASV systems to replay and voice conversion attacks using the same protocol and database, in particular the RSR2015 database which represents mobile device quality speech. We then analyse the interplay of voice conversion and speaker verification by linking the voice conversion objective evaluation measures with the speaker verification error rates to take a look at the vulnerabilities from the perspective of voice conversion.  相似文献   

4.
The objective of voice conversion system is to formulate the mapping function which can transform the source speaker characteristics to that of the target speaker. In this paper, we propose the General Regression Neural Network (GRNN) based model for voice conversion. It is a single pass learning network that makes the training procedure fast and comparatively less time consuming. The proposed system uses the shape of the vocal tract, the shape of the glottal pulse (excitation signal) and long term prosodic features to carry out the voice conversion task. In this paper, the shape of the vocal tract and the shape of source excitation of a particular speaker are represented using Line Spectral Frequencies (LSFs) and Linear Prediction (LP) residual respectively. GRNN is used to obtain the mapping function between the source and target speakers. The direct transformation of the time domain residual using Artificial Neural Network (ANN) causes phase change and generates artifacts in consecutive frames. In order to alleviate it, wavelet packet decomposed coefficients are used to characterize the excitation of the speech signal. The long term prosodic parameters namely, pitch contour (intonation) and the energy profile of the test signal are also modified in relation to that of the target (desired) speaker using the baseline method. The relative performances of the proposed model are compared to voice conversion system based on the state of the art RBF and GMM models using objective and subjective evaluation measures. The evaluation measures show that the proposed GRNN based voice conversion system performs slightly better than the state of the art models.  相似文献   

5.
In this paper, we present a comparative analysis of artificial neural networks (ANNs) and Gaussian mixture models (GMMs) for design of voice conversion system using line spectral frequencies (LSFs) as feature vectors. Both the ANN and GMM based models are explored to capture nonlinear mapping functions for modifying the vocal tract characteristics of a source speaker according to a desired target speaker. The LSFs are used to represent the vocal tract transfer function of a particular speaker. Mapping of the intonation patterns (pitch contour) is carried out using a codebook based model at segmental level. The energy profile of the signal is modified using a fixed scaling factor defined between the source and target speakers at the segmental level. Two different methods for residual modification such as residual copying and residual selection methods are used to generate the target residual signal. The performance of ANN and GMM based voice conversion (VC) system are conducted using subjective and objective measures. The results indicate that the proposed ANN-based model using LSFs feature set may be used as an alternative to state-of-the-art GMM-based models used to design a voice conversion system.  相似文献   

6.
This article puts forward a new algorithm for voice conversion which not only removes the necessity of parallel corpus in the training phase but also resolves the issue of insufficiency of the target speaker’s corpus. The proposed approach is based on one of the new voice conversion models utilizing classical LPC analysis-synthesis model combined with GMM. Through this algorithm, the conversion functions among vowels and demi-syllables are derived. We assumed that these functions are rather the same for different speakers if their genders, accents, and languages are alike. Therefore, we will be able to produce the demi-syllables with just having access to few sentences from the target speaker and forming the GMM for one of his/her vowels. The results from the appraisal of the proposed method for voice conversion clarifies that this method has the ability to efficiently realize the speech features of the target speaker. It can also provide results comparable to the ones obtained through the parallel-corpus-based approaches.  相似文献   

7.
In this paper, we describe a novel spectral conversion method for voice conversion (VC). A Gaussian mixture model (GMM) of the joint probability density of source and target features is employed for performing spectral conversion between speakers. The conventional method converts spectral parameters frame by frame based on the minimum mean square error. Although it is reasonably effective, the deterioration of speech quality is caused by some problems: 1) appropriate spectral movements are not always caused by the frame-based conversion process, and 2) the converted spectra are excessively smoothed by statistical modeling. In order to address those problems, we propose a conversion method based on the maximum-likelihood estimation of a spectral parameter trajectory. Not only static but also dynamic feature statistics are used for realizing the appropriate converted spectrum sequence. Moreover, the oversmoothing effect is alleviated by considering a global variance feature of the converted spectra. Experimental results indicate that the performance of VC can be dramatically improved by the proposed method in view of both speech quality and conversion accuracy for speaker individuality.  相似文献   

8.
In this paper, we introduce the use of continuous prosodic features for speaker recognition, and we show how they can be modeled using joint factor analysis. Similar features have been successfully used in language identification. These prosodic features are pitch and energy contours spanning a syllable-like unit. They are extracted using a basis consisting of Legendre polynomials. Since the feature vectors are continuous (rather than discrete), they can be modeled using a standard Gaussian mixture model (GMM). Furthermore, speaker and session variability effects can be modeled in the same way as in conventional joint factor analysis. We find that the best results are obtained when we use the information about the pitch, energy, and the duration of the unit all together. Testing on the core condition of NIST 2006 speaker recognition evaluation data gives an equal error rate of 16.6% and 14.6%, with prosodic features alone, for all trials and English-only trials, respectively. When the prosodic system is fused with a state-of-the-art cepstral joint factor analysis system, we obtain a relative improvement of 8% (all trials) and 12% (English only) compared to the cepstral system alone.  相似文献   

9.
Statistical Approach for Voice Personality Transformation   总被引:1,自引:0,他引:1  
A voice transformation method which changes the source speaker's utterances so as to sound similar to those of a target speaker is described. Speaker individuality transformation is achieved by altering the LPC cepstrum, average pitch period and average speaking rate. The main objective of the work involves building a nonlinear relationship between the parameters for the acoustical features of two speakers, based on a probabilistic model. The conversion rules involve the probabilistic classification and a cross correlation probability between the acoustic features of the two speakers. The parameters of the conversion rules are estimated by estimating the maximum likelihood of the training data. To obtain transformed speech signals which are perceptually closer to the target speaker's voice, prosody modification is also involved. Prosody modification is achieved by scaling excitation spectrum and time scale modification with appropriate modification factors. An evaluation by objective tests and informal listening tests clearly indicated the effectiveness of the proposed transformation method. We also confirmed that the proposed method leads to smoothly evolving spectral contours over time, which, from a perceptual standpoint, produced results that were superior to conventional vector quantization (VQ)-based methods  相似文献   

10.
语音的电子伪装是指采用变声设备或语音处理软件改变说话人的个性特征,以达到故意隐藏该说话人身份的目的。电子伪装语音还原是指通过技术手段将伪装语音变回原声,这对基于语音的身份鉴别具有重要意义。本文将频域和时域伪装语音的还原问题抽象为伪装因子的估计问题,通过基于i-vector的自动说话人确认方法估计伪装因子,并引入对称变换进一步提高估计效果。该方法借助于i-vector的噪声鲁棒性,提高了真实含噪场景下伪装因子的估计精度,从而改进了噪声条件下电子伪装语音的还原效果。在干净语音库TIMIT上训练i-vector并在含噪语音库VoxCeleb1上对本文方法进行测试,结果表明,伪装因子估计的错误率从基线系统的9.19%降低为4.49%,还原语音在自动说话人确认等错误率和听觉感知方面也取得了提升。  相似文献   

11.
通过对语音转换的研究,提出了一种把源说话人特征转换为目标说话人特征的方法。语音转换特征参数分为两类:(1)频谱特征参数;(2)基音和声调模式。分别描述信号模型和转换方法。频谱特征用基于音素的2维HMMS建模,F0轨迹用来表示基音和音调。用基音同步叠加法对基音厨期、声调和语速进行变换。  相似文献   

12.
通过对语音转换的研究,提出了一种把源说话人特征转换为目标说话人特征的方法。语音转换特征参数分为两类:(1)频谱特征参数;(2)基音和声调模式。分别描述信号模型和转换方法。频谱特征用基于音素的2维HMMS建模,F0轨迹用来表示基音和音调。用基音同步叠加法对基音周期﹑声调和语速进行变换。  相似文献   

13.
We propose a pitch synchronous approach to design the voice conversion system taking into account the correlation between the excitation signal and vocal tract system characteristics of speech production mechanism. The glottal closure instants (GCIs) also known as epochs are used as anchor points for analysis and synthesis of the speech signal. The Gaussian mixture model (GMM) is considered to be the state-of-art method for vocal tract modification in a voice conversion framework. However, the GMM based models generate overly-smooth utterances and need to be tuned according to the amount of available training data. In this paper, we propose the support vector machine multi-regressor (M-SVR) based model that requires less tuning parameters to capture a mapping function between the vocal tract characteristics of the source and the target speaker. The prosodic features are modified using epoch based method and compared with the baseline pitch synchronous overlap and add (PSOLA) based method for pitch and time scale modification. The linear prediction residual (LP residual) signal corresponding to each frame of the converted vocal tract transfer function is selected from the target residual codebook using a modified cost function. The cost function is calculated based on mapped vocal tract transfer function and its dynamics along with minimum residual phase, pitch period and energy differences with the codebook entries. The LP residual signal corresponding to the target speaker is generated by concatenating the selected frame and its previous frame so as to retain the maximum information around the GCIs. The proposed system is also tested using GMM based model for vocal tract modification. The average mean opinion score (MOS) and ABX test results are 3.95 and 85 for GMM based system and 3.98 and 86 for the M-SVR based system respectively. The subjective and objective evaluation results suggest that the proposed M-SVR based model for vocal tract modification combined with modified residual selection and epoch based model for prosody modification can provide a good quality synthesized target output. The results also suggest that the proposed integrated system performs slightly better than the GMM based baseline system designed using either epoch based or PSOLA based model for prosody modification.  相似文献   

14.
说话人识别中MFCC参数提取的改进   总被引:1,自引:0,他引:1  
在说话人识别方面,最常用到的语音特征就是梅尔倒频谱系数(MFCC)。提出了一种改进的提取MFCC参数的方法,对传统的提取MFCC过程中计算FFT这一步骤进行频谱重构,对频谱进行噪声补偿重建,使之具有很好的抗噪性,逼近纯净语音的频谱。实验表明基于此改进提取的MFCC参数,可以明显提高说话人识别系统的识别率,尤其在低信噪比的环境下,效果明显。  相似文献   

15.
Speech separation is an essential part of any voice recognition system like speaker recognition, speech recognition and hearing aids etc. When speech separation is applied at the front-end of any voice recognition system increases the performance efficiency of that particular system. In this paper we propose a system for single channel speech separation by combining empirical mode decomposition (EMD) and multi pitch information. The proposed method is completely unsupervised and requires no knowledge of the underlying speakers. In this method we apply EMD to short frames of the mixed speech for better estimation of the speech specific information. Speech specific information is derived through multi pitch tracking. To track multi pitch information from the mixed signal we apply simple-inverse filtering tracking and histogram based pitch estimation to excitation source information along with estimating the number of speakers present in the mixed signal.  相似文献   

16.
In this paper, a frame linear predictive coding spectrum (FLPCS) technique for speaker identification is presented. Traditionally, linear predictive coding (LPC) was applied in many speech recognition applications, nevertheless, the modification of LPC termed FLPCS is proposed in this study for speaker identification. The analysis procedure consists of feature extraction and voice classification. In the stage of feature extraction, the representative characteristics were extracted using the FLPCS technique. Through the approach, the size of the feature vector of a speaker can be reduced within an acceptable recognition rate. In the stage of classification, general regression neural network (GRNN) and Gaussian mixture model (GMM) were applied because of their rapid response and simplicity in implementation. In the experimental investigation, performances of different order FLPCS coefficients which were induced from the LPC spectrum were compared with one another. Further, the capability analysis on GRNN and GMM was also described. The experimental results showed GMM can achieve a better recognition rate with feature extraction using the FLPCS method. It is also suggested the GMM can complete training and identification in a very short time.  相似文献   

17.
In this paper, we propose two kinds of modifications in speaker recognition. First, the correlations between frequency channels are of prime importance for speaker recognition. Some of these correlations are lost when the frequency domain is divided into sub-bands. Consequently we propose a particularly redundant parallel architecture for which most of the correlations are kept. Second, generally a log transformation used to modify the power spectrum is done after the filter-bank in the classical spectrum calculation. We will see that performing this transformation before the filter bank is more interesting in our case. In the processing of recognition, the Gaussian mixture model (GMM) recognition arithmetic is adopted. Experiments on speech corrupted by noise show a better adaptability of this approach in noisy environments, compared with a conventional device, especially when pruning of some recognizers is performed.  相似文献   

18.
In this paper, we propose two kinds of modifications in speaker recognition. First, the correlations between frequency channels are of prime importance for speaker recognition. Some of these correlations are lost when the frequency domain is divided into sub-bands. Consequently we propose a particularly redundant parallel architecture for which most of the correlations are kept. Second, generally a log transformation used to modify the power spectrum is done after the filter-bank in the classical spectrum calculation. We will see that performing this transformation before the filter bank is more interesting in our case. In the processing of recognition, the Gaussian mixture model (GMM) recognition arithmetic is adopted. Experiments on speech corrupted by noise show a better adaptability of this approach in noisy environments, comoared with a conventional device, esoeciallv when oruning of some recognizers is performed.  相似文献   

19.
为了更准确地在噪声环境中对不同语音信号进行识别,提出了一种用于普适语音环境下的自优化语音活动检测(VAD)算法,该算法运用个性化语音命令自动识别系统的语音信号,并能够有效地从多个发声者的混合语音中分离出个体发声者的声音,通过跟踪语音功率谱的较高幅度部分和自适应地抑制噪声来检测发声者的语音信号;设计并实现了一种处理多个发声者任务的自动语音识别(ASR),免去了对干净的语音变化进行先验估计,直接利用噪声本身产生语音/非语音判决的阈值以完成自优化过程;使用语音数据库NOIZEUS进行了评价测试,实验结果表明,所提出的盲源分离和噪声抑制方法不需要任何额外的计算过程,有效地减少了计算负担。  相似文献   

20.
为了解决航空机载环境下飞行员通话强噪声问题,提出了一种基于FPGA+DSP架构的数字话音处理系统.系统由模拟部分和数字部分组成,模拟部分完成话音信号的匹配、滤波、放大和AD/DA转换;数字部分设计了一种音频处理算法,对话音信号进行活动检测、噪声抑制和话音增强等处理.试验结果表明,该系统能够有效抑制通话噪声、增强话音信号,提高了飞行员通话的可懂度和舒适度.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号