期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

黄文娜彭亚雄贺松《计算机应用》2017,37(3):906-910

为了改善发声力度对说话人识别系统性能的影响,在训练语音存在少量耳语、高喊语音数据的前提下,提出了使用最大后验概率（MAP）和约束最大似然线性回归（CMLLR）相结合的方法来更新说话人模型、投影转换说话人特征。其中,MAP自适应方法用于对正常语音训练的说话人模型进行更新,而CMLLR特征空间投影方法则用来投影转换耳语、高喊测试语音的特征,从而改善训练语音与测试语音的失配问题。实验结果显示,采用MAP+CMLLR方法时,说话人识别系统等错误率（EER）明显降低,与基线系统、最大后验概率（MAP）自适应方法、最大似然线性回归（MLLR）模型投影方法和约束最大似然线性回归（CMLLR）特征空间投影方法相比,MAP+CMLLR方法的平均等错率分别降低了75.3%、3.5%、72%和70.9%。实验结果表明,所提出方法削弱了发声力度对说话人区分性的影响,使说话人识别系统对于发声力度变化更加鲁棒。相似文献

2.

Voice conversion using General Regression Neural Network

《Applied Soft Computing》2014

The objective of voice conversion system is to formulate the mapping function which can transform the source speaker characteristics to that of the target speaker. In this paper, we propose the General Regression Neural Network (GRNN) based model for voice conversion. It is a single pass learning network that makes the training procedure fast and comparatively less time consuming. The proposed system uses the shape of the vocal tract, the shape of the glottal pulse (excitation signal) and long term prosodic features to carry out the voice conversion task. In this paper, the shape of the vocal tract and the shape of source excitation of a particular speaker are represented using Line Spectral Frequencies (LSFs) and Linear Prediction (LP) residual respectively. GRNN is used to obtain the mapping function between the source and target speakers. The direct transformation of the time domain residual using Artificial Neural Network (ANN) causes phase change and generates artifacts in consecutive frames. In order to alleviate it, wavelet packet decomposed coefficients are used to characterize the excitation of the speech signal. The long term prosodic parameters namely, pitch contour (intonation) and the energy profile of the test signal are also modified in relation to that of the target (desired) speaker using the baseline method. The relative performances of the proposed model are compared to voice conversion system based on the state of the art RBF and GMM models using objective and subjective evaluation measures. The evaluation measures show that the proposed GRNN based voice conversion system performs slightly better than the state of the art models. 相似文献

3.

基于发声机理与人耳感知特性的说话人识别

杜晓青于凤芹《计算机工程》2013,(11):197-199,204

Mel频率倒谱系数（MFCC）与线性预测倒谱系数（LPCC）融合算法只能反映语音静态特征,且LPCC对语音低频局部特征描述不足。为此,提出将希尔伯特黄变换（HHT）倒谱系数与相对光谱一感知线性预测倒谱系数（RASTA—PLPCC）融合,得到一种既反映发声机理又体现人耳感知特性的说话人识别算法。HHT倒谱系数体现发声机理,能反映语音动态特性,并更好地描述信号低频局部特征,可改进LPCC的不足。PLPCC体现人耳感知特性,识别性能强于MFCC,用3种融合算法对两者进行融合,将融合特征用于高斯混合模型进行说话人识别。仿真实验结果表明,该融合算法较已有的MFCC与LPCC融合算法识别率提高了8．0％。相似文献

4.

Vocal emotion recognition in five native languages of Assam using new wavelet features

Aditya Bihar Kandali Aurobinda Routray Tapan Kumar Basu 《International Journal of Speech Technology》2009,12(1):1-13

The present work investigates the following specific research questions concerning voice emotion recognition: whether vocal emotion expressions of discrete emotion (i) can be distinguished from no-emotion (i.e. neutral), (ii) can be distinguished from another, (iii) of surprise, which is actually a cognitive component that could be present with any emotion, can also be recognized as distinct emotion, (iv) can be recognized cross-lingually. This study will enable us to get more information regarding nature and function of emotion. Furthermore, this work will help in developing a generalized voice emotion recognition system, which will increase the efficiency of human-machine interaction systems. In this work an emotional utterance database is created with 140 acted utterances per speaker consisting of short sentences of six full-blown basic emotions and neutral of five native languages of Assam. This database is validated by a Listening Test. Four feature sets are extracted based on WPCC2 (Wavelet-Packet-Cepstral-Coefficients computed by method 2), MFCC (Mel-Frequency-Cepstral-Coefficients), tfWPCC2 (Teager-energy-operated-in-Transform-domain WPCC2) and tfMFCC. The Gaussian Mixture Model (GMM) is used as classifier. The performances of all these feature sets are compared in respect of accuracy of classification in two experiments: (i) text-and-speaker independent vocal emotion recognition in individual languages, and (ii) cross-lingual vocal emotion recognition. tfWPCC2 is a new wavelet feature set proposed by the same authors in one of their recent papers in a National Seminar in India as cited in References. 相似文献

5.

基于连续小波和支持向量机的病态嗓音检测

颜景斌《电脑与信息技术》2008,16(3):21-23

声学分析是一种非常有前景的嗓音病理诊断方法,它采用连续小波分析方法提取嗓音特征参数.文章提出了一种基于SVM的病态嗓音分类算法,通过选择径向基函数RBF,可使分类的正确率达到97%. 相似文献

6.

Speaker identification using vowels features through a combined method of formants,wavelets, and neural network classifiers

《Applied Soft Computing》2015

This paper proposes a new method for speaker feature extraction based on Formants, Wavelet Entropy and Neural Networks denoted as FWENN. In the first stage, five formants and seven Shannon entropy wavelet packet are extracted from the speakers’ signals as the speaker feature vector. In the second stage, these 12 feature extraction coefficients are used as inputs to feed-forward neural networks. Probabilistic neural network is also proposed for comparison. In contrast to conventional speaker recognition methods that extract features from sentences (or words), the proposed method extracts the features from vowels. Advantages of using vowels include the ability to recognize speakers when only partially-recorded words are available. This may be useful for deaf-mute persons or when the recordings are damaged. Experimental results show that the proposed method succeeds in the speaker verification and identification tasks with high classification rate. This is accomplished with minimum amount of information, using only 12 coefficient features (i.e. vector length) and only one vowel signal, which is the major contribution of this work. The results are further compared to well-known classical algorithms for speaker recognition and are found to be superior. 相似文献

7.

说话人聚类的初始类生成方法

赖松轩李艳雄《计算机工程与应用》2017,53(3):149-153

目前说话人聚类时将说话人分割后的语音段作为初始类,直接对这些数量庞大语音段进行聚类的计算量非常大。为了降低说话人聚类时的计算量,提出一种面向说话人聚类的初始类生成方法。提取说话人分割后语音段的特征参数及特征参数的质心,结合层次聚类法和贝叶斯信息准则,对语音段进行具有宽松停止准则的“预聚类”,生成初始类。与直接对说话人分割后的语音段进行聚类的方法相比,该方法能在保持原有聚类性能的情况下,减少40.04%的计算时间;在允许聚类性能略有下降的情形下,减少60.03%以上的计算时间。相似文献

8.

病理嗓音发声系统的非对称建模研究

陶智曾晓亮顾玲玲张晓俊吴迪薛隆基《数据采集与处理》2016,31(2):260-267

为了在病理嗓音识别中为特征参数选择提供依据,提出声带非对称力学建模仿真病变声带并进行分析研究。依据声带的分层结构和组织特性,建立声带力学模型,耦合声门气流,求取模型输出的声门源激励波形。采用遗传粒子群拟牛顿结合优化算法(Genetic particle swarm optimization based on quasi-Newton method, GPSO-QN)将模型输出的声门源和实际目标声门波相匹配,提取优化模型参数。仿真实验结果表明,该声带模型能产生与实际声门源相一致的声门波形,同时也证明了左右声带生理组织间的非对称性是产生病理嗓音的重要原因。相似文献

9.

Wavelet basis selection for enhanced speech parametrization in speaker verification

Todor Ganchev Mihalis Siafarikas Iosif Mporas Tsenka Stoyanova 《International Journal of Speech Technology》2014,17(1):27-36

相似文献

10.

利用EHMM和CLR的说话人分割聚类算法 总被引：1，自引：0，他引：1

凌锦雯陆伟刘青松张琨磊《小型微型计算机系统》2012,33(6):1389-1392

针对传统的说话人分割聚类系统中,由于聚类时话者信息不足而影响切分准确度的问题,本文提出了一种基于进化隐马尔科夫模型和交叉对数似然比距离测度的多层次说话人分割聚类算法,在传统的话者分割聚类算法的基础上引入了重分割和重聚类的机制,以及基于距离测度和贝叶斯信息准则的分层聚类算法,有效的解决了传统方法中切分准确度受到话者信息制约的问题.在美国国家标准技术署(NIST)2003 Spring RT数据库上的实验结果表明,本文提出的算法比传统算法系统性能相对提高了41%. 相似文献

11.

Vocal fatigue induced by prolonged oral reading: Analysis and detection

《Computer Speech and Language》2014,28(2):453-466

This article uses prolonged oral reading corpora for various experiments to analyze and detect vocal fatigue. Vocal fatigue particularly concerns voice professionals, including teachers, telemarketing operators, users of automatic speech recognition technology and actors. In analyzing and detecting vocal fatigue, we focused our investigations on three main experiments: a prosodic analysis that can be compared to the results found in related work, a two-class Support Vector Machines (SVM) classifier into Fatigue and Non-Fatigue states using a large set of audio features and a comparison function that estimates the difference in fatigue level between two speech segments using a combination of multiple phoneme-based comparison functions. The experiments on prosodic analysis showed that vocal fatigue was not associated with an increase in fundamental frequency and voice intensity. A two-class SVM classifier using the Paralinguistic Challenge 2010 audio feature set gave an unweighted accuracy of 94.1% for the training set (10-fold cross-validation) and 68.2% for the test set. These results show that the phenomenon of vocal fatigue can be modeled and detected. The comparison function was assessed by detecting increased fatigue levels between two speech segments. The fatigue level detection performance in Equal Error Rate (EER) was 31% using all phonetic segments and yielded EER of 21% after filtering phonetic segments and 19% after filtering phonetic segments and cepstral features. These results show that some phonemes are more sensitive than others to vocal fatigue. These experiments show that the fatigued voice has specific characteristics for prolonged oral reading and suggest the feasibility of vocal fatigue detection. 相似文献

12.

Comparing ANN and GMM in a voice conversion framework

R.H. Laskar D. Chakrabarty F.A. Talukdar K. Sreenivasa Rao K. Banerjee 《Applied Soft Computing》2012,12(11):3332-3342

In this paper, we present a comparative analysis of artificial neural networks (ANNs) and Gaussian mixture models (GMMs) for design of voice conversion system using line spectral frequencies (LSFs) as feature vectors. Both the ANN and GMM based models are explored to capture nonlinear mapping functions for modifying the vocal tract characteristics of a source speaker according to a desired target speaker. The LSFs are used to represent the vocal tract transfer function of a particular speaker. Mapping of the intonation patterns (pitch contour) is carried out using a codebook based model at segmental level. The energy profile of the signal is modified using a fixed scaling factor defined between the source and target speakers at the segmental level. Two different methods for residual modification such as residual copying and residual selection methods are used to generate the target residual signal. The performance of ANN and GMM based voice conversion (VC) system are conducted using subjective and objective measures. The results indicate that the proposed ANN-based model using LSFs feature set may be used as an alternative to state-of-the-art GMM-based models used to design a voice conversion system. 相似文献

13.

Transforming Perceived Vocal Effort and Breathiness Using Adaptive Pre-Emphasis Linear Prediction

Nordstrom K.I. Tzanetakis G. Driessen P.F. 《IEEE transactions on audio, speech, and language processing》2008,16(6):1087-1096

This paper presents a technique to transform high-effort voices into breathy voices using adaptive pre-emphasis linear prediction (APLP). The primary benefit of this technique is that it estimates a spectral emphasis filter that can be used to manipulate the perceived vocal effort. The other benefit of APLP is that it estimates a formant filter that is more consistent across varying voice qualities. This paper describes how constant pre-emphasis linear prediction (LP) estimates a voice source with a constant spectral envelope even though the spectral envelope of the true voice source varies over time. A listening experiment demonstrates how differences in vocal effort and breathiness are audible in the formant filter estimated by constant pre-emphasis LP. APLP is presented as a technique to estimate a spectral emphasis filter that captures the combined influence of the glottal source and the vocal tract upon the spectral envelope of the voice. A final listening experiment demonstrates how APLP can be used to effectively transform high-effort voices into breathy voices. The techniques presented here are relevant to researchers in voice conversion, voice quality, singing, and emotion. 相似文献

14.

基于Mellin变换的语音新特征与频率归正说话人自适应技术

陈景东徐波黄泰翼《自动化学报》2000,26(4):478-484

为了减小由于说话人之间声道形状的差异而引起的非特定人语音识别系统性能的下降,研究了两种方法,一种是基于最大似然估计的频率归正说话人自适应方法,另一种是基于Mellin变换的语音新特征.在非特定人孤立词语音识别系统上的初步实验表明,这两种方法都可以提高系统对不同说话人的鲁棒性,相比之下,基于Mellin变换的语音新特征具有更好的性能,它不仅提高了系统对不同话者的识别性能,而且也使系统对不同话者的误识率的离散程度大大减小. 相似文献

15.

Unsupervised speaker segmentation and tracking in real-time audio content analysis 总被引：1，自引：0，他引：1

Lie Lu Hong-Jiang Zhang 《Multimedia Systems》2005,10(4):332-343

This paper addresses the problem of real-time speaker segmentation and speaker tracking in audio content analysis in which no prior knowledge of the number of speakers and the identities of speakers is available. Speaker segmentation is to detect the speaker change boundaries in a speech stream. It is performed by a two-step algorithm, which includes potential change detection and refinement. Speaker tracking is then performed based on the results of speaker segmentation by identifying the speaker of each segment. In our approach, incremental speaker model updating and segmental clustering is proposed, which makes the unsupervised speaker segmentation and tracking feasible in real-time processing. A Bayesian fusion method is also proposed to fuse multiple audio features to obtain a more reliable result, and different noise levels are utilized to compensate for background mismatch. Experiments show that the proposed algorithm can recall 89% of speaker change boundaries with 15% false alarms, and 76% of speakers can be unsupervised identified with 20% false alarms. Compared with previous works, the algorithm also has low computation complexity and can perform in 15% of real time with a very limited delay in analysis. Published online: 12 January 2005 Part of the work presented in this paper was published in the 10th ACM International Conference on Multimedia, 1-6 December 2002 相似文献

16.

Unsupervised speaker segmentation with residual phase and MFCC features

S. Jothilakshmi V. Ramalingam S. Palanivel 《Expert systems with applications》2009,36(6):9799-9804

This paper proposes an unsupervised method for improving the automatic speaker segmentation performance by combining the evidence from residual phase (RP) and mel frequency cepstral coefficients (MFCC). This method demonstrates the complementary nature of speaker specific information present in the residual phase in comparison with the information present in the conventional MFCC. Moreover this method presents an unsupervised speaker segmentation algorithm based on support vector machine (SVM). The experiments show that the combination of residual phase and MFCC helps to identify more robustly the transitions among speakers. 相似文献

17.

基于频域卷积信号盲源分离的乐曲数据库构建* 总被引：1，自引：1，他引：0

李鹏周明全黎南杉王学松a 《计算机应用研究》2010,27(4):1376-1379

将通过频域卷积信号盲源分离算法从MP3歌曲音频信号中分离出人声主唱信号,再从人声主唱信号中提取出能够表征歌曲的旋律特征构建哼唱检索系统的歌曲数据库。盲源分离要求观测信号数目不小于源信号数目,因此先用小波多分辨率分析构造一路观测信号,再用频域独立成分分析(FDICA)实现MP3歌曲音频信号的盲源分离(BSS)。实验证明,采用FDICA-based BSS从歌曲MP3中分离出的人声主唱信号的旋律特征与待检索的人声哼唱信号的旋律特征有较高的相似度,可以用歌曲MP3构建哼唱检索系统的歌曲数据库。相似文献

18.

Wavelet packet approximation of critical bands for speaker verification

Mihalis Siafarikas Todor Ganchev Nikos Fakotakis George Kokkinakis 《International Journal of Speech Technology》2007,10(4):197-218

Exploiting the capabilities offered by the plethora of existing wavelets, together with the powerful set of orthonormal bases provided by wavelet packets, we construct a novel wavelet packet-based set of speech features that is optimized for the task of speaker verification. Our approach differs from previous wavelet-based work, primarily in the wavelet-packet tree design that follows the concept of critical bands, as well as in the particular wavelet basis function that has been used. In comparative experiments, we investigate several alternative speech parameterizations with respect to their usefulness for differentiating among human voices. The experimental results confirm that the proposed speech features outperform Mel-Frequency Cepstral Coefficients (MFCC) and previously used wavelet features on the task of speaker verification. A relative reduction of the equal error rate by 15%, 15% and 8% was observed for the proposed speech features, when compared to the wavelet packet features introduced by Farooq and Datta, the MFCC of Slaney, and the subband based cepstral coefficients of Sarikaya et al., respectively. 相似文献

19.

An ANN based approach to recognize initial phonemes of spoken words of Assamese language

Mousmita Sarma Kandarpa Kumar Sarma 《Applied Soft Computing》2013,13(5):2281-2291

Initial phoneme is used in spoken word recognition models. These are used to activate words starting with that phoneme in spoken word recognition models. Such investigations are critical for classification of initial phoneme into a phonetic group. A work is described in this paper using an artificial neural network (ANN) based approach to recognize initial consonant phonemes of Assamese words. A self organizing map (SOM) based algorithm is developed to segment the initial phonemes from its word counterpart. Using a combination of three types of ANN structures, namely recurrent neural network (RNN), SOM and probabilistic neural network (PNN), the proposed algorithm proves its superiority over the conventional discrete wavelet transform (DWT) based phoneme segmentation. The algorithm is exclusively designed on the basis of Assamese phonemical structure which consists of certain unique features and are grouped into six distinct phoneme families. Before applying the segmentation approach using SOM, an RNN is used to take some localized decision to classify the words into six phoneme families. Next the SOM segmented phonemes are classified into individual phonemes. A two-class PNN classification is performed with clean Assamese phonemes, to recognize the segmented phonemes. The validation of recognized phonemes is checked by matching the first formant frequency of the phoneme. Formant frequency of Assamese phonemes, estimated using the pole or formant location determination from the linear prediction model of vocal tract, is used effectively as a priori knowledge in the proposed algorithm. 相似文献

20.

Speaker verification using excitation source information

Debadatta Pati S. R. Mahadeva Prasanna 《International Journal of Speech Technology》2012,15(2):241-257

In this work we develop a speaker recognition system based on the excitation source information and demonstrate its significance by comparing with the vocal tract information based system. The speaker-specific excitation information is extracted by the subsegmental, segmental and suprasegmental processing of the LP residual. The speaker-specific information from each level is modeled independently using Gaussian mixture modeling—universal background model (GMM-UBM) modeling and then combined at the score level. The significance of the proposed speaker recognition system is demonstrated by conducting speaker verification experiments on the NIST-03 database. Two different tests, namely, Clean test and Noisy test are conducted. In case of Clean test, the test speech signal is used as it is for verification. In case of Noisy test, the test speech is corrupted by factory noise (9 dB) and then used for verification. Even though for Clean test case, the proposed source based speaker recognition system still provides relatively poor performance than the vocal tract information, its performance is better for Noisy test case. Finally, for both clean and noisy cases, by providing different and robust speaker-specific evidences, the proposed system helps the vocal tract system to further improve the overall performance. 相似文献