期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

杨吉斌邢艳玲曹铁勇张雄伟《模式识别与人工智能》2005,18(3)

提出了一种结合Mellin变换和Mel频率分析的语音信号特征--MMCC特征.该特征利用Mellin变换的尺度不变性质,抑制了特征参数受不同说话人声道变化的影响,同时结合Mel频率的人耳听觉特性,改善了特征的鲁棒性,适合于非特定人识别系统的应用.仿真结果表明,采用MMCC特征的非特定人语音识别系统,其识别效果优于采用LPCC特征、MFCC特征和MMTLS特征的非特定人语音识别系统. 相似文献

2.

一种新的基于LBG和DTW的模板训练算法

封伶刚王秀萍《计算机工程与应用》2005,41(26):85-88

提出了一种新的基于LBG和DTW结合的模板训练算法,包括模板训练、初始模板设置、空子集处理三个部分,能够完整、有效地解决语音识别中模板训练的问题。该算法实现了语音信号特征矩阵的聚类及其质心的生成,使孤立词语音识别系统更好地适用于非特定人的情况,提高了系统对训练集外说话人语音的正确识别率。设计、实现了一个识别系统,模板训练中较快的收敛速度和系统较高的识别率验证了算法的优良性能。相似文献

3.

基于分类高斯混合模型和神经网络融合的与文本无关的说话人识别 总被引：1，自引：0，他引：1

黄伟戴蓓蒨李辉《模式识别与人工智能》2003,16(4)

本文提出了一种基于分类高斯混合模型和神经网络融合的说话人识别系统,根据能量阈值将每个话者语音的语音帧分为两类,在分类子空间分别为每个话者建立两个分类话者模型(GMM),并为每个话者建立一个用于对这两类模型进行数据融合的神经网络,话者识别的结果是经对各个话者神经网络的输出进行判决后做出的.在100个男性话者的与文本无关的说话人识别实验中,基于分类话者模型的策略在识别性能和噪声鲁棒性上均优于传统的GMM话者识别系统,而采用神经网络进行后端融合的策略又优于直接融合的策略,从而可以用较低的话者模型混合度和较短的测试语音获得较好的识别性能及噪声鲁棒性. 相似文献

4.

一种改进动态特征参数的话者语音识别系统

申小虎万荣春张新野《计算机仿真》2015,32(4):154-158

研究语音动态特征参数提取问题,在话者语音识别过程中,动态特征参数可以有效提高识别率.但是传统算法在其提取过程中存在大量干扰冗余信息,造成了识别率降低并带来运算速度的降低.为解决上述副作用,提出在说话人识别系统中,使用一种动态时频倒谱系数参数的方法.上述方法在不减少反应话者个体特征分布特性的前提下,可消除冗余信息并降低样本特征的维度.利用上述方法提取语音特征参数并输入混合高斯-通用背景模型进行说话人语音分类.在Matlab上仿真结果表明,动态时频倒谱系数可有效改进话者语音识别系统的识别正确率. 相似文献

5.

一种基于共振峰恢复和Ｍellin变换的非特定人语音特征提取方法

蒋冬梅赵荣椿《数据采集与处理》2001,16(1):58-62

针对非特定人语音识别中的声道长度归一化问题,首先研究一种能够去牛基音激励的,基于自相关估计的共振峰（Ｆormart) 频谱恢复方法,说明了不同说话人发同一元音时的频谱互为尺度化的关系,以及它们与同一说话人发不同元音时频谱的差别,然后结合具有尺度不变性的Ｍellin变换,提出了一种适用于非特定人的语音特征提取方法,在实难中,对从非特定人收集的２０个汉语元音,分别提取了其ＦＦＴ倒谱,Ｍel倒谱,ＦＦＴ－Ｍellin倒谱及本文Ｆormant-Mellin倒谱,并用一种很直观的Ｆ－roato分辨率准则进行了性能评价,结果表明,无论是对纯净的,还是对带附加白噪声的发音样本,本文由共振峰恢复和Ｍellin变换相结合得到的语音特征都具有较高的分辨率。相似文献

6.

短语音噪声环境下说话人识别特征提取

高会贤马全福郑晓势《计算机应用》2010,30(10):2712-2714

为了使说话人识别系统在语音较短和存在噪声的环境下也具有较高的识别率,基于矢量量化识别算法,对提取的特征参数进行研究。把小波变换与美尔频率倒谱系数(MFCC)的提取相结合,并将改进后的特征与谱质心特征进行了组合,建立了一种美尔频率小波变换系数+谱质心(MFWTC+SC)的新的组合特征参数。经实验表明,该组合特征可以有效地提高说话人识别系统的性能。相似文献

7.

基于话者无关模型的说话人转换方法

陈凌辉凌震华戴礼荣《模式识别与人工智能》2013,26(3):254-259

提出一种基于话者无关模型的说话人转换方法.考虑到音素信息共同存在于所有说话人的语音中,假设存在一个可以用高斯混合模型来描述的话者无关空间,且可用分段线性变换来描述该空间到各说话人相关空间之间的映射关系.在一个多说话人的数据库上,用话者自适应训练算法来训练模型,并在转换阶段使用源目标说话人空间到话者无关空间的变换关系来构造源与目标之间的特征变换关系,快速、灵活的构造说话人转换系统.通过主观测听实验来验证该算法相对于传统的基于话者相关模型方法的优点. 相似文献

8.

基于语音结构化模型的数字语音识别

姜莹俞一彪《计算机工程与设计》2012,33(4):1482-1485,1490

提出一种新的基于语音结构化模型的语音识别方法,并应用于非特定人数字语音识别.每一个数字语音计算倒谱特征之后提取语音中存在的对说话人差异具有不变性的结构化特征——全局声学结构(acoustical universal structure,AUS),并建立结构化模型,识别时提取测试语音的全局声学结构,然后与各数字语音的结构化模型进行匹配.测试了少量语料训练下的识别性能并与传统HMM (hidden Markov model)方法进行比较,结果表明该方法可以取得优于HMM的性能,语音结构化模型可以有效消除说话人之间的差异. 相似文献

9.

基于多码本矢量量化的非限定文本的联机话者辨认方法

马继涌高文姚鸿勋《计算机研究与发展》1999,36(6):712-716

传统的利用话者的一个时期的语音作为训练语音,进行话者码本训练的方法,识别系统往往不够稳定．为了适应话者自身语音的时变性,文中提出了利用话者不同时期的语音进行训练话者的模型,每个话者具有多个码本．这些码本是采用逐渐减小误识率的优化过程得到的．为了补偿不同信道对系统识别性能的影响,文中给出了一种信道补偿方法．同时提出以一帧高能的浊音语音特征代替一个浊音音素的特征,实现了在线浊音特征提取,利用两级矢量量化及码本索引策略减少了４４％的识别计算量．这些方法大大增加了系统的识别速度和鲁棒性．文中比较了用ＰＬＰ分析和ＬＰＣ倒谱分析进行话者辨认的识别结果．相似文献

10.

VQ话者模型中失真测度的鲁棒性研究

方绍武戴蓓倩《数据采集与处理》2000,15(2):157-161

文中研究表明,反映说话人特征信息的特征参数矢量的各个分量通常具有不同的分布,对正确识别说话人身份的有效性是有差别的。文中将这种有效性差别作为权重矢量反映到失真测度计算公式中,提出了一种新的失真测度,即方差归一化失真测度可有效提高话者识别系统的识别性能。进一步的实验还表明,该失真测度能提高话者识别系统的时间鲁棒性。文中同时还给出了适合于话者识别的参数归正方法：帧内幅度归正。相似文献

11.

Reliable methods for estimating relative vocal tract lengths from formant trajectories of common words

Watanabe A. Sakata T. 《IEEE transactions on audio, speech, and language processing》2006,14(4):1193-1204

This paper describes reliable methods for estimating relative vocal tract lengths from speech signals. Two proposed methods are based on the simple principle that resonant frequencies in an acoustic tube are inversely proportional to the tube length in cases where the configuration is constant. We estimated the ratio between two speakers' vocal tract lengths using first and second formant trajectories of the same words uttered by them. In the first method, which is referred to as "strict estimation method", we sought instances at which the gross structures of two vocal tracts are analogous by applying dynamic time-warping to formant-trajectories of common words that were uttered at different speeds. In those instances, which were found from among more than 100 common words by two speakers, an average formant ratio proved to be an excellent estimate (about /spl plusmn/0.1% in errors) for a reciprocal of the vocal tract length ratio. Next, we examined a simplified method for estimating those ratios using all corresponding points of two formant-trajectories: it is the "direct estimation method". Estimation errors in the direct estimation were evaluated to be about /spl plusmn/0.3% at equal utterance-speeds and /spl plusmn/2% at most, within 2.0 of the ratios of "fast" to "slow". Finally, we estimated relative vocal tract lengths for four Japanese speaker groups, whose members differed in terms of age and gender. Experimental results showed that the average vocal tract length of adult females and that of 7-10-year-old boys and girls are 21%, 27%, and 30%, respectively, shorter than adult males'. 相似文献

12.

Improved automatic speech recognition through speaker normalization

《Computer Speech and Language》2006,20(1):107-123

In this paper, speaker adaptive acoustic modeling is investigated by using a novel method for speaker normalization and a well known vocal tract length normalization method. With the novel normalization method, acoustic observations of training and testing speakers are mapped into a normalized acoustic space through speaker-specific transformations with the aim of reducing inter-speaker acoustic variability. For each speaker, an affine transformation is estimated with the goal of reducing the mismatch between the acoustic data of the speaker and a set of target hidden Markov models. This transformation is estimated through constrained maximum likelihood linear regression and then applied to map the acoustic observations of the speaker into the normalized acoustic space.Recognition experiments made use of two corpora, the first one consisting of adults’ speech, the second one consisting of children’s speech. Performing training and recognition with normalized data resulted in a consistent reduction of the word error rate with respect to the baseline systems trained on unnormalized data. In addition, the novel method always performed better than the reference vocal tract length normalization method adopted in this work.When unsupervised static speaker adaptation was applied in combination with each of the two speaker normalization methods, a different behavior was observed on the two corpora: in one case performance became very similar while in the other case the difference remained significant. 相似文献

13.

A new approach to acoustic analysis of two British regional accents—Birmingham and Liverpool accents

Dang Cong Zheng David Dyke Fiona Berryman Colin Morgan 《International Journal of Speech Technology》2012,15(2):77-85

Accent is a reflection of an individual speaker??s regional affiliation and is shaped by the speaker??s community background. This study investigated the acoustic characteristics of two British regional accents??the Birmingham and Liverpool accents??and their correlations from a different approach. In contrast to previous accent-related research, where the databases are formed from large groups of single-accent speakers, this study uses data from an individual who can speak in two accents, thus removing the effects of inter-speaker variability and facilitating efficient identification and analysis of the accent acoustic features. Acoustic features such as formant frequencies, pitch slope, intensity and phone duration have been used to investigate the prominent features of each accent. The acoustic analysis was based on nine monophthongal vowels and three diphthongal vowels. In addition, an analysis of variance of formant frequencies along the time dimension was performed to study the perceived effects of vocal tract shape changes as the speaker switches between the two accents. The results of the analysis indicate that the formant frequencies, pitch slope, the intensity and the phone duration all vary between the two accents. Classification testing using linear discriminant analysis showed that intensity had the strongest effect on differentiating between the two accents followed by F3, vowel duration, F2 and pitch slope. 相似文献

14.

Text-independent speaker identification based on selection of the most similar feature vectors

Mohammad Soleymanpour Hossein Marvi 《International Journal of Speech Technology》2017,20(1):99-108

The speaker recognition has been one of the interesting issues in signal and speech processing over the last few decades. Feature selection is one of the main parts of speaker recognition system which can improve the performance of the system. In this paper, we have proposed two methods to find MFCCs feature vectors with the highest similar that is applied to text independent speaker identification system. These feature vectors show individual properties of each person’s vocal tract that are mostly repeated. They are used to build speaker’s model and to specify decision boundary. We applied MFCC of each window over main signal as a feature vector and used clustering to obtain feature vectors with the highest similar. The Speaker identification experiments are performed using the ELSDSR database that consists of 22 speakers (12 male and 10 female) and Neural Network is used as a classifier. The effect of three main parameters have been considered in two proposed methods. Experimental results indicate that the performance of speaker identification system has been improved in accuracy and time consumption term. 相似文献

15.

基于MAP+CMLLR的说话人识别中发声力度问题

黄文娜彭亚雄贺松《计算机应用》2017,37(3):906-910

为了改善发声力度对说话人识别系统性能的影响,在训练语音存在少量耳语、高喊语音数据的前提下,提出了使用最大后验概率（MAP）和约束最大似然线性回归（CMLLR）相结合的方法来更新说话人模型、投影转换说话人特征。其中,MAP自适应方法用于对正常语音训练的说话人模型进行更新,而CMLLR特征空间投影方法则用来投影转换耳语、高喊测试语音的特征,从而改善训练语音与测试语音的失配问题。实验结果显示,采用MAP+CMLLR方法时,说话人识别系统等错误率（EER）明显降低,与基线系统、最大后验概率（MAP）自适应方法、最大似然线性回归（MLLR）模型投影方法和约束最大似然线性回归（CMLLR）特征空间投影方法相比,MAP+CMLLR方法的平均等错率分别降低了75.3%、3.5%、72%和70.9%。实验结果表明,所提出方法削弱了发声力度对说话人区分性的影响,使说话人识别系统对于发声力度变化更加鲁棒。相似文献

16.

Emotional Feature Extraction Method Based on the Concentration of Phoneme Influence for Human–Robot Interaction

《Advanced Robotics》2013,27(1-2):47-67

Depending on the emotion of speech, the meaning of the speech or the intention of the speaker differs. Therefore, speech emotion recognition, as well as automatic speech recognition is necessary to communicate precisely between humans and robots for human–robot interaction. In this paper, a novel feature extraction method is proposed for speech emotion recognition using separation of phoneme class. In feature extraction, the signal variation caused by different sentences usually overrides the emotion variation and it lowers the performance of emotion recognition. However, as the proposed method extracts features from speech in parts that correspond to limited ranges of the center of gravity of the spectrum (CoG) and formant frequencies, the effects of phoneme variation on features are reduced. Corresponding to the range of CoG, the obstruent sounds are discriminated from sonorant sounds. Moreover, the sonorant sounds are categorized into four classes by the resonance characteristics revealed by formant frequency. The result shows that the proposed method using 30 different speakers' corpora improves emotion recognition accuracy compared with other methods by 99% significance level. Furthermore, the proposed method was applied to extract several features including prosodic and phonetic features, and was implemented on 'Mung' robots as an emotion recognizer of users. 相似文献

17.

Speaker-adaptive learning of resonance targets in a hidden trajectory model of speech coarticulation

《Computer Speech and Language》2007,21(1):72-87

A novel speaker-adaptive learning algorithm is developed and evaluated for a hidden trajectory model of speech coarticulation and reduction. Central to this model is the process of bi-directional (forward and backward) filtering of the vocal tract resonance (VTR) target sequence. The VTR targets are key parameters of the model that control the hidden VTR’s dynamic behavior and the subsequent acoustic properties (those of the cepstral vector sequence). We describe two techniques for training these target parameters: (1) speaker-independent training that averages out the target variability over all speakers in the training set; and (2) speaker-adaptive training that takes into account the variability in the target values among individual speakers. The adaptive learning is applied also to adjust each unknown test speaker’s target values towards their true values. All the learning algorithms make use of the results of accurate VTR tracking as developed in our earlier work. In this paper, we present details of the learning algorithms and the analysis results comparing speaker-independent and speaker-adaptive learning. We also describe TIMIT phone recognition experiments and results, demonstrating consistent superiority of speaker adaptive learning over speaker-independent one measured by the phonetic recognition performance. 相似文献

18.

基于分段线性频谱弯折函数的说话人归一化方法 总被引：1，自引：0，他引：1

卢正鼎丰洪才《小型微型计算机系统》2004,25(12):2232-2236

在传统的声道长度归一化方法中 ,基于声道无损级联短管模型假设 ,用一个简单的声道因子来确定频谱弯折函数 ,无法描述出不同说话人的频谱差异的细节 .针对这一缺陷 ,提出用细致的分段线性频谱弯折函数来描述说话人差异 ,在适当的频谱分段下 ,较好地完成了频谱对齐的任务 .此外 ,由于利用了与模型无关的频谱弯折函数 ,该方法被证明是一种快速的、尤其适用于无监督模式的说话人鲁棒性方法相似文献

19.

Mismatched feature detection with finer granularity for emotional speaker recognition

Li Chen Ying-chun Yang Zhao-hui Wu 《浙江大学学报:C卷英文版》2014,15(10):903-916

The shapes of speakers＇ vocal organs change under their different emotional states, which leads to the deviation of the emotional acoustic space of short-time features from the neutral acoustic space and thereby the degradation of the speaker recognition performance. Features deviating greatly from the neutral acoustic space are considered as mismatched features, and they negatively affect speaker recognition systems. Emotion variation produces different feature deformations for different phonemes, so it is reasonable to build a finer model to detect mismatched features under each phoneme. However, given the difficulty of phoneme recognition, three sorts of acoustic class recognition--phoneme classes, Gaussian mixture model （GMM） tokenizer, and probabilistic GMM tokenizer--are proposed to replace phoneme recognition. We propose feature pruning and feature regulation methods to process the mismatched features to improve speaker recognition performance. As for the feature regulation method, a strategy of maximizing the between-class distance and minimizing the within-class distance is adopted to train the transformation matrix to regulate the mismatched features. Experiments conducted on the Mandarin affective speech corpus （MASC） show that our feature pruning and feature regulation methods increase the identification rate （IR） by 3.64% and 6.77%, compared with the baseline GMM-UBM （universal background model） algorithm. Also, corresponding IR increases of 2.09% and 3.32% can be obtained with our methods when applied to the state-of-the-art algorithm i-vector. 相似文献

20.

Nonlinear normalization of input patterns to speaker variability in speech recognition neural networks

Isar Nejadgholi Seyyed Ali Seyyedsalehi 《Neural computing & applications》2009,18(1):45-55

The issue of input variability resulting from speaker changes is one of the most crucial factors influencing the effectiveness of speech recognition systems. A solution to this problem is adaptation or normalization of the input, in a way that all the parameters of the input representation are adapted to that of a single speaker, and a kind of normalization is applied to the input pattern against the speaker changes, before recognition. This paper proposes three such methods in which some effects of the speaker changes influencing speech recognition process is compensated. In all three methods, a feed-forward neural network is first trained for mapping the input into codes representing the phonetic classes and speakers. Then, among the 71 speakers used in training, the one who is showing the highest percentage of phone recognition accuracy is selected as the reference speaker so that the representation parameters of the other speakers are converted to the corresponding speech uttered by him. In the first method, the error back-propagation algorithm is used for finding the optimal point of every decision region relating to each phone of each speaker in the input space for all the phones and all the speakers. The distances between these points and the corresponding points related to the reference speaker are employed for offsetting the speaker change effects and the adaptation of the input signal to the reference speaker. In the second method, using the error back-propagation algorithm and maintaining the reference speaker data as the desirable speaker output, we correct all the speech signal frames, i.e., the train and the test datasets, so that they coincide with the corresponding speech of the reference speaker. In the third method, another feed-forward neural network is applied inversely for mapping the phonetic classes and speaker information to the input representation. The phonetic output retrieved from the direct network along with the reference speaker data are given to the inverse network. Using this information, the inverse network yields an estimation of the input representation adapted to the reference speaker. In all three methods, the final speech recognition model is trained using the adapted training data, and is tested by the adapted testing data. Implementing these methods and combining the final network results with un-adapted network based on the highest confidence level, an increase of 2.1, 2.6 and 3% in phone recognition accuracy on the clean speech is obtained from the three methods, respectively. 相似文献