期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

何亮史永哲刘加《自动化学报》2011,37(7):849-856

为了使联合因子分析适用于多种信道条件下的文本无关说话人识别,提出了一种本征信道空间的正交拼接法.在多信道条件下,可以通过混合数据法或简单拼接法估计本征信道空间,但前者存在空间掩盖,后者虽解决了空间掩盖但引入了空间重叠.本文首先证明说话人建模和测试的核心运算是斜投影,基于上述证明,通过将待拼接空间正交的方法移除了空间重叠.在NIST SRE 2008核心评测数据库上的实验表明,本文所提算法优于混合数据法和简单拼接法. 相似文献

2.

说话人识别中的因子分析以及空间拼接 总被引：1，自引：0，他引：1

郭武李轶杰戴礼荣王仁华《自动化学报》2009,35(9):1193-1198

联合因子分析可以有效拟合混合高斯模型中的说话人和信道差异, 在说话人识别中得到广泛应用. 一般情况下, 该算法在对说话人和信道两个载荷矩阵进行联合估计时, 说话人残差矩阵无法发挥作用, 信道载荷矩阵的因子数不能提高. 本文提出说话人载荷矩阵、说话人残差载荷矩阵采用串行的训练模式, 在信道载荷矩阵训练中采用矩阵拼接的方法, 能够有效提高识别率; 在NIST SRE 2008年核心测试数据库的五个部分分别达到等错误率3.3%, 5.1%, 5.0%, 5.3%和5.0%. 相似文献

3.

利用空间相关性的改进HMM模型 总被引：1，自引：0，他引：1

苏腾荣吴及王作英吕萍《计算机工程与设计》2010,31(5)

语音识别领域中所采用的经典HMM模型,忽略了语音信号间的相关信息.针对这一问题,利用语音信号的空间相关性对经典HMM模型进行补偿,得到一种改进模型.该方法通过空间相关变换,描述了当前语音特征与历史数据之间的空间相关性,从而对联合状态输出分布进行建模.改进模型的解码算法利用空间相关性变换的参数更新算法在经典ⅧⅥM的解码算法基础上得到.实验结果表明,上述方法在说话人无关连续语音识别系统上获得了明显的性能改进. 相似文献

4.

说话人识别中的串行因子分析

郭武戴礼荣王仁华《模式识别与人工智能》2009,22(4)

在基于因子分析的说话人识别中,提出串行训练载荷矩阵的方法.在载荷矩阵训练中,采用串行的方式训练得到说话人因子矩阵、对角阵(残差矩阵)和信道空间矩阵.在说话人注册中,将以上3个载荷矩阵拼接,采用联合估计的方法得到每个说话人的因子.采用这种策略可有效解决因子分析中的饱和问题.在NIST SRE 2006年核心测试数据库上等错误率能达到3.65%. 相似文献

5.

话者识别的信道补偿

李轶杰郭武戴礼荣《小型微型计算机系统》2008,29(12)

在文本无关的说话人识别中,训练与测试语音中信道环境的差异是影响其性能最重要的因素.近年来,利用因子分析对信道建模成为说话人识别领域的重要方法,大大降低了说话人确认的错误率,但运算复杂度限制了实时的应用.本文介绍了一种简化的因子分析方法:首先在混合高斯模型的模型域训练信道空间,然后在特征域进行信道补偿,得到的新特征可用于各种系统.在NIST2006的数据库上,利用本文的方法相对基线系统在等错误率上有31%的降低. 相似文献

6.

基于超向量子空间分析的自动语种识别方法 总被引：2，自引：0，他引：2

宋彦戴礼荣王仁华《模式识别与人工智能》2010,23(2):165-170

在针对电话语音的自动语种识别系统中,由不同信道、说话内容及说话人等所引起的干扰是影响系统识别性能的一个重要因素。针对此,文中提出一种基于超向量子空间分析的自动语种识别方法。首先构造表征各训练语句的超向量空间并利用SVM模型进行区分性训练,然后利用子空间分析方法估计出噪声子空间,并在距离度量中去除这部分影响。在NIST 07 语种识别测试30s和10s任务中,该方法与基线系统相比,性能有明显提高,等错误率相对降低约20％。相似文献

7.

基于通用背景-联合估计(UB-JE)的说话人识别方法 总被引：2，自引：1，他引：1

汪海彬郭剑毅毛存礼余正涛《自动化学报》2018,44(10):1888-1895

在说话人识别中,有效的识别方法是核心.近年来,基于总变化因子分析（i-vector）方法成为了说话人识别领域的主流,其中总变化因子空间的估计是整个算法的关键.本文结合常规的因子分析方法提出一种新的总变化因子空间估计算法,即通用背景—联合估计（Universal background-joint estimation algorithm,UB-JE）算法.首先,根据高斯混合—通用背景模型（Gaussian mixture model-universal background model,GMM-UBM）思想提出总变化矩阵通用背景（UB）算法;其次,根据因子分析理论结合相关文献提出了一种总变化矩阵联合估计（JE）算法;最后,将两种算法相结合得到通用背景—联合估计（UB-JE）算法.采用TIMIT和MDSVC语音数据库,结合i-vector方法将所提的算法与传统算法进行对比实验.结果显示,等错误率（Equal error rate,EER）和最小检测代价函数（Minimum detection cost function,MinDCF）分别提升了8.3%与6.9%,所提方法能够提升i-vector方法的性能. 相似文献

8.

基于本征音因子分析的短时说话人识别

潘镭郭武李轶杰戴礼荣《数据采集与处理》2009,24(4)

提出了一种基于本征音因子分析的文本无关的说话人识别方法.它解决了训练语音与测试语音均很短的情况下,传统的基于最大后验概率准则的混合高斯模型无法建立稳定的说话人模型问题.首先利用期望最大化算法在开发集上训练出说话人的本征音载荷矩阵,在说话人模型建模时通过将短时语音数据向本征音空间的降维映射来得到模型参数.实验结果表明,在NIST SRE 2006数据库中的10 s训练语音-10 s测试语音任务中,在传统的混合高斯模型的基线系统上,通过采用本征音因子分析的方法可以使系统等错误率降低18%. 相似文献

9.

一种层次化空间分析方法在语种识别系统中的应用

常振超刘斌石远超张兴明杨镇西张丽《计算机应用研究》2012,29(10):3651-3654

在针对电话语音的自动语种识别系统中,训练和测试语料之间存在不同说话人、信道等因素差异带来的不匹配,是影响识别性能提高的关键因素。为了消除此类影响,提出一种层次化空间分析方法,首先对前端部分MFCC+SDC特征进行HLDA(异方差线性判别分析),增大了语种各个类的类间差异;然后对经自适应得到含有冗余信息的GSV进行PCA特征选择,有效地去除了信道等冗余信息的干扰。实验结果表明,此方法能有效消除信道等噪声影响,从而提升了原有系统的识别性能。相似文献

10.

基于i-向量和PCA字典学习稀疏表示的说话人确认

下载免费PDF全文

舒毅邢玉娟《计算机工程与应用》2016,52(18):144-147

稀疏表示以其出色的分类性能成为说话人确认研究的热点,其中过完备字典的构建是关键,直接影响其性能。为了提高说话人确认系统的鲁棒性,同时解决稀疏表示过完备字典中存在噪声及信道干扰信息的问题,提出一种基于i-向量的主成分稀疏表示字典学习算法。该算法在高斯通用背景模型的基础上提取说话人的i-向量,并使用类内协方差归一化技术对i-向量进行信道补偿;根据信道补偿后的说话人i-向量的均值向量估计其信道偏移空间,在该空间采用主成分分析方法提取低维信道偏移主分量,用于重新计算说话人i-向量,从而达到进一步抑制i-向量中信道干扰的目的;将新的i-向量作为字典原子构建高鲁棒性稀疏表示过完备字典。在测试阶段,测试语音的i-向量在该字典上寻找其稀疏表示系数向量,根据系数向量对测试i-向量的重构误差确定目标说话人。仿真实验表明,该算法具有良好的识别性能。相似文献

11.

Modeling nuisance variabilities with factor analysis for GMM-based audio pattern classification

Driss Matrouf Florian Verdet Mickaël Rouvier Jean-François Bonastre Georges Linarès 《Computer Speech and Language》2011,25(3):481-498

Audio pattern classification represents a particular statistical classification task and includes, for example, speaker recognition, language recognition, emotion recognition, speech recognition and, recently, video genre classification. The feature being used in all these tasks is generally based on a short-term cepstral representation. The cepstral vectors contain at the same time useful information and nuisance variability, which are difficult to separate in this domain. Recently, in the context of GMM-based recognizers, a novel approach using a Factor Analysis (FA) paradigm has been proposed for decomposing the target model into a useful information component and a session variability component. This approach is called Joint Factor Analysis (JFA), since it models jointly the nuisance variability and the useful information, using the FA statistical method. The JFA approach has even been combined with Support Vector Machines, known for their discriminative power. In this article, we successfully apply this paradigm to three automatic audio processing applications: speaker verification, language recognition and video genre classification. This is done by applying the same process and using the same free software toolkit. We will show that this approach allows for a relative error reduction of over 50% in all the aforementioned audio processing tasks. 相似文献

12.

Mismatched feature detection with finer granularity for emotional speaker recognition

Li Chen Ying-chun Yang Zhao-hui Wu 《浙江大学学报:C卷英文版》2014,15(10):903-916

The shapes of speakers＇ vocal organs change under their different emotional states, which leads to the deviation of the emotional acoustic space of short-time features from the neutral acoustic space and thereby the degradation of the speaker recognition performance. Features deviating greatly from the neutral acoustic space are considered as mismatched features, and they negatively affect speaker recognition systems. Emotion variation produces different feature deformations for different phonemes, so it is reasonable to build a finer model to detect mismatched features under each phoneme. However, given the difficulty of phoneme recognition, three sorts of acoustic class recognition--phoneme classes, Gaussian mixture model （GMM） tokenizer, and probabilistic GMM tokenizer--are proposed to replace phoneme recognition. We propose feature pruning and feature regulation methods to process the mismatched features to improve speaker recognition performance. As for the feature regulation method, a strategy of maximizing the between-class distance and minimizing the within-class distance is adopted to train the transformation matrix to regulate the mismatched features. Experiments conducted on the Mandarin affective speech corpus （MASC） show that our feature pruning and feature regulation methods increase the identification rate （IR） by 3.64% and 6.77%, compared with the baseline GMM-UBM （universal background model） algorithm. Also, corresponding IR increases of 2.09% and 3.32% can be obtained with our methods when applied to the state-of-the-art algorithm i-vector. 相似文献

13.

一种基于语音组成单位的说话人识别算法

黄长存汪增福《模式识别与人工智能》2008,21(6):856-866

以线性预测系数为特征通过高斯混合模型的迭代算法对训练样本的初始k均值聚类结果进行优化,得到语音组成单位的表示.以语音组成单位的模式匹配为基础,提出一种文本无关说话人确认的方法——均值法,以及一种文本无关说话人辨认方法.实验结果表明,即使在短时语音下本文方法都能取得较好效果. 相似文献

14.

深度学习框架下说话人识别研究综述

下载免费PDF全文

曾春艳马超峰王志锋朱栋梁赵楠王娟刘聪《计算机工程与应用》2020,56(7):8-16

说话人识别由于其独特的方便性、经济性和准确性等优势,已成为人们日常生活与工作中重要的身份认证方式。然而在实际应用场景下,对说话人识别系统的准确性、鲁棒性、迁移性、实时性等提出了巨大的挑战。近年来深度学习在特征表达和模式分类方面表现优异,为说话人识别技术的进一步发展提供了新方向。相较于传统说话人识别技术（如GMM-UBM、GMM-SVM、JFA、i-vector等）,聚焦于深度学习框架下的说话人识别方法,按照深度学习在说话人识别中的作用方式,将目前的研究分为基于深度学习的特征表达、基于深度学习的后端建模、端到端联合优化三种类别,并分析和总结了其典型算法的特点及网络结构,对其具体性能进行了对比分析。最后总结了深度学习在说话人识别中的应用特点及优势,进一步分析了目前说话人识别研究面临的问题及挑战,并展望了深度学习框架下说话人识别研究的前景,以期推动说话人识别技术的进一步发展。相似文献

15.

Partially supervised speaker clustering

Tang H Chu SM Hasegawa-Johnson M Huang TS 《IEEE transactions on pattern analysis and machine intelligence》2012,34(5):959-971

Content-based multimedia indexing, retrieval, and processing as well as multimedia databases demand the structuring of the media content (image, audio, video, text, etc.), one significant goal being to associate the identity of the content to the individual segments of the signals. In this paper, we specifically address the problem of speaker clustering, the task of assigning every speech utterance in an audio stream to its speaker. We offer a complete treatment to the idea of partially supervised speaker clustering, which refers to the use of our prior knowledge of speakers in general to assist the unsupervised speaker clustering process. By means of an independent training data set, we encode the prior knowledge at the various stages of the speaker clustering pipeline via 1) learning a speaker-discriminative acoustic feature transformation, 2) learning a universal speaker prior model, and 3) learning a discriminative speaker subspace, or equivalently, a speaker-discriminative distance metric. We study the directional scattering property of the Gaussian mixture model (GMM) mean supervector representation of utterances in the high-dimensional space, and advocate exploiting this property by using the cosine distance metric instead of the euclidean distance metric for speaker clustering in the GMM mean supervector space. We propose to perform discriminant analysis based on the cosine distance metric, which leads to a novel distance metric learning algorithm—linear spherical discriminant analysis (LSDA). We show that the proposed LSDA formulation can be systematically solved within the elegant graph embedding general dimensionality reduction framework. Our speaker clustering experiments on the GALE database clearly indicate that 1) our speaker clustering methods based on the GMM mean supervector representation and vector-based distance metrics outperform traditional speaker clustering methods based on the “bag of acoustic features” representation and statistical model-based distance metrics, 2) our advocated use of the cosine distance metric yields consistent increases in the speaker clustering performance as compared to the commonly used euclidean distance metric, 3) our partially supervised speaker clustering concept and strategies significantly improve the speaker clustering performance over the baselines, and 4) our proposed LSDA algorithm further leads to state-of-the-art speaker clustering performance. 相似文献

16.

基于话者无关模型的说话人转换方法

陈凌辉凌震华戴礼荣《模式识别与人工智能》2013,26(3):254-259

提出一种基于话者无关模型的说话人转换方法.考虑到音素信息共同存在于所有说话人的语音中,假设存在一个可以用高斯混合模型来描述的话者无关空间,且可用分段线性变换来描述该空间到各说话人相关空间之间的映射关系.在一个多说话人的数据库上,用话者自适应训练算法来训练模型,并在转换阶段使用源目标说话人空间到话者无关空间的变换关系来构造源与目标之间的特征变换关系,快速、灵活的构造说话人转换系统.通过主观测听实验来验证该算法相对于传统的基于话者相关模型方法的优点. 相似文献

17.

Adaptation of children’s speech with limited data based on formant-like peak alignment

Xiaodong Cui Abeer Alwan 《Computer Speech and Language》2006,20(4):400-419

Automatic recognition of children’s speech using acoustic models trained by adults results in poor performance due to differences in speech acoustics. These acoustical differences are a consequence of children having shorter vocal tracts and smaller vocal cords than adults. Hence, speaker adaptation needs to be performed. However, in real-world applications, the amount of adaptation data available may be less than what is needed by common speaker adaptation techniques to yield reasonable performance. In this paper, we first study, in the discrete frequency domain, the relationship between frequency warping in the front-end and corresponding transformations in the back-end. Three common feature extraction schemes are investigated and their transformation linearity in the back-end are discussed. In particular, we show that under certain approximations, frequency warping of MFCC features with Mel-warped triangular filter banks equals a linear transformation in the cepstral space. Based on that linear transformation, a formant-like peak alignment algorithm is proposed to adapt adult acoustic models to children’s speech. The peaks are estimated by Gaussian mixtures using the Expectation-Maximization (EM) algorithm [Zolfaghari, P., Robinson, T., 1996. Formant analysis using mixtures of Gaussians, Proceedings of International Conference on Spoken Language Processing, 1229–1232]. For limited adaptation data, the algorithm outperforms traditional vocal tract length normalization (VTLN) and maximum likelihood linear regression (MLLR) techniques. 相似文献

18.

基于SIFT的说话人唇动识别

马新军吴晨晨仲乾元李园园《计算机应用》2017,37(9):2694-2699

针对唇部特征提取维度过高以及对尺度空间敏感的问题,提出了一种基于尺度不变特征变换（SIFT）算法作特征提取来进行说话人身份认证的技术。首先,提出了一种简单的视频帧图片规整算法,将不同长度的唇动视频规整到同一的长度,提取出具有代表性的唇动图片;然后,提出一种在SIFT关键点的基础上,进行纹理和运动特征的提取算法,并经过主成分分析（PCA）算法的整合,最终得到具有代表性的唇动特征进行认证;最后,根据所得到的特征,提出了一种简单的分类算法。实验结果显示,和常见的局部二元模式（LBP）特征和方向梯度直方图（HOG）特征相比较,该特征提取算法的错误接受率（FAR）和错误拒绝率（FRR）表现更佳。说明整个说话人唇动特征识别算法是有效的,能够得到较为理想的结果。相似文献

19.

Maximum likelihood stochastic transformation adaptation for medium and small data sets

《Computer Speech and Language》2001,15(3):257-285

Speaker adaptation is recognized as an essential part of today’s large-vocabulary automatic speech recognition systems. A family of techniques that has been extensively applied for limited adaptation data is transformation-based adaptation. In transformation-based adaptation we partition our parameter space in a set of classes, estimate a transform (usually linear) for each class and apply the same transform to all the components of the class. It is known, however, that additional gains can be made if we do not constrain the components of each class to use the same transform. In this paper two speaker adaptation algorithms are described. First, instead of estimating one linear transform for each class (as maximum likelihood linear regression (MLLR) does, for example) we estimate multiple linear transforms per class of models and a transform weights vector which is specific to each component (Gaussians in our case). This in effect means that each component receives its own transform without having to estimate each one of them independently. This scheme, termed maximum likelihood stochastic transformation (MLST) achieves a good trade-off between robustness and acoustic resolution. MLST is evaluated on the Wall Street Journal(WSJ) corpus for non-native speakers and it is shown that in the case of 40 adaptation sentences the algorithm outperforms MLLR by more than 13%. In the second half of this paper, we introduce a variant of the MLST designed to operate under sparsity of data. Since the majority of the adaptation parameters are the transformations, we estimate them on the training speakers and adapt to a new speaker by estimating the transform weights only. First we cluster the speakers in a number of sets and estimate the transformations on each cluster. The new speaker will use transformations from all clusters to perform adaptation. This method, termed basis transformation, can be seen as a speaker similarity scheme. Experimental results on the WSJ show that when basis transformation is cascaded with MLLR marginal gains can be obtained from MLLR only, for adaptation of native speakers. 相似文献

20.

Speaker identification using multi-step clustering algorithm with transformation-based GMM 总被引：1，自引：0，他引：1

Limin Xu Zhenmin Tang 《Automatic Control and Computer Sciences》2007,41(4):224-231

To improve the performance of speaker recognition, the embedded linear transformation is used to integrate both transformation and diagonal-covariance Caussian mixture into a unified framework. In the case, the mixture number of GMM must be fixed in model training. The cluster expectation-maximization (EM) algorithm is a well-known technique in which the mixture number is regarded as an estimated parameter. This paper presents a new model structure that integrates a multi-step cluster algorithm into the estimating process of GMM with the embedded transformation. In the approach, the transformation matrix, the mixture number and model parameters are simultaneously estimated according to a maximum likelihood criterion. The proposed method is demonstrated on a database of three data sessions for text independent speaker identification. The experiments show that this method outperforms the traditional GMM with cluster EM algorithm. This text was submitted by the authors in English. 相似文献