首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
A primary challenge in the field of automatic speech recognition is to understand and create acoustic models to represent individual differences in their spoken language. Individual’s age, gender; their speaking styles influenced by their dialect may be few of the reasons for these differences. This work investigates the dialectal differences by measuring the analysis of variance of acoustic features such as, formant frequencies, pitch, pitch slope, duration and intensity for vowel sounds. This paper attempts to discuss methods to capture dialect specific knowledge through vocal tract and prosody information extracted from speech that can be utilized for automatic identification of dialects. Kernel based support vector machine is utilized for measuring the dialect discriminating ability of acoustic features. For the spectral feature shifted delta cepstral coefficients along with Mel frequency cepstral coefficients gives a recognition performance of 66.97 %. Combination of prosodic features performs better with a classification score of 74 %. The model is further evaluated for the combination of spectral and prosodic feature set and achieves a classification accuracy of 88.77 %. The proposed model is compared with the human perception of dialects. The overall work is based on four dialects of Hindi; one of the world’s major languages.  相似文献   

2.
研究将深度神经网络有效地应用到维吾尔语大词汇量连续语音识别声学建模中的两种方法:深度神经网络与隐马尔可夫模型组成混合架构模型(Deep neural network hidden Markov model, DNN-HMM),代替高斯混合模型进行状态输出概率的计算;深度神经网络作为前端的声学特征提取器提取瓶颈特征(Bottleneck features, BN),为传统的GMM-HMM(Gaussian mixture model-HMM)声学建模架构提供更有效的声学特征(BN-GMM-HMM)。实验结果表明,DNN-HMM模型和BN- GMM-HMM模型比GMM-HMM基线模型词错误率分别降低了8.84%和5.86%,两种方法都取得了较大的性能提升。  相似文献   

3.
韵律特征的转换是语音转换系统中的重要组成部分,对转换后的语音的自然度起着至关重要的作用.介绍韵律特征结构,指出了韵律特征建模与控制转换研究的主要进展,讨论现有常用转换方法的过程机理和优缺点.在此基础上,对韵律转换的研究情况进行了归纳分析和展望.  相似文献   

4.
In this paper, we describe a novel spectral conversion method for voice conversion (VC). A Gaussian mixture model (GMM) of the joint probability density of source and target features is employed for performing spectral conversion between speakers. The conventional method converts spectral parameters frame by frame based on the minimum mean square error. Although it is reasonably effective, the deterioration of speech quality is caused by some problems: 1) appropriate spectral movements are not always caused by the frame-based conversion process, and 2) the converted spectra are excessively smoothed by statistical modeling. In order to address those problems, we propose a conversion method based on the maximum-likelihood estimation of a spectral parameter trajectory. Not only static but also dynamic feature statistics are used for realizing the appropriate converted spectrum sequence. Moreover, the oversmoothing effect is alleviated by considering a global variance feature of the converted spectra. Experimental results indicate that the performance of VC can be dramatically improved by the proposed method in view of both speech quality and conversion accuracy for speaker individuality.  相似文献   

5.
In this paper we demonstrate the use of prosody models for developing speech systems in Indian languages. Duration and intonation models developed using feedforward neural networks are considered as prosody models. Labelled broadcast news data in the languages Hindi, Telugu, Tamil and Kannada is used for developing the neural network models for predicting the duration and intonation. The features representing the positional, contextual and phonological constraints are used for developing the prosody models. In this paper, the use of prosody models is illustrated using speech recognition, speech synthesis, speaker recognition and language identification applications. Autoassociative neural networks and support vector machines are used as classification models for developing the speech systems. The performance of the speech systems has shown to be improved by combining the prosodic features along with one popular spectral feature set consisting of Weighted Linear Prediction Cepstral Coefficients (WLPCCs).  相似文献   

6.
提出基于发音特征的声调建模改进方法,并将其用于随机段模型的一遍解码中。根据普通话的发音特点,确定了用于区别汉语元音、辅音信息的7种发音特征,并以此为目标值利用阶层式多层感知器计算语音信号属于发音特征的35个类别后验概率,将该概率作为发音特征与传统的韵律特征一起用于声调建模。根据随机段模型的解码特点,在两层剪枝后对保留下来的路径计算其声调模型概率得分,加权后加入路径总的概率得分中。在“863-test”测试集上进行的实验结果显示,使用了新的发音特征集合中声调模型的识别精度提高了3.11%;融入声调信息后随机段模型的字错误率从13.67%下降到12.74%。表明了将声调信息应用到随机段模型的可行性。  相似文献   

7.
为了克服利用高斯混合模型(GMM)进行语音转换的过程中出现的过平滑现象,考虑到GMM模型参数的均值能够表征转换特征的频谱包络形状,本文提出一种基于GMM与ANN混合模型的语音转换,利用ANN对GMM模型参数的均值进行转换;为了获取连续的转换频谱,采用静态和动态频谱特征相结合来逼近转换频谱序列;鉴于基频对语音转换的重要性,在频谱转换的基础上,对基频也进行了分析和转换。最后,通过主观和客观实验对提出的混合模型的语音转换方法的性能进行测试,实验结果表明,与传统的基于GMM模型的语音转换方法相比,本文提出的方法能够获得更好的转换语音。  相似文献   

8.
This paper describes a work aimed towards understanding the art of mimicking by professional mimicry artists while imitating the speech characteristics of known persons, and also explores the possibility of detecting a given speech as genuine or impostor. This includes a systematic approach of collecting three categories of speech data, namely original speech of the mimicry artists, speech while mimicking chosen celebrities and original speech of the chosen celebrities, to analyze the variations in prosodic features. A?method is described for the automatic extraction of relevant prosodic features in order to model speaker characteristics. Speech is automatically segmented as intonation phrases using speech/nonspeech classification. Further segmentation is done using valleys in energy contour. Intonation, duration and energy features are extracted for each of these segments. Intonation curve is approximated using Legendre polynomials. Other useful prosodic features include average jitter, average shimmer, total duration, voiced duration and change in energy. These prosodic features extracted from original speech of celebrities and mimicry artists are used for creating speaker models. Support Vector Machine (SVM) is used for creating speaker models, and detection of a given speech as genuine or impostor is attempted using a speaker verification framework of SVM models.  相似文献   

9.
In this paper spectral and prosodic features extracted from different levels are explored for analyzing the language specific information present in speech. In this work, spectral features extracted from frames of 20 ms (block processing), individual pitch cycles (pitch synchronous analysis) and glottal closure regions are used for discriminating the languages. Prosodic features extracted from syllable, tri-syllable and multi-word (phrase) levels are proposed in addition to spectral features for capturing the language specific information. In this study, language specific prosody is represented by intonation, rhythm and stress features at syllable and tri-syllable (words) levels, whereas temporal variations in fundamental frequency (F 0 contour), durations of syllables and temporal variations in intensities (energy contour) are used to represent the prosody at multi-word (phrase) level. For analyzing the language specific information in the proposed features, Indian language speech database (IITKGP-MLILSC) is used. Gaussian mixture models are used to capture the language specific information from the proposed features. The evaluation results indicate that language identification performance is improved with combination of features. Performance of proposed features is also analyzed on standard Oregon Graduate Institute Multi-Language Telephone-based Speech (OGI-MLTS) database.  相似文献   

10.
语音转换技术在语音处理领域是一个比较新的研究方向,也是近年来语音领域的研究热点。语音转换技术是指改变源说话人的语音特征使之具有目标说话人特征的一项技术。本文说明了语音转换的定义,介绍了语音的个性特征,列举了频谱包络的几种主要的转换算法以及韵律转换的主要算法。最后说明了语音转换今后的研究方向。  相似文献   

11.
语音转换在教育、娱乐、医疗等各个领域都有广泛的应用,为了得到高质量的转换语音,提出了基于多谱特征生成对抗网络的语音转换算法。利用生成对抗网络对由谱特征参数生成的声纹图进行转换,利用特征级多模态融合技术使网络学习来自不同特征域的多种信息,以提高网络对语音信号的感知能力,从而得到具有良好清晰度和可懂度的高质量转换语音。实验结果表明,在主、客观评价指标上,本文算法较传统算法均有明显提升。  相似文献   

12.
Automatic mood detection and tracking of music audio signals   总被引:2,自引:0,他引:2  
Music mood describes the inherent emotional expression of a music clip. It is helpful in music understanding, music retrieval, and some other music-related applications. In this paper, a hierarchical framework is presented to automate the task of mood detection from acoustic music data, by following some music psychological theories in western cultures. The hierarchical framework has the advantage of emphasizing the most suitable features in different detection tasks. Three feature sets, including intensity, timbre, and rhythm are extracted to represent the characteristics of a music clip. The intensity feature set is represented by the energy in each subband, the timbre feature set is composed of the spectral shape features and spectral contrast features, and the rhythm feature set indicates three aspects that are closely related with an individual's mood response, including rhythm strength, rhythm regularity, and tempo. Furthermore, since mood is usually changeable in an entire piece of classical music, the approach to mood detection is extended to mood tracking for a music piece, by dividing the music into several independent segments, each of which contains a homogeneous emotional expression. Preliminary evaluations indicate that the proposed algorithms produce satisfactory results. On our testing database composed of 800 representative music clips, the average accuracy of mood detection achieves up to 86.3%. We can also on average recall 84.1% of the mood boundaries from nine testing music pieces.  相似文献   

13.
提出了一种基于粒子群算法PSO优化广义回归神经网络GRNN模型的语音转换方法。首先,该方法利用训练语音的声道和激励源的个性化特征参数分别训练两个GRNN,得到GRNN的结构参数;然后,利用PSO对GRNN的结构参数进行优化,减少人为因素对转换结果的影响;最后,对语音的韵律特征、基音轮廓和能量分别进行了线性转换,使得转换后的语音包含更多源语音的个性化特征信息。主客观实验结果表明:与径向基神经网络RBF和GRNN相比,使用本文提出的转换模型获得的转换语音的自然度和似然度都得到了很大的提升,谱失真率明显降低并且更接近于目标语音。  相似文献   

14.
In this paper, we present a comparative analysis of artificial neural networks (ANNs) and Gaussian mixture models (GMMs) for design of voice conversion system using line spectral frequencies (LSFs) as feature vectors. Both the ANN and GMM based models are explored to capture nonlinear mapping functions for modifying the vocal tract characteristics of a source speaker according to a desired target speaker. The LSFs are used to represent the vocal tract transfer function of a particular speaker. Mapping of the intonation patterns (pitch contour) is carried out using a codebook based model at segmental level. The energy profile of the signal is modified using a fixed scaling factor defined between the source and target speakers at the segmental level. Two different methods for residual modification such as residual copying and residual selection methods are used to generate the target residual signal. The performance of ANN and GMM based voice conversion (VC) system are conducted using subjective and objective measures. The results indicate that the proposed ANN-based model using LSFs feature set may be used as an alternative to state-of-the-art GMM-based models used to design a voice conversion system.  相似文献   

15.
Despite recent advances in vision-based gesture recognition, its applications remain largely limited to artificially defined and well-articulated gesture signs used for human-computer interaction. A key reason for this is the low recognition rates for "natural" gesticulation. Previous attempts of using speech cues to reduce error-proneness of visual classification have been mostly limited to keyword-gesture coanalysis. Such scheme inherits complexity and delays associated with natural language processing. This paper offers a novel "signal-level" perspective, where prosodic manifestations in speech and hand kinematics are considered as a basis for coanalyzing loosely coupled modalities. We present a computational framework for improving continuous gesture recognition based on two phenomena that capture voluntary (coarticulation) and involuntary (physiological) contributions of prosodic synchronization. Physiological constraints, manifested as signal interruptions during multimodal production, are exploited in an audiovisual feature integration framework using hidden Markov models. Coarticulation is analyzed using a Bayesian network of naive classifiers to explore alignment of intonationally prominent speech segments and hand kinematics. The efficacy of the proposed approach was demonstrated on a multimodal corpus created from the Weather Channel broadcast. Both schemas were found to contribute uniquely by reducing different error types, which subsequently improves the performance of continuous gesture recognition.  相似文献   

16.
17.
时长信息是韵律的重要组成部分,对于语音合成的自然度和可懂度都有不可忽视的作用。时长预测是建立对时长有影响的韵律环境与自然语流中音段时长的对应关系。本文引入了统计学中etasquared 的概念研究汉语中韵律环境因素对时长的影响,设计了残差算法定量分析属性之间的交互作用,由此建立了多项式回归的汉语时长预测模型。实验结果表明,使用5~6 个韵律属性基本上就能够建立比较相关的对应关系,和使用同样韵律属性的Wagon 回归树的效果相比有明显的优势。  相似文献   

18.

In this paper, we propose a hybrid speech enhancement system that exploits deep neural network (DNN) for speech reconstruction and Kalman filtering for further denoising, with the aim to improve performance under unseen noise conditions. Firstly, two separate DNNs are trained to learn the mapping from noisy acoustic features to the clean speech magnitudes and line spectrum frequencies (LSFs), respectively. Then the estimated clean magnitudes are combined with the phase of the noisy speech to reconstruct the estimated clean speech, while the LSFs are converted to linear prediction coefficients (LPCs) to implement Kalman filtering. Finally, the reconstructed speech is Kalman-filtered for further removing the residual noises. The proposed hybrid system takes advantage of both the DNN based reconstruction and traditional Kalman filtering, and can work reliably in either matched or unmatched acoustic environments. Computer based experiments are conducted to evaluate the proposed hybrid system with comparison to traditional iterative Kalman filtering and several state-of-the-art DNN based methods under both seen and unseen noises. It is shown that compared to the DNN based methods, the hybrid system achieves similar performance under seen noise, but notably better performance under unseen noise, in terms of both speech quality and intelligibility.

  相似文献   

19.
Hidden Markov Models are used in an experiment to investigate how state occupancy corresponds to prosodic parameters and spectral balance. In order to define separate sub-classes in the data using a maximum likelihood approach, modelling was performed using a single model where individual states correspond to different categories without assuming the structure of the data, rather than manually segmenting the data and modelling each predefined category separately.The results indicate a significant content of segmental information in the prosodic parameters, but the results based on the time-alignment of the model states with the feature vectors are in a form which is not directly usable in a recognition environment. The classification of various phonetic categories is particularly consistent for vowels and nasals and is generally better for voiced than unvoiced speech. The classification is also robust to influences of segmental effects on the data, with consistent alignments with segments regardless of the type of neighbouring phenemes.  相似文献   

20.
A Spectral Conversion Approach to Single-Channel Speech Enhancement   总被引:1,自引:0,他引:1  
In this paper, a novel method for single-channel speech enhancement is proposed, which is based on a spectral conversion feature denoising approach. Spectral conversion has been applied previously in the context of voice conversion, and has been shown to successfully transform spectral features with particular statistical properties into spectral features that best fit (with the constraint of a piecewise linear transformation) different target statistics. This spectral transformation is applied as an initialization step to two well-known single channel enhancement methods, namely the iterative Wiener filter (IWF) and a particular iterative implementation of the Kalman filter. In both cases, spectral conversion is shown here to provide a significant improvement as opposed to initializations using the spectral features directly from the noisy speech. In essence, the proposed approach allows for applying these two algorithms in a user-centric manner, when "clean" speech training data are available from a particular speaker. The extra step of spectral conversion is shown to offer significant advantages regarding output signal-to-noise ratio (SNR) improvement over the conventional initializations, which can reach 2 dB for the IWF and 6 dB for the Kalman filtering algorithm, for low input SNRs and for white and colored noise, respectively  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号