首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 24 毫秒
1.
We are interested in recovering aspects of vocal tract's geometry and dynamics from speech, a problem referred to as speech inversion. Traditional audio-only speech inversion techniques are inherently ill-posed since the same speech acoustics can be produced by multiple articulatory configurations. To alleviate the ill-posedness of the audio-only inversion process, we propose an inversion scheme which also exploits visual information from the speaker's face. The complex audiovisual-to-articulatory mapping is approximated by an adaptive piecewise linear model. Model switching is governed by a Markovian discrete process which captures articulatory dynamic information. Each constituent linear mapping is effectively estimated via canonical correlation analysis. In the described multimodal context, we investigate alternative fusion schemes which allow interaction between the audio and visual modalities at various synchronization levels. For facial analysis, we employ active appearance models (AAMs) and demonstrate fully automatic face tracking and visual feature extraction. Using the AAM features in conjunction with audio features such as Mel frequency cepstral coefficients (MFCCs) or line spectral frequencies (LSFs) leads to effective estimation of the trajectories followed by certain points of interest in the speech production system. We report experiments on the QSMT and MOCHA databases which contain audio, video, and electromagnetic articulography data recorded in parallel. The results show that exploiting both audio and visual modalities in a multistream hidden Markov model based scheme clearly improves performance relative to either audio or visual-only estimation.   相似文献   

2.
提出一种基于状态异步动态贝叶斯网络模型(SA-DBN)的语音驱动面部动画合成方法。提取音视频语音数据库中音频的感知线性预测特征和面部图像的主动外观模型(AAM)特征来训练模型参数,对于给定的输入语音,基于极大似然估计原理学习得到对应的最优AAM特征序列,并由此合成面部图像序列和面部动画。对合成面部动画的主观评测结果表明,与听视觉状态同步的DBN模型相比,通过限制听觉语音状态和视觉语音状态间的最大异步程度,SA-DBN可以得到清晰自然并且嘴部运动与输入语音高度一致的面部动画。  相似文献   

3.
This paper presents a photo realistic facial animation synthesis approach based on an audio visual articulatory dynamic Bayesian network model (AF_AVDBN), in which the maximum asynchronies between the articulatory features, such as lips, tongue and glottis/velum, can be controlled. Perceptual Linear Prediction (PLP) features from audio speech, as well as active appearance model (AAM) features from face images of an audio visual continuous speech database, are adopted to train the AF_AVDBN model parameters. Based on the trained model, given an input audio speech, the optimal AAM visual features are estimated via a maximum likelihood estimation (MLE) criterion, which are then used to construct face images for the animation. In our experiments, facial animations are synthesized for 20 continuous audio speech sentences, using the proposed AF_AVDBN model, as well as the state-of-art methods, being the audio visual state synchronous DBN model (SS_DBN) implementing a multi-stream Hidden Markov Model, and the state asynchronous DBN model (SA_DBN). Objective evaluations on the learned AAM features show that much more accurate visual features can be learned from the AF_AVDBN model. Subjective evaluations show that the synthesized facial animations using AF_AVDBN are better than those using the state based SA_DBN and SS_DBN models, in the overall naturalness and matching accuracy of the mouth movements to the speech content.  相似文献   

4.
This article uses prolonged oral reading corpora for various experiments to analyze and detect vocal fatigue. Vocal fatigue particularly concerns voice professionals, including teachers, telemarketing operators, users of automatic speech recognition technology and actors. In analyzing and detecting vocal fatigue, we focused our investigations on three main experiments: a prosodic analysis that can be compared to the results found in related work, a two-class Support Vector Machines (SVM) classifier into Fatigue and Non-Fatigue states using a large set of audio features and a comparison function that estimates the difference in fatigue level between two speech segments using a combination of multiple phoneme-based comparison functions. The experiments on prosodic analysis showed that vocal fatigue was not associated with an increase in fundamental frequency and voice intensity. A two-class SVM classifier using the Paralinguistic Challenge 2010 audio feature set gave an unweighted accuracy of 94.1% for the training set (10-fold cross-validation) and 68.2% for the test set. These results show that the phenomenon of vocal fatigue can be modeled and detected. The comparison function was assessed by detecting increased fatigue levels between two speech segments. The fatigue level detection performance in Equal Error Rate (EER) was 31% using all phonetic segments and yielded EER of 21% after filtering phonetic segments and 19% after filtering phonetic segments and cepstral features. These results show that some phonemes are more sensitive than others to vocal fatigue. These experiments show that the fatigued voice has specific characteristics for prolonged oral reading and suggest the feasibility of vocal fatigue detection.  相似文献   

5.
This paper presents an articulatory modelling approach to convert acoustic speech into realistic mouth animation. We directly model the movements of articulators, such as lips, tongue, and teeth, using a dynamic Bayesian network (DBN)-based audio-visual articulatory model (AVAM). A multiple-stream structure with a shared articulator layer is adopted in the model to synchronously associate the two building blocks of speech, i.e., audio and video. This model not only describes the synchronization between visual articulatory movements and audio speech, but also reflects the linguistic fact that different articulators evolve asynchronously. We also present a Baum-Welch DBN inversion (DBNI) algorithm to generate optimal facial parameters from audio given the trained AVAM under maximum likelihood (ML) criterion. Extensive objective and subjective evaluations on the JEWEL audio-visual dataset demonstrate that compared with phonemic HMM approaches, facial parameters estimated by our approach follow the true parameters more accurately, and the synthesized facial animation sequences are so lively that 38% of them are undistinguishable  相似文献   

6.
In this paper we study the production and perception of speech in diverse conditions for the purposes of accurate, flexible and highly intelligible talking face animation. We recorded audio, video and facial motion capture data of a talker uttering a set of 180 short sentences, under three conditions: normal speech (in quiet), Lombard speech (in noise), and whispering. We then produced an animated 3D avatar with similar shape and appearance as the original talker and used an error minimization procedure to drive the animated version of the talker in a way that matched the original performance as closely as possible. In a perceptual intelligibility study with degraded audio we then compared the animated talker against the real talker and the audio alone, in terms of audio-visual word recognition rate across the three different production conditions. We found that the visual intelligibility of the animated talker was on par with the real talker for the Lombard and whisper conditions. In addition we created two incongruent conditions where normal speech audio was paired with animated Lombard speech or whispering. When compared to the congruent normal speech condition, Lombard animation yields a significant increase in intelligibility, despite the AV-incongruence. In a separate evaluation, we gathered subjective opinions on the different animations, and found that some degree of incongruence was generally accepted.  相似文献   

7.
Animating expressive faces across languages   总被引:2,自引:0,他引:2  
This paper describes a morphing-based audio driven facial animation system. Based on an incoming audio stream, a face image is animated with full lip synchronization and synthesized expressions. A novel scheme to implement a language independent system for audio-driven facial animation given a speech recognition system for just one language, in our case, English, is presented. The method presented here can also be used for text to audio-visual speech synthesis. Visemes in new expressions are synthesized to be able to generate animations with different facial expressions. An animation sequence using optical flow between visemes is constructed, given an incoming audio stream and still pictures of a face representing different visemes. The presented techniques give improved lip synchronization and naturalness to the animated video.  相似文献   

8.
We propose a coupled hidden Markov model (CHMM) approach to video-realistic speech animation, which realizes realistic facial animations driven by speaker independent continuous speech. Different from hidden Markov model (HMM)-based animation approaches that use a single-state chain, we use CHMMs to explicitly model the subtle characteristics of audio-visual speech, e.g., the asynchrony, temporal dependency (synchrony), and different speech classes between the two modalities. We derive an expectation maximization (EM)-based A/V conversion algorithm for the CHMMs, which converts acoustic speech into decent facial animation parameters. We also present a video-realistic speech animation system. The system transforms the facial animation parameters to a mouth animation sequence, refines the animation with a performance refinement process, and finally stitches the animated mouth with a background facial sequence seamlessly. We have compared the animation performance of the CHMM with the HMMs, the multi-stream HMMs and the factorial HMMs both objectively and subjectively. Results show that the CHMMs achieve superior animation performance. The ph-vi-CHMM system, which adopts different state variables (phoneme states and viseme states) in the audio and visual modalities, performs the best. The proposed approach indicates that explicitly modelling audio-visual speech is promising for speech animation.  相似文献   

9.
Looking at the speaker's face can be useful to better hear a speech signal in noisy environment and extract it from competing sources before identification. This suggests that the visual signals of speech (movements of visible articulators) could be used in speech enhancement or extraction systems. In this paper, we present a novel algorithm plugging audiovisual coherence of speech signals, estimated by statistical tools, on audio blind source separation (BSS) techniques. This algorithm is applied to the difficult and realistic case of convolutive mixtures. The algorithm mainly works in the frequency (transform) domain, where the convolutive mixture becomes an additive mixture for each frequency channel. Frequency by frequency separation is made by an audio BSS algorithm. The audio and visual informations are modeled by a newly proposed statistical model. This model is then used to solve the standard source permutation and scale factor ambiguities encountered for each frequency after the audio blind separation stage. The proposed method is shown to be efficient in the case of 2 times 2 convolutive mixtures and offers promising perspectives for extracting a particular speech source of interest from complex mixtures  相似文献   

10.
Content-based audio classification and segmentation is a basis for further audio/video analysis. In this paper, we present our work on audio segmentation and classification which employs support vector machines (SVMs). Five audio classes are considered in this paper: silence, music, background sound, pure speech, and non- pure speech which includes speech over music and speech over noise. A sound stream is segmented by classifying each sub-segment into one of these five classes. We have evaluated the performance of SVM on different audio type-pairs classification with testing unit of different- length and compared the performance of SVM, K-Nearest Neighbor (KNN), and Gaussian Mixture Model (GMM). We also evaluated the effectiveness of some new proposed features. Experiments on a database composed of about 4- hour audio data show that the proposed classifier is very efficient on audio classification and segmentation. It also shows the accuracy of the SVM-based method is much better than the method based on KNN and GMM.  相似文献   

11.
基于Agent的人机情感交互系统研究   总被引:8,自引:0,他引:8  
该文在软件Agent技术基础上,提出具有识别真实人情感,同时又能够表达虚拟人个性情感的Multi-AgentSystem体系结构。在系统的感知Agent中,采用隐马尔可夫(HMM)数学模型通过表情(视觉)和语言(声音)对六种基本情感状态进行识别。实验结果说明基于视觉和听觉的多模态识别算法能提高计算机对人的情感识别率。文章最后指出面向Agent技术实现人机情感交互系统的开发是一种行之有效的方法。  相似文献   

12.
利用语音来驱动人脸动画,是虚拟现实(Virtual Reality)等领域重要的智能技术,近年来虚拟现实技术的飞速发展更进一步地突出了在沉浸环境下的人机自然交流的迫切需求。语音驱动的人脸动画技术能够创造出自然生动、带有情感的动画,相对于传统预设的人脸动画而言能够更好地辅助人机交互、提升用户体验。为推进该技术的智能化程度和应用,针对语音驱动人脸动画的关键问题:音视频映射,综述了逐帧分析、多帧分析和逐音素分析的映射方法,同时也梳理了多种脸部模型的思想,动画合成、情感融合、人脸动画评价的方法,及可能的研究发展方向。  相似文献   

13.
In this paper, we demonstrate that use of a recently proposed feature set, termed Maximum Auto-Correlation Values, which utilizes information from the source part of the speech signal, significantly improves the robustness of a text independent identity verification system. We also propose an adaptive fusion technique for integration of audio and visual information in a multi-modal verification system. The proposed technique explicitly measures the quality of the speech signal, adjusting the amount of contribution of the speech modality to the final verification decision. Results on the VidTIMIT database indicate that the proposed approach outperforms existing adaptive and non-adaptive fusion techniques. For a wide range of audio SNRs, the performance of the multi-modal system utilizing the proposed technique is always found to be better than the performance of the face modality.  相似文献   

14.
With the acceleration of the pace of work and life, people are facing more and more pressure, which increases the probability of suffering from depression. However, many patients may fail to get a timely diagnosis due to the serious imbalance in the doctor–patient ratio in the world. A promising development is that physiological and psychological studies have found some differences in speech and facial expression between patients with depression and healthy individuals. Consequently, to improve current medical care, Deep Learning (DL) has been used to extract a representation of depression cues from audio and video for automatic depression detection. To classify and summarize such research, we introduce the databases and describe objective markers for automatic depression estimation. We also review the DL methods for automatic detection of depression to extract a representation of depression from audio and video. Lastly, we discuss challenges and promising directions related to the automatic diagnoses of depression using DL.  相似文献   

15.
Audio watermarking and signature are widely used for authentication. However, these techniques will become powerless in many actual situations because of their requirement of additional information. Audio forensic techniques are necessary for digital audio. In this paper, we propose an audio forensics scheme to detect and locate speech audio forged operations in time domain (including deletion, insertion, substitution and splicing) by performing discrete wavelet packet decomposition and analyzing singularity points of audio signals. We first analyze the forged operations and find that the audio signals will often generate new singular points because of the decrease or breaking of the correlation property of those samples close to the tampering position. Then we utilize the singularity analysis based on wavelet packet and design five parameters (which is different for the sample rate of digital audio file) to propose an approach which can detect and locate audio forgeries in time domain. Finally, extensive experimental results have demonstrated that the proposed method can better achieve the goals that identify whether a given speech file has been tampered (e.g., part of the content deleted or replaced) previously and further locate the forged positions in time domain.  相似文献   

16.
In this paper, we present human emotion recognition systems based on audio and spatio-temporal visual features. The proposed system has been tested on audio visual emotion data set with different subjects for both genders. The mel-frequency cepstral coefficient (MFCC) and prosodic features are first identified and then extracted from emotional speech. For facial expressions spatio-temporal features are extracted from visual streams. Principal component analysis (PCA) is applied for dimensionality reduction of the visual features and capturing 97 % of variances. Codebook is constructed for both audio and visual features using Euclidean space. Then occurrences of the histograms are employed as input to the state-of-the-art SVM classifier to realize the judgment of each classifier. Moreover, the judgments from each classifier are combined using Bayes sum rule (BSR) as a final decision step. The proposed system is tested on public data set to recognize the human emotions. Experimental results and simulations proved that using visual features only yields on average 74.15 % accuracy, while using audio features only gives recognition average accuracy of 67.39 %. Whereas by combining both audio and visual features, the overall system accuracy has been significantly improved up to 80.27 %.  相似文献   

17.
视频数据中的音频流包含了丰富的语义信息.在基于内容的视频检索中,对音频信息的分析是不可分割的一部分.本文主要讨论基于内容的音频场景分割,分析各种音频特征及提取方法,并在此基础上提出一种新的音频流分割方法,根据六种音频类型(语音、音乐、静音、环境音、纯语音、音乐背景下的语音和环境音背景下的语音)的音频特征对视频数据中的音频流分割音频场景.实验证明该方法是有效的,在保证一定的分割精度的同时,准确率和查全率都得到了较大的提高.  相似文献   

18.
The multi-modal emotion recognition lacks the explicit mapping relation between emotion state and audio and image features, so extracting the effective emotion information from the audio/visual data is always a challenging issue. In addition, the modeling of noise and data redundancy is not solved well, so that the emotion recognition model is often confronted with the problem of low efficiency. The deep neural network (DNN) performs excellently in the aspects of feature extraction and highly non-linear feature fusion, and the cross-modal noise modeling has great potential in solving the data pollution and data redundancy. Inspired by these, our paper proposes a deep weighted fusion method for audio-visual emotion recognition. Firstly, we conduct the cross-modal noise modeling for the audio and video data, which eliminates most of the data pollution in the audio channel and the data redundancy in visual channel. The noise modeling is implemented by the voice activity detection(VAD), and the data redundancy in the visual data is solved through aligning the speech area both in audio and visual data. Then, we extract the audio emotion features and visual expression features via two feature extractors. The audio emotion feature extractor, audio-net, is a 2D CNN, which accepting the image-based Mel-spectrograms as input data. On the other hand, the facial expression feature extractor, visual-net, is a 3D CNN to which facial expression image sequence is feeded. To train the two convolutional neural networks on the small data set efficiently, we adopt the strategy of transfer learning. Next, we employ the deep belief network(DBN) for highly non-linear fusion of multi-modal emotion features. We train the feature extractors and the fusion network synchronously. And finally the emotion classification is obtained by the support vector machine using the output of the fusion network. With consideration of cross-modal feature fusion, denoising and redundancy removing, our fusion method show excellent performance on the selected data set.  相似文献   

19.
基于乘积HMM的双模态语音识别方法   总被引:3,自引:2,他引:1       下载免费PDF全文
针对噪声环境中的语音识别,提出一种用于双模态语音识别的乘积隐马尔可夫模型(HMM)。在独立训练音频HMM和视频HMM的基础上,建立二维训练模型,表征音频流和视频流之间的异步特性。引入权重系数,根据不同噪声环境自适应调整音频流与视频流的权重。实验结果证明,与其他双模态语音识别方法相比,该方法的识别性能更高。  相似文献   

20.
Visual speech information plays an important role in automatic speech recognition (ASR) especially when audio is corrupted or even inaccessible. Despite the success of audio-based ASR, the problem of visual speech decoding remains widely open. This paper provides a detailed review of recent advances in this research area. In comparison with the previous survey [97] which covers the whole ASR system that uses visual speech information, we focus on the important questions asked by researchers and summarize the recent studies that attempt to answer them. In particular, there are three questions related to the extraction of visual features, concerning speaker dependency, pose variation and temporal information, respectively. Another question is about audio-visual speech fusion, considering the dynamic changes of modality reliabilities encountered in practice. In addition, the state-of-the-art on facial landmark localization is briefly introduced in this paper. Those advanced techniques can be used to improve the region-of-interest detection, but have been largely ignored when building a visual-based ASR system. We also provide details of audio-visual speech databases. Finally, we discuss the remaining challenges and offer our insights into the future research on visual speech decoding.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号