期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

陈墨郭雷《计算机工程与应用》2018,54(9):1-4

情感标签标注是情感计算中的一个重要领域。该领域中针对音频、图像和多媒体内容的情感标签标注已有多个相关工作发表。为分析某个基于脑电图的大脑编码的多媒体情感标签标注中音频信号的重要性,情感计算公开数据库DEAP被用作测试基准。基于DEAP数据库的多媒体刺激,共提取了音频特征和三类视频特征。首先仅使用视频特征基于该框架进行多媒体标签标注任务,之后联合使用音频和视频特征进行同样的工作。实验结果表明,与仅使用视频特征的结果相比,联合使用音视频特征可以提高标注准确率,并且没有因为增加特征维数造成性能损失。相似文献

2.

Video-based emotion recognition in the wild using deep transfer learning and score fusion

《Image and vision computing》2017

相似文献

3.

A Survey of Affect Recognition Methods: Audio, Visual, and Spontaneous Expressions 总被引：3，自引：0，他引：3

Zeng Zhihong Pantic Maja Roisman Glenn I. Huang Thomas S. 《IEEE transactions on pattern analysis and machine intelligence》2009,31(1):39-58

Automated analysis of human affective behavior has attracted increasing attention from researchers in psychology, computer science, linguistics, neuroscience, and related disciplines. However, the existing methods typically handle only deliberately displayed and exaggerated expressions of prototypical emotions despite the fact that deliberate behaviour differs in visual appearance, audio profile, and timing from spontaneously occurring behaviour. To address this problem, efforts to develop algorithms that can process naturally occurring human affective behaviour have recently emerged. Moreover, an increasing number of efforts are reported toward multimodal fusion for human affect analysis including audiovisual fusion, linguistic and paralinguistic fusion, and multi-cue visual fusion based on facial expressions, head movements, and body gestures. This paper introduces and surveys these recent advances. We first discuss human emotion perception from a psychological perspective. Next we examine available approaches to solving the problem of machine understanding of human affective behavior, and discuss important issues like the collection and availability of training and test data. We finally outline some of the scientific and engineering challenges to advancing human affect sensing technology. 相似文献

4.

Realistic emotion visualization by combining facial animation and hairstyle synthesis

Jun Yu Lingyan Li Jie Zou 《Multimedia Tools and Applications》2017,76(13):14905-14919

Facial expressions are one of most intuitive way for expressing emotions, and can facilitate human-computer interaction by enabling users to communicate with computers using more natural ways. Besides, the hair can be designed to enhance the expression of emotions particularly. To visualize the emotions in multiple aspects for completeness, we propose a realistic visual emotional synthesis system based on the combination of facial expression and hairstyle in this paper. Firstly, facial expression is synthesized by the anatomical model and parameterized model. Secondly, cartoonish hairstyle is synthesized to describe emotion implicitly by the mass-spring model and cantilever beam model. Finally, the synthesis results of facial expression and hairstyle are combined to produce a complete visual emotion synthesis result. Experiment results demonstrate the proposed system can synthesize realistic animation, and the emotion expressiveness by combining of face and hair outperform that by face or hair alone. 相似文献

5.

Audio-visual emotion fusion (AVEF): A deep efficient weighted approach

《Information Fusion》2019

The multi-modal emotion recognition lacks the explicit mapping relation between emotion state and audio and image features, so extracting the effective emotion information from the audio/visual data is always a challenging issue. In addition, the modeling of noise and data redundancy is not solved well, so that the emotion recognition model is often confronted with the problem of low efficiency. The deep neural network (DNN) performs excellently in the aspects of feature extraction and highly non-linear feature fusion, and the cross-modal noise modeling has great potential in solving the data pollution and data redundancy. Inspired by these, our paper proposes a deep weighted fusion method for audio-visual emotion recognition. Firstly, we conduct the cross-modal noise modeling for the audio and video data, which eliminates most of the data pollution in the audio channel and the data redundancy in visual channel. The noise modeling is implemented by the voice activity detection(VAD), and the data redundancy in the visual data is solved through aligning the speech area both in audio and visual data. Then, we extract the audio emotion features and visual expression features via two feature extractors. The audio emotion feature extractor, audio-net, is a 2D CNN, which accepting the image-based Mel-spectrograms as input data. On the other hand, the facial expression feature extractor, visual-net, is a 3D CNN to which facial expression image sequence is feeded. To train the two convolutional neural networks on the small data set efficiently, we adopt the strategy of transfer learning. Next, we employ the deep belief network(DBN) for highly non-linear fusion of multi-modal emotion features. We train the feature extractors and the fusion network synchronously. And finally the emotion classification is obtained by the support vector machine using the output of the fusion network. With consideration of cross-modal feature fusion, denoising and redundancy removing, our fusion method show excellent performance on the selected data set. 相似文献

6.

Expressions Recognition of North-East Indian (NEI) Faces

Priya Saha Mrinal Kanti Bhowmik Debotosh Bhattacharjee Barin Kumar De Mita Nasipuri 《Multimedia Tools and Applications》2016,75(24):16781-16807

Facial expression is one of the major distracting factors for face recognition performance. Pose and illumination variations on face images also influence the performance of face recognition systems. The combination of three variations (facial expression, pose and illumination) seriously degrades the recognition accuracy. In this paper, three experimental protocols are designed in such a way that the successive performance degradation due to the increasing variations (expressions, expressions with illumination effect and expressions with illumination and pose effect) on face images can be examined. The whole experiment is carried out using North-East Indian (NEI) face images with the help of four well-known classification algorithms namely Linear Discriminant Analysis (LDA), K-Nearest Neighbor algorithm (KNN), combination of Principal Component Analysis and Linear Discriminant Analysis (PCA + LDA), combination of Principal Component Analysis and K-Nearest Neighbor algorithm (PCA + KNN). The experimental observations are analyzed through confusion matrices and graphs. This paper also describes the creation of NEI facial expression database, which contains visual static face images of different ethnic groups of the North-East states. The database is useful for future researchers in the area of forensic science, medical applications, affective computing, intelligent environments, lie detection, psychiatry, anthropology, etc. 相似文献

7.

基于人脸表情特征的情感交互系统* 总被引：1，自引：1，他引：0

徐红彭力《计算机应用研究》2012,29(3):1111-1115

设计了一套基于人脸表情特征的情感交互系统(情感虚拟人),关键技术分别为情感识别、情感计算、情感合成与输出三个方面。情感识别部分首先采用特征块的方法对面部静态表情图形进行预处理,然后利用二维主元分析(2DPCA)提取特征,最后利用多级量子神经网络分类器实现七类表情识别分类;在情感计算部分建立了隐马尔可夫情感模型(HMM),并且用改进的遗传算法估计模型中的参数;在情感合成与输出阶段,首先采用NURBS曲面和面片相结合的算法,建立人脸三维网格模型,然后采用关键帧技术,实现了符合人类行为规律的连续表情动画。最后完成了基于人脸表情特征的情感交互系统的设计。相似文献

8.

Recognizing Human Emotional State From Audiovisual Signals 总被引：1，自引：0，他引：1

Yongjin Wang Ling Guan 《Multimedia, IEEE Transactions on》2008,10(4):659-668

Machine recognition of human emotional state is an important component for efficient human-computer interaction. The majority of existing works address this problem by utilizing audio signals alone, or visual information only. In this paper, we explore a systematic approach for recognition of human emotional state from audiovisual signals. The audio characteristics of emotional speech are represented by the extracted prosodic, Mel-frequency Cepstral Coefficient (MFCC), and formant frequency features. A face detection scheme based on HSV color model is used to detect the face from the background. The visual information is represented by Gabor wavelet features. We perform feature selection by using a stepwise method based on Mahalanobis distance. The selected audiovisual features are used to classify the data into their corresponding emotions. Based on a comparative study of different classification algorithms and specific characteristics of individual emotion, a novel multiclassifier scheme is proposed to boost the recognition performance. The feasibility of the proposed system is tested over a database that incorporates human subjects from different languages and cultural backgrounds. Experimental results demonstrate the effectiveness of the proposed system. The multiclassifier scheme achieves the best overall recognition rate of 82.14%. 相似文献

9.

Human emotion recognition from videos using spatio-temporal and audio features

Munaf Rashid S. A. R. Abu-Bakar Musa Mokji 《The Visual computer》2013,29(12):1269-1275

In this paper, we present human emotion recognition systems based on audio and spatio-temporal visual features. The proposed system has been tested on audio visual emotion data set with different subjects for both genders. The mel-frequency cepstral coefficient (MFCC) and prosodic features are first identified and then extracted from emotional speech. For facial expressions spatio-temporal features are extracted from visual streams. Principal component analysis (PCA) is applied for dimensionality reduction of the visual features and capturing 97 % of variances. Codebook is constructed for both audio and visual features using Euclidean space. Then occurrences of the histograms are employed as input to the state-of-the-art SVM classifier to realize the judgment of each classifier. Moreover, the judgments from each classifier are combined using Bayes sum rule (BSR) as a final decision step. The proposed system is tested on public data set to recognize the human emotions. Experimental results and simulations proved that using visual features only yields on average 74.15 % accuracy, while using audio features only gives recognition average accuracy of 67.39 %. Whereas by combining both audio and visual features, the overall system accuracy has been significantly improved up to 80.27 %. 相似文献

10.

融合背景上下文特征的视觉情感识别与预测方法

冯月华魏若岩朱晓庆《计算机应用研究》2024,41(5)

为解决基于视觉的情感识别无法捕捉人物所处环境和与周围人物互动对情感识别的影响、单一情感种类无法更丰富地描述人物情感、无法对未来情感进行合理预测的问题,提出了融合背景上下文特征的视觉情感识别与预测方法。该方法由融合背景上下文特征的情感识别模型（Context-ER）和基于GRU与Valence-Arousal连续情感维度的情感预测模型（GRU-mapVA）组成。Context-ER同时综合了面部表情、身体姿态和背景上下文（所处环境、与周围人物互动行为）特征,进行26种离散情感类别的多标签分类和3个连续情感维度的回归。GRU-mapVA根据所提映射规则将Valence-Arousal的预测值投影到改进的Valence-Arousal模型上,使得情感预测类间差异更为明显。Context-ER在Emotic数据集上进行了测试,结果表明,识别情感的平均精确率比现有最优方法提高4%以上;GRU-mapVA在三段视频样本上进行了测试,结果表明情感预测效果相较于现有方法有很大提升。相似文献

11.

A systematic review on affective computing: emotion models,databases, and recent advances

《Information Fusion》2022

Affective computing conjoins the research topics of emotion recognition and sentiment analysis, and can be realized with unimodal or multimodal data, consisting primarily of physical information (e.g., text, audio, and visual) and physiological signals (e.g., EEG and ECG). Physical-based affect recognition caters to more researchers due to the availability of multiple public databases, but it is challenging to reveal one's inner emotion hidden purposefully from facial expressions, audio tones, body gestures, etc. Physiological signals can generate more precise and reliable emotional results; yet, the difficulty in acquiring these signals hinders their practical application. Besides, by fusing physical information and physiological signals, useful features of emotional states can be obtained to enhance the performance of affective computing models. While existing reviews focus on one specific aspect of affective computing, we provide a systematical survey of important components: emotion models, databases, and recent advances. Firstly, we introduce two typical emotion models followed by five kinds of commonly used databases for affective computing. Next, we survey and taxonomize state-of-the-art unimodal affect recognition and multimodal affective analysis in terms of their detailed architectures and performances. Finally, we discuss some critical aspects of affective computing and its applications and conclude this review by pointing out some of the most promising future directions, such as the establishment of benchmark database and fusion strategies. The overarching goal of this systematic review is to help academic and industrial researchers understand the recent advances as well as new developments in this fast-paced, high-impact domain. 相似文献

12.

Towards modeling embodied conversational agent character profiles using appraisal theory predictions in expression synthesis 总被引：2，自引：2，他引：0

L. Malatesta A. Raouzaiou K. Karpouzis S. Kollias 《Applied Intelligence》2009,30(1):58-64

Appraisal theories in psychology study facial expressions in order to deduct information regarding the underlying emotion elicitation processes. Scherer’s component process model provides predictions regarding particular face muscle deformations that are attributed as reactions to the cognitive appraisal stimuli in the study of emotion episodes. In the current work, MPEG-4 facial animation parameters are used in order to evaluate these theoretical predictions for intermediate and final expressions of a given emotion episode. We manipulate parameters such as intensity and temporal evolution of synthesized facial expressions. In emotion episodes originating from identical stimuli, by varying the cognitive appraisals of the stimuli and mapping them to different expression intensities and timings, various behavioral patterns can be generated and thus different agent character profiles can be defined. The results of the synthesis process are consequently applied to Embodied Conversational Agents (ECAs), aiming to render their interaction with humans, or other ECAs, more affective. 相似文献

13.

Bi-modal emotion recognition from expressive face and body gestures

Hatice Gunes Massimo Piccardi 《Journal of Network and Computer Applications》2007,30(4):1334-1345

Psychological research findings suggest that humans rely on the combined visual channels of face and body more than any other channel when they make judgments about human communicative behavior. However, most of the existing systems attempting to analyze the human nonverbal behavior are mono-modal and focus only on the face. Research that aims to integrate gestures as an expression mean has only recently emerged. Accordingly, this paper presents an approach to automatic visual recognition of expressive face and upper-body gestures from video sequences suitable for use in a vision-based affective multi-modal framework. Face and body movements are captured simultaneously using two separate cameras. For each video sequence single expressive frames both from face and body are selected manually for analysis and recognition of emotions. Firstly, individual classifiers are trained from individual modalities. Secondly, we fuse facial expression and affective body gesture information at the feature and at the decision level. In the experiments performed, the emotion classification using the two modalities achieved a better recognition accuracy outperforming classification using the individual facial or bodily modality alone. 相似文献

14.

基于人工心理理论的情感模型构建方法研究 总被引：2，自引：1，他引：2

谷学静王志良魏哲华王超《微计算机信息》2006,22(5):264-266

人类的智能不仅表现为正常的理性思维和逻辑推理能力,也应表现为正常的情感能力。本文在阐述情感交互系统结构和情感特性的基础上,提出了一种基于HMM的情感模型,并采用情绪熵与情感熵的概念约束模型的初始值,从而使该模型适应不同的性格特征。相似文献

15.

基于情感识别的智能教学系统研究 总被引：1，自引：0，他引：1

吴彦文刘伟张昆明《计算机工程与设计》2008,29(9):2350-2352

针对传统的智能教学系统(ITS)在情感方面的缺失,提出了基于情感识别技术的ITS模型.该系统模型在传统的教学系统上新增情感识别模块,利用人脸表情识别以及文本识别等技术所构建,可以获取和识别学生的学习情感,并根据学习情感进行相应的情感激励策略,实现情感化的教学. 相似文献

16.

基于三流DBN模型的听视觉情感识别

下载免费PDF全文

吕兰兰蒋冬梅王风娜 Hichem Sahli Werner Verhelst 《计算机工程》2012,38(5):161-162,166

为更好地对听视觉情感信息之间的关联关系进行建模,提出一种三流混合动态贝叶斯网络情感识别模型(T_AsyDBN)。采用MFCC特征及基于基频和短时能量的局域韵律特征作为听觉输入流,在状态层同步。将面部几何特征和面部动作参数特征作为视觉输入流,与听觉输入流在状态层异步。实验结果表明,该模型优于有状态异步约束的听视觉双流DBN模型,6种情感的平均识别率从 52.14%提高到63.71%。相似文献

17.

Speech driven photo realistic facial animation based on an articulatory DBN model and AAM features

Dongmei Jiang Yong Zhao Hichem Sahli Yanning Zhang 《Multimedia Tools and Applications》2014,73(1):397-415

This paper presents a photo realistic facial animation synthesis approach based on an audio visual articulatory dynamic Bayesian network model (AF_AVDBN), in which the maximum asynchronies between the articulatory features, such as lips, tongue and glottis/velum, can be controlled. Perceptual Linear Prediction (PLP) features from audio speech, as well as active appearance model (AAM) features from face images of an audio visual continuous speech database, are adopted to train the AF_AVDBN model parameters. Based on the trained model, given an input audio speech, the optimal AAM visual features are estimated via a maximum likelihood estimation (MLE) criterion, which are then used to construct face images for the animation. In our experiments, facial animations are synthesized for 20 continuous audio speech sentences, using the proposed AF_AVDBN model, as well as the state-of-art methods, being the audio visual state synchronous DBN model (SS_DBN) implementing a multi-stream Hidden Markov Model, and the state asynchronous DBN model (SA_DBN). Objective evaluations on the learned AAM features show that much more accurate visual features can be learned from the AF_AVDBN model. Subjective evaluations show that the synthesized facial animations using AF_AVDBN are better than those using the state based SA_DBN and SS_DBN models, in the overall naturalness and matching accuracy of the mouth movements to the speech content. 相似文献

18.

A deep bidirectional LSTM approach for video-realistic talking head

Bo Fan Lei Xie Shan Yang Lijuan Wang Frank K. Soong 《Multimedia Tools and Applications》2016,75(9):5287-5309

This paper proposes a deep bidirectional long short-term memory approach in modeling the long contextual, nonlinear mapping between audio and visual streams for video-realistic talking head. In training stage, an audio-visual stereo database is firstly recorded as a subject talking to a camera. The audio streams are converted into acoustic feature, i.e. Mel-Frequency Cepstrum Coefficients (MFCCs), and their textual labels are also extracted. The visual streams, in particular, the lower face region, are compactly represented by active appearance model (AAM) parameters by which the shape and texture variations can be jointly modeled. Given pairs of the audio and visual parameter sequence, a DBLSTM model is trained to learn the sequence mapping from audio to visual space. For any unseen speech audio, whether it is original recorded or synthesized by text-to-speech (TTS), the trained DBLSTM model can predict a convincing AAM parameter trajectory for the lower face animation. To further improve the realism of the proposed talking head, the trajectory tiling method is adopted to use the DBLSTM predicted AAM trajectory as a guide to select a smooth real sample image sequence from the recorded database. We then stitch the selected lower face image sequence back to a background face video of the same subject, resulting in a video-realistic talking head. Experimental results show that the proposed DBLSTM approach outperforms the existing HMM-based approach in both objective and subjective evaluations. 相似文献

19.

基于一致性图卷积模型的多模态对话情绪识别

谭晓聪郭军军线岩团相艳《计算机应用研究》2023,40(10):3100-3106

多模态对话情绪识别是一项根据对话中话语的文本、语音、图像模态预测其情绪类别的任务。针对现有研究主要关注话语上下文的多模态特征提取和融合,而没有充分考虑每个说话人情绪特征利用的问题,提出一种基于一致性图卷积网络的多模态对话情绪识别模型。该模型首先构建了多模态特征学习和融合的图卷积网络,获得每条话语的上下文特征;在此基础上,以说话人在完整对话中的平均特征为一致性约束,使模型学习到更合理的话语特征,从而提高预测情绪类别的性能。在两个基准数据集IEMOCAP和MELD上与其他基线模型进行了比较,结果表明所提模型优于其他模型。此外,还通过消融实验验证了一致性约束和模型其他组成部分的有效性。相似文献

20.

Multimodal information fusion application to human emotion recognition from face and speech 总被引：1，自引：1，他引：0

Muharram Mansoorizadeh Nasrollah Moghaddam Charkari 《Multimedia Tools and Applications》2010,49(2):277-297

A multimedia content is composed of several streams that carry information in audio, video or textual channels. Classification and clustering multimedia contents require extraction and combination of information from these streams. The streams constituting a multimedia content are naturally different in terms of scale, dynamics and temporal patterns. These differences make combining the information sources using classic combination techniques difficult. We propose an asynchronous feature level fusion approach that creates a unified hybrid feature space out of the individual signal measurements. The target space can be used for clustering or classification of the multimedia content. As a representative application, we used the proposed approach to recognize basic affective states from speech prosody and facial expressions. Experimental results over two audiovisual emotion databases with 42 and 12 subjects revealed that the performance of the proposed system is significantly higher than the unimodal face based and speech based systems, as well as synchronous feature level and decision level fusion approaches. 相似文献