期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

LSTM-Modeling of continuous emotions in an audiovisual affect recognition framework

Martin Wöllmer Moritz KaiserFlorian Eyben Björn SchullerGerhard Rigoll 《Image and vision computing》2013

Automatically recognizing human emotions from spontaneous and non-prototypical real-life data is currently one of the most challenging tasks in the field of affective computing. This article presents our recent advances in assessing dimensional representations of emotion, such as arousal, expectation, power, and valence, in an audiovisual human–computer interaction scenario. Building on previous studies which demonstrate that long-range context modeling tends to increase accuracies of emotion recognition, we propose a fully automatic audiovisual recognition approach based on Long Short-Term Memory (LSTM) modeling of word-level audio and video features. LSTM networks are able to incorporate knowledge about how emotions typically evolve over time so that the inferred emotion estimates are produced under consideration of an optimal amount of context. Extensive evaluations on the Audiovisual Sub-Challenge of the 2011 Audio/Visual Emotion Challenge show how acoustic, linguistic, and visual features contribute to the recognition of different affective dimensions as annotated in the SEMAINE database. We apply the same acoustic features as used in the challenge baseline system whereas visual features are computed via a novel facial movement feature extractor. Comparing our results with the recognition scores of all Audiovisual Sub-Challenge participants, we find that the proposed LSTM-based technique leads to the best average recognition performance that has been reported for this task so far. 相似文献

2.

Audio-based description and structuring of videos

Hadi Harb Liming Chen 《International Journal on Digital Libraries》2006,6(1):70-81

相似文献

3.

基于一对一支持向量机的视频自动分类算法

覃丹蒋兴浩孙锬锋陈斌《计算机应用与软件》2010,27(1):3-5

通过分析五类典型视频在视觉上的特性,提取了七种最能揭示几类视频差异的特征,并设计了一种基于一对一支持向量机（1-1 SVM）的视频内容自动分类算法,用于解决在对网络视频媒体的管理、点播、检索中对视频内容进行初步筛选的问题。基于大量实际视频片段的仿真实验结果证明了本算法在区分能力和准确率方面的性能优势。相似文献

4.

基于注意力的短视频多模态情感分析

下载免费PDF全文

黄欢孙力娟曹莹郭剑任恒毅《图学学报》2021,42(1):8-14

针对现有的情感分析方法缺乏对短视频中信息的充分考虑,从而导致不恰当的情感分析结果.基于音视频的多模态情感分析(AV-MSA)模型便由此产生,模型通过利用视频帧图像中的视觉特征和音频信息来完成短视频的情感分析.模型分为视觉与音频2分支,音频分支采用卷积神经网络(CNN)架构来提取音频图谱中的情感特征,实现情感分析的目的;... 相似文献

5.

A three-level framework for affective content analysis and its case studies

Min Xu Jinqiao Wang Xiangjian He Jesse S. Jin Suhuai Luo Hanqing Lu 《Multimedia Tools and Applications》2014,70(2):757-779

Emotional factors directly reflect audiences’ attention, evaluation and memory. Recently, video affective content analysis attracts more and more research efforts. Most of the existing methods map low-level affective features directly to emotions by applying machine learning. Compared to human perception process, there is actually a gap between low-level features and high-level human perception of emotion. In order to bridge the gap, we propose a three-level affective content analysis framework by introducing mid-level representation to indicate dialog, audio emotional events (e.g., horror sounds and laughters) and textual concepts (e.g., informative keywords). Mid-level representation is obtained from machine learning on low-level features and used to infer high-level affective content. We further apply the proposed framework and focus on a number of case studies. Audio emotional event, dialog and subtitle are studied to assist affective content detection in different video domains/genres. Multiple modalities are considered for affective analysis, since different modality has its own merit to evoke emotions. Experimental results shows the proposed framework is effective and efficient for affective content analysis. Audio emotional event, dialog and subtitle are promising mid-level representations. 相似文献

6.

基于JSP的网络视音频播放系统设计与实现

张庆涛《电脑编程技巧与维护》2014,(4):64-66

根据流媒体传输原理,在局域网的基础上模拟基于Web的视频和音频播放系统,实现用户信息管理、听音频、看视频、文件的添加、删除、修改、上传及搜索功能等,从而设计出符合现在人们需求的视音频播放系统,为网络时代的人们提供方便、快捷的视音频点播节目,提供更加人性化设置。相似文献

7.

Video data mining: semantic indexing and event detection from the association perspective 总被引：6，自引：0，他引：6

Xingquan Zhu Xindong Wu Elmagarmid A.K. Zhe Feng Lide Wu 《Knowledge and Data Engineering, IEEE Transactions on》2005,17(5):665-677

Advances in the media and entertainment industries, including streaming audio and digital TV, present new challenges for managing and accessing large audio-visual collections. Current content management systems support retrieval using low-level features, such as motion, color, and texture. However, low-level features often have little meaning for naive users, who much prefer to identify content using high-level semantics or concepts. This creates a gap between systems and their users that must be bridged for these systems to be used effectively. To this end, in this paper, we first present a knowledge-based video indexing and content management framework for domain specific videos (using basketball video as an example). We will provide a solution to explore video knowledge by mining associations from video data. The explicit definitions and evaluation measures (e.g., temporal support and confidence) for video associations are proposed by integrating the distinct feature of video data. Our approach uses video processing techniques to find visual and audio cues (e.g., court field, camera motion activities, and applause), introduces multilevel sequential association mining to explore associations among the audio and visual cues, classifies the associations by assigning each of them with a class label, and uses their appearances in the video to construct video indices. Our experimental results demonstrate the performance of the proposed approach. 相似文献

8.

融合多模态特征与时区检测的视频摘要算法

白晨范涛王文静王国中《计算机应用研究》2023,40(11):3276-3281+3288

针对传统视频摘要算法没有充分利用视频的多模态信息、难以确保摘要视频片段时序一致性的问题,提出了一种融合多模态特征与时区检测的视频摘要算法(MTNet)。首先,通过GoogLeNet与VGGish预训练模型提取视频图像与音频的特征表示,设计了一种维度平滑操作对齐两种模态特征,使模型具备全面的表征能力;其次,考虑到生成的视频摘要应具备全局代表性,因此通过单双层自注意力机制结合残差结构分别提取视频图像与音频特征的长范围时序特征,获取模型在时序范围的单一向量表示;最后,通过分离式时区检测与权值共享方法对视频逐个时序片段的摘要边界与重要性进行预测,并通过非极大值抑制来选取关键视频片段生成视频摘要。实验结果表明,在两个标准数据集SumMe与TvSum上,MTNet的表征能力与鲁棒性更强;它的F₁值相较基于无锚框的视频摘要算法DSNet-AF以及基于镜头重要性预测的视频摘要算法VASNet,在两个数据集上分别有所提高。相似文献

9.

A live video streaming system for intuitive human-system interaction

Kaoru Sugita Nobuhiro Nakamura Tetsushi Oka Masao Yokota 《Artificial Life and Robotics》2008,12(1-2):194-198

In this paper, we propose an intuitive live video streaming system based on virtual reality technologies among people who are far apart. This system is a kind of server-client system, and can provide remote users with virtual 3D audiovisual fields in real time via a very-high-speed network. The server captures audio and video data from its clients, compiles them into a 3D audiovisual scene at a virtual conference, and broadcasts it to the clients. At the present stage, our system captures 2 videos and creates one 3D video at a time. Our system can play 3D audiovisual contents on Windows XP systems as well as on CAVE systems. Currently, our system can play the 3D video contents at about 2.36 fps under a LAN environment. This work was presented in part at the 12th International Symposium on Artificial Life and Robotics, Oita, Japan, January 25–27, 2007 相似文献

10.

A Survey of MPEG-1 Audio,Video and Semantic Analysis Techniques

Uma Srinivasan Silvia Pfeiffer Surya Nepal Michael Lee Lifang Gu Stephen Barrass 《Multimedia Tools and Applications》2005,27(1):105-141

Digital audio & video data have become an integral part of multimedia information systems. To reduce storage and bandwidth requirements, they are commonly stored in a compressed format, such as MPEG-1. Increasing amounts of MPEG encoded audio and video documents are available online and in proprietary collections. In order to effectively utilise them, we need tools and techniques to automatically analyse, segment, and classify MPEG video content. Several techniques have been developed both in the audio and visual domain to analyse videos. This paper presents a survey of audio and visual analysis techniques on MPEG-1 encoded media that are useful in supporting a variety of video applications. Although audio and visual feature analyses have been carried out extensively, they become useful to applications only when they convey a semantic meaning of the video content. Therefore, we also present a survey of works that provide semantic analysis on MPEG-1 encoded videos. 相似文献

11.

Evaluating multimedia features and fusion for example-based event detection

Gregory K. Myers Ramesh Nallapati Julien van Hout Stephanie Pancoast Ramakant Nevatia Chen Sun Amirhossein Habibian Dennis C. Koelma Koen E. A. van de Sande Arnold W. M. Smeulders Cees G. M. Snoek 《Machine Vision and Applications》2014,25(1):17-32

Multimedia event detection (MED) is a challenging problem because of the heterogeneous content and variable quality found in large collections of Internet videos. To study the value of multimedia features and fusion for representing and learning events from a set of example video clips, we created SESAME, a system for video SEarch with Speed and Accuracy for Multimedia Events. SESAME includes multiple bag-of-words event classifiers based on single data types: low-level visual, motion, and audio features; high-level semantic visual concepts; and automatic speech recognition. Event detection performance was evaluated for each event classifier. The performance of low-level visual and motion features was improved by the use of difference coding. The accuracy of the visual concepts was nearly as strong as that of the low-level visual features. Experiments with a number of fusion methods for combining the event detection scores from these classifiers revealed that simple fusion methods, such as arithmetic mean, perform as well as or better than other, more complex fusion methods. SESAME’s performance in the 2012 TRECVID MED evaluation was one of the best reported. 相似文献

12.

Interworking of heterogeneous access networks and QoS provisioning via IP multimedia core networks

《Computer Networks》2008,52(1):215-227

Moving towards packet networks, where IP will have a prominent role, constitutes nowadays a widely accepted perception of future communications, the first instance of which has begun to materialise with the IP multimedia core network subsystem (IMS). By specification, IMS is the first implementation towards reaching converged communications which allows users to communicate with video, audio and multimedia content, via any fixed, mobile and wireless access network type, with controllable QoS. To enable IMS communications across heterogeneous networks, incorporating UMTS, WLAN and fixed IP access points, 3GPP and ETSI’s TISPAN currently work on schemes for controlling bandwidth allocation at the service level by employing logical interfaces that carry SIP messages. This article analyzes how interconnection between such heterogeneous networks may be performed on real platforms. In this effort, special attention is paid to the way the various interconnection possibilities can affect end-to-end QoS provisioning. 相似文献

13.

LGA: latent genre aware micro-video recommendation on social media

Jingwei Ma Guang Li Mingyang Zhong Xin Zhao Lei Zhu Xue Li 《Multimedia Tools and Applications》2018,77(3):2991-3008

Social media has evolved into one of the most important channels to share micro-videos nowadays. The sheer volume of micro-videos available in social networks often undermines users’ capability to choose the micro-videos that best fit their interests. Recommendation appear as a natural solution to this problem. However, existing video recommendation methods only consider the users’ historical preferences on videos, without exploring any video contents. In this paper, we develop a novel latent genre aware micro-video recommendation model to solve the problem. First, we extract user-item interaction features, and auxiliary features describing both contextual and visual contents of micro-videos. Second, these features are fed into the neural recommendation model that simultaneously learns the latent genres of micro-videos and the optimal recommendation scores. Experiments on real-world dataset demonstrate the effectiveness and the efficiency of our proposed method compared with several state-of-the-art approaches. 相似文献

14.

Recognizing Human Emotional State From Audiovisual Signals 总被引：1，自引：0，他引：1

Yongjin Wang Ling Guan 《Multimedia, IEEE Transactions on》2008,10(4):659-668

Machine recognition of human emotional state is an important component for efficient human-computer interaction. The majority of existing works address this problem by utilizing audio signals alone, or visual information only. In this paper, we explore a systematic approach for recognition of human emotional state from audiovisual signals. The audio characteristics of emotional speech are represented by the extracted prosodic, Mel-frequency Cepstral Coefficient (MFCC), and formant frequency features. A face detection scheme based on HSV color model is used to detect the face from the background. The visual information is represented by Gabor wavelet features. We perform feature selection by using a stepwise method based on Mahalanobis distance. The selected audiovisual features are used to classify the data into their corresponding emotions. Based on a comparative study of different classification algorithms and specific characteristics of individual emotion, a novel multiclassifier scheme is proposed to boost the recognition performance. The feasibility of the proposed system is tested over a database that incorporates human subjects from different languages and cultural backgrounds. Experimental results demonstrate the effectiveness of the proposed system. The multiclassifier scheme achieves the best overall recognition rate of 82.14%. 相似文献

15.

一种基于用户播放行为序列的个性化视频推荐策略 总被引：4，自引：0，他引：4

王娜何晓明刘志强王文君李霞《计算机学报》2020,43(1):123-135

本文针对在线视频服务网站的个性化推荐问题,提出了一种基于用户播放行为序列的个性化推荐策略.该策略通过深度神经网络词向量模型分析用户播放视频行为数据,将视频映射成等维度的特征向量,提取视频的语义特征.聚类用户播放历史视频的特征向量,建模用户兴趣分布矩阵.结合用户兴趣偏好和用户观看历史序列生成推荐列表.在大规模的视频服务系统中进行了离线实验,相比随机算法、基于物品的协同过滤和基于用户的协同过滤传统推荐策略,本方法在用户观看视频的Top-N推荐精确率方面平均分别获得22.3%、30.7%和934%的相对提升,在召回率指标上分别获得52.8%、41%和1065%的相对提升.进一步地与矩阵分解算法SVD++、基于双向LSTM模型和注意力机制的Bi-LSTM+Attention算法和基于用户行为序列的深度兴趣网络DIN比较,Top-N推荐精确率和召回率也得到了明显提升.该推荐策略不仅获得了较高的精确率和召回率,还尝试解决传统推荐面临大规模工业数据集时的数据要求严苛、数据稀疏和数据噪声等问题. 相似文献

16.

Audiovisual integration with Segment Models for tennis video parsing 总被引：2，自引：0，他引：2

Manolis Delakis Guillaume Gravier Patrick Gros 《Computer Vision and Image Understanding》2008,111(2):142-154

Automatic video content analysis is an emerging research subject with numerous applications to large video databases and personal video recording systems. The aim of this study is to fuse multimodal information in order to automatically parse the underlying structure of tennis broadcasts. The frame-based observation distributions of Hidden Markov Models are too strict in modeling heterogeneous audiovisual data. We propose instead the use of segmental features, of the framework of Segment Models, to overcome this limitation and extend the synchronization points to the segment boundaries. Considering each segment as a video scene, auditory and visual features collected inside the scene boundaries can thus be sampled and modeled with their native sampling rates and models. Experimental results on a corpus of 15-h tennis video demonstrated a performance superiority of Segment Models with synchronous audiovisual fusion over Hidden Markov Models. Results though with asynchronous fusion are less optimistic. 相似文献

17.

A Web video retrieval method using hierarchical structure of Web video groups

Ryosuke Harakawa Takahiro Ogawa Miki Haseyama 《Multimedia Tools and Applications》2016,75(24):17059-17079

In this paper, we propose a Web video retrieval method that uses hierarchical structure of Web video groups. Existing retrieval systems require users to input suitable queries that identify the desired contents in order to accurately retrieve Web videos; however, the proposed method enables retrieval of the desired Web videos even if users cannot input the suitable queries. Specifically, we first select representative Web videos from a target video dataset by using link relationships between Web videos obtained via metadata “related videos” and heterogeneous video features. Furthermore, by using the representative Web videos, we construct a network whose nodes and edges respectively correspond to Web videos and links between these Web videos. Then Web video groups, i.e., Web video sets with similar topics are hierarchically extracted based on strongly connected components, edge betweenness and modularity. By exhibiting the obtained hierarchical structure of Web video groups, users can easily grasp the overview of many Web videos. Consequently, even if users cannot write suitable queries that identify the desired contents, it becomes feasible to accurately retrieve the desired Web videos by selecting Web video groups according to the hierarchical structure. Experimental results on actual Web videos verify the effectiveness of our method. 相似文献

18.

A multimodal approach for extracting content descriptive metadata from lecture videos

Vidhya Balasubramanian Sooryanarayan Gobu Doraisamy Navaneeth Kumar Kanakarajan 《Journal of Intelligent Information Systems》2016,46(1):121-145

相似文献

19.

Hierarchical video content description and summarization using unified semantic and visual similarity

Xingquan?Zhu Email author Jianping?Fan Ahmed?K.?Elmagarmid Xindong?Wu 《Multimedia Systems》2003,9(1):31-53

相似文献

20.

Human Behavior Analysis for Highlight Ranking in Broadcast Racket Sports Video 总被引：3，自引：0，他引：3

Guangyu Zhu Qingming Huang Changsheng Xu Liyuan Xing Wen Gao Hongxun Yao 《Multimedia, IEEE Transactions on》2007,9(6):1167-1182

The majority of existing work on sports video analysis concentrates on highlight extraction. Little work focuses on the important issue as how the extracted highlights should be organized. In this paper, we present a multimodal approach to organize the highlights extracted from racket sports video grounded on human behavior analysis using a nonlinear affective ranking model. Two research challenges of highlight ranking are addressed, namely affective feature extraction and ranking model construction. The basic principle of affective feature extraction in our work is to extract sensitive features which can stimulate user's emotion. Since the users pay most attention to player behavior and audience response in racket sport highlights, we extract affective features from player behavior including action and trajectory, and game-specific audio keywords. We propose a novel motion analysis method to recognize the player actions. We employ support vector regression to construct the nonlinear highlight ranking model from affective features. A new subjective evaluation criterion is proposed to guide the model construction. To evaluate the performance of the proposed approaches, we have tested them on more than ten-hour broadcast tennis and badminton videos. The experimental results demonstrate that our action recognition approach significantly outperforms the existing appearance-based method. Moreover, our user study shows that the affective highlight ranking approach is effective. 相似文献