共查询到20条相似文献,搜索用时 31 毫秒
1.
Fusion of audio and motion information on HMM-based highlight extraction for baseball games 总被引:2,自引:0,他引:2
《Multimedia, IEEE Transactions on》2006,8(3):585-599
This paper aims to extract baseball game highlights based on audio-motion integrated cues. In order to better describe different audio and motion characteristics in baseball game highlights, we propose a novel representation method based on likelihood models. The proposed likelihood models measure the "likeliness" of low-level audio features and motion features to a set of predefined audio types and motion categories, respectively. Our experiments show that using the proposed likelihood representation is more robust than using low-level audio/motion features to extract the highlight. With the proposed likelihood models, we then construct an integrated feature representation by symmetrically fusing the audio and motion likelihood models. Finally, we employ a hidden Markov model (HMM) to model and detect the transition of the integrated representation for highlight segments. A series of experiments have been conducted on a 12-h video database to demonstrate the effectiveness of our proposed method and show that the proposed framework achieves promising results over a variety of baseball game sequences. 相似文献
2.
3.
4.
《Computer Vision and Image Understanding》2009,113(3):415-424
While most existing sports video research focuses on detecting event from soccer and baseball etc., little work has been contributed to flexible content summarization on racquet sports video, e.g. tennis, table tennis etc. By taking advantages of the periodicity of video shot content and audio keywords in the racquet sports video, we propose a novel flexible video content summarization framework. Our approach combines the structure event detection method with the highlight ranking algorithm. Firstly, unsupervised shot clustering and supervised audio classification are performed to obtain the visual and audio mid-level patterns respectively. Then, a temporal voting scheme for structure event detection is proposed by utilizing the correspondence between audio and video content. Finally, by using the affective features extracted from the detected events, a linear highlight model is adopted to rank the detected events in terms of their exciting degrees. Experimental results show that the proposed approach is effective. 相似文献
5.
In heterogeneous networks, different modalities are coexisting. For example, video sources with certain lengths usually have abundant time-varying audiovisual data. From the users’ perspective, different video segments will trigger different kinds of emotions. In order to better interact with users in heterogeneous networks and improve their user experiences, affective video content analysis to predict users’ emotions is essential. Academically, users’ emotions can be evaluated by arousal and valence values, and fear degree, which provides an approach to quantize the prediction accuracy of the reaction of the audience and users towards videos. In this paper, we propose the multimodal data fusion method for integrating the visual and audio data in order to perform the affective video content analysis. Specifically, to align the visual and audio data, the temporal attention filters are proposed to obtain the time-span features of the entire video segments. Then, by using the two-branch network structure, matched visual and audio features are integrated in the common space. At last, the fused audiovisual feature is employed for the regression and classification subtasks in order to measure the emotional responses of users. Simulation results show that the proposed method can accurately predict the subjective feelings of users towards the video contents, which provides a way to predict users’ preferences and recommend videos according to their own demand. 相似文献
6.
Video Semantic Event/Concept Detection Using a Subspace-Based Multimedia Data Mining Framework 总被引:3,自引:0,他引:3
Mei-Ling Shyu Zongxing Xie Min Chen Shu-Ching Chen 《Multimedia, IEEE Transactions on》2008,10(2):252-259
In this paper, a subspace-based multimedia data mining framework is proposed for video semantic analysis, specifically video event/concept detection, by addressing two basic issues, i.e., semantic gap and rare event/concept detection. The proposed framework achieves full automation via multimodal content analysis and intelligent integration of distance-based and rule-based data mining techniques. The content analysis process facilitates the comprehensive video analysis by extracting low-level and middle-level features from audio/visual channels. The integrated data mining techniques effectively address these two basic issues by alleviating the class imbalance issue along the process and by reconstructing and refining the feature dimension automatically. The promising experimental performance on goal/corner event detection and sports/commercials/building concepts extraction from soccer videos and TRECVID news collections demonstrates the effectiveness of the proposed framework. Furthermore, its unique domain-free characteristic indicates the great potential of extending the proposed multimedia data mining framework to a wide range of different application domains. 相似文献
7.
8.
Jinqiao Wang Lingyu Duan Qingshan Liu Hanqing Lu Jin J.S. 《Multimedia, IEEE Transactions on》2008,10(3):393-408
With the advance of digital video recording and playback systems, the request for efficiently managing recorded TV video programs is evident so that users can readily locate and browse their favorite programs. In this paper, we propose a multimodal scheme to segment and represent TV video streams. The scheme aims to recover the temporal and structural characteristics of TV programs with visual, auditory, and textual information. In terms of visual cues, we develop a novel concept named program-oriented informative images (POIM) to identify the candidate points correlated with the boundaries of individual programs. For audio cues, a multiscale Kullback-Leibler (K-L) distance is proposed to locate audio scene changes (ASC), and accordingly ASC is aligned with video scene changes to represent candidate boundaries of programs. In addition, latent semantic analysis (LSA) is adopted to calculate the textual content similarity (TCS) between shots to model the inter-program similarity and intra-program dissimilarity in terms of speech content. Finally, we fuse the multimodal features of POIM, ASC, and TCS to detect the boundaries of programs including individual commercials (spots). Towards effective program guide and attracting content browsing, we propose a multimodal representation of individual programs by using POIM images, key frames, and textual keywords in a summarization manner. Extensive experiments are carried out over an open benchmarking dataset TRECVID 2005 corpus and promising results have been achieved. Compared with the electronic program guide (EPG), our solution provides a more generic approach to determine the exact boundaries of diverse TV programs even including dramatic spots. 相似文献
9.
Gregory K. Myers Ramesh Nallapati Julien van Hout Stephanie Pancoast Ramakant Nevatia Chen Sun Amirhossein Habibian Dennis C. Koelma Koen E. A. van de Sande Arnold W. M. Smeulders Cees G. M. Snoek 《Machine Vision and Applications》2014,25(1):17-32
Multimedia event detection (MED) is a challenging problem because of the heterogeneous content and variable quality found in large collections of Internet videos. To study the value of multimedia features and fusion for representing and learning events from a set of example video clips, we created SESAME, a system for video SEarch with Speed and Accuracy for Multimedia Events. SESAME includes multiple bag-of-words event classifiers based on single data types: low-level visual, motion, and audio features; high-level semantic visual concepts; and automatic speech recognition. Event detection performance was evaluated for each event classifier. The performance of low-level visual and motion features was improved by the use of difference coding. The accuracy of the visual concepts was nearly as strong as that of the low-level visual features. Experiments with a number of fusion methods for combining the event detection scores from these classifiers revealed that simple fusion methods, such as arithmetic mean, perform as well as or better than other, more complex fusion methods. SESAME’s performance in the 2012 TRECVID MED evaluation was one of the best reported. 相似文献
10.
基于声、像特征的视频暴力场面的探测 总被引:2,自引:0,他引:2
本文介绍了一种探测视频中暴力场面的方法,该方法综合声音和图像特征对视频进行分析,提高了探测的准确度,这种方法可用于对视频的高层语义特征建立索引,从而支持视频的基于内容检索。 相似文献
11.
Integrated Mining of Visual Features, Speech Features, and Frequent Patterns for Semantic Video Annotation 总被引:2,自引:0,他引:2
Tseng V.S. Ja-Hwung Su Jhih-Hong Huang Chih-Jen Chen 《Multimedia, IEEE Transactions on》2008,10(2):260-267
To support effective multimedia information retrieval, video annotation has become an important topic in video content analysis. Existing video annotation methods put the focus on either the analysis of low-level features or simple semantic concepts, and they cannot reduce the gap between low-level features and high-level concepts. In this paper, we propose an innovative method for semantic video annotation through integrated mining of visual features, speech features, and frequent semantic patterns existing in the video. The proposed method mainly consists of two main phases: 1) Construction of four kinds of predictive annotation models, namely speech-association, visual-association, visual-sequential, and statistical models from annotated videos. 2) Fusion of these models for annotating un-annotated videos automatically. The main advantage of the proposed method lies in that all visual features, speech features, and semantic patterns are considered simultaneously. Moreover, the utilization of high-level rules can effectively complement the insufficiency of statistics-based methods in dealing with complex and broad keyword identification in video annotation. Through empirical evaluation on NIST TRECVID video datasets, the proposed approach is shown to enhance the performance of annotation substantially in terms of precision, recall, and F-measure. 相似文献
12.
Graph-based multilevel temporal video segmentation 总被引:1,自引:0,他引:1
This paper presents a graph-based multilevel temporal video segmentation method. In each level of the segmentation, a weighted
undirected graph structure is implemented. The graph is partitioned into clusters which represent the segments of a video.
Three low-level features are used in the calculation of temporal segments’ similarities: visual content, motion content and
shot duration. Our strength factor approach contributes to the results by improving the efficiency of the proposed method.
Experiments show that the proposed video scene detection method gives promising results in order to organize videos without
human intervention. 相似文献
13.
M. Billinghurst J. Bowskill M. Jessop J. Morphett 《Personal and Ubiquitous Computing》1999,3(1-2):72-80
Wearable computers provide constant access to computing and communications resources; however, there are many unanswered questions as to how this computing power can be used to enhance communication. We describe a wearable augmented reality communication space that uses spatialised 3D graphics and audio cues to aid communication. The user is surrounded by virtual avatars of the remote collaborators that they can interact with using natural head and body motions. The use of spatial cues means that the conferencing space can potentially support dozens of simultaneous users. We report on two experiments that show users can understand speakers better with spatial rather than non-spatial audio, and that minimal visual cues may be sufficient to distinguish between speakers. Additional informal user studies with real conference participants suggest that wearable communication spaces may offer significant advantages over traditional communication devices. 相似文献
14.
15.
Frank Hopfgartner Thierry Urruty Pablo Bermejo Lopez Robert Villa Joemon M. Jose 《Multimedia Tools and Applications》2010,47(3):631-662
In this paper we explore the limitations of facet based browsing which uses sub-needs of an information need for querying
and organising the search process in video retrieval. The underlying assumption of this approach is that the search effectiveness
will be enhanced if such an approach is employed for interactive video retrieval using textual and visual features. We explore
the performance bounds of a faceted system by carrying out a simulated user evaluation on TRECVid data sets, and also on the
logs of a prior user experiment with the system. We first present a methodology to reduce the dimensionality of features by
selecting the most important ones. Then, we discuss the simulated evaluation strategies employed in our evaluation and the
effect on the use of both textual and visual features. Facets created by users are simulated by clustering video shots using
textual and visual features. The experimental results of our study demonstrate that the faceted browser can potentially improve
the search effectiveness. 相似文献
16.
17.
18.
19.
P. Garrido L. Valgaerts H. Sarmadi I. Steiner K. Varanasi P. Pérez C. Theobalt 《Computer Graphics Forum》2015,34(2):193-204
In many countries, foreign movies and TV productions are dubbed, i.e., the original voice of an actor is replaced with a translation that is spoken by a dubbing actor in the country's own language. Dubbing is a complex process that requires specific translations and accurately timed recitations such that the new audio at least coarsely adheres to the mouth motion in the video. However, since the sequence of phonemes and visemes in the original and the dubbing language are different, the video‐to‐audio match is never perfect, which is a major source of visual discomfort. In this paper, we propose a system to alter the mouth motion of an actor in a video, so that it matches the new audio track. Our paper builds on high‐quality monocular capture of 3D facial performance, lighting and albedo of the dubbing and target actors, and uses audio analysis in combination with a space‐time retrieval method to synthesize a new photo‐realistically rendered and highly detailed 3D shape model of the mouth region to replace the target performance. We demonstrate plausible visual quality of our results compared to footage that has been professionally dubbed in the traditional way, both qualitatively and through a user study. 相似文献
20.
In the past few years, modeling and querying video databases have been a subject of extensive research to develop tools for effective search of videos. In this paper, we present a hierarchal approach to model videos at three levels, object level (OL), frame level (FL), and shot level (SL). The model captures the visual features of individual objects at OL, visual-spatio-temporal (VST) relationships between objects at FL, and time-varying visual features and time-varying VST relationships at SL. We call the combination of the time-varying visual features and the time-varying VST relationships a Content trajectory which is used to represent and index a shot. A novel query interface that allows users to describe the time-varying contents of complex video shots such as those of skiers, soccer players, etc., by sketch and feature specification is presented. Our experimental results prove the effectiveness of modeling and querying shots using the content trajectory approach. 相似文献