首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到10条相似文献,搜索用时 140 毫秒
1.
This paper aims to extract baseball game highlights based on audio-motion integrated cues. In order to better describe different audio and motion characteristics in baseball game highlights, we propose a novel representation method based on likelihood models. The proposed likelihood models measure the "likeliness" of low-level audio features and motion features to a set of predefined audio types and motion categories, respectively. Our experiments show that using the proposed likelihood representation is more robust than using low-level audio/motion features to extract the highlight. With the proposed likelihood models, we then construct an integrated feature representation by symmetrically fusing the audio and motion likelihood models. Finally, we employ a hidden Markov model (HMM) to model and detect the transition of the integrated representation for highlight segments. A series of experiments have been conducted on a 12-h video database to demonstrate the effectiveness of our proposed method and show that the proposed framework achieves promising results over a variety of baseball game sequences.  相似文献   

2.
3.
4.
In heterogeneous networks, different modalities are coexisting. For example, video sources with certain lengths usually have abundant time-varying audiovisual data. From the users’ perspective, different video segments will trigger different kinds of emotions. In order to better interact with users in heterogeneous networks and improve their user experiences, affective video content analysis to predict users’ emotions is essential. Academically, users’ emotions can be evaluated by arousal and valence values, and fear degree, which provides an approach to quantize the prediction accuracy of the reaction of the audience and users towards videos. In this paper, we propose the multimodal data fusion method for integrating the visual and audio data in order to perform the affective video content analysis. Specifically, to align the visual and audio data, the temporal attention filters are proposed to obtain the time-span features of the entire video segments. Then, by using the two-branch network structure, matched visual and audio features are integrated in the common space. At last, the fused audiovisual feature is employed for the regression and classification subtasks in order to measure the emotional responses of users. Simulation results show that the proposed method can accurately predict the subjective feelings of users towards the video contents, which provides a way to predict users’ preferences and recommend videos according to their own demand.  相似文献   

5.
While most existing sports video research focuses on detecting event from soccer and baseball etc., little work has been contributed to flexible content summarization on racquet sports video, e.g. tennis, table tennis etc. By taking advantages of the periodicity of video shot content and audio keywords in the racquet sports video, we propose a novel flexible video content summarization framework. Our approach combines the structure event detection method with the highlight ranking algorithm. Firstly, unsupervised shot clustering and supervised audio classification are performed to obtain the visual and audio mid-level patterns respectively. Then, a temporal voting scheme for structure event detection is proposed by utilizing the correspondence between audio and video content. Finally, by using the affective features extracted from the detected events, a linear highlight model is adopted to rank the detected events in terms of their exciting degrees. Experimental results show that the proposed approach is effective.  相似文献   

6.
In this paper, a subspace-based multimedia data mining framework is proposed for video semantic analysis, specifically video event/concept detection, by addressing two basic issues, i.e., semantic gap and rare event/concept detection. The proposed framework achieves full automation via multimodal content analysis and intelligent integration of distance-based and rule-based data mining techniques. The content analysis process facilitates the comprehensive video analysis by extracting low-level and middle-level features from audio/visual channels. The integrated data mining techniques effectively address these two basic issues by alleviating the class imbalance issue along the process and by reconstructing and refining the feature dimension automatically. The promising experimental performance on goal/corner event detection and sports/commercials/building concepts extraction from soccer videos and TRECVID news collections demonstrates the effectiveness of the proposed framework. Furthermore, its unique domain-free characteristic indicates the great potential of extending the proposed multimedia data mining framework to a wide range of different application domains.  相似文献   

7.
8.
With the advance of digital video recording and playback systems, the request for efficiently managing recorded TV video programs is evident so that users can readily locate and browse their favorite programs. In this paper, we propose a multimodal scheme to segment and represent TV video streams. The scheme aims to recover the temporal and structural characteristics of TV programs with visual, auditory, and textual information. In terms of visual cues, we develop a novel concept named program-oriented informative images (POIM) to identify the candidate points correlated with the boundaries of individual programs. For audio cues, a multiscale Kullback-Leibler (K-L) distance is proposed to locate audio scene changes (ASC), and accordingly ASC is aligned with video scene changes to represent candidate boundaries of programs. In addition, latent semantic analysis (LSA) is adopted to calculate the textual content similarity (TCS) between shots to model the inter-program similarity and intra-program dissimilarity in terms of speech content. Finally, we fuse the multimodal features of POIM, ASC, and TCS to detect the boundaries of programs including individual commercials (spots). Towards effective program guide and attracting content browsing, we propose a multimodal representation of individual programs by using POIM images, key frames, and textual keywords in a summarization manner. Extensive experiments are carried out over an open benchmarking dataset TRECVID 2005 corpus and promising results have been achieved. Compared with the electronic program guide (EPG), our solution provides a more generic approach to determine the exact boundaries of diverse TV programs even including dramatic spots.  相似文献   

9.
Multimedia event detection (MED) is a challenging problem because of the heterogeneous content and variable quality found in large collections of Internet videos. To study the value of multimedia features and fusion for representing and learning events from a set of example video clips, we created SESAME, a system for video SEarch with Speed and Accuracy for Multimedia Events. SESAME includes multiple bag-of-words event classifiers based on single data types: low-level visual, motion, and audio features; high-level semantic visual concepts; and automatic speech recognition. Event detection performance was evaluated for each event classifier. The performance of low-level visual and motion features was improved by the use of difference coding. The accuracy of the visual concepts was nearly as strong as that of the low-level visual features. Experiments with a number of fusion methods for combining the event detection scores from these classifiers revealed that simple fusion methods, such as arithmetic mean, perform as well as or better than other, more complex fusion methods. SESAME’s performance in the 2012 TRECVID MED evaluation was one of the best reported.  相似文献   

10.
基于声、像特征的视频暴力场面的探测   总被引:2,自引:0,他引:2  
本文介绍了一种探测视频中暴力场面的方法,该方法综合声音和图像特征对视频进行分析,提高了探测的准确度,这种方法可用于对视频的高层语义特征建立索引,从而支持视频的基于内容检索。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号