首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 18 毫秒
1.
With the exponential growth of social media, there exist huge numbers of near-duplicate web videos, ranging from simple formatting to complex mixture of different editing effects. In addition to the abundant video content, the social Web provides rich sets of context information associated with web videos, such as thumbnail image, time duration and so on. At the same time, the popularity of Web 2.0 demands for timely response to user queries. To balance the speed and accuracy aspects, in this paper, we combine the contextual information from time duration, number of views, and thumbnail images with the content analysis derived from color and local points to achieve real-time near-duplicate elimination. The results of 24 popular queries retrieved from YouTube show that the proposed approach integrating content and context can reach real-time novelty re-ranking of web videos with extremely high efficiency, where the majority of duplicates can be rapidly detected and removed from the top rankings. The speedup of the proposed approach can reach 164 times faster than the effective hierarchical method proposed in , with just a slight loss of performance.  相似文献   

2.
Nowadays, numerous social videos have pervaded on the web. Social web videos are characterized with the accompanying rich contextual information which describe the content of videos and thus greatly facilitate video search and browsing. Generally, those contextual data such as tags are provided at the whole video level, without temporal indication of when they actually appear in the video, let alone the spatial annotation of object related tags in the video frames. However, many tags only describe parts of the video content. Therefore, tag localization, the process of assigning tags to the underlying relevant video segments or frames even regions in frames is gaining increasing research interests and a benchmark dataset for the fair evaluation of tag localization algorithms is highly desirable. In this paper, we describe and release a dataset called DUT-WEBV, which contains about 4,000 videos collected from YouTube portal by issuing 50 concepts as queries. These concepts cover a wide range of semantic aspects including scenes like “mountain”, events like “flood”, objects like “cows”, sites like “gas station”, and activities like “handshaking”, offering great challenges to the tag (i.e., concept) localization task. For each video of a tag, we carefully annotate the time durations when the tag appears in the video and also label the spatial location of object with mask in frames for object related tag. Besides the video itself, the contextual information, such as thumbnail images, titles, and YouTube categories, is also provided. Together with this benchmark dataset, we present a baseline for tag localization using multiple instance learning approach. Finally, we discuss some open research issues for tag localization in web videos.  相似文献   

3.
4.
视频情感语义分析——类型-强度分解法   总被引:1,自引:0,他引:1  
郭戈  平西建 《计算机应用》2010,30(6):1704-1707
提出一种视频情感语义分析方法,利用强度和类型两个独立分量构造的二维空间来建立人类情感模型,通过多模态的听觉、视觉特征,分别构造连续型的情感强度时序曲线和离散型的情感类型时序曲线,从而建立低层特征空间与情感空间之间的映射关系,实现对视频内容的情感注释和分析。类型-强度分解曲线可客观、真实地描述视频情感内容随时间的变化情况。实验结果验证了该方法的有效性。  相似文献   

5.
In this paper, we present a real time system for detecting repeated video clips from a live video source such as news broadcasts. Our system utilizes customized temporal video segmentation techniques to automatically partition the digital video signal into semantically sensible shots and scenes. As each frame of the video source is processed, we extract auxiliary information to facilitate repeated sequence detection. When the video transition marking the end of the shot/scene is detected, we are able to rapidly locate all previous occurrences of the video clip. Our objective is to use repeated sequence information in our multimedia content analysis application to deduce semantic relationships among shots/scenes in the input video. Our real time video processing techniques are independent of source and domain and can be applied to other applications such as commercial detection and improved video compression.  相似文献   

6.
Online video stream data are surging to an unprecedented level. Massive video publishing and sharing impose heavy demands on continuous video near-duplicate detection for many novel video applications. This paper presents an accurate and accelerated system for video near-duplicate detection over continuous video streams. We propose to transform a high-dimensional video stream into a one-dimensional Video Trend Stream (VTS) to monitor the continuous luminance changes of consecutive frames, based on which video similarity is derived. In order to do fast comparison and effective early pruning, a compact auxiliary signature named CutSig is proposed to approximate the video structure. CutSig explores cut distribution feature of the video structure and contributes to filter candidates quickly. To scan along a video stream in a rapid way, shot cuts with local maximum AI (average information value) in a query video are used as reference cuts, and a skipping approach based on reference cut alignment is embedded for efficient acceleration. Extensive experimental results on detecting diverse near-duplicates in real video streams show the effectiveness and efficiency of our method.  相似文献   

7.
针对目前词袋模型(BoW)视频语义概念检测方法中的量化误差问题,为了更有效地自动提取视频的底层特征,提出一种基于拓扑独立成分分析(TICA)和高斯混合模型(GMM)的视频语义概念检测算法。首先,通过TICA算法进行视频片段的特征提取,该特征提取算法能够学习到视频片段复杂不变性特征;其次利用GMM方法对视频视觉特征进行建模,描述视频特征的分布情况;最后构造视频片段的GMM超向量,采用支持向量机(SVM)进行视频语义概念检测。GMM是BoW概率框架下的拓展,能够减少量化误差,具有良好的鲁棒性。在TRECVID 2012和OV两个视频库上,将所提方法与传统的BoW、SIFT-GMM方法进行了对比实验,结果表明,基于TICA和GMM的视频语义概念检测方法能够提高视频语义概念检测的准确率。  相似文献   

8.
Bag-of-visual-words (BoW) has recently become a popular representation to describe video and image content. Most existing approaches, nevertheless, neglect inter-word relatedness and measure similarity by bin-to-bin comparison of visual words in histograms. In this paper, we explore the linguistic and ontological aspects of visual words for video analysis. Two approaches, soft-weighting and constraint-based earth mover’s distance (CEMD), are proposed to model different aspects of visual word linguistics and proximity. In soft-weighting, visual words are cleverly weighted such that the linguistic meaning of words is taken into account for bin-to-bin histogram comparison. In CEMD, a cross-bin matching algorithm is formulated such that the ground distance measure considers the linguistic similarity of words. In particular, a BoW ontology which hierarchically specifies the hyponym relationship of words is constructed to assist the reasoning. We demonstrate soft-weighting and CEMD on two tasks: video semantic indexing and near-duplicate keyframe retrieval. Experimental results indicate that soft-weighting is superior to other popular weighting schemes such as term frequency (TF) weighting in large-scale video database. In addition, CEMD shows excellent performance compared to cosine similarity in near-duplicate retrieval.  相似文献   

9.
基于高层语义的视频检索研究   总被引:1,自引:0,他引:1       下载免费PDF全文
视频语义检索的研究是目前研究的热点之一。现有的视频检索系统技术多是基于底层特征的、非语义层次的检索。与人类思维中所能理解的高层语义概念相去甚远,这严重影响视频检索的实际效果。如何跨越底层特征和高层语义的鸿沟,用高层语义概念进行视频检索是当前研究的重点。通过对视频内容的语义理解、语义分析、语义提取的简要概述,试图构造一种视频语义检索模型。  相似文献   

10.
The massive web videos prompt an imperative demand on efficiently grasping the major events.However, the distinct characteristics of web videos, such as the limited number of features, the noisy text information, and the unavoidable error in near-duplicate keyframes (NDKs) detection, make web video event mining a challenging task.In this paper, we propose a novel four-stage framework to improve the performance of web video event mining.Data preprocessing is the first stage.Multiple Correspondence Analysis (MCA) is then applied to explore the correlation between terms and classes, targeting for bridging the gap between NDKs and high-level semantic concepts.Next, co-occurrence information is used to detect the similarity between NDKs and classes using the NDK-within-video information.Finally, both of them are integrated for web video event mining through negative NDK pruning and positive NDK enhancement.Moreover, both NDKs and terms with relatively low frequencies are treated as useful information in our experiments.Experimental results on large-scale web videos from YouTube demonstrate that the proposed framework outperforms several existing mining methods and obtains good results for web video event mining.  相似文献   

11.

As one of key technologies in content-based near-duplicate detection and video retrieval, video sequence matching can be used to judge whether two videos exist duplicate or near-duplicate segments or not. Despite a lot of research efforts devoted in recent years, how to precisely and efficiently perform sequence matching among videos (which may be subject to complex audio-visual transformations) from a large-scale database still remains a pretty challenging task. To address this problem, this paper proposes a multiscale video sequence matching (MS-VSM) method, which can gradually detect and locate the similar segments between videos from coarse to fine scales. At the coarse scale, it makes use of the Maximum Weight Matching (MWM) algorithm to rapidly select several candidate reference videos from the database for a given query. Then for each candidate video, its most similar segment with respect to the given query is obtained at the middle scale by the Constrained Longest Ascending Matching Subsequence (CLAMS) algorithm, and then can be used to judge whether that candidate exists near-duplicate or not. If so, the precise locations of the near-duplicate segments in both query and reference videos are determined at the fine scale by using bi-directional scanning to check the matching similarity at the segments’ boundaries. As such, the MS-VSM method can achieve excellent near-duplicate detection accuracy and localization precision with a very high processing efficiency. Extensive experiments show that it outperforms several state-of-the-art methods remarkably on several benchmarks.

  相似文献   

12.
This paper presents a probabilistic Bayesian belief network (BBN) method for automatic indexing of excitement clips of sports video sequences. The excitement clips from sports video sequences are extracted using audio features. The excitement clips are comprised of multiple subclips corresponding to the events such as replay, field-view, close-ups of players, close-ups of referees/umpires, spectators, players’ gathering. The events are detected and classified using a hierarchical classification scheme. The BBN based on observed events is used to assign semantic concept-labels to the excitement clips, such as goals, saves, and card in soccer video, wicket and hit in cricket video sequences. The BBN based indexing results are compared with our previously proposed event-association based approach and found BBN is better than the event-association based approach. The proposed scheme provides a generalizable method for linking low-level video features with high-level semantic concepts. The generic nature of the proposed approach in the sports domain is validated by demonstrating successful indexing of soccer and cricket video excitement clips. The proposed scheme offers a general approach to the automatic tagging of large scale multimedia content with rich semantics. The collection of labeled excitement clips provide a video summary for highlight browsing, video skimming, indexing and retrieval.  相似文献   

13.
14.
Semantic filtering and retrieval of multimedia content is crucial for efficient use of the multimedia data repositories. Video query by semantic keywords is one of the most difficult problems in multimedia data retrieval. The difficulty lies in the mapping between low-level video representation and high-level semantics. We therefore formulate the multimedia content access problem as a multimedia pattern recognition problem. We propose a probabilistic framework for semantic video indexing, which call support filtering and retrieval and facilitate efficient content-based access. To map low-level features to high-level semantics we propose probabilistic multimedia objects (multijects). Examples of multijects in movies include explosion, mountain, beach, outdoor, music etc. Semantic concepts in videos interact and to model this interaction explicitly, we propose a network of multijects (multinet). Using probabilistic models for six site multijects, rocks, sky, snow, water-body forestry/greenery and outdoor and using a Bayesian belief network as the multinet we demonstrate the application of this framework to semantic indexing. We demonstrate how detection performance can be significantly improved using the multinet to take interconceptual relationships into account. We also show how the multinet can fuse heterogeneous features to support detection based on inference and reasoning  相似文献   

15.
Query by video clip   总被引:15,自引:0,他引:15  
Typical digital video search is based on queries involving a single shot. We generalize this problem by allowing queries that involve a video clip (say, a 10-s video segment). We propose two schemes: (i) retrieval based on key frames follows the traditional approach of identifying shots, computing key frames from a video, and then extracting image features around the key frames. For each key frame in the query, a similarity value (using color, texture, and motion) is obtained with respect to the key frames in the database video. Consecutive key frames in the database video that are highly similar to the query key frames are then used to generate the set of retrieved video clips. (ii) In retrieval using sub-sampled frames, we uniformly sub-sample the query clip as well as the database video. Retrieval is based on matching color and texture features of the sub-sampled frames. Initial experiments on two video databases (basketball video with approximately 16,000 frames and a CNN news video with approximately 20,000 frames) show promising results. Additional experiments using segments from one basketball video as query and a different basketball video as the database show the effectiveness of feature representation and matching schemes.  相似文献   

16.
Large-scale semantic concept detection from large video database suffers from large variations among different semantic concepts as well as their corresponding effective low-level features. In this paper, we propose a novel framework to deal with this obstacle. The proposed framework consists of four major components: feature pool construction, pre-filtering, modeling, and classification. First, a large low-level feature pool is constructed, from which a specific set of features are selected for the latter steps automatically or semi-automatically. Then, to deal with the unbalance problem in training set, a pre-filtering classifier is generated, which the aim of achieving a high recall rate and a certain precision rate nearly 50% for a certain concept. Thereafter, from the pre-filtered training samples, a SVM classifier is built based on the selected features in the feature pool. After that, the SVM classifier is applied to classification of semantic concept. This framework is flexible and extensible in terms of adding new features into the feature pool, introducing human interactions in selecting features, building models for new concepts and adopting active learning.  相似文献   

17.
18.
Smart video surveillance (SVS) applications enhance situational awareness by allowing domain analysts to focus on the events of higher priority. SVS approaches operate by trying to extract and interpret higher “semantic” level events that occur in video. One of the key challenges of SVS is that of person identification where the task is for each subject that occurs in a video shot to identify the person it corresponds to. The problem of person identification is especially challenging in resource-constrained environments where transmission delay, bandwidth restriction, and packet loss may prevent the capture of high-quality data. Conventional person identification approaches which primarily are based on analyzing facial features are often not sufficient to deal with poor-quality data. To address this challenge, we propose a framework that leverages heterogeneous contextual information together with facial features to handle the problem of person identification for low-quality data. We first investigate the appropriate methods to utilize heterogeneous context features including clothing, activity, human attributes, gait, people co-occurrence, and so on. We then propose a unified approach for person identification that builds on top of our generic entity resolution framework called RelDC, which can integrate all these context features to improve the quality of person identification. This work thus links one well-known problem of person identification from the computer vision research area (that deals with video/images) with another well-recognized challenge known as entity resolution from the database and AI/ML areas (that deals with textual data). We apply the proposed solution to a real-world dataset consisting of several weeks of surveillance videos. The results demonstrate the effectiveness and efficiency of our approach even on low-quality video data.  相似文献   

19.
提出了一种足球视频的语义结构,即足球视频由多个语义事件构成,每个语义事件由数个语义镜头组成。为了分析这种语义结构,建立了“精彩事件”和“一般事件”两种语义事件的多个隐马尔科夫模型(HMMs),并提出了场地比率、人脸比率、边缘、运动强度四种特征作为HMMs的观测值输入。利用HMM的三种算法训练HMMs,分析出精彩事件,并为每个镜头标注语义。  相似文献   

20.
Video question answering (Video QA) has received much attention in recent years. It can answer questions according to the visual content of a video clip. Video QA task can be solved only according to the video data. But if the video clip has some relevant text information, It can also be solved by using the fused video and text data. How to select the useful region features from the video frames and select the useful text features from the text information needs to be solved. And how to fuse the video and text features also needs to be solved. Therefore, we propose a forget memory network to solve these problems. The forget memory network with video framework can solve Video QA task only according to the video data. It can select the useful region features for the question and forget the irrelevant region features from the video frames. The forget memory network with video and text framework can extract the useful text features and forget the irrelevant text features for the question. And it can fuse the video and text data to solve Video QA task. The fused video and text features can help improve the experimental performance.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号