首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 78 毫秒
1.
Nowadays, numerous social videos have pervaded on the web. Social web videos are characterized with the accompanying rich contextual information which describe the content of videos and thus greatly facilitate video search and browsing. Generally, those contextual data such as tags are provided at the whole video level, without temporal indication of when they actually appear in the video, let alone the spatial annotation of object related tags in the video frames. However, many tags only describe parts of the video content. Therefore, tag localization, the process of assigning tags to the underlying relevant video segments or frames even regions in frames is gaining increasing research interests and a benchmark dataset for the fair evaluation of tag localization algorithms is highly desirable. In this paper, we describe and release a dataset called DUT-WEBV, which contains about 4,000 videos collected from YouTube portal by issuing 50 concepts as queries. These concepts cover a wide range of semantic aspects including scenes like “mountain”, events like “flood”, objects like “cows”, sites like “gas station”, and activities like “handshaking”, offering great challenges to the tag (i.e., concept) localization task. For each video of a tag, we carefully annotate the time durations when the tag appears in the video and also label the spatial location of object with mask in frames for object related tag. Besides the video itself, the contextual information, such as thumbnail images, titles, and YouTube categories, is also provided. Together with this benchmark dataset, we present a baseline for tag localization using multiple instance learning approach. Finally, we discuss some open research issues for tag localization in web videos.  相似文献   

2.
This paper presents an unified approach in analyzing and structuring the content of videotaped lectures for distance learning applications. By structuring lecture videos, we can support topic indexing and semantic querying of multimedia documents captured in the traditional classrooms. Our goal in this paper is to automatically construct the cross references of lecture videos and textual documents so as to facilitate the synchronized browsing and presentation of multimedia information. The major issues involved in our approach are topical event detection, video text analysis and the matching of slide shots and external documents. In topical event detection, a novel transition detector is proposed to rapidly locate the slide shot boundaries by computing the changes of text and background regions in videos. For each detected topical event, multiple keyframes are extracted for video text detection, super-resolution reconstruction, binarization and recognition. A new approach for the reconstruction of high-resolution textboxes based on linear interpolation and multi-frame integration is also proposed for the effective binarization and recognition. The recognized characters are utilized to match the video slide shots and external documents based on our proposed title and content similarity measures.  相似文献   

3.
4.
Videos have diverse content that can assist students in learning. However, because videos are linear media, video users may take a longer time than readers of text to evaluate the context. Therefore, the process of video search may vary from one user to another depending on the users' individual characteristics, and the effectiveness of video learning may also vary across individuals. This study evaluated 100 Taiwanese fifth graders searching for videos related to “understanding animals” on YouTube and examined the effects of the students' metacognitive strategies (planning, monitoring, and evaluating) and verbal-imagery cognitive style on their video searches. The observable indicators were quantitative (search behaviors, search performance, and learning performance) and qualitative (search process observations and interviews). The study concludes that metacognitive strategy is the primary influencer of video search. Students with better metacognitive skills used fewer keywords, browsed fewer videos, and spent less time evaluating videos, but they achieved higher learning performance. They reviewed the video metadata information on the user interface and did not attempt to watch videos on the video recommendation lists, particularly videos that were irrelevant to the task requirements. During the course of the searches, keyword usage had a significant influence on the students' search performance and learning performance. The fewer keywords the students used, the better search and learning performance they were able to achieve. Our results are different from those of previous studies on text, image, and map searches. Accordingly, users must adopt different search strategies when using various types of search engines.  相似文献   

5.
In this paper, we propose a Web video retrieval method that uses hierarchical structure of Web video groups. Existing retrieval systems require users to input suitable queries that identify the desired contents in order to accurately retrieve Web videos; however, the proposed method enables retrieval of the desired Web videos even if users cannot input the suitable queries. Specifically, we first select representative Web videos from a target video dataset by using link relationships between Web videos obtained via metadata “related videos” and heterogeneous video features. Furthermore, by using the representative Web videos, we construct a network whose nodes and edges respectively correspond to Web videos and links between these Web videos. Then Web video groups, i.e., Web video sets with similar topics are hierarchically extracted based on strongly connected components, edge betweenness and modularity. By exhibiting the obtained hierarchical structure of Web video groups, users can easily grasp the overview of many Web videos. Consequently, even if users cannot write suitable queries that identify the desired contents, it becomes feasible to accurately retrieve the desired Web videos by selecting Web video groups according to the hierarchical structure. Experimental results on actual Web videos verify the effectiveness of our method.  相似文献   

6.
7.
This paper proposes a 1D representation of isometric feature mapping (Isomap) based united video coding algorithms. First, 1D Isomap representations that maintain distances are generated which can achieve a very high compression ratio. Next, embedding and reconstruction algorithms for the 1D Isomap representation are presented that can transform samples from a high-dimensional space to a low-dimensional space and vice versa. Then, dictionary learning algorithms for training samples are proposed to compress the input samples. Finally, a unified coding framework for diverse videos based on a 1D Isomap representation is built. The proposed methods make full use of correlations between internal and external videos, which are not considered by classical methods. Simulation experiments have shown that the proposed methods can obtain higher peak signal-to-noise ratios than standard highly efficient video coding for similar bit per pixel levels in the low bit rate situation.  相似文献   

8.
9.
Wireless video streaming on smartphones drains a significantly large fraction of battery energy, which is primarily consumed by wireless network interfaces for downloading unused data and repeatedly switching radio interface. In this paper, we propose an energy-efficient download scheduling algorithm for video streaming based on an aggregate model that utilizes user’s video viewing history to predict user behavior when watching a new video, thereby minimizing wasted energy when streaming over wireless network interfaces. The aggregate model is constructed by a personal retention model with users’ personal viewing history and the audience retention on crowd-sourced viewing history, which can accurately predict the user behavior of watching videos by balancing “user interest” and “video attractiveness”. We evaluate different users streaming multiple videos in various wireless environments and the results illustrate that the aggregate model can help reduce energy waste by 20 % on average. In addition, we also discuss implementation details and extensions, such as dynamically updating personal retention, balancing audience and personal retention, categorizing videos for accurate model.  相似文献   

10.
The sharing and re-sharing of videos on social sites, blogs e-mail, and other means has given rise to the phenomenon of viral videos—videos that become popular through internet sharing. In this paper we seek to better understand viral videos on YouTube by analyzing sharing and its relationship to video popularity using millions of YouTube videos. The socialness of a video is quantified by classifying the referrer sources for video views as social (e.g. an emailed link, Facebook referral) or non-social (e.g. a link from related videos). We find that viewership patterns of highly social videos are very different from less social videos. For example, the highly social videos rise to, and fall from, their peak popularity more quickly than less social videos. We also find that not all highly social videos become popular, and not all popular videos are highly social. By using our insights on viral videos we are able develop a method for ranking blogs and websites on their ability to spread viral videos.  相似文献   

11.
The volume of surveillance videos is increasing rapidly, where humans are the major objects of interest. Rapid human retrieval in surveillance videos is therefore desirable and applicable to a broad spectrum of applications. Existing big data processing tools that mainly target textual data cannot be applied directly for timely processing of large video data due to three main challenges: videos are more data-intensive than textual data; visual operations have higher computational complexity than textual operations; and traditional segmentation may damage video data’s continuous semantics. In this paper, we design SurvSurf, a human retrieval system on large surveillance video data that exploits characteristics of these data and big data processing tools. We propose using motion information contained in videos for video data segmentation. The basic data unit after segmentation is called M-clip. M-clips help remove redundant video contents and reduce data volumes. We use the MapReduce framework to process M-clips in parallel for human detection and appearance/motion feature extraction. We further accelerate vision algorithms by processing only sub-areas with significant motion vectors rather than entire frames. In addition, we design a distributed data store called V-BigTable to structuralize M-clips’ semantic information. V-BigTable enables efficient retrieval on a huge amount of M-clips. We implement the system on Hadoop and HBase. Experimental results show that our system outperforms basic solutions by one order of magnitude in computational time with satisfactory human retrieval accuracy.  相似文献   

12.
13.
Learning to predict future visual dynamics given input video sequences is a challenging but essential task. Although many stochastic video prediction models are proposed, they still suffer from “multi-modal entanglement”, which refers to the ambiguity of learned representations for multi-modal dynamics modeling. While most existing video prediction models are label-free, we propose a self-supervised labeling strategy to improve spatiotemporal prediction networks without extra supervision. Starting from a set of clustered pseudo-labels, our framework alternates between model optimization and label updating. The key insight of our method lies in that we exploit the reconstruction error from the optimized model itself as an indicator to progressively refine the label assignment on the training set. The two steps are interdependent, with the predictive model guiding the direction of label updates, and in turn, effective pseudo-labels further help the model learn better disentangled multi-modal representation. Experiments on two different video prediction datasets demonstrate the effectiveness of the proposed method.  相似文献   

14.
This paper presents a state of the art review of features extraction for soccer video summarization research. The all existing approaches with regard to event detection, video summarization based on video stream and application of text sources in event detection have been surveyed. As regard the current challenges for automatic and real time provision of summary videos, different computer vision approaches are discussed and compared. Audio, video feature extraction methods and their combination with textual methods have been investigated. Available commercial products are presented to better clarify the boundaries in this domain and future directions for improvement of existing systems have been suggested.  相似文献   

15.
16.
In this paper, we propose an approach of inferring the labels of unlabeled consumer videos and at the same time recognizing the key segments of the videos by learning from Web image sets for video annotation. The key segments of the videos are automatically recognized by transferring the knowledge learned from related Web image sets to the videos. We introduce an adaptive latent structural SVM method to adapt the pre-learned classifiers using Web image sets to an optimal target classifier, where the locations of the key segments are modeled as latent variables because the ground-truth of key segments are not available. We utilize a limited number of labeled videos and abundant labeled Web images for training annotation models, which significantly alleviates the time-consuming and labor-expensive collection of a large number of labeled training videos. Experiment on the two challenge datasets Columbia’s Consumer Video (CCV) and TRECVID 2014 Multimedia Event Detection (MED2014) shows our method performs better than state-of-art methods.  相似文献   

17.
Temporal alignment of videos is an important requirement of tasks such as video comparison, analysis and classification. Most of the approaches proposed to date for video alignment leverage dynamic programming algorithms whose parameters are manually tuned. Conversely, this paper proposes a model that can learn its parameters automatically by minimizing a meaningful loss function over a given training set of videos and alignments. For learning, we exploit the effective framework of structural SVM and we extend it with an original scoring function that suitably scores the alignment of two given videos, and a loss function that quantifies the accuracy of a predicted alignment. The experimental results from four video action datasets show that the proposed model has been able to outperform a baseline and a state-of-the-art algorithm by a large margin in terms of alignment accuracy.  相似文献   

18.
Multimedia event-based video indexing using time intervals   总被引:6,自引:0,他引:6  
We propose the time interval multimedia event (TIME) framework as a robust approach for classification of semantic events in multimodal video documents. The representation used in TIME extends the Allen temporal interval relations and allows for proper inclusion of context and synchronization of the heterogeneous information sources involved in multimodal video analysis. To demonstrate the viability of our approach, it was evaluated on the domains of soccer and news broadcasts. For automatic classification of semantic events, we compare three different machine learning techniques, i.c. C4.5 decision tree, maximum entropy, and support vector machine. The results show that semantic video indexing results significantly benefit from using the TIME framework.  相似文献   

19.
Smart video surveillance (SVS) applications enhance situational awareness by allowing domain analysts to focus on the events of higher priority. SVS approaches operate by trying to extract and interpret higher “semantic” level events that occur in video. One of the key challenges of SVS is that of person identification where the task is for each subject that occurs in a video shot to identify the person it corresponds to. The problem of person identification is especially challenging in resource-constrained environments where transmission delay, bandwidth restriction, and packet loss may prevent the capture of high-quality data. Conventional person identification approaches which primarily are based on analyzing facial features are often not sufficient to deal with poor-quality data. To address this challenge, we propose a framework that leverages heterogeneous contextual information together with facial features to handle the problem of person identification for low-quality data. We first investigate the appropriate methods to utilize heterogeneous context features including clothing, activity, human attributes, gait, people co-occurrence, and so on. We then propose a unified approach for person identification that builds on top of our generic entity resolution framework called RelDC, which can integrate all these context features to improve the quality of person identification. This work thus links one well-known problem of person identification from the computer vision research area (that deals with video/images) with another well-recognized challenge known as entity resolution from the database and AI/ML areas (that deals with textual data). We apply the proposed solution to a real-world dataset consisting of several weeks of surveillance videos. The results demonstrate the effectiveness and efficiency of our approach even on low-quality video data.  相似文献   

20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号