首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Automatic content analysis of sports videos is a valuable and challenging task. Motivated by analogies between a class of sports videos and languages, the authors propose a novel approach for sports video analysis based on compiler principles. It integrates both semantic analysis and syntactic analysis to automatically create an index and a table of contents for a sports video. Each shot of the video sequence is first annotated and indexed with semantic labels through detection of events using domain knowledge. A grammar-based parser is then constructed to identify the tree structure of the video content based on the labels. Meanwhile, the grammar can be used to detect and recover errors during the analysis. As a case study, a sports video parsing system is presented in the particular domain of diving. Experimental results indicate the proposed approach is effective.  相似文献   

2.
With the recent popularization of mobile video cameras including camera phones, a new technology, mobile video surveillance, which uses mobile video cameras for video surveillance has been emerging. Such videos, however, may infringe upon the privacy of others by disclosing privacy sensitive information (PSI), i.e., their appearances. To prevent videos from infringing on the right to privacy, new techniques are required that automatically obscure PSI regions. The problem is how to determine the PSI regions to be obscured while maintaining enough video content to present the camera persons’ capture-intentions, i.e., what they want to record in their videos to achieve their surveillance tasks. To this end, we introduce a new concept called intended human objects that are defined as human objects essential for capture-intentions, and develop a new method called intended human object detection that automatically detects the intended human objects in videos taken by different camera persons. Through the process of intended human object detection, we develop a system for automatically obscuring PSI regions. We experimentally show the performance of intended human object detection and the contributions of the features used. Our user study shows the potential applicability of our proposed system.  相似文献   

3.
4.
小目标检测是针对图像中像素占比少的目标,借助计算机视觉在图像中找到并判断该目标所属类别的目标检测技术。与目前应用较为成熟的大尺度、中尺度目标检测不同,小目标自身存在着语义信息少、覆盖面积小等先天不足,导致小目标的检测效果并不理想,因此如何提高小目标的检测效果依然是计算机视觉领域的一大难题。对近年来国内外小目标检测领域研究成果进行了梳理,以小目标检测技术为核心,对关于小目标的定义、检测难点进行分析;将能有效提高小目标检测精度的方法进行分类汇总,并介绍了各种方法的应用与优缺点;最后对未来小目标检测领域发展趋势进行了预测与展望。  相似文献   

5.
6.
Describing visual contents in videos by semantic concepts is an effective and realistic approach that can be used in video applications such as annotation, indexing, retrieval and ranking. In these applications, video data needs to be labelled with some known set of labels or concepts. Assigning semantic concepts manually is not feasible due to the large volume of ever-growing video data. Hence, automatic semantic concept detection of videos is a hot research area. Recently Deep Convolutional Neural Networks (CNNs) used in computer vision tasks are showing remarkable performance. In this paper, we present a novel approach for automatic semantic video concept detection using deep CNN and foreground driven concept co-occurrence matrix (FDCCM) which keeps foreground to background concept co-occurrence values, built by exploiting concept co-occurrence relationship in pre-labelled TRECVID video dataset and from a collection of random images extracted from Google Images. To deal with the dataset imbalance problem, we have extended this approach by making a fusion of two asymmetrically trained deep CNNs and used FDCCM to further improve concept detection. The performance of the proposed approach is compared with state-of-the-art approaches for the video concept detection over the widely used TRECVID data set and is found to be superior to existing approaches.  相似文献   

7.
Semantic filtering and retrieval of multimedia content is crucial for efficient use of the multimedia data repositories. Video query by semantic keywords is one of the most difficult problems in multimedia data retrieval. The difficulty lies in the mapping between low-level video representation and high-level semantics. We therefore formulate the multimedia content access problem as a multimedia pattern recognition problem. We propose a probabilistic framework for semantic video indexing, which call support filtering and retrieval and facilitate efficient content-based access. To map low-level features to high-level semantics we propose probabilistic multimedia objects (multijects). Examples of multijects in movies include explosion, mountain, beach, outdoor, music etc. Semantic concepts in videos interact and to model this interaction explicitly, we propose a network of multijects (multinet). Using probabilistic models for six site multijects, rocks, sky, snow, water-body forestry/greenery and outdoor and using a Bayesian belief network as the multinet we demonstrate the application of this framework to semantic indexing. We demonstrate how detection performance can be significantly improved using the multinet to take interconceptual relationships into account. We also show how the multinet can fuse heterogeneous features to support detection based on inference and reasoning  相似文献   

8.
Effective parsing of video through the spatial and temporal domains is vital to many computer vision problems because it is helpful to automatically label objects in video instead of manual fashion, which is tedious. Some literatures propose to parse the semantic information on individual 2D images or individual video frames, however, these approaches only take use of the spatial information, ignore the temporal continuity information and fail to consider the relevance of frames. On the other hand, some approaches which only consider the spatial information attempt to propagate labels in the temporal domain for parsing the semantic information of the whole video, yet the non-injective and non-surjective natures can cause the black hole effect. In this paper, inspirited by some annotated image datasets (e.g., Stanford Background Dataset, LabelMe, and SIFT-FLOW), we propose to transfer or propagate such labels from images to videos. The proposed approach consists of three main stages: I) the posterior category probability density function (PDF) is learned by an algorithm which combines frame relevance and label propagation from images. II) the prior contextual constraint PDF on the map of pixel categories through whole video is learned by the Markov Random Fields (MRF). III) finally, based on both learned PDFs, the final parsing results are yielded up to the maximum a posterior (MAP) process which is computed via a very efficient graph-cut based integer optimization algorithm. The experiments show that the black hole effect can be effectively handled by the proposed approach.  相似文献   

9.
Temporal segmentation of videos into meaningful image sequences containing some particular activities is an interesting problem in computer vision. We present a novel algorithm to achieve this semantic video segmentation. The segmentation task is accomplished through event detection in a frame-by-frame processing setup. We propose using one-class classification (OCC) techniques to detect events that indicate a new segment, since they have been proved to be successful in object classification and they allow for unsupervised event detection in a natural way. Various OCC schemes have been tested and compared, and additionally, an approach based on the temporal self-similarity maps (TSSMs) is also presented. The testing was done on a challenging publicly available thermal video dataset. The results are promising and show the suitability of our approaches for the task of temporal video segmentation.  相似文献   

10.
监控视频是安防系统的重要组成部分。在如今的各行各业中,只要涉及到安全,均 离不开监控视频。但对监控视频内容的分析主要依靠大量人工来完成,人力和时间成本巨大。随 着监控视频数据越来越多,如何提高针对视频内容的分析效率、降低用户认知负荷是拓展视频利 用率的重要方面。为此,针对监控视频存在的冗余信息较多、人工获取视频关键内容效率低的问 题,采用螺旋视频摘要及相应交互技术,开发了一种面向监控视频内容的可视分析系统,结合运 动目标检测结果数据,基于螺旋摘要的展示优势实现多角度可视化视频目标统计信息,并辅以针 对螺旋摘要的导航、定位操作以及草图交互等方式,实现对监控视频内容的快速有效获取。  相似文献   

11.
李冠彬  张锐斐  刘梦梦  刘劲  林倞 《软件学报》2023,34(12):5905-5920
视频描述技术旨在为视频自动生成包含丰富内容的文字描述,近年来吸引了广泛的研究兴趣.一个准确而精细的视频描述生成方法,不仅需要对视频有全局上的理解,更离不开具体显著目标的局部空间和时序特征.如何建模一个更优的视频特征表达,一直是视频描述工作的研究重点和难点.另一方面,大多数现有工作都将句子视为一个链状结构,并将视频描述任务视为一个生成单词序列的过程,而忽略了句子的语义结构,这使得算法难以应对和优化复杂的句子描述及长句子中易引起的逻辑错误.为了解决上述问题,提出一种新颖的语言结构引导的可解释视频语义描述生成方法,通过设计一个基于注意力的结构化小管定位机制,充分考虑局部对象信息和句子语义结构.结合句子的语法分析树,所提方法能够自适应地加入具有文本内容的相应时空特征,进一步提升视频描述的生成效果.在主流的视频描述任务基准数据集MSVD和MSR-VTT上的实验结果表明,所提出方法在大多数评价指标上都达到了最先进的水平.  相似文献   

12.
Video remains the method of choice for capturing temporal events. However, without access to the underlying 3D scene models, it remains difficult to make object level edits in a single video or across multiple videos. While it may be possible to explicitly reconstruct the 3D geometries to facilitate these edits, such a workflow is cumbersome, expensive, and tedious. In this work, we present a much simpler workflow to create plausible editing and mixing of raw video footage using only sparse structure points (SSP) directly recovered from the raw sequences. First, we utilize user‐scribbles to structure the point representations obtained using structure‐from‐motion on the input videos. The resultant structure points, even when noisy and sparse, are then used to enable various video edits in 3D, including view perturbation, keyframe animation, object duplication and transfer across videos, etc. Specifically, we describe how to synthesize object images from new views adopting a novel image‐based rendering technique using the SSPs as proxy for the missing 3D scene information. We propose a structure‐preserving image warping on multiple input frames adaptively selected from object video, followed by a spatio‐temporally coherent image stitching to compose the final object image. Simple planar shadows and depth maps are synthesized for objects to generate plausible video sequence mimicking real‐world interactions. We demonstrate our system on a variety of input videos to produce complex edits, which are otherwise difficult to achieve.  相似文献   

13.
In this paper, we tackle the problem of segmenting out a sequence of actions from videos. The videos contain background and actions which are usually composed of ordered sub-actions. We refer the sub-actions and the background as semantic units. Considering the possible overlap between two adjacent semantic units, we propose a bidirectional sliding window method to generate the label distributions for various segments in the video. The label distribution covers a certain number of semantic unit labels, representing the degree to which each label describes the video segment. The mapping from a video segment to its label distribution is then learned by a Label Distribution Learning (LDL) algorithm. Based on the LDL model, a soft video parsing method with segmental regular grammars is proposed to construct a tree structure for the video. Each leaf of the tree stands for a video clip of background or sub-action. The proposed method shows promising results on the THUMOS’14, MSR-II and UCF101 datasets and its computational complexity is much less than the compared state-of-the-art video parsing method.  相似文献   

14.
15.
We present a novel approach for multi-object tracking which considers object detection and spacetime trajectory estimation as a coupled optimization problem. Our approach is formulated in an MDL hypothesis selection framework, which allows it to recover from mismatches and temporarily lost tracks. Building upon a multi-view/multi-category object detector, it localizes cars and pedestrians in the input images. The 2D object detections are converted to 3D observations, which are accumulated in a world coordinate frame. Trajectory analysis in a spacetime window yields physically plausible trajectory candidates. Tracking is achieved by performing model selection after every frame. At each time instant, our approach searches for the globally optimal set of spacetime trajectories which provides the best explanation for the current image and all evidence collected so far, while satisfying the constraints that no two objects may occupy the same physical space, nor explain the same image pixels at any time. Successful trajectory hypotheses are then fed back to guide object detection in future frames. The resulting approach can initialize automatically and track a large and varying number of objects from both static and moving cameras. We evaluate our approach on several challenging video sequences with both a surveillance-type scenario and a scenario where the input videos are taken from a moving vehicle.  相似文献   

16.
Nowadays, numerous social videos have pervaded on the web. Social web videos are characterized with the accompanying rich contextual information which describe the content of videos and thus greatly facilitate video search and browsing. Generally, those contextual data such as tags are provided at the whole video level, without temporal indication of when they actually appear in the video, let alone the spatial annotation of object related tags in the video frames. However, many tags only describe parts of the video content. Therefore, tag localization, the process of assigning tags to the underlying relevant video segments or frames even regions in frames is gaining increasing research interests and a benchmark dataset for the fair evaluation of tag localization algorithms is highly desirable. In this paper, we describe and release a dataset called DUT-WEBV, which contains about 4,000 videos collected from YouTube portal by issuing 50 concepts as queries. These concepts cover a wide range of semantic aspects including scenes like “mountain”, events like “flood”, objects like “cows”, sites like “gas station”, and activities like “handshaking”, offering great challenges to the tag (i.e., concept) localization task. For each video of a tag, we carefully annotate the time durations when the tag appears in the video and also label the spatial location of object with mask in frames for object related tag. Besides the video itself, the contextual information, such as thumbnail images, titles, and YouTube categories, is also provided. Together with this benchmark dataset, we present a baseline for tag localization using multiple instance learning approach. Finally, we discuss some open research issues for tag localization in web videos.  相似文献   

17.
This paper presents an unified approach in analyzing and structuring the content of videotaped lectures for distance learning applications. By structuring lecture videos, we can support topic indexing and semantic querying of multimedia documents captured in the traditional classrooms. Our goal in this paper is to automatically construct the cross references of lecture videos and textual documents so as to facilitate the synchronized browsing and presentation of multimedia information. The major issues involved in our approach are topical event detection, video text analysis and the matching of slide shots and external documents. In topical event detection, a novel transition detector is proposed to rapidly locate the slide shot boundaries by computing the changes of text and background regions in videos. For each detected topical event, multiple keyframes are extracted for video text detection, super-resolution reconstruction, binarization and recognition. A new approach for the reconstruction of high-resolution textboxes based on linear interpolation and multi-frame integration is also proposed for the effective binarization and recognition. The recognized characters are utilized to match the video slide shots and external documents based on our proposed title and content similarity measures.  相似文献   

18.
Ye Lu  Ze-Nian Li 《Pattern recognition》2008,41(3):1159-1172
A new method of video object extraction is proposed to automatically extract the object of interest from actively acquired videos. Traditional video object extraction techniques often operate under the assumption of homogeneous object motion and extract various parts of the video that are motion consistent as objects. In contrast, the proposed active video object extraction (AVOE) approach assumes that the object of interest is being actively tracked by a non-calibrated camera under general motion and classifies the possible movements of the camera that result in the 2D motion patterns as recovered from the image sequence. Consequently, the AVOE method is able to extract the single object of interest from the active video. We formalize the AVOE process using notions from Gestalt psychology. We define a new Gestalt factor called “shift and hold” and present 2D object extraction algorithms. Moreover, since an active video sequence naturally contains multiple views of the object of interest, we demonstrate that these views can be combined to form a single 3D object regardless of whether the object is static or moving in the video.  相似文献   

19.
基本的目标检测任务是在图像中识别目标,并标注目标的类别和位置信息.但是,很多应用中的目标检测任务常常带有语义约束,典型的包括单类别目标的数量约束和多个目标之间的空间位置约束.如在基于视频的生产安全监控系统中,目标检测不仅要识别和标定安全防护装备,还要检测这些安全防护装备是否被规范穿戴.提出了一种目标检测中语义约束检查算...  相似文献   

20.
Hierarchical database for a multi-camera surveillance system   总被引:1,自引:0,他引:1  
This paper presents a framework for event detection and video content analysis for visual surveillance applications. The system is able to coordinate the tracking of objects between multiple camera views, which may be overlapping or non-overlapping. The key novelty of our approach is that we can automatically learn a semantic scene model for a surveillance region, and have defined data models to support the storage of tracking data with different layers of abstraction into a surveillance database. The surveillance database provides a mechanism to generate video content summaries of objects detected by the system across the entire surveillance region in terms of the semantic scene model. In addition, the surveillance database supports spatio-temporal queries, which can be applied for event detection and notification applications.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号