首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 256 毫秒
1.
摄像机图像序列的全景图拼接   总被引:17,自引:2,他引:17  
全景图是虚拟现实和计算机视觉中一种重要的场景表示方法 .通常获得高质量的全景图需要使用昂贵的专用设备 ,而且拍摄时需要精确地校准摄像机 .从普通摄像机图像拼接是获得全景图的一种低成本而且比较灵活的方法 .文中提出一种新的摄像机图像拼接算法 ,利用摄像机绕垂直轴旋转 36 0°依次拍摄的照片序列 ,拼接圆柱面全景图 .该算法不要求校准摄像机 ,对相邻帧之间摄像机的转角也没有严格的限制 ,而且不受帧间光照强度剧烈变化的影响 .从实验结果看 ,该算法获得了理想的拼接效果  相似文献   

2.
Images/videos captured by portable devices (e.g., cellphones, DV cameras) often have limited fields of view. Image stitching, also referred to as mosaics or panorama, can produce a wide angle image by compositing several photographs together. Although various methods have been developed for image stitching in recent years, few works address the video stitching problem. In this paper, we present the first system to stitch videos captured by hand‐held cameras. We first recover the 3D camera paths and a sparse set of 3D scene points using CoSLAM system, and densely reconstruct the 3D scene in the overlapping regions. Then, we generate a smooth virtual camera path, which stays in the middle of the original paths. Finally, the stitched video is synthesized along the virtual path as if it was taken from this new trajectory. The warping required for the stitching is obtained by optimizing over both temporal stability and alignment quality, while leveraging on 3D information at our disposal. The experiments show that our method can produce high quality stitching results for various challenging scenarios.  相似文献   

3.
This paper proposes a general scheme for recognizing the contents of a video using a set of panoramas recorded in a database. In essence, a panorama inherently records the appearances of an omni-directional scene from its central point to arbitrary viewing directions and, thus, can serve as a compact representation of an environment. In particular, this paper emphasizes the use of a sequence of successive frames in a video taken with a video camera, instead of a single frame, for visual recognition. The associated recognition task is formulated as a shortest-path searching problem, and a dynamic-programming technique is used to solve it. Experimental results show that our method can effectively recognize a video.  相似文献   

4.
A key characteristic of video data is the associated spatial and temporal semantics. It is important that a video model models the characteristics of objects and their relationships in time and space. J.F. Allen's (1983) 13 temporal relationships are often used in formulating queries that contain the temporal relationships among video frames. For the spatial relationships, most of the approaches are based on projecting objects on a two or three-dimensional coordinate system. However, very few attempts have been made formally to represent the spatio-temporal relationships of objects contained in the video data and to formulate queries with spatio-temporal constraints. The purpose of the work is to design a model representation for the specification of the spatio-temporal relationships among objects in video sequences. The model describes the spatial relationships among objects for each frame in a given video scene and the temporal relationships (for this frame) of the temporal intervals measuring the duration of these spatial relationships. It also models the temporal composition of an object, which reflects the evolution of object's spatial relationships over the subsequent frames in the video scene and in the entire video sequence. Our model representation also provides an effective and expressive way for the complete and precise specification of distances among objects in digital video. This model is a basis for the annotation of raw video  相似文献   

5.
Motion, as a feature of video that changes in temporal sequences, is crucial to visual understanding. The powerful video representation and extraction models are typically able to focus attention on motion features in challenging dynamic environments to complete more complex video understanding tasks. However, previous approaches discriminate mainly based on similar features in the spatial or temporal domain, ignoring the interdependence of consecutive video frames. In this paper, we propose the motion sensitive self-supervised collaborative network, a video representation learning framework that exploits a pretext task to assist feature comparison and strengthen the spatiotemporal discrimination power of the model. Specifically, we first propose the motion-aware module, which extracts consecutive motion features from the spatial regions by frame difference. The global–local contrastive module is then introduced, with context and enhanced video snippets being defined as appropriate positive samples for a broader feature similarity comparison. Finally, we introduce the snippet operation prediction module, which further assists contrastive learning to obtain more reliable global semantics by sensing changes in continuous frame features. Experimental results demonstrate that our work can effectively extract robust motion features and achieve competitive performance compared with other state-of-the-art self-supervised methods on downstream action recognition and video retrieval tasks.  相似文献   

6.
针对在同一场景下获取的体育运动视频,提出了一种基于全局运动补偿及运动前景区域信息的体育运动视频合成方法。首先,对待合成视频,通过全局运动估计与补偿,将相邻帧在空间上对齐到当前帧。通过计算帧差,得到当前帧中的运动前景区域信息。然后根据两段待合成视频之间背景的相似性,计算并修正全局运动参数,确定待合成对应帧之间的位置关系。最后,依据已经获得的运动前景区域信息,生成合成帧。实验结果表明,该方法可自动合成在同一场景中获得的有相似动态背景的体育视频,保持了前景与背景的清晰度,能清晰地显示运动员动作的差异。  相似文献   

7.
一种全景图浏览器的JAVA实现算法   总被引:1,自引:0,他引:1  
全景图(panorama image)是近来出现在Internet上的一种新的交互式的虚拟场景表示方式,它以图像的方式再现了三维场景,可以用相应的浏览器实现虚拟场景的漫游,具有很好的真实感和沉浸感。本文详细介绍了这种全景图浏览器的实现原理,并给出了关键的JAVA代码。  相似文献   

8.
Browsing video scenes is just the process to unfold the story scenarios of a long video archive, which can help users to locate their desired video segments quickly and efficiently. Automatic scene detection of a long video stream file is hence the first and crucial step toward a concise and comprehensive content-based representation for indexing, browsing and retrieval purposes. In this paper, we present a novel scene detection scheme for various video types. We first detect video shot using a coarse-to-fine algorithm. The key frames without useful information are detected and removed using template matching. Spatio-temporal coherent shots are then grouped into the same scene based on the temporal constraint of video content and visual similarity of shot activity. The proposed algorithm has been performed on various types of videos containing movie and TV program. Promising experimental results shows that the proposed method makes sense to efficient retrieval of video contents of interest.  相似文献   

9.
针对长视频序列的全景图拼接,提出了一种新的处理方法。该方法将长视频序列分段,对每一段分别采用场景流形算法进行拼接;将拼接后的相邻图像采用动态规划的思想搜索最佳缝合线;依次缝合所有拼接后的图像,得到整个视频序列的全景图。实验结果表明,采用该方法拼接而成的全景图可以最大限度地保证源图像的细节,没有明显的扭曲和变形,视觉效果比较好。在实际应用中,该方法可以较好地实现含动态场景的长视频序列的全景图拼接。  相似文献   

10.
A variety of recognizing architectures based on deep convolutional neural networks have been devised for labeling videos containing human motion with action labels. However, so far, most works cannot properly deal with the temporal dynamics encoded in multiple contiguous frames, which distinguishes action recognition from other recognition tasks. This paper develops a temporal extension of convolutional neural networks to exploit motion-dependent features for recognizing human action in video. Our approach differs from other recent attempts in that it uses multiplicative interactions between convolutional outputs to describe motion information across contiguous frames. Interestingly, the representation of image content arises when we are at work on extracting motion pattern, which makes our model effectively incorporate both of them to analysis video. Additional theoretical analysis proves that motion and content-dependent features arise simultaneously from the developed architecture, whereas previous works mostly deal with the two separately. Our architecture is trained and evaluated on the standard video actions benchmarks of KTH and UCF101, where it matches the state of the art and has distinct advantages over previous attempts to use deep convolutional architectures for action recognition.  相似文献   

11.
This study proposes a robust video hashing for video copy detection.The proposed method,which is based on representative-dispersive frames(R-D frames),can reveal the global and local information of a video.In this method,a video is represented as a graph with frames as vertices.A similarity measure is proposed to calculate the weights between edges.To select R-D frames,the adjacency matrix of the generated graph is constructed,and the adjacency number of each vertex is calculated,and then some vertices that represent the R-D frames of the video are selected.To reveal the temporal and spatial information of the video,all R-D frames are scanned to constitute an image called video tomography image,the fourth-order cumulant of which is calculated to generate a hash sequence that can inherently describe the corresponding video.Experimental results show that the proposed video hashing is resistant to geometric attacks on frames and channel impairments on transmission.  相似文献   

12.
13.
Motion layer extraction in the presence of occlusion using graph cuts   总被引:1,自引:0,他引:1  
Extracting layers from video is very important for video representation, analysis, compression, and synthesis. Assuming that a scene can be approximately described by multiple planar regions, this paper describes a robust and novel approach to automatically extract a set of affine or projective transformations induced by these regions, detect the occlusion pixels over multiple consecutive frames, and segment the scene into several motion layers. First, after determining a number of seed regions using correspondences in two frames, we expand the seed regions and reject the outliers employing the graph cuts method integrated with level set representation. Next, these initial regions are merged into several initial layers according to the motion similarity. Third, an occlusion order constraint on multiple frames is explored, which enforces that the occlusion area increases with the temporal order in a short period and effectively maintains segmentation consistency over multiple consecutive frames. Then, the correct layer segmentation is obtained by using a graph cuts algorithm and the occlusions between the overlapping layers are explicitly determined. Several experimental results are demonstrated to show that our approach is effective and robust.  相似文献   

14.
Deep convolutional neural networks (DCNNs) based methods recently keep setting new records on the tasks of predicting depth maps from monocular images. When dealing with video-based applications such as 2D (2-dimensional) to 3D (3-dimensional) video conversion, however, these approaches tend to produce temporally inconsistent depth maps, since their CNN models are optimized over single frames. In this paper, we address this problem by introducing a novel spatial-temporal conditional random fields (CRF) model into the DCNN architecture, which is able to enforce temporal consistency between depth map estimations over consecutive video frames. In our approach, temporally consistent superpixel (TSP) is first applied to an image sequence to establish the correspondence of targets in consecutive frames. A DCNN is then used to regress the depth value of each temporal superpixel, followed by a spatial-temporal CRF layer to model the relationship of the estimated depths in both spatial and temporal domains. The parameters in both DCNN and CRF models are jointly optimized with back propagation. Experimental results show that our approach not only is able to significantly enhance the temporal consistency of estimated depth maps over existing single-frame-based approaches, but also improves the depth estimation accuracy in terms of various evaluation metrics.  相似文献   

15.
许源  薛向阳 《计算机科学》2006,33(11):134-138
准确提取视频高层语义特征,有助于更好地进行基于内容的视频检索。视频局部高层语义特征描述的是图像帧中的物体。考虑到物体本身以及物体所处的特定场景所具有的特点,我们提出一种将图像帧的局部信息和全局信息结合起来提取视频局部高层语义特征的算法。在TRECVID2005数据集上的实验结果表明,与单独基于局部或者单独基于全局的方法相比,此方法具有较好的性能。  相似文献   

16.
This paper presents a probabilistic framework for discovering objects in video. The video can switch between different shots, the unknown objects can leave or enter the scene at multiple times, and the background can be cluttered. The framework consists of an appearance model and a motion model. The appearance model exploits the consistency of object parts in appearance across frames. We use maximally stable extremal regions as observations in the model and hence provide robustness to object variations in scale, lighting and viewpoint. The appearance model provides location and scale estimates of the unknown objects through a compact probabilistic representation. The compact representation contains knowledge of the scene at the object level, thus allowing us to augment it with motion information using a motion model. This framework can be applied to a wide range of different videos and object types, and provides a basis for higher level video content analysis tasks. We present applications of video object discovery to video content analysis problems such as video segmentation and threading, and demonstrate superior performance to methods that exploit global image statistics and frequent itemset data mining techniques.  相似文献   

17.
视频序列的全景图拼接技术   总被引:10,自引:0,他引:10       下载免费PDF全文
提出了一种对视频序列进行全景图拼接的方法。主要讨论了有大面积的非刚性运动物体出现的序列,不过此方法也同样适用于无运动物体的纯背景序列。为计算各帧间的投影关系,用仿射模型来描述摄像机运动,并用特征点匹配的方法计算出模型中各参数的值。由于用相关法计算的匹配结果准确率比较低,所以用RANSAC(Random Sampling Consensus)对匹配结果进行了筛选,可以准确求出摄像机运动参数。利用运动参数进行投影,然后用多帧相减并求交集,估计出每帧图像中运动物体存在的区域,最后计算得到了全景图。该方法的结果与前人得到的结果进行了比较,证明用此方法能获得质量较高的全景图。  相似文献   

18.
Video remains the method of choice for capturing temporal events. However, without access to the underlying 3D scene models, it remains difficult to make object level edits in a single video or across multiple videos. While it may be possible to explicitly reconstruct the 3D geometries to facilitate these edits, such a workflow is cumbersome, expensive, and tedious. In this work, we present a much simpler workflow to create plausible editing and mixing of raw video footage using only sparse structure points (SSP) directly recovered from the raw sequences. First, we utilize user‐scribbles to structure the point representations obtained using structure‐from‐motion on the input videos. The resultant structure points, even when noisy and sparse, are then used to enable various video edits in 3D, including view perturbation, keyframe animation, object duplication and transfer across videos, etc. Specifically, we describe how to synthesize object images from new views adopting a novel image‐based rendering technique using the SSPs as proxy for the missing 3D scene information. We propose a structure‐preserving image warping on multiple input frames adaptively selected from object video, followed by a spatio‐temporally coherent image stitching to compose the final object image. Simple planar shadows and depth maps are synthesized for objects to generate plausible video sequence mimicking real‐world interactions. We demonstrate our system on a variety of input videos to produce complex edits, which are otherwise difficult to achieve.  相似文献   

19.
A thousand words in a scene   总被引:2,自引:0,他引:2  
This paper presents a novel approach for visual scene modeling and classification, investigating the combined use of text modeling methods and local invariant features. Our work attempts to elucidate (1) whether a textlike bag-of-visterms (BOV) representation (histogram of quantized local visual features) is suitable for scene (rather than object) classification, (2) whether some analogies between discrete scene representations and text documents exist, and 3) whether unsupervised, latent space models can be used both as feature extractors for the classification task and to discover patterns of visual co-occurrence. Using several data sets, we validate our approach, presenting and discussing experiments on each of these issues. We first show, with extensive experiments on binary and multiclass scene classification tasks using a 9,500-image data set, that the BOV representation consistently outperforms classical scene classification approaches. In other data sets, we show that our approach competes with or outperforms other recent more complex methods. We also show that probabilistic latent semantic analysis (PLSA) generates a compact scene representation, is discriminative for accurate classification, and is more robust than the BOV representation when less labeled training data is available. Finally, through aspect-based image ranking experiments, we show the ability of PLSA to automatically extract visually meaningful scene patterns, making such representation useful for browsing image collections.  相似文献   

20.
近年来,深度学习在人工智能领域表现出优异的性能。基于深度学习的人脸生成和操纵技术已经能够合成逼真的伪造人脸视频,也被称作深度伪造,让人眼难辨真假。然而,这些伪造人脸视频可能会给社会带来巨大的潜在威胁,比如被用来制作政治虚假新闻,从而引发政治暴力或干扰正常选举等。因此,亟需研发对应的检测方法来主动发现伪造人脸视频。现有的方法在制作伪造人脸视频时,容易在空间上和时序上留下一些细微的伪造痕迹,比如纹理和颜色上的扭曲或脸部的闪烁等。主流的检测方法同样采用深度学习,可以被划分为两类,即基于视频帧的方法和基于视频片段的方法。前者采用卷积神经网络(Convolutional Neural Network,CNN)发现单个视频帧中的空间伪造痕迹,后者则结合循环神经网络(Recurrent Neural Network,RNN)捕捉视频帧之间的时序伪造痕迹。这些方法都是基于图像的全局信息进行决策,然而伪造痕迹一般存在于五官的局部区域。因而本文提出了一个统一的伪造人脸视频检测框架,利用全局时序特征和局部空间特征发现伪造人脸视频。该框架由图像特征提取模块、全局时序特征分类模块和局部空间特征分类模块组成。在FaceForensics++数据集上的实验结果表明,本文所提出的方法比之前的方法具有更好的检测效果。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号