首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
While there are various commercial-strength editing tools available today for still images, object-based manipulation of real-world video footage is still a challenging problem. In this system paper, we present a framework for interactive video editing. Our focus is on footage from a single, conventional video camera. By relying on spatio-temporal editing techniques operating on the video cube, we do not need to recover 3D scene geometry. Our framework is capable of removing and inserting objects, object motion editing, non-rigid object deformations, keyframe interpolation, as well as emulating camera motion. We demonstrate how movie shots with moderate complexity can be persuasively modified during post-processing.  相似文献   

2.
《Graphical Models》2007,69(1):57-70
This paper proposes a new framework for video editing in gradient domain. The spatio-temporal gradient fields of target videos are modified and/or mixed to generate a new gradient field which is usually not integrable. We compare two methods to solve this “mixed gradient problem”, i.e., the variational method and loopy belief propagation. We propose a 3D video integration algorithm, which uses the variational method to find the potential function whose gradient field is closest to the mixed gradient field in the sense of least squares. The video is reconstructed by solving a 3D Poisson equation. The main contributions of our framework lie in three aspects: first, we derive a straightforward extension of current 2D gradient technique to 3D space, thus resulting in a novel video editing framework, which is very different from all current video editing software; secondly, we propose using a fast and accurate 3D discrete Poisson solver which uses diagonal multigrids to solve the 3D Poisson equation, which is up to twice as fast as a simple conventional multigrid algorithm; finally, we introduce a set of new applications, such as face replacement and painting, high dynamic range video compression and graphcut based video compositing. A set of gradient operators is also provided to the user for editing purposes. We evaluate our algorithm using a variety of examples for image/video or video/video pairs. The resulting video can be seamlessly reconstructed.  相似文献   

3.
On Space-Time Interest Points   总被引:16,自引:0,他引:16  
  相似文献   

4.
We propose ViComp, an automatic audio-visual camera selection framework for composing uninterrupted recordings from multiple user-generated videos (UGVs) of the same event. We design an automatic audio-based cut-point selection method to segment the UGV. ViComp combines segments of UGVs using a rank-based camera selection strategy by considering audio-visual quality and camera selection history. We analyze the audio to maintain audio continuity. To filter video segments which contain visual degradations, we perform spatial and spatio-temporal quality assessment. We validate the proposed framework with subjective tests and compare it with state-of-the-art methods.  相似文献   

5.
This paper studies evolutionary programming and adopts reinforcement learning theory to learn individual mutation operators. A novel algorithm named RLEP (Evolutionary Programming based on Reinforcement Learning) is proposed. In this algorithm, each individual learns its optimal mutation operator based on the immediate and delayed performance of mutation operators. Mutation operator selection is mapped into a reinforcement learning problem. Reinforcement learning methods are used to learn optimal policies by maximizing the accumulated rewards. According to the calculated Q function value of each candidate mutation operator, an optimal mutation operator can be selected to maximize the learned Q function value. Four different mutation operators have been employed as the basic candidate operators in RLEP and one is selected for each individual in different generations. Our simulation shows the performance of RLEP is the same as or better than the best of the four basic mutation operators.  相似文献   

6.
In this paper, we proposed a unified framework for anomaly detection and localization in crowed scenes. For each video frame, we extract the spatio-temporal sparse features of 3D blocks and generate the saliency map using a block-based center-surround difference operator. Two sparse coding strategies including off-line long-term sparse representation and on-line short-term sparse representation are integrated within our framework. Abnormality of each candidate is measured using bottom-up saliency and top-down fixation inference and further used to classify the frames into normal and anomalous ones by a binary classifier. Local abnormal events are localized and segmented based on the saliency map. In the experiments, we compared our method against several state-of-the-art approaches on UCSD data set which is a widely used anomaly detection and localization benchmark. Our method outputs competitive results with near real-time processing speed compared to state-of-the-arts.  相似文献   

7.
This paper presents an approach for detecting suspicious events in videos by using only the video itself as the training samples for valid behaviors. These salient events are obtained in real-time by detecting anomalous spatio-temporal regions in a densely sampled video. The method codes a video as a compact set of spatio-temporal volumes, while considering the uncertainty in the codebook construction. The spatio-temporal compositions of video volumes are modeled using a probabilistic framework, which calculates their likelihood of being normal in the video. This approach can be considered as an extension of the Bag of Video words (BOV) approaches, which represent a video as an order-less distribution of video volumes. The proposed method imposes spatial and temporal constraints on the video volumes so that an inference mechanism can estimate the probability density functions of their arrangements. Anomalous events are assumed to be video arrangements with very low frequency of occurrence. The algorithm is very fast and does not employ background subtraction, motion estimation or tracking. It is also robust to spatial and temporal scale changes, as well as some deformations. Experiments were performed on four video datasets of abnormal activities in both crowded and non-crowded scenes and under difficult illumination conditions. The proposed method outperformed all other approaches based on BOV that do not account for contextual information.  相似文献   

8.
A relational ranking query uses a scoring function to limit the results of a conventional query to a small number of the most relevant answers. The increasing popularity of this query paradigm has led to the introduction of specialized rank join operators that integrate the selection of top tuples with join processing. These operators access just “enough” of the input in order to generate just “enough” output and can offer significant speed-ups for query evaluation. The number of input tuples that an operator accesses is called the input depth of the operator, and this is the driving cost factor in rank join processing. This introduces the important problem of depth estimation, which is crucial for the costing of rank join operators during query compilation and thus for their integration in optimized physical plans. We introduce an estimation methodology, termed deep, for approximating the input depths of rank join operators in a physical execution plan. At the core of deep lies a general, principled framework that formalizes depth computation in terms of the joint distribution of scores in the base tables. This framework results in a systematic estimation methodology that takes the characteristics of the data directly into account and thus enables more accurate estimates. We develop novel estimation algorithms that provide an efficient realization of the formal deep framework, and describe their integration on top of the statistics module of an existing query optimizer. We validate the performance of deep with an extensive experimental study on data sets of varying characteristics. The results verify the effectiveness of deep as an estimation method and demonstrate its advantages over previously proposed techniques.  相似文献   

9.
We fit k-spheres optimally to n-D point data, in a geometrically total least squares sense. A specific practical instance is the optimal fitting of 2D-circles to a 3D point set. Among the optimal fitting methods for 2D-circles based on 2D (!) point data compared in Al-Sharadqah and Chernov (Electron. J. Stat. 3:886–911, 2009), there is one with an algebraic form that permits its extension to optimally fitting k-spheres in n-D. We embed this ‘Pratt 2D circle fit’ into the framework of conformal geometric algebra (CGA), and doing so naturally enables the generalization. The procedure involves a representation of the points in n-D as vectors in an (n+2)-D space with attractive metric properties. The hypersphere fit then becomes an eigenproblem of a specific symmetric linear operator determined by the data. The eigenvectors of this operator form an orthonormal basis representing perpendicular hyperspheres. The intersection of these are the optimal k-spheres; in CGA the intersection is a straightforward outer product of vectors. The resulting optimal fitting procedure can easily be implemented using a standard linear algebra package; we show this for the 3D case of fitting spheres, circles and point pairs. The fits are optimal (in the sense of achieving the KCR lower bound on the variance). We use the framework to show how the hyperaccurate fit hypersphere of Al-Sharadqah and Chernov (Electron. J. Stat. 3:886–911, 2009) is a minor rescaling of the Pratt fit hypersphere.  相似文献   

10.
We present a novel representation and rendering method for free‐viewpoint video of human characters based on multiple input video streams. The basic idea is to approximate the articulated 3D shape of the human body using a subdivision into textured billboards along the skeleton structure. Billboards are clustered to fans such that each skeleton bone contains one billboard per source camera. We call this representation articulated billboards. In the paper we describe a semi‐automatic, data‐driven algorithm to construct and render this representation, which robustly handles even challenging acquisition scenarios characterized by sparse camera positioning, inaccurate camera calibration, low video resolution, or occlusions in the scene. First, for each input view, a 2D pose estimation based on image silhouettes, motion capture data, and temporal video coherence is used to create a segmentation mask for each body part. Then, from the 2D poses and the segmentation, the actual articulated billboard model is constructed by a 3D joint optimization and compensation for camera calibration errors. The rendering method includes a novel way of blending the textural contributions of each billboard and features an adaptive seam correction to eliminate visible discontinuities between adjacent billboards textures. Our articulated billboards do not only minimize ghosting artifacts known from conventional billboard rendering, but also alleviate restrictions to the setup and sensitivities to errors of more complex 3D representations and multiview reconstruction techniques. Our results demonstrate the flexibility and the robustness of our approach with high quality free‐viewpoint video generated from broadcast footage of challenging, uncontrolled environments.  相似文献   

11.
In this paper, we propose a framework for human action analysis from video footage. A video action sequence in our perspective is a dynamic structure of sparse local spatial–temporal patches termed action elements, so the problems of action analysis in video are carried out here based on the set of local characteristics as well as global shape of a prescribed action. We first detect a set of action elements that are the most compact entities of an action, then we extend the idea of Implicit Shape Model to space time, in order to properly integrate the spatial and temporal properties of these action elements. In particular, we consider two different recipes to construct action elements: one is to use a Sparse Bayesian Feature Classifier to choose action elements from all detected Spatial Temporal Interest Points, and is termed discriminative action elements. The other one detects affine invariant local features from the holistic Motion History Images, and picks up action elements according to their compactness scores, and is called generative action elements. Action elements detected from either way are then used to construct a voting space based on their local feature representations as well as their global configuration constraints. Our approach is evaluated in the two main contexts of current human action analysis challenges, action retrieval and action classification. Comprehensive experimental results show that our proposed framework marginally outperforms all existing state-of-the-arts techniques on a range of different datasets.  相似文献   

12.

The most successful approaches to video understanding and video matching use local spatio-temporal features as a sparse representation for video content. In the last decade, a great interest in evaluation of local visual features in the domain of images is observed. The aim is to provide researchers with guidance when selecting the best approaches for new applications and data-sets. FeEval is presented, a framework for the evaluation of spatio-temporal features. For the first time, this framework allows for a systematic measurement of the stability and the invariance of local features in videos. FeEval consists of 30 original videos from a great variety of different sources, including HDTV shows, 1080p HD movies and surveillance cameras. The videos are iteratively varied by well defined challenges leading to a total of 1710 video clips. We measure coverage, repeatability and matching performance under these challenges. Similar to prior work on 2D images, this leads to a new robustness and matching measurement. Supporting the choices of recent state of the art benchmarks, this allows for a in-depth analysis of spatio-temporal features in comparison to recent benchmark results.

  相似文献   

13.
A spatial query language enables the spatial analysis of building information models and the extraction of partial models that fulfill certain spatial constraints. Among other features, the developed spatial query language includes directional operators, i.e., operators that reflect the directional relationships between 3D spatial objects, such as northOf, southOf, eastOf, westOf, above and below. The paper presents in-depth definitions of the semantics of two new directional models for extended 3D objects, the projection-based and the halfspace-based model, by using point-set theory notation. It further describes the possible implementation of directional operators using a newly developed space-partitioning data structure called slot-tree, which is derived from the objects’ octree representation. The slot-tree allows for the application of recursive algorithms that successively increase the discrete resolution of the spatial objects employed and thereby enables the user to trade-off between computational effort and the required accuracy. The article also introduces detailed investigations on the runtime performance of the developed algorithms.  相似文献   

14.
Hu  Zheng-ping  Zhang  Rui-xue  Qiu  Yue  Zhao  Meng-yao  Sun  Zhe 《Multimedia Tools and Applications》2021,80(24):33179-33192

C3D has been widely used for video representation and understanding. However, it is performed on spatio-temporal contexts in a global view, which often weakens its capacity of learning local representation. To alleviate this problem, a concise and novel multi-layer feature fusion network with the cooperation of local and global views is introduced. For the current network, the global view branch is used to learn the core video semantics, while the local view branch is used to capture the contextual local semantics. Unlike traditional C3D model, the global view branch can only provide the big view branch with the most activated video features from a broader 3D receptive field. Via adding such shallow-view contexts, the local view branch can learn more robust and discriminative spatio-temporal representations for video classification. Thus we propose 3D convolutional networks with multi-layer-pooling selection fusion for video classification, the integrated deep global feature is combined with the information originated from shallow layer of local feature extraction networks, through the space-time pyramid pooling, adaptive pooling and attention pooling three different pooling units, different time–space feature information is obtained, and finally cascaded and used for classification. Experiments on the UCF-101 and HMDB-51 datasets achieve correct classification rate 95.0% and 72.2% respectively. The results show that the proposed 3D convolutional networks with multi-layer-pooling selection fusion has better classification performance.

  相似文献   

15.
Scale-invariant interest points have found several highly successful applications in computer vision, in particular for image-based matching and recognition. This paper presents a theoretical analysis of the scale selection properties of a generalized framework for detecting interest points from scale-space features presented in Lindeberg (Int. J. Comput. Vis. 2010, under revision) and comprising:
  • an enriched set of differential interest operators at a fixed scale including the Laplacian operator, the determinant of the Hessian, the new Hessian feature strength measures I and II and the rescaled level curve curvature operator, as well as
  • an enriched set of scale selection mechanisms including scale selection based on local extrema over scale, complementary post-smoothing after the computation of non-linear differential invariants and scale selection based on weighted averaging of scale values along feature trajectories over scale.
  • It is shown how the selected scales of different linear and non-linear interest point detectors can be analyzed for Gaussian blob models. Specifically it is shown that for a rotationally symmetric Gaussian blob model, the scale estimates obtained by weighted scale selection will be similar to the scale estimates obtained from local extrema over scale of scale normalized derivatives for each one of the pure second-order operators. In this respect, no scale compensation is needed between the two types of scale selection approaches. When using post-smoothing, the scale estimates may, however, be different between different types of interest point operators, and it is shown how relative calibration factors can be derived to enable comparable scale estimates for each purely second-order operator and for different amounts of self-similar post-smoothing. A theoretical analysis of the sensitivity to affine image deformations is presented, and it is shown that the scale estimates obtained from the determinant of the Hessian operator are affine covariant for an anisotropic Gaussian blob model. Among the other purely second-order operators, the Hessian feature strength measure I has the lowest sensitivity to non-uniform scaling transformations, followed by the Laplacian operator and the Hessian feature strength measure II. The predictions from this theoretical analysis agree with experimental results of the repeatability properties of the different interest point detectors under affine and perspective transformations of real image data. A number of less complete results are derived for the level curve curvature operator.  相似文献   

    16.
    17.
    In this paper we present a scalable 3D video framework for capturing and rendering dynamic scenes. The acquisition system is based on multiple sparsely placed 3D video bricks, each comprising a projector, two grayscale cameras, and a color camera. Relying on structured light with complementary patterns, texture images and pattern-augmented views of the scene are acquired simultaneously by time-multiplexed projections and synchronized camera exposures. Using space–time stereo on the acquired pattern images, high-quality depth maps are extracted, whose corresponding surface samples are merged into a view-independent, point-based 3D data structure. This representation allows for effective photo-consistency enforcement and outlier removal, leading to a significant decrease of visual artifacts and a high resulting rendering quality using EWA volume splatting. Our framework and its view-independent representation allow for simple and straightforward editing of 3D video. In order to demonstrate its flexibility, we show compositing techniques and spatiotemporal effects.  相似文献   

    18.
    In this paper, we introduce a new concept of (A, η)-accretive mappings, which generalizes the existing monotone or accretive operators. We study some properties of (A, η)-accretive mappings and define resolvent operators associated with (A, η)-accretive mappings. By using the new resolvent operator technique, we also construct a new perturbed iterative algorithm with mixed errors for a class of nonlinear relaxed Cocoercive variational inclusions involving (A, η)-accretive mappings and study applications of (A, η)-accretive mappings to the approximation-solvability of this class of nonlinear relaxed Cocoercive variational inclusions in q-uniformly smooth Banach spaces. Our results improve and generalize the corresponding results of recent works.  相似文献   

    19.
    给出了一族新的左连续三角模族Tq,p-LGN族及其伴随蕴涵算子族Rq,p-LGN,它包括Lukasiewicz蕴涵算子、Gödel蕴涵算子及R0蕴涵算子; 提出了基于蕴涵算子族的模糊推理的思想,并给出了基于蕴涵算子族Rq,p-LGN的FMP模型的三I支持算法。  相似文献   

    20.
    Interactive selection of desired textures and textured objects from a video is a challenging problem in video editing. In this paper, we present a scalable framework that accurately selects textured objects with only moderate user interaction. Our method applies the active learning methodology, and the user only needs to label minimal initial training data and subsequent query data. An active learning algorithm uses these labeled data to obtain an initial classifier and iteratively improves it until its performance becomes satisfactory. A revised graph-cut algorithm based on the trained classifier has also been developed to improve the spatial coherence of selected texture regions. We show that our system is responsive even with videos of a large number of frames, and it frees the user from extensive labeling work. A variety of operations, such as color editing, compositing, and texture cloning, can be then applied to the selected textures to achieve interesting editing effects.  相似文献   

    设为首页 | 免责声明 | 关于勤云 | 加入收藏

    Copyright©北京勤云科技发展有限公司  京ICP备09084417号