首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
We present a deep learning based technique that enables novel‐view videos of human performances to be synthesized from sparse multi‐view captures. While performance capturing from a sparse set of videos has received significant attention, there has been relatively less progress which is about non‐rigid objects (e.g., human bodies). The rich articulation modes of human body make it rather challenging to synthesize and interpolate the model well. To address this problem, we propose a novel deep learning based framework that directly predicts novel‐view videos of human performances without explicit 3D reconstruction. Our method is a composition of two steps: novel‐view prediction and detail enhancement. We first learn a novel deep generative query network for view prediction. We synthesize novel‐view performances from a sparse set of just five or less camera videos. Then, we use a new generative adversarial network to enhance fine‐scale details of the first step results. This opens up the possibility of high‐quality low‐cost video‐based performance synthesis, which is gaining popularity for VA and AR applications. We demonstrate a variety of promising results, where our method is able to synthesis more robust and accurate performances than existing state‐of‐the‐art approaches when only sparse views are available.  相似文献   

2.
A practical way to generate a high dynamic range (HDR) video using off‐the‐shelf cameras is to capture a sequence with alternating exposures and reconstruct the missing content at each frame. Unfortunately, existing approaches are typically slow and are not able to handle challenging cases. In this paper, we propose a learning‐based approach to address this difficult problem. To do this, we use two sequential convolutional neural networks (CNN) to model the entire HDR video reconstruction process. In the first step, we align the neighboring frames to the current frame by estimating the flows between them using a network, which is specifically designed for this application. We then combine the aligned and current images using another CNN to produce the final HDR frame. We perform an end‐to‐end training by minimizing the error between the reconstructed and ground truth HDR images on a set of training scenes. We produce our training data synthetically from existing HDR video datasets and simulate the imperfections of standard digital cameras using a simple approach. Experimental results demonstrate that our approach produces high‐quality HDR videos and is an order of magnitude faster than the state‐of‐the‐art techniques for sequences with two and three alternating exposures.  相似文献   

3.
针对虚拟到真实驾驶场景翻译中成对的数据样本缺乏以及前后帧不一致等问题,提出一种基于生成对抗网络的视频翻译模型。为解决数据样本缺乏问题,模型采取“双网络”架构,将语义分割场景作为中间过渡分别构建前、后端网络。在前端网络中,采用卷积和反卷积框架,并利用光流网络提取前后帧的动态信息,实现从虚拟场景到语义分割场景的连续的视频翻译;在后端网络中,采用条件生成对抗网络框架,设计生成器、图像判别器和视频判别器,并结合光流网络,实现从语义分割场景到真实场景的连续的视频翻译。实验利用从自动驾驶模拟器采集的数据与公开数据集进行训练和测试,在多种驾驶场景中能够实现虚拟到真实场景的翻译,翻译效果明显好于对比算法。结果表明,所提模型能够有效解决前后帧不连续和动态目标模糊的问题,使翻译的视频更为流畅,并且能适应多种复杂的驾驶场景。  相似文献   

4.
Figure skating is one of the most popular ice sports at the Winter Olympic Games. The skaters perform several skating skills to express the beauty of the art on ice. Skating involves moving on ice while wearing skate shoes with thin blades; thus, it requires much practice to skate without losing balance. Moreover, figure skating presents dynamic moves, such as jumping, artistically. Therefore, demonstrating figure skating skills is even more difficult to achieve than basic skating, and professional skaters often fall during Winter Olympic performances. We propose a system to demonstrate figure skating motions with a physically simulated human‐like character. We simulate skating motions with non‐holonomic constraints, which make the skate blade glide on the ice surface. It is difficult to obtain reference motions from figure skaters because figure skating motions are very fast and dynamic. Instead of using motion capture data, we use key poses extracted from videos on YouTube and complete reference motions using trajectory optimization. We demonstrate figure skating skills, such as crossover, three‐turn, and even jump. Finally, we use deep reinforcement learning to generate a robust controller for figure skating skills.  相似文献   

5.
This paper presents a novel generative model to synthesize fluid simulations from a set of reduced parameters. A convolutional neural network is trained on a collection of discrete, parameterizable fluid simulation velocity fields. Due to the capability of deep learning architectures to learn representative features of the data, our generative model is able to accurately approximate the training data set, while providing plausible interpolated in‐betweens. The proposed generative model is optimized for fluids by a novel loss function that guarantees divergence‐free velocity fields at all times. In addition, we demonstrate that we can handle complex parameterizations in reduced spaces, and advance simulations in time by integrating in the latent space with a second network. Our method models a wide variety of fluid behaviors, thus enabling applications such as fast construction of simulations, interpolation of fluids with different parameters, time re‐sampling, latent space simulations, and compression of fluid simulation data. Reconstructed velocity fields are generated up to 700× faster than re‐simulating the data with the underlying CPU solver, while achieving compression rates of up to 1300×.  相似文献   

6.
Shadow removal for videos is an important and challenging vision task. In this paper, we present a novel shadow removal approach for videos captured by free moving cameras using illumination transfer optimization. We first detect the shadows of the input video using interactive fast video matting. Then, based on the shadow detection results, we decompose the input video into overlapped 2D patches, and find the coherent correspondences between the shadow and non‐shadow patches via discrete optimization technique built on the patch similarity metric. We finally remove the shadows of the input video sequences using an optimized illumination transfer method, which reasonably recovers the illumination information of the shadow regions and produces spatio‐temporal shadow‐free videos. We also process the shadow boundaries to make the transition between shadow and non‐shadow regions smooth. Compared with previous works, our method can handle videos captured by free moving cameras and achieve better shadow removal results. We validate the effectiveness of the proposed algorithm via a variety of experiments.  相似文献   

7.
Despite considerable advances in natural image matting over the last decades, video matting still remains a difficult problem. The main challenges faced by existing methods are the large amount of user input required, and temporal inconsistencies in mattes between pairs of adjacent frames. We present a temporally‐coherent matte‐propagation method for videos based on PatchMatch and edge‐aware filtering. Given an input video and trimaps for a few frames, including the first and last, our approach generates alpha mattes for all frames of the video sequence. We also present a user scribble‐based interface for video matting that takes advantage of the efficiency of our method to interactively refine the matte results. We demonstrate the effectiveness of our approach by using it to generate temporally‐coherent mattes for several natural video sequences. We perform quantitative comparisons against the state‐of‐the‐art sparse‐input video matting techniques and show that our method produces significantly better results according to three different metrics. We also perform qualitative comparisons against the state‐of‐the‐art dense‐input video matting techniques and show that our approach produces similar quality results while requiring only about 7% of the amount of user input required by such techniques. These results show that our method is both effective and user‐friendly, outperforming state‐of‐the‐art solutions.  相似文献   

8.

Video anomaly detection automatically recognizes abnormal events in surveillance videos. Existing works have made advances in recognizing whether a video contains abnormal events; however, they cannot temporally localize the abnormal events within videos. This paper presents a novel anomaly attention-based framework for accurately temporally localize the abnormal events. Benefiting from the proposed framework, we can achieve frame-level VAD using video-level labels, which significantly reduces the burden of data annotation. Our method is an end-to-end deep neural network-based approach, which contains three modules: anomaly attention module (AAM), discriminative anomaly attention module (DAAM) and generative anomaly attention module (GAAM). Specifically, AAM is trained to generate the anomaly attention, which is used to measure the abnormal degree of each frame. Whereas, DAAM and GAAM are used to alternately augmenting AAM from two different aspects. On the one hand, DAAM enhancing AAM by optimizing the video-level video classification. On the other hand, GAAM adopts a conditional variational autoencoder to model the likelihood of each frame given the attention for refining AAM. As a result, AAM can generate higher anomaly scores for abnormal frames while lower anomaly scores for normal frames. Experimental results show that our proposed approach outperforms state-of-the-art methods, which validates the superiority of our AAVAD.

  相似文献   

9.
Video remains the method of choice for capturing temporal events. However, without access to the underlying 3D scene models, it remains difficult to make object level edits in a single video or across multiple videos. While it may be possible to explicitly reconstruct the 3D geometries to facilitate these edits, such a workflow is cumbersome, expensive, and tedious. In this work, we present a much simpler workflow to create plausible editing and mixing of raw video footage using only sparse structure points (SSP) directly recovered from the raw sequences. First, we utilize user‐scribbles to structure the point representations obtained using structure‐from‐motion on the input videos. The resultant structure points, even when noisy and sparse, are then used to enable various video edits in 3D, including view perturbation, keyframe animation, object duplication and transfer across videos, etc. Specifically, we describe how to synthesize object images from new views adopting a novel image‐based rendering technique using the SSPs as proxy for the missing 3D scene information. We propose a structure‐preserving image warping on multiple input frames adaptively selected from object video, followed by a spatio‐temporally coherent image stitching to compose the final object image. Simple planar shadows and depth maps are synthesized for objects to generate plausible video sequence mimicking real‐world interactions. We demonstrate our system on a variety of input videos to produce complex edits, which are otherwise difficult to achieve.  相似文献   

10.
We present a real‐time multi‐view facial capture system facilitated by synthetic training imagery. Our method is able to achieve high‐quality markerless facial performance capture in real‐time from multi‐view helmet camera data, employing an actor specific regressor. The regressor training is tailored to specified actor appearance and we further condition it for the expected illumination conditions and the physical capture rig by generating the training data synthetically. In order to leverage the information present in live imagery, which is typically provided by multiple cameras, we propose a novel multi‐view regression algorithm that uses multi‐dimensional random ferns. We show that higher quality can be achieved by regressing on multiple video streams than previous approaches that were designed to operate on only a single view. Furthermore, we evaluate possible camera placements and propose a novel camera configuration that allows to mount cameras outside the field of view of the actor, which is very beneficial as the cameras are then less of a distraction for the actor and allow for an unobstructed line of sight to the director and other actors. Our new real‐time facial capture approach has immediate application in on‐set virtual production, in particular with the ever‐growing demand for motion‐captured facial animation in visual effects and video games.  相似文献   

11.
We present a novel approach to optimally retarget videos for varied displays with differing aspect ratios by preserving salient scene content discovered via eye tracking. Our algorithm performs editing with cut, pan and zoom operations by optimizing the path of a cropping window within the original video while seeking to (i) preserve salient regions, and (ii) adhere to the principles of cinematography. Our approach is (a) content agnostic as the same methodology is employed to re‐edit a wide‐angle video recording or a close‐up movie sequence captured with a static or moving camera, and (b) independent of video length and can in principle re‐edit an entire movie in one shot. Our algorithm consists of two steps. The first step employs gaze transition cues to detect time stamps where new cuts are to be introduced in the original video via dynamic programming. A subsequent step optimizes the cropping window path (to create pan and zoom effects), while accounting for the original and new cuts. The cropping window path is designed to include maximum gaze information, and is composed of piecewise constant, linear and parabolic segments. It is obtained via L(1) regularized convex optimization which ensures a smooth viewing experience. We test our approach on a wide variety of videos and demonstrate significant improvement over the state‐of‐the‐art, both in terms of computational complexity and qualitative aspects. A study performed with 16 users confirms that our approach results in a superior viewing experience as compared to gaze driven re‐editing [ JSSH15 ] and letterboxing methods, especially for wide‐angle static camera recordings.  相似文献   

12.
This paper describes a novel real‐time end‐to‐end system for facial expression transformation, without the need of any driving source. Its core idea is to directly generate desired and photo‐realistic facial expressions on top of input monocular RGB video. Specifically, an unpaired learning framework is developed to learn the mapping between any two facial expressions in the facial blendshape space. Then, it automatically transforms the source expression in an input video clip to a specified target expression through the combination of automated 3D face construction, the learned bi‐directional expression mapping and automated lip correction. It can be applied to new users without additional training. Its effectiveness is demonstrated through many experiments on faces from live and online video, with different identities, ages, speeches and expressions.  相似文献   

13.
Traditional pencil drawing rendering algorithms when applied to video may suffer from temporal inconsistency and shower‐door effect due to the stochastic noise models employed. This paper attempts to resolve these problems with deep learning. Recently, many research endeavors have demonstrated that feed‐forward Convolutional Neural Networks (CNNs) are capable of using a reference image to stylize a whole video sequence while removing the shower‐door effect in video style transfer applications. Compared with video style transfer, pencil drawing video is more sensitive to the inconsistency of texture and requires a stronger expression of pencil hatching. Thus, in this paper we develop an approach by combining a latest Line Integral Convolution (LIC) based method, specializing in realistically simulating pencil drawing images, with a new feed‐forward CNN that can eliminate the shower‐door effect successfully. Taking advantage of optical flow, we adopt a feature‐map‐level temporal loss function and propose a new framework to avoid the temporal inconsistency between consecutive frames, enhancing the visual impression of pencil strokes and tone. Experimental comparisons with the existing feed‐forward CNNs have demonstrated that our method can generate temporally more stable and visually more pleasant pencil drawing video results in a faster manner.  相似文献   

14.
We propose a learning-based approach for full-body pose reconstruction from extremely sparse upper body tracking data, obtained from a virtual reality (VR) device. We leverage a conditional variational autoencoder with gated recurrent units to synthesize plausible and temporally coherent motions from 4-point tracking (head, hands, and waist positions and orientations). To avoid synthesizing implausible poses, we propose a novel sample selection and interpolation strategy along with an anomaly detection algorithm. Specifically, we monitor the quality of our generated poses using the anomaly detection algorithm and smoothly transition to better samples when the quality falls below a statistically defined threshold. Moreover, we demonstrate that our sample selection and interpolation method can be used for other applications, such as target hitting and collision avoidance, where the generated motions should adhere to the constraints of the virtual environment. Our system is lightweight, operates in real-time, and is able to produce temporally coherent and realistic motions.  相似文献   

15.
Motion planning is one of the most significant technologies for autonomous driving. To make motion planning models able to learn from the environment and to deal with emergency situations, a new motion planning framework called as "parallel planning" is proposed in this paper. In order to generate sufficient and various training samples, artificial traffic scenes are firstly constructed based on the knowledge from the reality. A deep planning model which combines a convolutional neural network (CNN) with the Long Short-Term Memory module (LSTM) is developed to make planning decisions in an end-toend mode. This model can learn from both real and artificial traffic scenes and imitate the driving style of human drivers. Moreover, a parallel deep reinforcement learning approach is also presented to improve the robustness of planning model and reduce the error rate. To handle emergency situations, a hybrid generative model including a variational auto-encoder (VAE) and a generative adversarial network (GAN) is utilized to learn from virtual emergencies generated in artificial traffic scenes. While an autonomous vehicle is moving, the hybrid generative model generates multiple video clips in parallel, which correspond to different potential emergency scenarios. Simultaneously, the deep planning model makes planning decisions for both virtual and current real scenes. The final planning decision is determined by analysis of real observations. Leveraging the parallel planning approach, the planner is able to make rational decisions without heavy calculation burden when an emergency occurs.   相似文献   

16.
We present a user‐assisted video stabilization algorithm that is able to stabilize challenging videos when state‐of‐the‐art automatic algorithms fail to generate a satisfactory result. Current methods do not give the user any control over the look of the final result. Users either have to accept the stabilized result as is, or discard it should the stabilization fail to generate a smooth output. Our system introduces two new modes of interaction that allow the user to improve the unsatisfactory stabilized video. First, we cluster tracks and visualize them on the warped video. The user ensures that appropriate tracks are selected by clicking on track clusters to include or exclude them. Second, the user can directly specify how regions in the output video should look by drawing quadrilaterals to select and deform parts of the frame. These user‐provided deformations reduce undesirable distortions in the video. Our algorithm then computes a stabilized video using the user‐selected tracks, while respecting the user‐modified regions. The process of interactively removing user‐identified artifacts can sometimes introduce new ones, though in most cases there is a net improvement. We demonstrate the effectiveness of our system with a variety of challenging hand held videos.  相似文献   

17.
This paper proposes a deep learning‐based image tone enhancement approach that can maximally enhance the tone of an image while preserving the naturalness. Our approach does not require carefully generated ground‐truth images by human experts for training. Instead, we train a deep neural network to mimic the behavior of a previous classical filtering method that produces drastic but possibly unnatural‐looking tone enhancement results. To preserve the naturalness, we adopt the generative adversarial network (GAN) framework as a regularizer for the naturalness. To suppress artifacts caused by the generative nature of the GAN framework, we also propose an imbalanced cycle‐consistency loss. Experimental results show that our approach can effectively enhance the tone and contrast of an image while preserving the naturalness compared to previous state‐of‐the‐art approaches.  相似文献   

18.
3D garment capture is an important component for various applications such as free‐view point video, virtual avatars, online shopping, and virtual cloth fitting. Due to the complexity of the deformations, capturing 3D garment shapes requires controlled and specialized setups. A viable alternative is image‐based garment capture. Capturing 3D garment shapes from a single image, however, is a challenging problem and the current solutions come with assumptions on the lighting, camera calibration, complexity of human or mannequin poses considered, and more importantly a stable physical state for the garment and the underlying human body. In addition, most of the works require manual interaction and exhibit high run‐times. We propose a new technique that overcomes these limitations, making garment shape estimation from an image a practical approach for dynamic garment capture. Starting from synthetic garment shape data generated through physically based simulations from various human bodies in complex poses obtained through Mocap sequences, and rendered under varying camera positions and lighting conditions, our novel method learns a mapping from rendered garment images to the underlying 3D garment model. This is achieved by training Convolutional Neural Networks (CNN‐s) to estimate 3D vertex displacements from a template mesh with a specialized loss function. We illustrate that this technique is able to recover the global shape of dynamic 3D garments from a single image under varying factors such as challenging human poses, self occlusions, various camera poses and lighting conditions, at interactive rates. Improvement is shown if more than one view is integrated. Additionally, we show applications of our method to videos.  相似文献   

19.
20.
The transportation of prerecorded, compressed video data without loss of picture quality requires the network and video servers to support large fluctuations in bandwidth requirements. Fully utilizing a client-side buffer for smoothing bandwidth requirements can limit the fluctuations in bandwidth required from the underlying network and the video-on-demand servers. This paper shows that, for a fixed-size buffer constraint, the critical bandwidth allocation technique results in plans for continuous playback of stored video that have (1) the minimum number of bandwidth increases, (2) the smallest peak bandwidth requirements, and (3) the largest minimum bandwidth requirements. In addition, this paper introduces an optimal bandwidth allocation algorithm which, in addition to the three critical bandwidth allocation properties, minimizes the total number of bandwidth changes necessary for continuous playback. A comparison between the optimal bandwidth allocation algorithm and other critical bandwidth-based algorithms using 17 full-length movie videos and 3 seminar videos is also presented.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号