首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 742 毫秒
1.
Shot Partitioning Based Recognition of TV Commercials   总被引:1,自引:0,他引:1  
Digital video applications exploit the intrinsic structure of video sequences. In order to obtain and represent this structure for video annotation and indexing tasks, the main initial step is automatic shot partitioning. This paper analyzes the problem of automatic TV commercials recognition, and a new algorithm for scene break detection is then introduced. The structure of each commercial is represented by the set of its key-frames, which are automatically extracted from the video stream. The particular characteristics of commercials make commonly used shot boundary detection techniques obtain worse results than with other video content domains. These techniques are based on individual image features or visual cues, which show significant performance lacks when they are applied to complex video content domains like commercials. We present a new scene break detection algorithm based on the combined analysis of edge and color features. Local motion estimation is applied to each edge in a frame, and the continuity of the color around them is then checked in the following frame. By separately considering both sides of each edge, we rely on the continuous presence of the objects and/or the background of the scene during each shot. Experimental results show that this approach outperforms single feature algorithms in terms of precision and recall.  相似文献   

2.
Guiding Attention by Cooperative Cues   总被引:2,自引:1,他引:1       下载免费PDF全文
A common assumption in visual attention is based on the rationale of "limited capacity of information processing". From this view point there is little consideration of how different information channels or modules are cooperating because cells in processing stages are forced to compete for the limited resource. To examine the mechanism behind the cooperative behavior of information channels, a computational model of selective attention is implemented based on two hypotheses. Unlike the traditional view of visual attention, the cooperative behavior is assumed to be a dynamic integration process between the bottom-up and top-down information. Furthermore, top-down information is assumed to provide a contextual cue during selection process and to guide the attentional allocation among many bottom-up candidates. The result from a series of simulation with still and video images showed some interesting properties that could not be explained by the competitive aspect of selective attention alone.  相似文献   

3.
A key characteristic of video data is the associated spatial and temporal semantics. It is important that a video model models the characteristics of objects and their relationships in time and space. J.F. Allen's (1983) 13 temporal relationships are often used in formulating queries that contain the temporal relationships among video frames. For the spatial relationships, most of the approaches are based on projecting objects on a two or three-dimensional coordinate system. However, very few attempts have been made formally to represent the spatio-temporal relationships of objects contained in the video data and to formulate queries with spatio-temporal constraints. The purpose of the work is to design a model representation for the specification of the spatio-temporal relationships among objects in video sequences. The model describes the spatial relationships among objects for each frame in a given video scene and the temporal relationships (for this frame) of the temporal intervals measuring the duration of these spatial relationships. It also models the temporal composition of an object, which reflects the evolution of object's spatial relationships over the subsequent frames in the video scene and in the entire video sequence. Our model representation also provides an effective and expressive way for the complete and precise specification of distances among objects in digital video. This model is a basis for the annotation of raw video  相似文献   

4.
5.
6.
Even though visual attention models using bottom-up saliency can speed up object recognition by predicting object locations, in the presence of multiple salient objects, saliency alone cannot discern target objects from the clutter in a scene. Using a metric named familiarity, we propose a top-down method for guiding attention towards target objects, in addition to bottom-up saliency. To demonstrate the effectiveness of familiarity, the unified visual attention model (UVAM) which combines top-down familiarity and bottom-up saliency is applied to SIFT based object recognition. The UVAM is tested on 3600 artificially generated images containing COIL-100 objects with varying amounts of clutter, and on 126 images of real scenes. The recognition times are reduced by 2.7× and 2×, respectively, with no reduction in recognition accuracy, demonstrating the effectiveness and robustness of the familiarity based UVAM.  相似文献   

7.
Spatio-temporal alignment of sequences   总被引:2,自引:0,他引:2  
This paper studies the problem of sequence-to-sequence alignment, namely, establishing correspondences in time and in space between two different video sequences of the same dynamic scene. The sequences are recorded by uncalibrated video cameras which are either stationary or jointly moving, with fixed (but unknown) internal parameters and relative intercamera external parameters. Temporal variations between image frames (such as moving objects or changes in scene illumination) are powerful cues for alignment, which cannot be exploited by standard image-to-image alignment techniques. We show that, by folding spatial and temporal cues into a single alignment framework, situations which are inherently ambiguous for traditional image-to-image alignment methods, are often uniquely resolved by sequence-to-sequence alignment. Furthermore, the ability to align and integrate information across multiple video sequences both in time and in space gives rise to new video applications that are not possible when only image-to-image alignment is used.  相似文献   

8.
Detection of objects from a video is one of the basic issues in computer vision study. It is obvious that moving objects detection is particularly important, since they are those to which one should pay attention in walking, running, or driving a car. This paper proposes a method of detecting moving objects from a video as foreground objects by inferring backgrounds frame by frame. The proposed method can cope with various changes of a scene including large dynamical change of a scene in a video taken by a stationary/moving camera. Experimental results show satisfactory performance of the proposed method.  相似文献   

9.
In this paper we present a Bayesian framework for parsing images into their constituent visual patterns. The parsing algorithm optimizes the posterior probability and outputs a scene representation as a parsing graph, in a spirit similar to parsing sentences in speech and natural language. The algorithm constructs the parsing graph and re-configures it dynamically using a set of moves, which are mostly reversible Markov chain jumps. This computational framework integrates two popular inference approaches—generative (top-down) methods and discriminative (bottom-up) methods. The former formulates the posterior probability in terms of generative models for images defined by likelihood functions and priors. The latter computes discriminative probabilities based on a sequence (cascade) of bottom-up tests/filters. In our Markov chain algorithm design, the posterior probability, defined by the generative models, is the invariant (target) probability for the Markov chain, and the discriminative probabilities are used to construct proposal probabilities to drive the Markov chain. Intuitively, the bottom-up discriminative probabilities activate top-down generative models. In this paper, we focus on two types of visual patterns—generic visual patterns, such as texture and shading, and object patterns including human faces and text. These types of patterns compete and cooperate to explain the image and so image parsing unifies image segmentation, object detection, and recognition (if we use generic visual patterns only then image parsing will correspond to image segmentation (Tu and Zhu, 2002. IEEE Trans. PAMI, 24(5):657–673). We illustrate our algorithm on natural images of complex city scenes and show examples where image segmentation can be improved by allowing object specific knowledge to disambiguate low-level segmentation cues, and conversely where object detection can be improved by using generic visual patterns to explain away shadows and occlusions.  相似文献   

10.
11.
We address the problem of localizing and obtaining high-resolution footage of the people present in a scene. We propose a biologically-inspired solution combining pre-attentive, low-resolution sensing for detection with shiftable, high-resolution, attentive sensing for confirmation and further analysis. The detection problem is made difficult by the unconstrained nature of realistic environments and human behaviour, and the low resolution of pre-attentive sensing. Analysis of human peripheral vision suggests a solution based on integration of relatively simple but complementary cues. We develop a Bayesian approach involving layered probabilistic modeling and spatial integration using a flexible norm that maximizes the statistical power of both dense and sparse cues. We compare the statistical power of several cues and demonstrate the advantage of cue integration. We evaluate the Bayesian cue integration method for human detection on a labelled surveillance database and find that it outperforms several competing methods based on conjunctive combinations of classifiers (e.g., Adaboost). We have developed a real-time version of our pre-attentive human activity sensor that generates saccadic targets for an attentive foveated vision system. Output from high-resolution attentive detection algorithms and gaze state parameters are fed back as statistical priors and combined with pre-attentive cues to determine saccadic behaviour. The result is a closed-loop system that fixates faces over a 130 deg field of view, allowing high-resolution capture of facial video over a large dynamic scene.  相似文献   

12.
本文提出一种基于非线性尺度空间理论的视频对象分割方法。该方法首先利用非线性尺度空间理论对视频序列中的每一帧图像做多尺度分解,然后在由大尺度图像组成的序列中利用运动信息确定运动对象,最后根据每一帧图像的多尺度图像确定对象的准确边界。实验结果表明,该方法能够在复杂的自然场景中精确地分割出视频对象,具有较强的抗干扰能力。  相似文献   

13.
In applications of augmented reality like virtual studio TV production, multisite video conference applications using a virtual meeting room and synthetic/natural hybrid coding according to the new ISO/MPEG-4 standard, a synthetic scene is mixed into a natural scene to generate a synthetic/natural hybrid image sequence. For realism, the illumination in both scenes should be identical. In this paper, the illumination of the natural scene is estimated automatically and applied to the synthetic scene. The natural scenes are restricted to scenes with nonoccluding, simple, moving, mainly rigid objects. For illumination estimation, these natural objects are automatically segmented in the natural image sequence and three-dimensionally (3-D) modeled using ellipsoid-like models. The 3-D shape, 3-D motion, and the displaced frame difference between two succeeding images are evaluated to estimate three illumination parameters. The parameters describe a distant point light source and ambient light. Using the estimated illumination parameters, the synthetic scene is rendered and mixed to the natural image sequence. Experimental results with a moving virtual object mixed into real video telephone sequences show that the virtual object appears naturally having the same shading and shadows as the real objects. Further, shading and shadow allows the viewer to understand the motion trajectory of the objects much better  相似文献   

14.
15.
Visual tracking can be treated as a parameter estimation problem that infers target states based on image observations from video sequences. A richer target representation may incur better chances of successful tracking in cluttered and dynamic environments, and thus enhance the robustness. Richer representations can be constructed by either specifying a detailed model of a single cue or combining a set of rough models of multiple cues. Both approaches increase the dimensionality of the state space, which results in a dramatic increase of computation. To investigate the integration of rough models from multiple cues and to explore computationally efficient algorithms, this paper formulates the problem of multiple cue integration and tracking in a probabilistic framework based on a factorized graphical model. Structured variational analysis of such a graphical model factorizes different modalities and suggests a co-inference process among these modalities. Based on the importance sampling technique, a sequential Monte Carlo algorithm is proposed to provide an efficient simulation and approximation of the co-inferencing of multiple cues. This algorithm runs in real-time at around 30 Hz. Our extensive experiments show that the proposed algorithm performs robustly in a large variety of tracking scenarios. The approach presented in this paper has the potential to solve other problems including sensor fusion problems.  相似文献   

16.
17.
A camera mounted on an aerial vehicle provides an excellent means for monitoring large areas of a scene. Utilizing several such cameras on different aerial vehicles allows further flexibility, in terms of increased visual scope and in the pursuit of multiple targets. In this paper, we address the problem of associating objects across multiple airborne cameras. Since the cameras are moving and often widely separated, direct appearance-based or proximity-based constraints cannot be used. Instead, we exploit geometric constraints on the relationship between the motion of each object across cameras, to test multiple association hypotheses, without assuming any prior calibration information. Given our scene model, we propose a likelihood function for evaluating a hypothesized association between observations in multiple cameras that is geometrically motivated. Since multiple cameras exist, ensuring coherency in association is an essential requirement, e.g. that transitive closure is maintained between more than two cameras. To ensure such coherency we pose the problem of maximizing the likelihood function as a k-dimensional matching and use an approximation to find the optimal assignment of association. Using the proposed error function, canonical trajectories of each object and optimal estimates of inter-camera transformations (in a maximum likelihood sense) are computed. Finally, we show that as a result of associating objects across the cameras, a concurrent visualization of multiple aerial video streams is possible and that, under special conditions, trajectories interrupted due to occlusion or missing detections can be repaired. Results are shown on a number of real and controlled scenarios with multiple objects observed by multiple cameras, validating our qualitative models, and through simulation quantitative performance is also reported.  相似文献   

18.
Video painting with space-time-varying style parameters   总被引:3,自引:0,他引:3  
Artists use different means of stylization to control the focus on different objects in the scene. This allows them to portray complex meaning and achieve certain artistic effects. Most prior work on painterly rendering of videos, however, uses only a single painting style, with fixed global parameters, irrespective of objects and their layout in the images. This often leads to inadequate artistic control. Moreover, brush stroke orientation is typically assumed to follow an everywhere continuous directional field. In this paper, we propose a video painting system that accounts for the spatial support of objects in the images or videos, and uses this information to specify style parameters and stroke orientation for painterly rendering. Since objects occupy distinct image locations and move relatively smoothly from one video frame to another, our object-based painterly rendering approach is characterized by style parameters that coherently vary in space and time. Space-time-varying style parameters enable more artistic freedom, such as emphasis/de-emphasis, increase or decrease of contrast, exaggeration or abstraction of different objects in the scene in a temporally coherent fashion.  相似文献   

19.
20.
The Segmentation According to Natural Examples (SANE) algorithm learns to segment objects in static images from video training data. SANE uses background subtraction to find the segmentation of moving objects in videos. This provides object segmentation information for each video frame. The collection of frames and segmentations forms a training set that SANE uses to learn the image and shape properties of the observed motion boundaries. When presented with new static images, the trained model infers segmentations similar to the observed motion segmentations. SANE is a general method for learning environment-specific segmentation models. Because it can automatically generate training data from video, it can adapt to a new environment and new objects with relative ease, an advantage over untrained segmentation methods or those that require human-labeled training data. By using the local shape information in the training data, it outperforms a trained local boundary detector. Its performance is competitive with a trained top-down segmentation algorithm that uses global shape. The shape information it learns from one class of objects can assist the segmentation of other classes.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号