首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 78 毫秒
1.
Recognizing human actions from video has been a challenging problem in computer vision. Although human actions can be inferred from a wide range of data, it has been demonstrated that simple human actions can be inferred by tracking the movement of the head in 2D. This is a promising idea as detecting and tracking the head is expected to be simpler and faster because the head has lower shape variability and higher visibility than other body parts (e.g., hands and/or feet). Although tracking the movement of the head alone does not provide sufficient information for distinguishing among complex human actions, it could serve as a complimentary component of a more sophisticated action recognition system. In this article, we extend this idea by developing a more general, viewpoint invariant, action recognition system by detecting and tracking the 3D position of the head using multiple cameras. The proposed approach employs Principal Component Analysis (PCA) to register the 3D trajectories in a common coordinate system and Dynamic Time Warping (DTW) to align them in time for matching. We present experimental results to demonstrate the potential of using 3D head trajectory information to distinguish among simple but common human actions independently of viewpoint.  相似文献   

2.
Dynamic Template Tracking and Recognition   总被引:2,自引:0,他引:2  
In this paper we address the problem of tracking non-rigid objects whose local appearance and motion changes as a function of time. This class of objects includes dynamic textures such as steam, fire, smoke, water, etc., as well as articulated objects such as humans performing various actions. We model the temporal evolution of the object’s appearance/motion using a linear dynamical system. We learn such models from sample videos and use them as dynamic templates for tracking objects in novel videos. We pose the problem of tracking a dynamic non-rigid object in the current frame as a maximum a-posteriori estimate of the location of the object and the latent state of the dynamical system, given the current image features and the best estimate of the state in the previous frame. The advantage of our approach is that we can specify a-priori the type of texture to be tracked in the scene by using previously trained models for the dynamics of these textures. Our framework naturally generalizes common tracking methods such as SSD and kernel-based tracking from static templates to dynamic templates. We test our algorithm on synthetic as well as real examples of dynamic textures and show that our simple dynamics-based trackers perform at par if not better than the state-of-the-art. Since our approach is general and applicable to any image feature, we also apply it to the problem of human action tracking and build action-specific optical flow trackers that perform better than the state-of-the-art when tracking a human performing a particular action. Finally, since our approach is generative, we can use a-priori trained trackers for different texture or action classes to simultaneously track and recognize the texture or action in the video.  相似文献   

3.
We propose a general architecture for action (mimicking) and program (gesture) level visual imitation. Action-level imitation involves two modules. The viewpoint Transformation (VPT) performs a "rotation" to align the demonstrator's body to that of the learner. The Visuo-Motor Map (VMM) maps this visual information to motor data. For program-level (gesture) imitation, there is an additional module that allows the system to recognize and generate its own interpretation of observed gestures to produce similar gestures/goals at a later stage. Besides the holistic approach to the problem, our approach differs from traditional work in i) the use of motor information for gesture recognition; ii) usage of context (e.g., object affordances) to focus the attention of the recognition system and reduce ambiguities, and iii) use iconic image representations for the hand, as opposed to fitting kinematic models to the video sequence. This approach is motivated by the finding of visuomotor neurons in the F5 area of the macaque brain that suggest that gesture recognition/imitation is performed in motor terms (mirror) and rely on the use of object affordances (canonical) to handle ambiguous actions. Our results show that this approach can outperform more conventional (e.g., pure visual) methods.  相似文献   

4.
吸烟检测已成为公共场所禁烟的重要措施,基于视频图像的吸烟动作识别已广泛用于吸烟检测中。使用深度学习的方法进行图像处理,需要大量数据集训练模型。现有的吸烟动作识别方法的准确率和实时性不够理想,且多只针对一个人进行动作识别。为解决这些问题,提出了一种通过检测周期性动作来识别多人吸烟动作的方法。在进行了大量的实验后发现吸烟行为是有节奏和周期性的,对此具体分析了吸烟行为的周期性并制定了吸烟行为规范;利用人体关节点信息,关注关节点的运动轨迹,检测运动轨迹是否符合周期性规律从而实现吸烟动作识别;同时跟踪多人关节点的信息,以实现多个人实时吸烟行为的识别。实验结果表明,该方法可以达到91%的准确率,在各种情况下都可以保持较高准确率和鲁棒性。  相似文献   

5.
Automatically detecting objects in images or video sequences is one of the most relevant and frequently tackled tasks in computer vision and pattern recognition.The starting point for this work is a very general model-based approach to object detection. The problem is turned into a global continuous optimization one: given a parametric model of the object to be detected within an image, a function is maximized, which represents the similarity between the model and a region of the image under investigation.In particular, in this work, the optimization problem is tackled using Particle Swarm Optimization (PSO) and Differential Evolution (DE). We compare the performances of these optimization techniques on two real-world paradigmatic problems, onto which many other real-world object detection problems can be mapped: hippocampus localization in histological images and human body pose estimation in video sequences. In the former, a 2D deformable model of a section of the hippocampus is fit to the corresponding region of a histological image, to accurately localize such a structure and analyze gene expression in specific sub-regions. In the latter, an articulated 3D model of a human body is matched against a set of images of a human performing some action, taken from different perspectives, to estimate the subject's posture in space.Given the significant computational burden imposed by this approach, we implemented PSO and DE as parallel algorithms within the nVIDIA? CUDA computing architecture.  相似文献   

6.
Neuro-psychological findings have shown that human perception of objects is based on part decomposition. Most objects are made of multiple parts which are likely to be the entities actually involved in grasp affordances. Therefore, automatic object recognition and robot grasping should take advantage from 3D shape segmentation. This paper presents an approach toward planning robot grasps across similar objects by part correspondence. The novelty of the method lies in the topological decomposition of objects that enables high-level semantic grasp planning.In particular, given a 3D model of an object, the representation is initially segmented by computing its Reeb graph. Then, automatic object recognition and part annotation are performed by applying a shape retrieval algorithm. After the recognition phase, queries are accepted for planning grasps on individual parts of the object. Finally, a robot grasp planner is invoked for finding stable grasps on the selected part of the object. Grasps are evaluated according to a widely used quality measure. Experiments performed in a simulated environment on a reasonably large dataset show the potential of topological segmentation to highlight candidate parts suitable for grasping.  相似文献   

7.
Simultaneous tracking and action recognition for single actor human actions   总被引:1,自引:0,他引:1  
This paper presents an approach to simultaneously tracking the pose and recognizing human actions in a video. This is achieved by combining a Dynamic Bayesian Action Network (DBAN) with 2D body part models. Existing DBAN implementation relies on fairly weak observation features, which affects the recognition accuracy. In this work, we use a 2D body part model for accurate pose alignment, which in turn improves both pose estimate and action recognition accuracy. To compensate for the additional time required for alignment, we use an action entropy-based scheme to determine the minimum number of states to be maintained in each frame while avoiding sample impoverishment. In addition, we also present an approach to automation of the keypose selection task for learning 3D action models from a few annotations. We demonstrate our approach on a hand gesture dataset with 500 action sequences, and we show that compared to DBAN our algorithm achieves 6% improvement in accuracy.  相似文献   

8.
Human action recognition is an important branch among the studies of both human perception and computer vision systems. Along with the development of artificial intelligence, deep learning techniques have gained remarkable reputation when dealing with image categorization tasks (e.g., object detection and classification). However, since human actions normally present in the form of sequential image frames, analyzing human action data requires significantly increased computational power than still images when deep learning techniques are employed. Such a challenge has been the bottleneck for the migration of learning-based image representation techniques to action sequences, so that the old fashioned handcrafted human action representations are still widely used for human action recognition tasks. On the other hand, since handcrafted representations are usually ad-hoc and overfit to specific data, they are incapable of being generalized to deal with various realistic scenarios. Consequently, resorting to deep learning action representations for human action recognition tasks is eventually a natural option. In this work, we provide a detailed overview of recent advancements in human action representations. As the first survey that covers both handcrafted and learning-based action representations, we explicitly discuss the superiorities and limitations of exiting techniques from both kinds. The ultimate goal of this survey is to provide comprehensive analysis and comparisons between learning-based and handcrafted action representations respectively, so as to inspire action recognition researchers towards the study of both kinds of representation techniques.  相似文献   

9.
Current state-of-the-art action classification methods aggregate space–time features globally, from the entire video clip under consideration. However, the features extracted may in part be due to irrelevant scene context, or movements shared amongst multiple action classes. This motivates learning with local discriminative parts, which can help localise which parts of the video are significant. Exploiting spatio-temporal structure in the video should also improve results, just as deformable part models have proven highly successful in object recognition. However, whereas objects have clear boundaries which means we can easily define a ground truth for initialisation, 3D space–time actions are inherently ambiguous and expensive to annotate in large datasets. Thus, it is desirable to adapt pictorial star models to action datasets without location annotation, and to features invariant to changes in pose such as bag-of-feature and Fisher vectors, rather than low-level HoG. Thus, we propose local deformable spatial bag-of-features in which local discriminative regions are split into a fixed grid of parts that are allowed to deform in both space and time at test-time. In our experimental evaluation we demonstrate that by using local space–time action parts in a weakly supervised setting, we are able to achieve state-of-the-art classification performance, whilst being able to localise actions even in the most challenging video datasets.  相似文献   

10.
This paper presents a robust framework for tracking complex objects in video sequences. Multiple hypothesis tracking (MHT) algorithm reported in (IEEE Trans. Pattern Anal. Mach. Intell. 18(2) (1996)) is modified to accommodate a high level representations (2D edge map, 3D models) of objects for tracking. The framework exploits the advantages of MHT algorithm which is capable of resolving data association/uncertainty and integrates it with object matching techniques to provide a robust behavior while tracking complex objects. To track objects in 2D, a 4D feature is used to represent edge/line segments and are tracked using MHT. In many practical applications 3D models provide more information about the object's pose (i.e., rotation information in the transformation space) which cannot be recovered using 2D edge information. Hence, a 3D model-based object tracking algorithm is also presented. A probabilistic Hausdorff image matching algorithm is incorporated into the framework in order to determine the geometric transformation that best maps the model features onto their corresponding ones in the image plane. 3D model of the object is used to constrain the tracker to operate in a consistent manner. Experimental results on real and synthetic image sequences are presented to demonstrate the efficacy of the proposed framework.  相似文献   

11.
Genetic object recognition using combinations of views   总被引:1,自引:0,他引:1  
Investigates the application of genetic algorithms (GAs) for recognizing real 2D or 3D objects from 2D intensity images, assuming that the viewpoint is arbitrary. Our approach is model-based (i.e. we assume a pre-defined set of models), while our recognition strategy relies on the theory of algebraic functions of views. According to this theory, the variety of 2D views depicting an object can be expressed as a combination of a small number of 2D views of the object. This implies a simple and powerful strategy for object recognition: novel 2D views of an object (2D or 3D) can be recognized by simply matching them to combinations of known 2D views of the object. In other words, objects in a scene are recognized by "predicting" their appearance through the combination of known views of the objects. This is an important idea, which is also supported by psychophysical findings indicating that the human visual system works in a similar way. The main difficulty in implementing this idea is determining the parameters of the combination of views. This problem can be solved either in the space of feature matches among the views ("image space") or the space of parameters ("transformation space"). In general, both of these spaces are very large, making the search very time-consuming. In this paper, we propose using GAs to search these spaces efficiently. To improve the efficiency of genetic searching in the transformation space, we use singular value decomposition and interval arithmetic to restrict the genetic search to the most feasible regions of the transformation space. The effectiveness of the GA approaches is shown on a set of increasingly complex real scenes where exact and near-exact matches are found reliably and quickly  相似文献   

12.
Recently, much work has been done in multiple object tracking on the one hand and on reference model adaptation for a single-object tracker on the other side. In this paper, we do both tracking of multiple objects (faces of people) in a meeting scenario and online learning to incrementally update the models of the tracked objects to account for appearance changes during tracking. Additionally, we automatically initialize and terminate tracking of individual objects based on low-level features, i.e., face color, face size, and object movement. Many methods unlike our approach assume that the target region has been initialized by hand in the first frame. For tracking, a particle filter is incorporated to propagate sample distributions over time. We discuss the close relationship between our implemented tracker based on particle filters and genetic algorithms. Numerous experiments on meeting data demonstrate the capabilities of our tracking approach. Additionally, we provide an empirical verification of the reference model learning during tracking of indoor and outdoor scenes which supports a more robust tracking. Therefore, we report the average of the standard deviation of the trajectories over numerous tracking runs depending on the learning rate.   相似文献   

13.
Three-dimensional object recognition on range data and 3D point clouds is becoming more important nowadays. Since many real objects have a shape that could be approximated by simple primitives, robust pattern recognition can be used to search for primitive models. For example, the Hough transform is a well-known technique which is largely adopted in 2D image space. In this paper, we systematically analyze different probabilistic/randomized Hough transform algorithms for spherical object detection in dense point clouds. In particular, we study and compare four variants which are characterized by the number of points drawn together for surface computation into the parametric space and we formally discuss their models. We also propose a new method that combines the advantages of both single-point and multi-point approaches for a faster and more accurate detection. The methods are tested on synthetic and real datasets.  相似文献   

14.
A combined 2D, 3D approach is presented that allows for robust tracking of moving people and recognition of actions. It is assumed that the system observes multiple moving objects via a single, uncalibrated video camera. Low-level features are often insufficient for detection, segmentation, and tracking of non-rigid moving objects. Therefore, an improved mechanism is proposed that integrates low-level (image processing), mid-level (recursive 3D trajectory estimation), and high-level (action recognition) processes. A novel extended Kalman filter formulation is used in estimating the relative 3D motion trajectories up to a scale factor. The recursive estimation process provides a prediction and error measure that is exploited in higher-level stages of action recognition. Conversely, higher-level mechanisms provide feedback that allows the system to reliably segment and maintain the tracking of moving objects before, during, and after occlusion. Heading-guided recognition (HGR) is proposed as an efficient method for adaptive classification of activity. The HGR approach is demonstrated using “motion history images” that are then recognized via a mixture-of-Gaussians classifier. The system is tested in recognizing various dynamic human outdoor activities: running, walking, roller blading, and cycling. In addition, experiments with real and synthetic data sets are used to evaluate stability of the trajectory estimator with respect to noise.  相似文献   

15.
We have constructed a dialog environment between a human and a virtual agent. With commercial off-the-shelf VR technologies, special devices such as a data glove have to be used for the interaction, but it is difficult for anyone to manipulate objects on their own. If there is a helper who has direct access to objects in virtual space, we may ask them. The question, however, is how to communicate with the helper. The basic idea is to utilize speech and gesture recognition systems. We have already reported the above-mentioned result, although only the avatar can move a virtual object in the current system. The user cannot freely manipulate virtual objects. Therefore, in a new attempt, we constructed a communication channel between virtual space and the real world so that the virtual object could be manipulated. In order to develop the new system, we extended the existing system to an internet meeting system allowing users in different places to interact with each other by voice and by a pointing action with a finger. This work was presented in part and awarded as Young Author Award at the 13th International Symposium on Artificial Life and Robotics, Oita, Japan, January 31–February 2, 2008  相似文献   

16.
Pose refinement is an essential task for computer vision systems that require the calibration and verification of model and camera parameters. Typical domains include the real-time tracking of objects and verification in model-based recognition systems. A technique is presented for recovering model and camera parameters of 3D objects from a single two-dimensional image. This basic problem is further complicated by the incorporation of simple bounds on the model and camera parameters and linear constraints restricting some subset of object parameters to a specific relationship. It is demonstrated in this paper that this constrained pose refinement formulation is no more difficult than the original problem based on numerical analysis techniques, including active set methods and lagrange multiplier analysis. A number of bounded and linearly constrained parametric models are tested and convergence to proper values occurs from a wide range of initial error, utilizing minimal matching information (relative to the number of parameters and components). The ability to recover model parameters in a constrained search space will thus simplify associated object recognition problems.  相似文献   

17.
Creating dynamic virtual environments consisting of humans interacting with objects is a fundamental problem in computer graphics. While it is well‐accepted that agent interactions play an essential role in synthesizing such scenes, most extant techniques exclusively focus on static scenes, leaving the dynamic component out. In this paper, we present a generative model to synthesize plausible multi‐step dynamic human‐object interactions. Generating multi‐step interactions is challenging since the space of such interactions is exponential in the number of objects, activities, and time steps. We propose to handle this combinatorial complexity by learning a lower dimensional space of plausible human‐object interactions. We use action plots to represent interactions as a sequence of discrete actions along with the participating objects and their states. To build action plots, we present an automatic method that uses state‐of‐the‐art computer vision techniques on RGB videos in order to detect individual objects and their states, extract the involved hands, and recognize the actions performed. The action plots are built from observing videos of everyday activities and are used to train a generative model based on a Recurrent Neural Network (RNN). The network learns the causal dependencies and constraints between individual actions and can be used to generate novel and diverse multi‐step human‐object interactions. Our representation and generative model allows new capabilities in a variety of applications such as interaction prediction, animation synthesis, and motion planning for a real robotic agent.  相似文献   

18.
《Real》2001,7(6):495-506
Augmented reality requires understanding of the scene to know when, where and what to display as a response to changes in the surrounding world. This understanding often involves tracking and recognition of multiple objects and locations in real-time. Technologies frequently used for multiple object tracking, such as electromagnetic trackers are very limited in range, as well as constraining. The use of Computer Vision to identify and track multiple objects is very promising. However, the requirements for traditional object recognition using appearance-based or model-based vision are very complex and their performance is far from real-time. An alternative is to use a set of markers or fiducials for object tracking and recognition. In this paper we present a system of marker coding that, together with an efficient image processing technique, provides a practical method for tracking the marked objects in real-time. The technique is based on clustering of candidate regions in space using a minimum spanning tree. The markers in the codes also allow the estimation of the three dimensional pose of the objects. We demonstrate the utility of the marker-based tracking technique in an Augmented Reality application. The application involves superimposing graphics over real industrial parts that are tracked using fiducials and manipulated by a human in order to complete an assembly. The system aids in the evaluation of the different assembly sequence possibilities.  相似文献   

19.
Interpretation of images and videos containing humans interacting with different objects is a daunting task. It involves understanding scene/event, analyzing human movements, recognizing manipulable objects, and observing the effect of the human movement on those objects. While each of these perceptual tasks can be conducted independently, recognition rate improves when interactions between them are considered. Motivated by psychological studies of human perception, we present a Bayesian approach which integrates various perceptual tasks involved in understanding human-object interactions. Previous approaches to object and action recognition rely on static shape/appearance feature matching and motion analysis, respectively. Our approach goes beyond these traditional approaches and applies spatial and functional constraints on each of the perceptual elements for coherent semantic interpretation. Such constraints allow us to recognize objects and actions when the appearances are not discriminative enough. We also demonstrate the use of such constraints in recognition of actions from static images without using any motion information.  相似文献   

20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号