共查询到12条相似文献,搜索用时 15 毫秒
1.
Although multiple methods have been proposed for human action recognition, the existing multi-view approaches cannot well discover meaningful relationship among multiple action categories from different views. To handle this problem, this paper proposes an multi-view learning approach for multi-view action recognition. First, the proposed method leverages the popular visual representation method, bag-of-visual-words (BoVW)/fisher vector (FV), to represent individual videos in each view. Second, the sparse coding algorithm is utilized to transfer the low-level features of various views into the discriminative and high-level semantics space. Third, we employ the multi-task learning (MTL) approach for joint action modeling and discovery of latent relationship among different action categories. The extensive experimental results on M2I and IXMAS datasets have demonstrated the effectiveness of our proposed approach. Moreover, the experiments further demonstrate that the discovered latent relationship can benefit multi-view model learning to augment the performance of action recognition. 相似文献
2.
In the action recognition, a proper frame sampling method can not only reduce redundant video information, but also improve the accuracy of action recognition. In this paper, an action density based non-isometric frame sampling method, namely NFS, is proposed to discard the redundant video information and sample the rational frames in videos for neural networks to achieve great accuracy on human action recognition, in which action density is introduced in our method to indicate the intensity of actions in videos. Particularly, the action density determination mechanism, focused-clips division mechanism, and reinforcement learning based frame sampling (RLFS) mechanism are proposed in NFS method. Via the evaluations with various neural networks and datasets, our results show that the proposed NFS method can achieve great effectiveness in frame sampling and can assist in achieving better accuracy on action recognition in comparison with existing methods. 相似文献
3.
《Journal of Visual Communication and Image Representation》2014,25(1):12-23
Human actions can be considered as a sequence of body poses over time, usually represented by coordinates corresponding to human skeleton models. Recently, a variety of low-cost devices have been released, able to produce markerless real time pose estimation. Nevertheless, limitations of the incorporated RGB-D sensors can produce inaccuracies, necessitating the utilization of alternative representation and classification schemes in order to boost performance. In this context, we propose a method for action recognition where skeletal data are initially processed in order to obtain robust and invariant pose representations and then vectors of dissimilarities to a set of prototype actions are computed. The task of recognition is performed in the dissimilarity space using sparse representation. A new publicly available dataset is introduced in this paper, created for evaluation purposes. The proposed method was also evaluated on other public datasets, and the results are compared to those of similar methods. 相似文献
4.
5.
The conditional random fields (CRFs) model, as one of the most successful discriminative approaches, has received renewed attention recently for human action recognition. However, the existing CRFs model formulations have typically limited capabilities to capture higher order dependencies among the given states and deeper intermediate representations within the target states, which are potentially useful and significant to model the complex action recognition scenarios. In this paper, we present a novel double-layer CRFs (DL-CRFs) model for human action recognition in the graphical model framework. In problem formulation, an augmented top layer as the high-level and global variable is designed in the DL-CRFs model, with the global perception perspective to acquire higher-order dependencies between the target states. Meanwhile, we exploit the additional intermediate variables to explicitly perceive the intermediate representations between the target states and observation features. We then propose to decompose the DL-CRFs model in two parts, that are the top linear-chain CRFs model and the bottom one, in order to execute ease inference both during the parameter learning phase and test time. Lastly, the assumed DL-CRFs model parameters can be learned with block-coordinate primal–dual Frank–Wolfe algorithm with gap sampling scheme in a structured support vector machine framework. Experimental results and discussions on two public benchmark datasets demonstrate that the proposed approach performs better than other state-of-the-art methods in several evaluation criteria. 相似文献
6.
Learned dictionaries have been validated to perform better than predefined ones in many application areas. Focusing on synthetic aperture radar (SAR) images, a structure preserving dictionary learning (SPDL) algorithm, which can capture and preserve the local and distant structures of the datasets for SAR target configuration recognition is proposed in this paper. Due to the target aspect angle sensitivity characteristic of SAR images, two structure preserving factors are embedded into the proposed SPDL algorithm. One is constructed to preserve the local structure of the datasets, and the other one is established to preserve the distant structure of the datasets. Both the local and distant structures of the datasets are preserved using the learned dictionary to realize target configuration recognition. Experimental results on the moving and stationary target acquisition and recognition (MSTAR) database demonstrate that the proposed algorithm is capable of handling the situations with limited number of training samples and under noise conditions. 相似文献
7.
《Journal of Visual Communication and Image Representation》2014,25(6):1432-1445
In this paper we introduce a novel method for action/movement recognition in motion capture data. The joints orientation angles and the forward differences of these angles in different temporal scales are used to represent a motion capture sequence. Initially K-means is applied on training data to discover the most representative patterns on orientation angles and their forward differences. A novel K-means variant that takes into account the periodic nature of angular data is applied on the former. Each frame is then assigned to one or more of these patterns and histograms that describe the frequency of occurrence of these patterns for each movement are constructed. Nearest neighbour and SVM classification are used for action recognition on the test data. The effectiveness and robustness of this method is shown through extensive experimental results on four standard databases of motion capture data and various experimental setups. 相似文献
8.
Visual attention is effective in differentiating an object from its surroundings. Color is used to guide attention via a top-down category-specific attention map in the top-down color attention (CA) method. To uniformly highlight the entire object, our color attention map is reconstructed based on the estimated object patches. The object patches consist of strong patches and false weak patches whose contextual color attention values are beyond the optimal threshold of class-specific contextual color attention. The color attention map constructed by the object color histogram is then used to weight the local shape for object recognition. Extensive experiments are conducted to show that our method can provide state-of-the-art results on several challenging datasets. 相似文献
9.
Human action recognition in videos is still an important while challenging task. Existing methods based on RGB image or optical flow are easily affected by clutters and ambiguous backgrounds. In this paper, we propose a novel Pose-Guided Inflated 3D ConvNet framework (PI3D) to address this issue. First, we design a spatial–temporal pose module, which provides essential clues for the Inflated 3D ConvNet (I3D). The pose module consists of pose estimation and pose-based action recognition. Second, for multi-person estimation task, the introduced pose estimation network can determine the action most relevant to the action category. Third, we propose a hierarchical pose-based network to learn the spatial–temporal features of human pose. Moreover, the pose-based network and I3D network are fused at the last convolutional layer without loss of performance. Finally, the experimental results on four data sets (HMDB-51, SYSU 3D, JHMDB and Sub-JHMDB) demonstrate that the proposed PI3D framework outperforms the existing methods on human action recognition. This work also shows that posture cues significantly improve the performance of I3D. 相似文献
10.
11.
《Journal of Visual Communication and Image Representation》2014,25(5):1082-1092
Sparse representation is a new approach that has received significant attention for image classification and recognition. This paper presents a PCA-based dictionary building for sparse representation and classification of universal facial expressions. In our method, expressive facials images of each subject are subtracted from a neutral facial image of the same subject. Then the PCA is applied to these difference images to model the variations within each class of facial expressions. The learned principal components are used as the atoms of the dictionary. In the classification step, a given test image is sparsely represented as a linear combination of the principal components of six basic facial expressions. Our extensive experiments on several publicly available face datasets (CK+, MMI, and Bosphorus datasets) show that our framework outperforms the recognition rate of the state-of-the-art techniques by about 6%. This approach is promising and can further be applied to visual object recognition. 相似文献
12.
《Journal of Visual Communication and Image Representation》2014,25(1):24-38
Much of the existing work on action recognition combines simple features with complex classifiers or models to represent an action. Parameters of such models usually do not have any physical meaning nor do they provide any qualitative insight relating the action to the actual motion of the body or its parts. In this paper, we propose a new representation of human actions called sequence of the most informative joints (SMIJ), which is extremely easy to interpret. At each time instant, we automatically select a few skeletal joints that are deemed to be the most informative for performing the current action based on highly interpretable measures such as the mean or variance of joint angle trajectories. We then represent the action as a sequence of these most informative joints. Experiments on multiple databases show that the SMIJ representation is discriminative for human action recognition and performs better than several state-of-the-art algorithms. 相似文献