首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 193 毫秒
1.
Over the past few years, skeleton-based action recognition has attracted great success because the skeleton data is immune to illumination variation, view-point variation, background clutter, scaling, and camera motion. However, effective modeling of the latent information of skeleton data is still a challenging problem. Therefore, in this paper, we propose a novel idea of action embedding with a self-attention Transformer network for skeleton-based action recognition. Our proposed technology mainly comprises of two modules as, (i) action embedding and (ii) self-attention Transformer. The action embedding encodes the relationship between corresponding body joints (e.g., joints of both hands move together for performing clapping action) and thus captures the spatial features of joints. Meanwhile, temporal features and dependencies of body joints are modeled using Transformer architecture. Our method works in a single-stream (end-to-end) fashion, where multiple-layer perceptron (MLP) is used for classification. We carry out an ablation study and evaluate the performance of our model on a small-scale SYSU-3D dataset and large-scale NTU-RGB+D and NTU-RGB+D 120 datasets where the results establish that our method performs better than other state-of-the-art architectures.  相似文献   

2.
针对骨架行为识别对时空特征提取不充分以及难以捕捉全局上下文信息的问题,研究了一种将时空注意力机制和自适应图卷积网络相结合的人体骨架行为识别方案。首先,构建基于非局部操作的时空注意力模块,辅助模型关注骨架序列中最具判别性的帧和区域;其次,利用高斯嵌入函数和轻量级卷积神经网络的特征学习能力,并考虑人体先验知识在不同时期的影响,构建自适应图卷积网络;最后,将自适应图卷积网络作为基本框架,并嵌入时空注意力模块,与关节信息、骨骼信息以及各自的运动信息构建双流融合模型。该算法在NTU RGB+D数据集的两种评价标准下分别达到了90.2%和96.2%的准确率,在大规模的数据集Kinetics上体现出模型的通用性,验证了该算法在提取时空特征和捕捉全局上下文信息上的优越性。   相似文献   

3.
Graph convolutional networks (GCNs) have proven to be an effective approach for 3D human pose estimation. By naturally modeling the skeleton structure of the human body as a graph, GCNs are able to capture the spatial relationships between joints and learn an efficient representation of the underlying pose. However, most GCN-based methods use a shared weight matrix, making it challenging to accurately capture the different and complex relationships between joints. In this paper, we introduce an iterative graph filtering framework for 3D human pose estimation, which aims to predict the 3D joint positions given a set of 2D joint locations in images. Our approach builds upon the idea of iteratively solving graph filtering with Laplacian regularization via the Gauss–Seidel iterative method. Motivated by this iterative solution, we design a Gauss–Seidel network (GS-Net) architecture, which makes use of weight and adjacency modulation, skip connection, and a pure convolutional block with layer normalization. Adjacency modulation facilitates the learning of edges that go beyond the inherent connections of body joints, resulting in an adjusted graph structure that reflects the human skeleton, while skip connections help maintain crucial information from the input layer’s initial features as the network depth increases. We evaluate our proposed model on two standard benchmark datasets, and compare it with a comprehensive set of strong baseline methods for 3D human pose estimation. Our experimental results demonstrate that our approach outperforms the baseline methods on both datasets, achieving state-of-the-art performance. Furthermore, we conduct ablation studies to analyze the contributions of different components of our model architecture and show that the skip connection and adjacency modulation help improve the model performance.  相似文献   

4.
裴晓敏  范慧杰  唐延东 《红外与激光工程》2018,47(2):203007-0203007(6)
基于自然场景图像的人体行为识别方法中遮挡、背景干扰、光照不均匀等因素影响识别结果,利用人体三维骨架序列的行为识别方法可以克服上述缺点。首先,考虑人体行为的时空特性,提出一种时空特征融合深度学习网络人体骨架行为识别方法;其次,根据骨架几何特征建立视角不变性特征表示,CNN(Convolutional Neural Network)网络学习骨架的局部空域特征,作用于空域的LSTM(Long Short Term Memory)网络学习骨架空域节点之间的相关性特征,作用于时域的LSTM网络学习骨架序列时空关联性特征;最后,利用NTU RGB+D数据库验证文中算法。实验结果表明:算法识别精度有所提高,对于多视角骨架具有较强的鲁棒性。  相似文献   

5.
3D skeleton sequences contain more effective and discriminative information than RGB video and are more suitable for human action recognition. Accurate extraction of human skeleton information is the key to the high accuracy of action recognition. Considering the correlation between joint points, in this work, we first propose a skeleton feature extraction method based on complex network. The relationship between human skeleton points in each frame is coded as a network. The changes of action over time are described by a time series network composed of skeleton points. Network topology attributes are used as feature vectors, complex network coding and LSTM are combined to recognize human actions. The method was verified on the NTU RGB + D60, MSR Action3D and UTKinect-Action3D dataset, and have achieved good performance, respectively. It shows that the method of extracting skeleton features based on complex network can properly identify different actions. This method that considers the temporal information and the relationship between skeletons at the same time plays an important role in the accurate recognition of human actions.  相似文献   

6.
Modeling and reasoning of the interactions between multiple entities (actors and objects) are beneficial for the action recognition task. In this paper, we propose a 3D Deformable Convolution Temporal Reasoning (DCTR) network to model and reason about the latent relationship dependencies between different entities in videos. The proposed DCTR network consists of a spatial modeling module and a temporal reasoning module. The spatial modeling module uses 3D deformable convolution to capture relationship dependencies between different entities in the same frame, while the temporal reasoning module uses Conv-LSTM to reason about the changes of multiple entity relationship dependencies in the temporal dimension. Experiments on the Moments-in-Time dataset, UCF101 dataset and HMDB51 dataset demonstrate that the proposed method outperforms several state-of-the-art methods.  相似文献   

7.
Most of the existing Action Quality Assessment (AQA) methods for scoring sports videos have deeply researched how to evaluate the single action or several sequential-defined actions that performed in short-term sport videos, such as diving, vault, etc. They attempted to extract features directly from RGB videos through 3D ConvNets, which makes the features mixed with ambiguous scene information. To investigate the effectiveness of deep pose feature learning on automatically evaluating the complicated activities in long-duration sports videos, such as figure skating and artistic gymnastic, we propose a skeleton-based deep pose feature learning method to address this problem. For pose feature extraction, a spatial–temporal pose extraction module (STPE) is built to capture the subtle changes of human body movements and obtain the detail representations for skeletal data in space and time dimensions. For temporal information representation, an inter-action temporal relation extraction module (ATRE) is implemented by recurrent neural network to model the dynamic temporal structure of skeletal subsequences. We evaluate the proposed method on figure skating activity of MIT-skate and FIS-V datasets. The experimental results show that the proposed method is more effective than RGB video-based deep feature learning methods, including SENet and C3D. Significant performance progress has been achieved for the Spearman Rank Correlation (SRC) on MIT-Skate dataset. On FIS-V dataset, for the Total Element Score (TES) and the Program Component Score (PCS), better SRC and MSE have been achieved between the predicted scores against the judge’s ones when compared with SENet and C3D feature methods.  相似文献   

8.
Anomaly behavior detection plays a significant role in emergencies such as robbery. Although a lot of works have been proposed to deal with this problem, the performance in real applications is still relatively low. Here, to detect abnormal human behavior in videos, we propose a multiscale spatial temporal attention graph convolution network (MSTA-GCN) to capture and cluster the features of the human skeleton. First, based on the human skeleton graph, a multiscale spatial temporal attention graph convolution block (MSTA-GCB) is built which contains multiscale graphs in temporal and spatial dimensions. MSTA-GCB can simulate the motion relations of human body components at different scales where each scale corresponds to different granularity of annotation levels on the human skeleton. Then, static, globally-learned and attention-based adjacency matrices in the graph convolution module are proposed to capture hierarchical representation. Finally, extensive experiments are carried out on the ShanghaiTech Campus and CUHK Avenue datasets, the final results of the frame-level AUC/EER are 0.759/0.311 and 0.876/0.192, respectively. Moreover, the frame-level AUC is 0.768 for the human-related ShanghaiTech subset. These results show that our MSTA-GCN outperforms most of methods in video anomaly detection and we have obtained a new state-of-the-art performance in skeleton-based anomaly behavior detection.  相似文献   

9.
Human action analysis has been an active research area in computer vision, and has many useful applications such as human computer interaction. Most of the state-of-the-art approaches of human action analysis are data-driven and focus on general action recognition. In this paper, we aim to analyze fitness actions with skeleton sequences and propose an efficient and robust fitness action analysis framework. Firstly, fitness actions from 15 subjects are captured and built to a fitness action dataset (Fitness-28). Secondly, skeleton information is extracted and made alignment with a simplified human skeleton model. Thirdly, the aligned skeleton information is transformed to an uniform human center coordinate system with the proposed spatial–temporal skeleton encoding method. Finally, the action classifier and local–global geometrical registration strategy are constructed to analyze the fitness actions. Experimental results demonstrate that our method can effectively assess fitness action, and have a good performance on artificial intelligence fitness system.  相似文献   

10.
光场图像的深度估计是3维重建、自动驾驶、对象跟踪等应用中的关键技术。然而,现有的深度学习方法忽略了光场图像的几何特性,在边缘、弱纹理等区域表现出较差的学习能力,导致深度图像细节的缺失。该文提出了一种基于语义导向的光场图像深度估计网络,利用上下文信息来解决复杂区域的不适应问题。设计了语义感知模块的编解码结构来重构空间信息以更好地捕捉物体边界,空间金字塔池化结构利用空洞卷积增大感受野,挖掘多尺度的上下文内容信息;通过无降维的自适应特征注意力模块局部跨通道交互,消除信息冗余的同时有效融合多路特征;最后引入堆叠沙漏串联多个沙漏模块,通过编解码结构得到更加丰富的上下文信息。在HCI 4D光场数据集上的实验结果表明,该方法表现出较高的准确性和泛化能力,优于所比较的深度估计的方法,且保留较好的边缘细节。  相似文献   

11.
The increasing importance of skeleton information in surveillance big data feature analysis demands significant storage space. The development of an effective and efficient solution for storage is still a challenging task. In this paper, we propose a new framework for the lossless compression of skeleton sequences by exploiting both spatial and temporal prediction and coding redundancies. Firstly, we propose a set of skeleton prediction modes, namely, spatial differential-based, motion vector-based, relative motion vector-based, and trajectory-based skeleton prediction mode. These modes can effectively handle both spatial and temporal redundancies present in the skeleton sequences. Secondly, we further enhance performance by introducing a novel approach to handle coding redundancy. Our proposed scheme is able to significantly reduce the size of skeleton data while maintaining exactly the same skeleton quality due to lossless compression approach. Experiments are conducted on standard surveillance and Posetrack action datasets containing challenging test skeleton sequences. Our method obviously outperforms the traditional direct coding methods by providing an average of 73% and 66% bit-savings on the two datasets.  相似文献   

12.

Detecting small-scale pedestrians in aerial images is a challenging task that can be difficult even for humans. Observing that the single image based method cannot achieve robust performance because of the poor visual cues of small instances. Considering that multiple frames may provide more information to detect such difficult case instead of only single frame, we design a novel video based pedestrian detection method with a two-stream network pipeline to fully utilize the temporal and contextual information of a video. An aggregated feature map is proposed to absorb the spatial and temporal information with the help of spatial and temporal sub-networks. To better capture motion information, a more refined flow net (SPyNet) is adopted instead of a simple flownet. In the spatial stream subnetwork, we modified the backbone network structure by increasing the feature map resolution with relatively larger receptive field to make it suitable for small-scale detection. Experimental results based on drone video datasets demonstrate that our approach improves detection accuracy in the case of small-scale instances and reduces false positive detections. By exploiting the temporal information and aggregating the feature maps, our two-stream method improves the detection performance by 8.48% in mean Average Precision (mAP) from that of the basic single stream R-FCN method, and it outperforms the state-of-the-art method by 3.09% on the Okutama Human-action dataset.

  相似文献   

13.
刘桂玉  刘佩林  钱久超 《信息技术》2020,(5):121-124,130
基于3D骨架的动作识别技术现已成为人机交互的重要手段。为了提高3D动作识别的精度,文中提出一种将3D骨架特征和2D图片特征进行融合的双流神经网络。其中一个网络处理3D骨架序列,另一个网络处理2D图片。最后再将二者的特征进行融合,以提高识别精度。相较于单独使用3D骨架的动作识别,文中所使用的方法在NTU_RGBD数据集以及SYSU数据集上都有了很大的精度提升。  相似文献   

14.
15.
罗元  李丹  张毅 《半导体光电》2020,41(3):414-419
手语识别广泛应用于聋哑人与正常人之间的交流中。针对手语识别任务中时空特征提取不充分而导致识别率低的问题,提出了一种新颖的基于时空注意力的手语识别模型。首先提出了基于残差3D卷积网络(Residual 3D Convolutional Neural Network,Res3DCNN)的空间注意力模块,用来自动关注空间中的显著区域;随后提出了基于卷积长短时记忆网络(Convolutional Long Short-Term Memory,ConvLSTM)的时间注意力模块,用来衡量视频帧的重要性。所提算法的关键在于在空间中关注显著区域,并且在时间上自动选择关键帧。最后,在CSL手语数据集上验证了算法的有效性。  相似文献   

16.
Human action recognition in videos is still an important while challenging task. Existing methods based on RGB image or optical flow are easily affected by clutters and ambiguous backgrounds. In this paper, we propose a novel Pose-Guided Inflated 3D ConvNet framework (PI3D) to address this issue. First, we design a spatial–temporal pose module, which provides essential clues for the Inflated 3D ConvNet (I3D). The pose module consists of pose estimation and pose-based action recognition. Second, for multi-person estimation task, the introduced pose estimation network can determine the action most relevant to the action category. Third, we propose a hierarchical pose-based network to learn the spatial–temporal features of human pose. Moreover, the pose-based network and I3D network are fused at the last convolutional layer without loss of performance. Finally, the experimental results on four data sets (HMDB-51, SYSU 3D, JHMDB and Sub-JHMDB) demonstrate that the proposed PI3D framework outperforms the existing methods on human action recognition. This work also shows that posture cues significantly improve the performance of I3D.  相似文献   

17.
刘强  张文英  陈恩庆 《信号处理》2020,36(9):1422-1428
人体动作识别在人机交互、视频内容检索等领域有众多应用,是多媒体信息处理的重要研究方向。现有的大多数基于双流网络进行动作识别的方法都是在双流上使用相同的卷积网络去处理RGB与光流数据,缺乏对多模态信息的利用,容易造成网络冗余和相似性动作误判问题。近年来,深度视频也越来越多地用于动作识别,但是大多数方法只关注了深度视频中动作的空间信息,没有利用时间信息。为了解决这些问题,本文提出一种基于异构多流网络的多模态动作识别方法。该方法首先从深度视频中获取动作的时间特征表示,即深度光流数据,然后选择合适的异构网络来进行动作的时空特征提取与分类,最后对RGB数据、RGB中提取的光流、深度视频和深度光流识别结果进行多模态融合。通过在国际通用的大型动作识别数据集NTU RGB+D上进行的实验表明,所提方法的识别性能要优于现有较先进方法的性能。   相似文献   

18.
Vehicular networks play a pivotal role in intelligent transportation system (ITS) and smart city (SC) construction, especially with the coming of 5G. Mobility models are crucial parts of vehicular network, especially for routing policy evaluation as well as traffic flow management. The big data aided vehicle mobility analysis and design attract researchers a lot with the acceleration of big data technology. Besides, complex network theory reveals the intrinsic temporal and spatial characteristics, considering the dynamic feature of vehicular network. In the following content, a big GPS dataset in Beijing, and its complex features verification are introduced. Some novel vehicle and location collaborative mobility schemes are proposed relying on the GPS dataset. We evaluate their performance in terms of complex features, such as duration distribution, interval time distribution and temporal and spatial characteristics. This paper elaborates upon mobility design and graph analysis of vehicular networks.  相似文献   

19.
近年来,基于骨架的人体动作识别任务因骨架数据的鲁棒性和泛化能力而受到了广泛关注。其中,将人体骨骼建模为时空图的图卷积网络取得了显著的性能。然而图卷积主要通过一系列3D卷积来学习长期交互联系,这种联系偏向于局部并且受到卷积核大小的限制,无法有效地捕获远程依赖关系。该文提出一种协作卷积Transformer网络(Co-ConvT),通过引入Transformer中的自注意力机制建立远程依赖关系,并将其与图卷积神经网络(GCNs)相结合进行动作识别,使模型既能通过图卷积神经网络提取局部信息,也能通过Transformer捕获丰富的远程依赖项。另外,Transformer的自注意力机制在像素级进行计算,因此产生了极大的计算代价,该模型通过将整个网络分为两个阶段,第1阶段使用纯卷积来提取浅层空间特征,第2阶段使用所提出的ConvT块捕获高层语义信息,降低了计算复杂度。此外,原始Transformer中的线性嵌入被替换为卷积嵌入,获得局部空间信息增强,并由此去除了原始模型中的位置编码,使模型更轻量。在两个大规模权威数据集NTU-RGB+D和Kinetics-Skeleton上进行实验验证,该模型分...  相似文献   

20.
Individual recognition from locomotion is a challenging task owing to large intra-class and small inter-class variations. In this article, we present a novel metric learning method for individual recognition from skeleton sequences. Firstly, we propose to model articulated body on Riemannian manifold to describe the essence of human motion, which can reflect biometric signatures of the enrolled individuals. Then two spatia-temporal metric learning approaches are proposed, namely Spatio-Temporal Large Margin Nearest Neighbor (ST-LMNN) and Spatio-Temporal Multi-Metric Learning (STMM), to learn discriminant bilinear metrics which can encode the spatio-temporal structure of human motion. Specifically, the ST-LMNN algorithm extends the bilinear model into classical Large Margin Nearest Neighbor method, which learns a low-dimensional local linear embedding in the spatial and temporal domain, respectively. To further capture the unique motion pattern for each individual, the proposed STMM algorithm learns a set of individual-specific spatio-temporal metrics, which make the projected features of the same person closer to its class mean than that of different classes by a large margin. Beyond that, we present a new publicly available dataset for locomotion recognition to evaluate the influence of both internal and external covariant factors. According to the experimental results from the three public datasets, we believe that the proposed approaches are both able to achieve competitive results in individual recognition.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号