首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
近年来,基于骨架的人体动作识别任务因骨架数据的鲁棒性和泛化能力而受到了广泛关注。其中,将人体骨骼建模为时空图的图卷积网络取得了显著的性能。然而图卷积主要通过一系列3D卷积来学习长期交互联系,这种联系偏向于局部并且受到卷积核大小的限制,无法有效地捕获远程依赖关系。该文提出一种协作卷积Transformer网络(Co-ConvT),通过引入Transformer中的自注意力机制建立远程依赖关系,并将其与图卷积神经网络(GCNs)相结合进行动作识别,使模型既能通过图卷积神经网络提取局部信息,也能通过Transformer捕获丰富的远程依赖项。另外,Transformer的自注意力机制在像素级进行计算,因此产生了极大的计算代价,该模型通过将整个网络分为两个阶段,第1阶段使用纯卷积来提取浅层空间特征,第2阶段使用所提出的ConvT块捕获高层语义信息,降低了计算复杂度。此外,原始Transformer中的线性嵌入被替换为卷积嵌入,获得局部空间信息增强,并由此去除了原始模型中的位置编码,使模型更轻量。在两个大规模权威数据集NTU-RGB+D和Kinetics-Skeleton上进行实验验证,该模型分...  相似文献   

2.
3.
Motivated by the powerful capability of deep neural networks in feature learning, a new graph-based neural network is proposed to learn local and global relational information on skeleton sequences represented as spatio-temporal graphs (STGs). The pipeline of our network architecture consists of three main stages. As the first stage, spatial–temporal sub-graphs (sub-STGs) are projected into a latent space in which every point is represented as a linear subspace. The second stage is based on message passing to acquire the localized correlated features of the nodes in the latent space. The third stage relies on graph convolutional networks (GCNs) to reason the long-range spatio-temporal dependencies through a graph representation of the latent space. Finally, the average pooling layer and the softmax classifier are then employed to predict the action categories based on the extracted local and global correlations. We validate our model in terms of action recognition using three challenging datasets: the NTU RGB+D, Kinetics Motion, and SBU Kinect Interaction datasets. The experimental results demonstrate the effectiveness of our approach and show that our proposed model outperforms the state-of-the-art methods.  相似文献   

4.
Three-dimensional human pose estimation (3D HPE) has broad application prospects in the fields of trajectory prediction, posture tracking and action analysis. However, the frequent self-occlusions and the substantial depth ambiguity in two-dimensional (2D) representations hinder the further improvement of accuracy. In this paper, we propose a novel video-based human body geometric aware network to mitigate the above problems. Our network can implicitly be aware of the geometric constraints of the human body by capturing spatial and temporal context information from 2D skeleton data. Specifically, a novel skeleton attention (SA) mechanism is proposed to model geometric context dependencies among different body joints, thereby improving the spatial feature representation ability of the network. To enhance the temporal consistency, a novel multilayer perceptron (MLP)-Mixer based structure is exploited to comprehensively learn temporal context information from input sequences. We conduct experiments on publicly available challenging datasets to evaluate the proposed approach. The results outperform the previous best approach by 0.5 mm in the Human3.6m dataset. It also demonstrates significant improvements in HumanEva-I dataset.  相似文献   

5.
Anomaly behavior detection plays a significant role in emergencies such as robbery. Although a lot of works have been proposed to deal with this problem, the performance in real applications is still relatively low. Here, to detect abnormal human behavior in videos, we propose a multiscale spatial temporal attention graph convolution network (MSTA-GCN) to capture and cluster the features of the human skeleton. First, based on the human skeleton graph, a multiscale spatial temporal attention graph convolution block (MSTA-GCB) is built which contains multiscale graphs in temporal and spatial dimensions. MSTA-GCB can simulate the motion relations of human body components at different scales where each scale corresponds to different granularity of annotation levels on the human skeleton. Then, static, globally-learned and attention-based adjacency matrices in the graph convolution module are proposed to capture hierarchical representation. Finally, extensive experiments are carried out on the ShanghaiTech Campus and CUHK Avenue datasets, the final results of the frame-level AUC/EER are 0.759/0.311 and 0.876/0.192, respectively. Moreover, the frame-level AUC is 0.768 for the human-related ShanghaiTech subset. These results show that our MSTA-GCN outperforms most of methods in video anomaly detection and we have obtained a new state-of-the-art performance in skeleton-based anomaly behavior detection.  相似文献   

6.
In this paper, a method is proposed to improve the accuracy of 3D hand pose estimation. The existing methods make poor use of the depth information of hand joints and have difficulties of estimating the 3D coordinates accurately. To solve this problem, a method that utilizing the information between adjacent joints of each finger is proposed to estimate the depth coordinates of joints. In order to make full use of 2D information for depth estimation, this paper divides hand pose estimation into two sub-tasks (2D hand joints estimation and depth estimation). In depth estimation, a multi-stage network is proposed. We first estimate the depth of a part of hand joints, and then with the help of it and 2D information, the depth coordinates of adjacent joints can be well estimated. The method proposed in this paper has been proved to be effective on three public hand pose datasets through Self-comparisons. Compared with the methods that based on 2D CNN, our method achieves state-of-the-art performance on ICVL and NYU datasets, and also has a good result on MSRA dataset.  相似文献   

7.
Hand pose estimation is a challenging task owing to the high flexibility and serious self-occlusion of the hand. Therefore, an optimized convolutional pose machine (OCPM) was proposed in this study to estimate the hand pose accurately. Traditional CPMs have two components, a feature extraction module and an information processing module. First, the backbone network of the feature extraction module was replaced by Resnet-18 to reduce the number of network parameters. Furthermore, an attention module called the convolutional block attention module (CBAM) is embedded into the feature extraction module to enhance the information extraction. Then, the structure of the information processing module was adjusted through a residual connection in each stage that consist of a series of continuous convolutional operations, and requires a dense fusion between the output from all previous stages and the feature extraction module. The experimental results on two public datasets showed that the OCPM network achieved excellent performance.  相似文献   

8.
人体骨骼点数据相对于RGB视频数据具有更好的环境适应性和动作表达能力,因此基于骨骼点数据的动作识别算法得到越来越广泛的关注和研究.近年来,基于图卷积网络(GCN)的骨骼点动作识别模型表现出了很好的性能,但多数基于GCN的模型往往使用固定空间配置分区策略且手动设定各骨骼点之间的连接关系,无法更好适应不同动作的变化特征.针...  相似文献   

9.
In order to solve the challenging tasks of person re-identification(Re-ID) in occluded scenarios, we propose a novel approach which divides local units by forming high-level semantic information of pedestrians and generates features of occluded parts. The approach uses CNN and pose estimation to extract the feature map and key points, and a graph convolutional network to learn the relation of key points. Specifically, we design a Generating Local Part (GLP) module to divide the feature map into different units. Based on different occluded conditions, the partition mode of GLP has high flexibility and variability. The features of the non-occluded parts are clustered into an intermediate node, and then the spatially correlated features of the occluded parts are generated according to the de-clustering operation. We conduct experiments on both the occluded and the holistic datasets to demonstrate its effectiveness.  相似文献   

10.
Recently convolutional neural networks (CNNs) have been employed to address the problem of hand pose estimation. In this work, we introduce an end-to-end deep architecture that can accurately estimate hand pose through the joint use of model-based and fine-tuning methods. In the model-based stage, we make use of the prior information in hand model geometry to ensure the geometric validity of the estimated poses. Next, we introduce a fine-tuning approach that learns to refine the errors between the model and observed hand. Our approach is validated on three challenging public datasets and achieves state-of-the-art performance.  相似文献   

11.
Different artifacts will manifest, whenever an image is compressed by a lossy compression algorithm. Higher frequency details present in the image may tend to be eliminated by compression. In certain cases, compression may introduce small image structures and noise. This phenomenon will limit the image quality thereby making images to appear much less pleasant to the human eye. Furthermore, other machine learning tasks like object detectors performance will be reduced due to compression. In this paper, we introduce a novel deep neural network with densely connected parallel convolutions to remove such artifacts and to recover the original image from its perturbed version. The proposed algorithm is named as densely connected parallel convolutional neural network in short DPCNN. Parallel convolution provides model parallelism and reduce the training burden. Furthermore, the dense skip connections provide short paths for gradient back-propagation and alleviates the gradient vanishing problem. Moreover, skip connections reduce the feature redundancy by combining features from different levels and increases the learning efficiency. However, these skip connections increase the model complexity. A bottleneck layer is used to keep the model compactness and to reduce the model complexity. The proposed approach can be used as a preprocessing step in different computer vision tasks where images are degraded by compression. Different from other methods, the proposed method is able to remove compression artifacts generated at any quality factor (QF). The experiments on benchmark datasets show the superiority of the proposed method over other methods quantitatively and qualitatively.  相似文献   

12.
叶俊  张云 《光电子.激光》2022,(12):1306-1314
目前,常见的三维(3D)人体姿态估计算法在表征学习上取得很好的效果,但是在人体骨架关节点处依然存在估计精度不佳等问题,因此,如何从单目RGB图像中利用冗余的二维(2D)姿态序列时空信息来估计人体姿态的有效方式是一个研究的难点。本文提出一种基于时空多特征融合网络的三维人体姿态估计算法,具体是结合一种图像外观信息和运动时序信息时空多特征融合层级方法,该方法利用一种紧凑的卷积神经网络(convolutional neural network, CNN)学习时空信息将二维关节点位置信息建模为三维关节点位置。实验结果表明,本文所提出的方法能实现较为先进的端对端姿态估计精度,而且不需要任何后处理阶段的姿态优化方法,本文得到的姿态估计在平均精度上得到有效的提升,证明本文方法能够有效提高人体姿态估计的准确性。  相似文献   

13.
在低信噪比通信环境中同步困难是任何编码系统必须解决的问题.对此,提出了基于软信息的编码辅助的迭代相位估计算法.该法采用载波相位同步、解调和译码联合处理的思想,将译码器输出的码比特的可靠量度直接反馈回估计单元以提高同步的准确度.考虑到现行系统中多采用非系统卷积码,文中利用对数似然代数原理,得到了适用于非系统码进行迭代的软输出维特比译码算法,有效降低了计算复杂度.仿真结果表明,该法仅需5次迭代,信噪比为3 dB的情况下,系统误比特率可达到10-6.  相似文献   

14.
方面情感分析旨在识别句子中特定方面的情感极性,是一项细粒度情感分析任务。传统基于注意力机制方法,仅在单词之间进行单一的语义交互,没有建立方面词与文本词的语法信息交互,导致方面词错误地关注到与其语法无关的文本词信息。此外,单词的位置距离特征和语法距离特征,分别体现其在句子线性形式中和句子语法依存树中的位置关系,而基于图卷积网络处理语法信息的方法却忽略距离特征,使距方面词较远的无关信息对其情感分析造成干扰。针对上述问题,该文提出多交互图卷积网络(MIGCN),首先将文本词位置距离特征馈入到每层图卷积网络,同时利用依存树中文本词的语法距离特征对图卷积网络的邻接矩阵加权,最后,设计语义交互和语法交互分别处理单词之间语义和语法信息。实验结果表明,在公共数据集上,准确率和宏F1值均优于基准模型。  相似文献   

15.
Human action recognition in videos is still an important while challenging task. Existing methods based on RGB image or optical flow are easily affected by clutters and ambiguous backgrounds. In this paper, we propose a novel Pose-Guided Inflated 3D ConvNet framework (PI3D) to address this issue. First, we design a spatial–temporal pose module, which provides essential clues for the Inflated 3D ConvNet (I3D). The pose module consists of pose estimation and pose-based action recognition. Second, for multi-person estimation task, the introduced pose estimation network can determine the action most relevant to the action category. Third, we propose a hierarchical pose-based network to learn the spatial–temporal features of human pose. Moreover, the pose-based network and I3D network are fused at the last convolutional layer without loss of performance. Finally, the experimental results on four data sets (HMDB-51, SYSU 3D, JHMDB and Sub-JHMDB) demonstrate that the proposed PI3D framework outperforms the existing methods on human action recognition. This work also shows that posture cues significantly improve the performance of I3D.  相似文献   

16.
针对红外视频缺少纹理细节特征以致在人体行为识别中难以兼顾计算复杂度与识别准确率的问题,提出一种基于全局双线性注意力的红外视频行为识别方法。为高效计算红外视频中的人体行为,设计基于两级检测网络的关节点提取模块来获得人体关节点信息,创新性地将所形成的关节点三维热图作为红外视频人体行为识别网络的输入特征;为了在轻量化计算的基础上进一步提升识别准确率,提出一种全局双线性注意力的三维卷积网络,从空间和通道两个维度提升注意力的建模能力,捕获全局结构信息。在InfAR和IITR-IAR数据集上的实验结果表明,该方法在红外视频行为识别中的有效性。  相似文献   

17.
由于快速的卷积神经网络超分辨率重建算法(FSRCNN)卷积层数少、相邻卷积层的特征信息之间缺乏关联性,因此难以提取到图像深层信息导致图像超分辨率重建效果不佳。针对此问题,该文提出多级跳线连接的深度残差网络超分辨率重建方法。首先,该方法设计了多级跳线连接的残差块,在多级跳线连接的残差块基础上构造了多级跳线连接的深度残差网络,解决相邻卷积层的特性信息缺乏关联性的问题;然后,使用随机梯度下降法(SGD)以可调节的学习率策略对多级跳线连接的深度残差网络进行训练,得到该网络超分辨率重建模型;最后,将低分辨率图像输入到多级跳线连接的深度残差网络超分辨率重建模型中,通过多级跳线连接的残差块得到预测的残差特征值,再将残差图像和低分辨率图像组合在一起转化为高分辨率图像。该文方法与bicubic, A+, SRCNN, FSRCNN和ESPCN算法在Set5和Set14测试集上进行了对比测试,在视觉效果和评价指标数值上该方法都优于其它对比算法。  相似文献   

18.
19.
In this paper, an end-to-end convolutional neural network is proposed to recover haze-free image named as Attention-Based Multi-Stream Feature Fusion Network (AMSFF-Net). The encoder-decoder network structure is used to construct the network. An encoder generates features at three resolution levels. The multi-stream features are extracted using residual dense blocks and fused by feature fusion blocks. AMSFF-Net has ability to pay more attention to informative features at different resolution levels using pixel attention mechanism. A sharp image can be recovered by the good kernel estimation. Further, AMSFF-Net has ability to capture semantic and sharp textural details from the extracted features and retain high-quality image from coarse-to-fine using mixed-convolution attention mechanism at decoder. The skip connections decrease the loss of image details from the larger receptive fields. Moreover, deep semantic loss function emphasizes more semantic information in deep features. Experimental findings prove that the proposed method outperforms in synthetic and real-world images.  相似文献   

20.
In clinical analysis and diagnosis, high resolution (HR) computed tomography (CT) images are required for proper treatment of a patient. Developing HR medical images by X-ray CT devices require extended radiation exposure with large radiative dosages, putting the patient at potential risk of inducing cancer. So, radiation exposure should be reduced. However, photon starvation and beam hardening in low-dose X-rays will cause severe artifacts. Thus, an accurate reconstruction of low-dose X-ray CT images is required. To this end, we propose a wavelet based multi-channel and multi-scale cross connected residual-in-dense grouped convolutional neural network (WCRDGCNN) for accurate super resolution (SR) of medical images. The adopted filter groups reduce the connection weights, thereby reducing the computational complexity. Gradient vanishing problem is tackled by using residual and dense skip connections. The extensive experimentation results on benchmark datasets show that our method outperforms the state-of-the-art SR methods.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号