Crowd counting is a conspicuous task in computer vision owing to scale variations, perspective distortions, and complex backgrounds. Existing research usually adopts the dilated convolution network to enlarge the receptive fields to solve the problem of scale variations. However, these methods easily bring background information into the large receptive fields to generate poor quality density maps. To address this problem, we propose a novel backbone called Context-guided Dense Attentional Dilated Network (CDADNet). CDADNet contains three components: an attentional module, a context-guided module and a dense attentional dilated module. The attentional module is used to provide attention maps which can remove background information, while the context-guided module is proposed to extract multi-scale contextual information. Moreover, the dense attentional dilated module aims to generate high-granularity density maps and the cascaded strategy is used to preserve information from changing scales. To verify the feasibility of our method, we compare it to the existing approaches on five crowd counting datasets (ShanghaiTech (Part_A and Part_B), WorldEXPO’10, UCSD, UCF_CC_50). The comparison results demonstrate that CDADNet is effective and robust for various scenes.  相似文献   

For reasons of public security, modeling large crowd distributions for counting or density estimation has attracted significant research interests in recent years. Existing crowd counting algorithms rely on predefined features and regression to estimate the crowd size. However, most of them are constrained by such limitations: (1) they can handle crowds with a few tens individuals, but for crowds of hundreds or thousands, they can only be used to estimate the crowd density rather than the crowd count; (2) they usually rely on temporal sequence in crowd videos which is not applicable to still images. Addressing these problems, in this paper, we investigate the use of a deep-learning approach to estimate the number of individuals presented in a mid-level or high-level crowd visible in a single image. Firstly, a ConvNet structure is used to extract crowd features. Then two supervisory signals, i.e., crowd count and crowd density, are employed to learn crowd features and estimate the specific counting. We test our approach on a dataset containing 107 crowd images with 45,000 annotated humans inside, and each with head counts ranging from 58 to 2201. The efficacy of the proposed approach is demonstrated in extensive experiments by quantifying the counting performance through multiple evaluation criteria.  相似文献   

尺度变化、遮挡和复杂背景等因素使得拥挤场景下的人群数量估计成为一项具有挑战性的任务。为了应对人群图像中的尺度变化和现有多列网络中规模限制及特征相似性问题,该文提出一种多尺度交互注意力人群计数网络(Multi-Scale Interactive Attention crowd counting Network, MSIANet)。首先,设计了一个多尺度注意力模块,该模块使用4个具有不同感受野的分支提取不同尺度的特征,并将各分支提取的尺度特征进行交互,同时,使用注意力机制来限制多列网络的特征相似性问题。其次,在多尺度注意力模块的基础上设计了一个语义信息融合模块,该模块将主干网络的不同层次的语义信息进行交互,并将多尺度注意力模块分层堆叠,以充分利用多层语义信息。最后,基于多尺度注意力模块和语义信息融合模块构建了多尺度交互注意力人群计数网络,该网络充分利用多层次语义信息和多尺度信息生成高质量人群密度图。实验结果表明,与现有代表性的人群计数方法相比,该文提出的MSIANet可有效提升人群计数任务的准确性和鲁棒性。  相似文献   

Crowd counting algorithms have recently incorporated attention mechanisms into convolutional neural networks (CNNs) to achieve significant progress. The channel attention model (CAM), as a popular attention mechanism, calculates a set of probability weights to select important channel-wise feature responses. However, most CAMs roughly assign a weight to the entire channel-wise map, which makes useful and useless information being treat indiscriminately, thereby limiting the representational capacity of networks. In this paper, we propose a multi-scale and spatial position-based channel attention network (MS-SPCANet), which integrates spatial position-based channel attention models (SPCAMs) with multiple scales into a CNN. SPCAM assigns different channel attention weights to different positions of channel-wise maps to capture more informative features. Furthermore, an adaptive loss, which uses adaptive coefficients to combine density map loss and headcount loss, is constructed to improve network performance in sparse crowd scenes. Experimental results on four public datasets verify the superiority of the scheme.  相似文献   

Crowd counting with density estimation has been an active research community due to its significant applications in the fields of public security, video surveillance, traffic monitoring. However, Crowd counting for congested scenes often suffers from some obstacles including severe occlusions, large scale variations, noise interference, etc. In this paper, using the first ten layers of a modified VGG16 and dilated convolution layers as the framework, we have proposed a CNN based crowd counting and density estimation model improved by the attention aware modules with residual connections. To tackle the problem of noise interference, convolutional block attention modules have been introduced into the deep network to segment the foreground and background to focus on interest information, refining deeper features of the input image. To improve information transmission and reuse, residual connections are utilized to link 3 attention blocks. Meanwhile, dilated convolution layers keep larger reception fields and obtain high-resolution density maps. The proposed method has been evaluated on three public benchmarks, i.e. Shanghai Tech A & B, UCF-QNRF and MALL, achieving the mean absolute errors of 64.6 & 8.3, 113.8 and 1.68, respectively. The results outperform some existing excellent approaches. This indicates that the proposed model has high accuracy and better robustness, which is suitable for crowd counting and density estimation in various congested scenes.  相似文献   

Convolutional neural networks (CNNs) have dominated the field of computer vision for nearly a decade. However, due to their limited receptive field, CNNs fail to model the global context. On the other hand, transformers, an attention-based architecture, can model the global context easily. Despite this, there are limited studies that investigate the effectiveness of transformers in crowd counting. In addition, the majority of the existing crowd-counting methods are based on the regression of density maps which requires point-level annotation of each person present in the scene. This annotation task is laborious and also error-prone. This has led to an increased focus on weakly-supervised crowd-counting methods, which require only count-level annotations. In this paper, we propose a weakly-supervised method for crowd counting using a pyramid vision transformer. We have conducted extensive evaluations to validate the effectiveness of the proposed method. Our method achieves state-of-the-art performance. More importantly, it shows remarkable generalizability.  相似文献   

Attention modules embedded in deep networks mediate the selection of informative regions for object recognition. In addition, the combination of features learned from different branches of a network can enhance the discriminative power of these features. However, fusing features with inconsistent scales is a less-studied problem. In this paper, we first propose a multi-scale channel attention network with an adaptive feature fusion strategy (MSCAN-AFF) for face recognition (FR), which fuses the relevant feature channels and improves the network’s representational power. In FR, face alignment is performed independently prior to recognition, which requires the efficient localization of facial landmarks, which might be unavailable in uncontrolled scenarios such as low-resolution and occlusion. Therefore, we propose utilizing our MSCAN-AFF to guide the Spatial Transformer Network (MSCAN-STN) to align feature maps learned from an unaligned training set in an end-to-end manner. Experiments on benchmark datasets demonstrate the effectiveness of our proposed MSCAN-AFF and MSCAN-STN.  相似文献   

现有的人群计数方法不能够完全适用于轨道交通场景中,为此,提出一种基于卷积神经网络的人群计数模型。模型采用VGG16作为前端网络提取浅层特征,提出一种基于Inception结构改进的M-Inception结构,结合空洞卷积构成后端网络,增大感受野,适应多监控角度下不同尺寸的行人目标;并提出一种融合行人总数估计损失和密度图损失的加权损失函数。将本文模型与4种现有模型进行对比实验,结果表明,提出的人群计数算法在地铁场景中的平均绝对误差和均方误差仅为1.46和2.13,优于4种对比模型。考虑到模型的实际应用,将模型部署到海思嵌入式芯片上,实测结果表明,模型可在嵌入式芯片上取得较高的计算速度和准确率,满足实际应用场景的需求。  相似文献   

In recent years, removing rain streaks from a single image has been a significant issue for outdoor vision tasks. In this paper, we propose a novel recursive residual atrous spatial pyramid pooling network to directly recover the clear image from rain image. Specifically, we adopt residual atrous spatial pyramid pooling (ResASPP) module which is constructed by alternately cascading a ResASPP block with a residual block to exploit multi-scale rain information. Besides, taking the dependencies of deep features across stages into consideration, a recurrent layer is introduced into ResASPP to model multi-stage processing procedure from coarse to fine. For each stage in our recursive network we concatenate the stage-wise output with the original rainy image and then feed them into the next stage. Furthermore, the negative SSIM loss and perceptual loss are employed to train the proposed network. Extensive experiments on both synthetic and real-world rainy datasets demonstrate that the proposed method outperforms the state-of-the-art deraining methods.  相似文献   

密集人群计数是计算机视觉领域的一个经典问题,仍然受制于尺度不均匀、噪声和遮挡等因素的影响。该文提出一种基于新型多尺度注意力机制的密集人群计数方法。深度网络包括主干网络、特征提取网络和特征融合网络。其中,特征提取网络包括特征支路和注意力支路,采用由并行卷积核函数组成的新型多尺度模块,能够更好地获取不同尺度下的人群特征,以适应密集人群分布的尺度不均匀特性;特征融合网络利用注意力融合模块对特征提取网络的输出特征进行增强,实现了注意力特征与图像特征的有效融合,提高了计数精度。在ShanghaiTech, UCF_CC_50, Mall和UCSD等公开数据集的实验表明,提出的方法在MAE和MSE两项指标上均优于现有方法。  相似文献   

针对多源遥感图像的差异性和互补性问题,该文提出一种基于空间与光谱注意力的光学图像和SAR图像特征融合分类方法。首先利用卷积神经网络分别进行光学图像和SAR图像的特征提取,设计空间注意力和光谱注意力组成的注意力模块分析特征重要程度,生成不同特征的权重进行特征融合增强,同时减弱对无效信息的关注,从而提高光学和SAR图像融合分类精度。通过在两组光学和SAR图像数据集上进行对比实验,结果表明所提方法取得更高的融合分类精度。  相似文献   

葛斌  彭曦晨  孙倩倩  袁政 《光电子.激光》2023,34(10):1111-1090
新型冠状病毒肺炎(corona virus disease 2019,COVID-19)严重影响人类社会和经济的发展,威胁人类的健康。如何更准确、快速地排查感染病毒的患者,使用卷积神经网络(convolutional neural network, CNN)的方法识别COVID-19胸部X射线影像,完成计算机自动辅助诊断。但是,由于识别精度不高,难以准确判断是否感染了COVID-19。为了提高网络模型对COVID-19胸部X射线影像的识别性能,首先提出注意力引导梯形金字塔融合网络(attention steered trapezoid pyramid fusion network, ASTPNet),该网络可以附加在不同的CNN上,有效地利用模型中深层与浅层网络的特点;其次提出注意力引导块(attention steered block, AS Block),通过通道和空间注意力,强调通道和空间中的有效语义信息,弱化无效的干扰信息,高效地聚合加权信息。最终实验结果表明:将ASTPNet附加在VGG16/19、ResNet34/50和ResNeXt上,识别精度有了显著提升;应用于自建的C...  相似文献   

针对遥感图像场景分类面临的类内差异性大、类间相似性高导致的部分场景出现分类混淆的问题,该文提出了一种基于双重注意力机制的强鉴别性特征表示方法.针对不同通道所代表特征的重要性程度以及不同局部区域的显著性程度不同,在卷积神经网络提取的高层特征基础上,分别设计了一个通道维和空间维注意力模块,利用循环神经网络的上下文信息提取能...  相似文献   

为了让网络捕捉到更有效的内容来进行行人的判别,该文提出一种基于阶梯型特征空间分割与局部分支注意力网络(SLANet)机制的多分支网络来关注局部图像的显著信息。首先,在网络中引入阶梯型分支注意力模块,该模块以阶梯型对特征图进行水平分块,并且使用了分支注意力给每个分支分配不同的权重。其次,在网络中引入多尺度自适应注意力模块,该模块对局部特征进行处理,自适应调整感受野尺寸来适应不同尺度图像,同时融合了通道注意力和空间注意力筛选出图像重要特征。在网络的设计上,使用多粒度网络将全局特征和局部特征进行结合。最后,该方法在3个被广泛使用的行人重识别数据集Market-1501,DukeMTMC-reID和CUHK03上进行验证。其中在Market-1501数据集上的mAP和Rank-1分别达到了88.1%和95.6%。实验结果表明,该文所提出的网络模型能够提高行人重识别准确率。  相似文献   

The existing deraining methods based on convolutional neural networks (CNNs) have made great success, but some remaining rain streaks can degrade images drastically. In this work, we proposed an end-to-end multi-scale context information and attention network, called MSCIANet. The proposed network consists of multi-scale feature extraction (MSFE) and multi-receptive fields feature extraction (MRFFE). Firstly, the MSFE can pick up features of rain streaks in different scales and propagate deep features of the two layers across stages by skip connections. Secondly, the MRFFE can refine details of the background by attention mechanism and the depthwise separable convolution of different receptive fields with different scales. Finally, the fusion of these outputs of two subnetworks can reconstruct the clean background image. Extensive experimental results have shown that the proposed network achieves a good effect on the deraining task on synthetic and real-world datasets. The demo can be available at https://github.com/CoderLi365/MSCIANet.  相似文献   

Siamese trackers have attracted considerable attention in the field of object tracking because of their high precision and speed. However, one of the main disadvantages of Siamese trackers is that their feature extraction network is relatively single. They often use AlexNet or ResNet50 as the backbone network. AlexNet is shallow and thus cannot easily extract abundant semantic information, whereas ResNet50 has many convolutional layers, reducing the real-time performance of Siamese trackers. We propose a multi-branch feature aggregation network with different designs in the shallow and deep convolutional layers. We use the residual module to build the shallow convolutional layers to extract textural and edge features. The deep convolution layers, designed with two independent branches, are built with residual and parallel modules to extract different semantic features. The proposed network has a depth of only nine modules, and thus it is a simple and effective network. We then apply the network to a Siamese tracker to form SiamMBFAN. We design multi-layer classification and regression subnetworks in the Siamese tracker by aggregating the last three modules of the two branches, improving the localization ability of the tracker. Our tracker achieves a better balance between performance and speed. Finally, SiamMBFAN is tested on four challenging benchmarks, including OTB100, VOT2016, VOT2018, and UAV123. Compared with other trackers, our tracker improves by 7% (OTB100).  相似文献   

单发多框检测器SSD是一种在简单、快速和准确性之间有着较好平衡的目标检测器算法.SSD网络结构中检测层单一的利用方式使得特征信息利用不充分,将导致小目标检测不够鲁棒.该文提出一种基于注意力机制的单发多框检测器算法ASSD.ASSD算法首先利用提出的双向特征融合模块进行特征信息融合以获取包含丰富细节和语义信息的特征层,然...  相似文献   

遥感图像内容丰富,一般的深度模型提取遥感图像特征时容易受复杂背景干扰,对关键特征的提取效果不佳,并且难以表达图像的空间信息,该文提出一种基于多尺度池化和范数注意力机制的深度卷积神经网络,在通道层面与空间层面自适应地给显著特征加权。首先,在多尺度池化通道注意力模块中,结合空间金字塔池化的思想,对每个通道上的特征图进行不同尺度的最大池化。接着,采用自适应均值池化将尺寸不同的特征图转换为统一尺寸,以便通过逐像素相加的方式来关注不同尺度的显著特征。然后,在范数空间注意力模块中,将各通道对应同一空间位置的像素构成向量,通过计算向量组的L1范数和L2范数,获得具有空间信息的特征图。最后,采用级联池化的方法优化高层特征,并将该高层特征用于遥感图像检索。在UC Merced, AID与NWPU-RESISC45 3个数据集上进行实验,结果表明该文所提注意力模型,关注了不同尺度的显著特征,结合了空间信息,提高了检索性能。  相似文献   

