首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Recently, there has been a trend in tracking to use more refined segmentation mask instead of coarse bounding box to represent the target object. Some trackers proposed segmentation branches based on the tracking framework and maintain real-time speed. However, those trackers use a simple FCNs structure and lack of the edge information modeling. This makes performance quite unsatisfactory. In this paper, we propose an edge-aware segmentation network, which uses the complementarity between target information and edge information to provide a more refined representation of the target. Firstly, We use the high-level features of the tracking backbone network and the correlation features of the classification branch of the tracking framework to fuse, and use the target edge and target segmentation mask for simultaneous supervision to obtain an optimized high-level feature with rough edge information and target information. Secondly, we use the optimized high-level features to guide the low-level features of the tracking backbone network to generate more refined edge features. Finally, we use the refined edge features to fuse with the target features of each layer to generate the final mask. Our approach has achieved leading performance on recent pixel-wise object tracking benchmark VOT2020 and segmentation datasets DAVIS2016 and DAVIS2017 while running on 47 fps. Code is available at https://github.com/TJUMMG/EATtracker.  相似文献   

2.
Current state-of-the-art two-stage models on instance segmentation task suffer from several types of imbalances. In this paper, we address the Intersection over the Union (IoU) distribution imbalance of positive input Regions of Interest (RoIs) during the training of the second stage. Our Self-Balanced R-CNN (SBR-CNN), an evolved version of the Hybrid Task Cascade (HTC) model, brings brand new loop mechanisms of bounding box and mask refinements. With an improved Generic RoI Extraction (GRoIE), we also address the feature-level imbalance at the Feature Pyramid Network (FPN) level, originated by a non-uniform integration between low- and high-level features from the backbone layers. In addition, the redesign of the architecture heads toward a fully convolutional approach with FCC further reduces the number of parameters and obtains more clues to the connection between the task to solve and the layers used. Moreover, our SBR-CNN model shows the same or even better improvements if adopted in conjunction with other state-of-the-art models. In fact, with a lightweight ResNet-50 as backbone, evaluated on COCO minival 2017 dataset, our model reaches 45.3% and 41.5% AP for object detection and instance segmentation, with 12 epochs and without extra tricks. The code is available at https://github.com/IMPLabUniPr/mmdetection/tree/sbr_cnn.  相似文献   

3.
Object detection across different scales is challenging as the variances of object scales. Thus, a novel detection network, Top-Down Feature Fusion Single Shot MultiBox Detector (TDFSSD), is proposed. The proposed network is based on Single Shot MultiBox Detector (SSD) using VGG-16 as backbone with a novel, simple yet efficient feature fusion module, namely, the Top-Down Feature Fusion Module. The proposed module fuses features from higher-level features, containing semantic information, to lower-level features, containing boundary information, iteratively. Extensive experiments have been conducted on PASCAL VOC2007, PASCAL VOC2012, and MS COCO datasets to demonstrate the efficiency of the proposed method. The proposed TDFSSD network is trained end to end and outperforms the state-of-the-art methods across the three datasets. The TDFSSD network achieves 81.7% and 80.1% mAPs on VOC2007 and 2012 respectively, which outperforms the reported best results of both one-stage and two-stage frameworks. In the meantime, it achieves 33.4% mAP on MS COCO test-dev, especially 17.2% average precision (AP) on small objects. Thus all the results show the efficiency of the proposed method on object detection. Code and model are available at: https://github.com/dongfengxijian/TDFSSD.  相似文献   

4.
Keypoint-based object detection achieves better performance without positioning calculations and extensive prediction. However, they have heavy backbone, and high-resolution is restored using upsampling that obtain unreliable features. We propose a self-constrained parallelism keypoint-based lightweight object detection network (SCPNet), which speeds inference, drops parameters, widens receptive fields, and makes prediction accurate. Specifically, the parallel multi-scale fusion module (PMFM) with parallel shuffle blocks (PSB) adopts parallel structure to obtain reliable features and reduce depth, adopts repeated multi-scale fusion to avoid too many parallel branches. The self-constrained detection module (SCDM) has a two-branch structure, with one branch predicting corners, and employing entad offset to match high-quality corner pairs, and the other branch predicting center keypoints. The distances between the paired corners’ geometric centers and the center keypoints are used for self-constrained detection. On MS-COCO 2017 and PASCAL VOC, SCPNet’s results are competitive with the state-of-the-art lightweight object detection. https://github.com/mengdie-wang/SCPNet.git.  相似文献   

5.
Images captured in weak illumination conditions could seriously degrade the image quality. Solving a series of degradation of low-light images can effectively improve the visual quality of images and the performance of high-level visual tasks. In this study, a novel Retinex-based Real-low to Real-normal Network (R2RNet) is proposed for low-light image enhancement, which includes three subnets: a Decom-Net, a Denoise-Net, and a Relight-Net. These three subnets are used for decomposing, denoising, contrast enhancement and detail preservation, respectively. Our R2RNet not only uses the spatial information of the image to improve the contrast but also uses the frequency information to preserve the details. Therefore, our model achieved more robust results for all degraded images. Unlike most previous methods that were trained on synthetic images, we collected the first Large-Scale Real-World paired low/normal-light images dataset (LSRW dataset) to satisfy the training requirements and make our model have better generalization performance in real-world scenes. Extensive experiments on publicly available datasets demonstrated that our method outperforms the existing state-of-the-art methods both quantitatively and visually. In addition, our results showed that the performance of the high-level visual task (i.e., face detection) can be effectively improved by using the enhanced results obtained by our method in low-light conditions. Our codes and the LSRW dataset are available at: https://github.com/JianghaiSCU/R2RNet.  相似文献   

6.
In the field of security, faces are usually blurry, occluded, diverse pose and small in the image captured by an outdoor surveillance camera, which is affected by the external environment such as the camera pose and range, weather conditions, etc. It can be described as a problem of hard face detection in natural images. To solve this problem, we propose a deep convolutional neural network named feature hierarchy encoder–decoder network (FHEDN). It is motivated by two observations from contextual semantic information and the mechanism of multi-scale face detection. The proposed network is a scale-variant style architecture and single stage, which are composed of encoder and decoder subnetworks. Based on the assumption that contextual semantic information around face being auxiliary to detect faces, we introduce a residual mechanism to fuse context prior-based information into face feature and formulate the learning chain to train each encoder–decoder pair. In addition, we discuss some important factors in implement details such as the distribution of training dataset, the scale of feature hierarchy, and anchor box size, etc. They have some impact on the detection performance of the final network. Compared with some state-of-the-art algorithms, our method achieves promising performance on the popular benchmarks including AFW, PASCAL FACE, FDDB, and WIDER FACE. Consequently, the proposed approach can be efficiently implemented and routinely applied to detect faces with severe occlusion and arbitrary pose variations in unconstrained scenes. Our code and results are available on https://github.com/zzxcoder/EvaluationFHEDN.  相似文献   

7.
To increase the richness of the extracted text modality feature information and deeply explore the semantic similarity between the modalities. In this paper, we propose a novel method, named adaptive weight multi-channel center similar deep hashing (AMCDH). The algorithm first utilizes three channels with different configurations to extract feature information from the text modality; and then adds them according to the learned weight ratio to increase the richness of the information. We also introduce the Jaccard coefficient to measure the semantic similarity level between modalities from 0 to 1, and utilize it as the penalty coefficient of the cross-entropy loss function to increase its role in backpropagation. Besides, we propose a method of constructing center similarity, which makes the hash codes of similar data pairs close to the same center point, and dissimilar data pairs are scattered at different center points to generate high-quality hash codes. Extensive experimental evaluations on four benchmark datasets show that the performance of our proposed model AMCDH is significantly better than other competing baselines. The code can be obtained from https://github.com/DaveLiu6/AMCDH.git.  相似文献   

8.
Infrared dim and small target detection is a key technology for space-based infrared search and tracking systems. Traditional detection methods have a high false alarm rate and fail to handle complex background and high-noise scenarios. Also, the methods cannot effectively detect targets on a small scale. In this paper, a U-Transformer method is proposed, and a transformer is introduced into the infrared dim and small target detection. First, a U-shaped network is constructed. In the encoder part, the self-attention mechanism is used for infrared dim and small target feature extraction, which helps to solve the problems of losing dim and small target features of deep networks. Meanwhile, by using the encoding and decoding structure, infrared dim and small target features are filtered from the complex background while the shallow features and semantic information of the target are retained. Experiments show that anchor-free and transformer have great potential for infrared dim and small target detection. On the datasets with a complex background, our method outperforms the state-of-the-art detectors and meets the real-time requirement. The code is publicly available at https://github.com/Linaom1214/U-Transformer.  相似文献   

9.
Recently, vision transformer has gained a breakthrough in image recognition. Its self-attention mechanism (MSA) can extract discriminative tokens information from different patches to improve image classification accuracy. However, the classification token in its deep layer ignore the local features between layers. In addition, the patch embedding layer feeds fixed-size patches into the network, which inevitably introduces additional image noise. Therefore, we propose a hierarchical attention vision transformer (HAVT) based on the transformer framework. We present a data augmentation method for attention cropping to crop and drop image noise and force the network to learn key features. Second, the hierarchical attention selection (HAS) module is proposed, which improves the network's ability to learn discriminative tokens between layers by filtering and fusing tokens between layers. Experimental results show that the proposed HAVT outperforms state-of-the-art approaches and significantly improves the accuracy to 91.8% and 91.0% on CUB-200–2011 and Stanford Dogs, respectively. We have released our source code on GitHub https://github.com/OhJackHu/HAVT.git.  相似文献   

10.
The existing deraining methods based on convolutional neural networks (CNNs) have made great success, but some remaining rain streaks can degrade images drastically. In this work, we proposed an end-to-end multi-scale context information and attention network, called MSCIANet. The proposed network consists of multi-scale feature extraction (MSFE) and multi-receptive fields feature extraction (MRFFE). Firstly, the MSFE can pick up features of rain streaks in different scales and propagate deep features of the two layers across stages by skip connections. Secondly, the MRFFE can refine details of the background by attention mechanism and the depthwise separable convolution of different receptive fields with different scales. Finally, the fusion of these outputs of two subnetworks can reconstruct the clean background image. Extensive experimental results have shown that the proposed network achieves a good effect on the deraining task on synthetic and real-world datasets. The demo can be available at https://github.com/CoderLi365/MSCIANet.  相似文献   

11.
The saliency prediction precision has improved rapidly with the development of deep learning technology, but the inference speed is slow due to the continuous deepening of networks. Hence, this paper proposes a fast saliency prediction model. Concretely, the siamese network backbone based on tailored EfficientNetV2 accelerates the inference speed while maintaining high performance. The shared parameters strategy further curbs parameter growth. Furthermore, we add multi-channel activation maps to optimize the fine features considering different channels and low-level visual features, which improves the interpretability of the model. Extensive experiments show that the proposed model achieves competitive performance on the standard benchmark datasets, and prove the effectiveness of our method in striking a balance between prediction accuracy and inference speed. Moreover, the small model size allows our method to be applied in edge devices. The code is available at: https://github.com/lscumt/fast-fixation-prediction.  相似文献   

12.
13.
Video super-resolution aims at restoring the spatial resolution of the reference frame based on consecutive input low-resolution (LR) frames. Existing implicit alignment-based video super-resolution methods commonly utilize convolutional LSTM (ConvLSTM) to handle sequential input frames. However, vanilla ConvLSTM processes input features and hidden states independently in operations and has limited ability to handle the inter-frame temporal redundancy in low-resolution fields. In this paper, we propose a multi-stage spatio-temporal adaptive network (MS-STAN). A spatio-temporal adaptive ConvLSTM (STAC) module is proposed to handle input features in low-resolution fields. The proposed STAC module utilizes the correlation between input features and hidden states in the ConvLSTM unit and modulates the hidden states adaptively conditioned on fused spatio-temporal features. A residual stacked bidirectional (RSB) architecture is further proposed to fully exploit the processing ability of the STAC unit. The proposed STAC and RSB architecture promote the vanilla ConvLSTM’s ability to exploit the inter-frame correlations, thus improving the reconstruction quality. Furthermore, different from existing methods that only aggregate features from the temporal branch once at a specified stage of the network, the proposed network is organized in a multi-stage manner. The corresponding temporal correlation in features at different stages can be fully exploited. Experimental results on Vimeo-90K-T and UMD10 datasets show that the proposed method has comparable performance with current video super-resolution methods. The code is available at https://github.com/yhjoker/MS-STAN.  相似文献   

14.
Due to the storage and retrieval efficiency of hashing, as well as the highly discriminative feature extraction by deep neural networks, deep cross-modal hashing retrieval has been attracting increasing attention in recent years. However, most of existing deep cross-modal hashing methods simply employ single-label to directly measure the semantic relevance across different modalities, but neglect the potential contributions from multiple category labels. With the aim to improve the accuracy of cross-modal hashing retrieval by fully exploring the semantic relevance based on multiple labels of training data, in this paper, we propose a multi-label semantics preserving based deep cross-modal hashing (MLSPH) method. MLSPH firstly utilizes multi-labels of instances to calculate semantic similarity of the original data. Subsequently, a memory bank mechanism is introduced to preserve the multiple labels semantic similarity constraints and enforce the distinctiveness of learned hash representations over the whole training batch. Extensive experiments on several benchmark datasets reveal that the proposed MLSPH surpasses prominent baselines and reaches the state-of-the-art performance in the field of cross-modal hashing retrieval. Code is available at: https://github.com/SWU-CS-MediaLab/MLSPH.  相似文献   

15.
16.
林森  赵振禹  任晓奎  陶志勇 《红外与激光工程》2022,51(8):20210702-1-20210702-12
3D点云数据处理在物体分割、医学图像分割和虚拟现实等领域起到了重要作用。然而现有3D点云学习网络全局特征提取范围小,难以描述局部高级语义信息,进而导致点云特征表述不完整。针对这些问题,提出一种基于语义信息补偿全局特征的物体点云分类分割网络。首先,将输入的点云数据对齐到规范空间,进行数据的输入转换预处理。然后,利用扩张边缘卷积模块提取转换后数据的每一层特征,并叠加生成全局特征。而在局部特征提取时,利用提取到的低级语义信息来描述高级语义信息和有效几何特征,用于补偿全局特征中遗漏的点云特征。最后,融合全局特征和局部高级语义信息得到点云的整体特征。实验结果表明,文中方法在分类和分割性能上优于目前经典和新颖的算法。  相似文献   

17.
18.
In the field of weakly supervised semantic segmentation (WSSS), Class Activation Maps (CAM) are typically adopted to generate pseudo masks. Yet, we find that the crux of the unsatisfactory pseudo masks is the incomplete CAM. Specifically, as convolutional neural networks tend to be dominated by the specific regions in the high-confidence channels of feature maps during prediction, the extracted CAM contains only parts of the object. To address this issue, we propose the Disturbed CAM (DCAM), a simple yet effective method for WSSS. Following CAM, we adopt a binary cross-entropy (BCE) loss to train a multi-label classification model. Then, we disturb the feature map with retraining to enhance the high-confidence channels. In addition, a softmax cross-entropy (SCE) loss branch is employed to increase the model attention to the target classes. Once converged, we extract DCAM in the same way as in CAM. The evaluation on both PASCAL VOC and MS COCO shows that DCAM not only generates high-quality masks (6.2% and 1.4% higher than the benchmark models), but also enables more accurate activation in object regions. The code is available at https://github.com/gyyang23/DCAM.  相似文献   

19.
Recent semantic segmentation frameworks usually combine low-level and high-level context information to achieve improved performance. In addition, postlevel context information is also considered. In this study, we present a Context ReFinement Network (CRFNet) and its training method to improve the semantic predictions of segmentation models of the encoder–decoder structure. Our study is based on postprocessing, which directly considers the relationship between spatially neighboring pixels of a label map, such as Markov and conditional random fields. CRFNet comprises two modules: a refiner and a combiner that, respectively, refine the context information from the output features of the conventional semantic segmentation network model and combine the refined features with the intermediate features from the decoding process of the segmentation model to produce the final output. To train CRFNet to refine the semantic predictions more accurately, we proposed a sequential training scheme. Using various backbone networks (ENet, ERFNet, and HyperSeg), we extensively evaluated our model on three large-scale, real-world datasets to demonstrate the effectiveness of our approach.  相似文献   

20.
Spatial–temporal information is easy to achieve in a practical surveillance scene, but it is often neglected in most current person reidentification (ReID) methods. Employing spatial–temporal information as a constrain has been verified as beneficial for ReID. However, there is no effective modeling according to the pedestrian movement law. In this paper, we present a ReID framework with internal and external spatial–temporal constraints, termed as IESC-ReID. A novel residual spatial attention module is proposed to build a spatial–temporal constraint and increase the robustness to partial occlusions or camera viewpoint changes. A Laplace-based spatial–temporal constraint is also introduced to eliminate irrelevant gallery images, which are gathered by the internal learning network. IESC-ReID constrains the attention within the functioning range of the channel space, and utilizes additional spatial–temporal constrains to further constrain results. Intensive experiments show that these constraints consistently improve the performance. Extensive experimental results on numerous publicly available datasets show that the proposed method outperforms several state-of-the-art ReID algorithms. Our code is publicly available at https://github.com/jiaming-wang/IESC.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号