首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
To increase the richness of the extracted text modality feature information and deeply explore the semantic similarity between the modalities. In this paper, we propose a novel method, named adaptive weight multi-channel center similar deep hashing (AMCDH). The algorithm first utilizes three channels with different configurations to extract feature information from the text modality; and then adds them according to the learned weight ratio to increase the richness of the information. We also introduce the Jaccard coefficient to measure the semantic similarity level between modalities from 0 to 1, and utilize it as the penalty coefficient of the cross-entropy loss function to increase its role in backpropagation. Besides, we propose a method of constructing center similarity, which makes the hash codes of similar data pairs close to the same center point, and dissimilar data pairs are scattered at different center points to generate high-quality hash codes. Extensive experimental evaluations on four benchmark datasets show that the performance of our proposed model AMCDH is significantly better than other competing baselines. The code can be obtained from https://github.com/DaveLiu6/AMCDH.git.  相似文献   

2.
Weakly supervised semantic segmentation (WSSS) with only image-level supervision is a challenging task. Most existing methods exploit Class Activation Maps (CAM) to generate pixel-level pseudo labels for supervised training. However, due to the local receptive field of Convolution Neural Networks (CNN), CAM applied to CNNs often suffers from partial activation — highlighting the most discriminative part instead of the entire object area. In order to capture both local features and global representations, the Conformer has been proposed to combine a visual transformer branch with a CNN branch. In this paper, we propose TransCAM, a Conformer-based solution to WSSS that explicitly leverages the attention weights from the transformer branch of the Conformer to refine the CAM generated from the CNN branch. TransCAM is motivated by our observation that attention weights from shallow transformer blocks are able to capture low-level spatial feature similarities while attention weights from deep transformer blocks capture high-level semantic context. Despite its simplicity, TransCAM achieves competitive performance of 69.3% and 69.6% on the respective PASCAL VOC 2012 validation and test sets, showing the effectiveness of transformer attention-based refinement of CAM for WSSS.  相似文献   

3.
Aerators are essential and crucial auxiliary devices in intensive culture, especially in industrial culture in China. In this paper, we propose a real-time expert system for anomaly detection of aerators based on computer vision technology and existing surveillance cameras. The expert system includes two modules, i.e., object region detection and working state detection. First, we present a small object region detection method based on the region proposal idea. Moreover, we propose a novel algorithm called reference frame Kanade-Lucas-Tomasi (RF-KLT) algorithm for motion feature extraction in fixed regions. Then, we describe a dimension reduction method of time series for establishing a feature dataset with obvious boundaries between classes. Finally, we use machine learning algorithms to build the feature classifier. The proposed expert system can realize real-time, robust and cost-free anomaly detection of aerators in both the actual video dataset and the augmented video dataset. Demo is available at https://youtu.be/xThHRwu_cnI.  相似文献   

4.
Deep network has become a new favorite for person re-identification (Re-ID), whose research focus is how to effectively extract the discriminative feature representation for pedestrians. In the paper, we propose a novel Re-ID network named as improved ReIDNet (iReIDNet), which can effectively extract the local and global multi-granular feature representations of pedestrians by a well-designed spatial feature transform and coordinate attention (SFTCA) mechanism together with improved global pooling (IGP) method. SFTCA utilizes channel adaptability and spatial location to infer a 2D attention map and can help iReIDNet to focus on the salient information contained in pedestrian images. IGP makes iReIDNet capture more effectively the global information of the whole human body. Besides, to boost the recognition accuracy, we develop a weighted joint loss to guide the training of iReIDNet. Comprehensive experiments demonstrate the availability and superiority of iReIDNet over other Re-ID methods. The code is available at https://github.com/XuRuyu66/ iReIDNet.  相似文献   

5.
In the task of skeleton-based action recognition, CNN-based methods represent the skeleton data as a pseudo image for processing. However, it still remains as a critical issue of how to construct the pseudo image to model the spatial dependencies of the skeletal data. To address this issue, we propose a novel convolutional neural network with adaptive inferential framework (AIF-CNN) to exploit the dependencies among the skeleton joints. We particularly investigate several initialization strategies to make the AIF effective with each strategy introducing the different prior knowledge. Extensive experiments on the dataset of NTU RGB+D and Kinetics-Skeleton demonstrate that the performance is improved significantly by integrating the different prior information. The source code is available at: https://github.com/hhe-distance/AIF-CNN.  相似文献   

6.
7.
Although convolutional neural networks (CNNs) have recently shown considerable progress in motion deblurring, most existing methods that adopt multi-scale input schemes are still challenging in accurately restoring the heavily-blurred regions in blurry images. Several recent methods aim to further improve the deblurring effect using larger and more complex models, but these methods inevitably result in huge computing costs. To address the performance-complexity trade-off, we propose a multi-stage feature-fusion dense network (MFFDNet) for motion deblurring. Each sub-network of our MFFDNet has the similar structure and the same scale of input. Meanwhile, we propose a feature-fusion dense connection structure to reuse the extracted features, thereby improving the deblurring effect. Moreover, instead of using the multi-scale loss function, we only calculate the loss function at the output of the last stage since the input scale of our sub-network is invariant. Experimental results show that MFFDNet maintains a relatively small computing cost while outperforming state-of-the-art motion-deblurring methods. The source code is publicly available at: https://github.com/CaiGuoHS/MFFDNet_release.  相似文献   

8.
Recently, there has been a trend in tracking to use more refined segmentation mask instead of coarse bounding box to represent the target object. Some trackers proposed segmentation branches based on the tracking framework and maintain real-time speed. However, those trackers use a simple FCNs structure and lack of the edge information modeling. This makes performance quite unsatisfactory. In this paper, we propose an edge-aware segmentation network, which uses the complementarity between target information and edge information to provide a more refined representation of the target. Firstly, We use the high-level features of the tracking backbone network and the correlation features of the classification branch of the tracking framework to fuse, and use the target edge and target segmentation mask for simultaneous supervision to obtain an optimized high-level feature with rough edge information and target information. Secondly, we use the optimized high-level features to guide the low-level features of the tracking backbone network to generate more refined edge features. Finally, we use the refined edge features to fuse with the target features of each layer to generate the final mask. Our approach has achieved leading performance on recent pixel-wise object tracking benchmark VOT2020 and segmentation datasets DAVIS2016 and DAVIS2017 while running on 47 fps. Code is available at https://github.com/TJUMMG/EATtracker.  相似文献   

9.
Current state-of-the-art two-stage models on instance segmentation task suffer from several types of imbalances. In this paper, we address the Intersection over the Union (IoU) distribution imbalance of positive input Regions of Interest (RoIs) during the training of the second stage. Our Self-Balanced R-CNN (SBR-CNN), an evolved version of the Hybrid Task Cascade (HTC) model, brings brand new loop mechanisms of bounding box and mask refinements. With an improved Generic RoI Extraction (GRoIE), we also address the feature-level imbalance at the Feature Pyramid Network (FPN) level, originated by a non-uniform integration between low- and high-level features from the backbone layers. In addition, the redesign of the architecture heads toward a fully convolutional approach with FCC further reduces the number of parameters and obtains more clues to the connection between the task to solve and the layers used. Moreover, our SBR-CNN model shows the same or even better improvements if adopted in conjunction with other state-of-the-art models. In fact, with a lightweight ResNet-50 as backbone, evaluated on COCO minival 2017 dataset, our model reaches 45.3% and 41.5% AP for object detection and instance segmentation, with 12 epochs and without extra tricks. The code is available at https://github.com/IMPLabUniPr/mmdetection/tree/sbr_cnn.  相似文献   

10.
In the field of security, faces are usually blurry, occluded, diverse pose and small in the image captured by an outdoor surveillance camera, which is affected by the external environment such as the camera pose and range, weather conditions, etc. It can be described as a problem of hard face detection in natural images. To solve this problem, we propose a deep convolutional neural network named feature hierarchy encoder–decoder network (FHEDN). It is motivated by two observations from contextual semantic information and the mechanism of multi-scale face detection. The proposed network is a scale-variant style architecture and single stage, which are composed of encoder and decoder subnetworks. Based on the assumption that contextual semantic information around face being auxiliary to detect faces, we introduce a residual mechanism to fuse context prior-based information into face feature and formulate the learning chain to train each encoder–decoder pair. In addition, we discuss some important factors in implement details such as the distribution of training dataset, the scale of feature hierarchy, and anchor box size, etc. They have some impact on the detection performance of the final network. Compared with some state-of-the-art algorithms, our method achieves promising performance on the popular benchmarks including AFW, PASCAL FACE, FDDB, and WIDER FACE. Consequently, the proposed approach can be efficiently implemented and routinely applied to detect faces with severe occlusion and arbitrary pose variations in unconstrained scenes. Our code and results are available on https://github.com/zzxcoder/EvaluationFHEDN.  相似文献   

11.
Object detection across different scales is challenging as the variances of object scales. Thus, a novel detection network, Top-Down Feature Fusion Single Shot MultiBox Detector (TDFSSD), is proposed. The proposed network is based on Single Shot MultiBox Detector (SSD) using VGG-16 as backbone with a novel, simple yet efficient feature fusion module, namely, the Top-Down Feature Fusion Module. The proposed module fuses features from higher-level features, containing semantic information, to lower-level features, containing boundary information, iteratively. Extensive experiments have been conducted on PASCAL VOC2007, PASCAL VOC2012, and MS COCO datasets to demonstrate the efficiency of the proposed method. The proposed TDFSSD network is trained end to end and outperforms the state-of-the-art methods across the three datasets. The TDFSSD network achieves 81.7% and 80.1% mAPs on VOC2007 and 2012 respectively, which outperforms the reported best results of both one-stage and two-stage frameworks. In the meantime, it achieves 33.4% mAP on MS COCO test-dev, especially 17.2% average precision (AP) on small objects. Thus all the results show the efficiency of the proposed method on object detection. Code and model are available at: https://github.com/dongfengxijian/TDFSSD.  相似文献   

12.
Many image co-segmentation algorithms have been proposed over the last decade. In this paper, we present a new dataset for evaluating co-segmentation algorithms, which contains 889 image groups with 18 images in each and the pixel-wise hand-annotated ground truths. The dataset is characterized by simple background produced from nearly a single color. It looks simple but is actually very challenging for current co-segmentation algorithms, because of four difficult cases in it: easy-confused foreground with background, transparent regions in objects, minor holes in objects, and shadows. In order to test the usefulness of our dataset, we review the state-of-the-art co-segmentation algorithms and evaluate seven algorithms on our dataset. The obtained performance of each algorithm is compared with those previously reported in the datasets with complex background. The results prove that our dataset is valuable for the development of co-segmentation techniques. It is more feasible to solve the four difficulties above on the simple background and then extend the solutions to the complex background problems. Our dataset can be freely downloaded from: http://www.iscbit.org/source/MLMR-COS.zip.  相似文献   

13.
Keypoint-based object detection achieves better performance without positioning calculations and extensive prediction. However, they have heavy backbone, and high-resolution is restored using upsampling that obtain unreliable features. We propose a self-constrained parallelism keypoint-based lightweight object detection network (SCPNet), which speeds inference, drops parameters, widens receptive fields, and makes prediction accurate. Specifically, the parallel multi-scale fusion module (PMFM) with parallel shuffle blocks (PSB) adopts parallel structure to obtain reliable features and reduce depth, adopts repeated multi-scale fusion to avoid too many parallel branches. The self-constrained detection module (SCDM) has a two-branch structure, with one branch predicting corners, and employing entad offset to match high-quality corner pairs, and the other branch predicting center keypoints. The distances between the paired corners’ geometric centers and the center keypoints are used for self-constrained detection. On MS-COCO 2017 and PASCAL VOC, SCPNet’s results are competitive with the state-of-the-art lightweight object detection. https://github.com/mengdie-wang/SCPNet.git.  相似文献   

14.
Video object segmentation, aiming to segment the foreground objects given the annotation of the first frame, has been attracting increasing attentions. Many state-of-the-art approaches have achieved great performance by relying on online model updating or mask-propagation techniques. However, most online models require high computational cost due to model fine-tuning during inference. Most mask-propagation based models are faster but with relatively low performance due to failure to adapt to object appearance variation. In this paper, we are aiming to design a new model to make a good balance between speed and performance. We propose a model, called NPMCA-net, which directly localizes foreground objects based on mask-propagation and non-local technique by matching pixels in reference and target frames. Since we bring in information of both first and previous frames, our network is robust to large object appearance variation, and can better adapt to occlusions. Extensive experiments show that our approach can achieve a new state-of-the-art performance with a fast speed at the same time (86.5% IoU on DAVIS-2016 and 72.2% IoU on DAVIS-2017, with speed of 0.11s per frame) under the same level comparison. Source code is available at https://github.com/siyueyu/NPMCA-net.  相似文献   

15.
Multi-label classification with region-free labels is attracting increasing attention compared to that with region-based labels due to the time-consuming manual region-labeling process. Existing methods usually employ attention-based technology to discover the conspicuous label-related regions in a weakly-supervised manner with only image-level region-free labels, while the region covering is not precise without exploring global clues of multi-level features. To address this issue, a novel Global-guided Weakly-Supervised Learning (GWSL) method for multi-label classification is proposed. The GWSL first extracts the multi-level features to estimate their global correlation map which is further utilized to guide feature disentanglement in the proposed Feature Disentanglement and Localization (FDL) networks. Specifically, the FDL networks then adaptively combine the different correlated features and localize the fine-grained features for identifying multiple labels. The proposed method is optimized in an end-to-end manner under weakly supervision with only image-level labels. Experimental results demonstrate that the proposed method outperforms the state-of-the-arts for multi-label learning problems on several publicly available image datasets. To facilitate similar researches in the future, the codes are directly available online at https://github.com/Yong-DAI/GWSL.  相似文献   

16.
The existing deraining methods based on convolutional neural networks (CNNs) have made great success, but some remaining rain streaks can degrade images drastically. In this work, we proposed an end-to-end multi-scale context information and attention network, called MSCIANet. The proposed network consists of multi-scale feature extraction (MSFE) and multi-receptive fields feature extraction (MRFFE). Firstly, the MSFE can pick up features of rain streaks in different scales and propagate deep features of the two layers across stages by skip connections. Secondly, the MRFFE can refine details of the background by attention mechanism and the depthwise separable convolution of different receptive fields with different scales. Finally, the fusion of these outputs of two subnetworks can reconstruct the clean background image. Extensive experimental results have shown that the proposed network achieves a good effect on the deraining task on synthetic and real-world datasets. The demo can be available at https://github.com/CoderLi365/MSCIANet.  相似文献   

17.
The saliency prediction precision has improved rapidly with the development of deep learning technology, but the inference speed is slow due to the continuous deepening of networks. Hence, this paper proposes a fast saliency prediction model. Concretely, the siamese network backbone based on tailored EfficientNetV2 accelerates the inference speed while maintaining high performance. The shared parameters strategy further curbs parameter growth. Furthermore, we add multi-channel activation maps to optimize the fine features considering different channels and low-level visual features, which improves the interpretability of the model. Extensive experiments show that the proposed model achieves competitive performance on the standard benchmark datasets, and prove the effectiveness of our method in striking a balance between prediction accuracy and inference speed. Moreover, the small model size allows our method to be applied in edge devices. The code is available at: https://github.com/lscumt/fast-fixation-prediction.  相似文献   

18.
We develop a full-reference (FR) video quality assessment framework that integrates analysis of space–time slices (STSs) with frame-based image quality measurement (IQA) to form a high-performance video quality predictor. The approach first arranges the reference and test video sequences into a space–time slice representation. To more comprehensively characterize space–time distortions, a collection of distortion-aware maps are computed on each reference–test video pair. These reference-distorted maps are then processed using a standard image quality model, such as peak signal-to-noise ratio (PSNR) or Structural Similarity (SSIM). A simple learned pooling strategy is used to combine the multiple IQA outputs to generate a final video quality score. This leads to an algorithm called Space–TimeSlice PSNR (STS-PSNR), which we thoroughly tested on three publicly available video quality assessment databases and found it to deliver significantly elevated performance relative to state-of-the-art video quality models. Source code for STS-PSNR is freely available at: http://live.ece.utexas.edu/research/Quality/STS-PSNR_release.zip.  相似文献   

19.
Semantic segmentation aims to map each pixel of an image into its corresponding semantic label. Most existing methods either mainly concentrate on high-level features or simple combination of low-level and high-level features from backbone convolutional networks, which may weaken or even ignore the compensation between different levels. To effectively take advantages from both shallow (textural) and deep (semantic) features, this paper proposes a novel plug-and-play module, namely feature enhancement module (FEM). The proposed FEM first uses an information extractor to extract the desired details or semantics from different stages, and then enhances target features by taking in the extracted message. Two types of FEM, i.e., detail FEM and semantic FEM, can be customized. Concretely, the former type strengthens textural information to protect key but tiny/low-contrast details from suppression/removal, while the other one highlights structural information to boost segmentation performance. By equipping a given backbone network with FEMs, there might contain two information flows, i.e., detail flow and semantic flow. Extensive experiments on the Cityscapes, ADE20K and PASCAL Context datasets are conducted to validate the effectiveness of our design. The code has been released at https://github.com/SuperZ-Liu/FENet.  相似文献   

20.
Knowledge distillation has become a key technique for making smart and light-weight networks through model compression and transfer learning. Unlike previous methods that applied knowledge distillation to the classification task, we propose to exploit the decomposition-and-replacement based distillation scheme for depth estimation from a single RGB color image. To do this, Laplacian pyramid-based knowledge distillation is firstly presented in this paper. The key idea of the proposed method is to transfer the rich knowledge of the scene depth, which is well encoded through the teacher network, to the student network in a structured way by decomposing it into the global context and local details. This is fairly desirable for the student network to restore the depth layout more accurately with limited resources. Moreover, we also propose a new guidance concept for knowledge distillation, so-called ReplaceBlock, which replaces blocks randomly selected in the decoded feature of the student network with those of the teacher network. Our ReplaceBlock gives a smoothing effect in learning the feature distribution of the teacher network by considering the spatial contiguity in the feature space. This process is also helpful to clearly restore the depth layout without the significant computational cost. Based on various experimental results on benchmark datasets, the effectiveness of our distillation scheme for monocular depth estimation is demonstrated in details. The code and model are publicly available at : https://github.com/tjqansthd/Lap_Rep_KD_Depth.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号