首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 171 毫秒
1.
Due to the storage and retrieval efficiency of hashing, as well as the highly discriminative feature extraction by deep neural networks, deep cross-modal hashing retrieval has been attracting increasing attention in recent years. However, most of existing deep cross-modal hashing methods simply employ single-label to directly measure the semantic relevance across different modalities, but neglect the potential contributions from multiple category labels. With the aim to improve the accuracy of cross-modal hashing retrieval by fully exploring the semantic relevance based on multiple labels of training data, in this paper, we propose a multi-label semantics preserving based deep cross-modal hashing (MLSPH) method. MLSPH firstly utilizes multi-labels of instances to calculate semantic similarity of the original data. Subsequently, a memory bank mechanism is introduced to preserve the multiple labels semantic similarity constraints and enforce the distinctiveness of learned hash representations over the whole training batch. Extensive experiments on several benchmark datasets reveal that the proposed MLSPH surpasses prominent baselines and reaches the state-of-the-art performance in the field of cross-modal hashing retrieval. Code is available at: https://github.com/SWU-CS-MediaLab/MLSPH.  相似文献   

2.
In the field of weakly supervised semantic segmentation (WSSS), Class Activation Maps (CAM) are typically adopted to generate pseudo masks. Yet, we find that the crux of the unsatisfactory pseudo masks is the incomplete CAM. Specifically, as convolutional neural networks tend to be dominated by the specific regions in the high-confidence channels of feature maps during prediction, the extracted CAM contains only parts of the object. To address this issue, we propose the Disturbed CAM (DCAM), a simple yet effective method for WSSS. Following CAM, we adopt a binary cross-entropy (BCE) loss to train a multi-label classification model. Then, we disturb the feature map with retraining to enhance the high-confidence channels. In addition, a softmax cross-entropy (SCE) loss branch is employed to increase the model attention to the target classes. Once converged, we extract DCAM in the same way as in CAM. The evaluation on both PASCAL VOC and MS COCO shows that DCAM not only generates high-quality masks (6.2% and 1.4% higher than the benchmark models), but also enables more accurate activation in object regions. The code is available at https://github.com/gyyang23/DCAM.  相似文献   

3.
As the demand for realistic representation and its applications increases rapidly, 3D human modeling via a single RGB image has become the essential technique. Owing to the great success of deep neural networks, various learning-based approaches have been introduced for this task. However, partial occlusions still give the difficulty to accurately estimate the 3D human model. In this letter, we propose the part-attentive kinematic regressor for 3D human modeling. The key idea of the proposed method is to predict body part attentions based on each body center position and estimate parameters of the 3D human model via corresponding attentive features through the kinematic chain-based decoder in a one-stage fashion. One important advantage is that the proposed method has a good ability to yield natural shapes and poses even with severe occlusions. Experimental results on benchmark datasets show that the proposed method is effective for 3D human modeling under complicated real-world environments. The code and model are publicly available at: https://github.com/DCVL-3D/PKCN_release  相似文献   

4.
Infrared dim and small target detection is a key technology for space-based infrared search and tracking systems. Traditional detection methods have a high false alarm rate and fail to handle complex background and high-noise scenarios. Also, the methods cannot effectively detect targets on a small scale. In this paper, a U-Transformer method is proposed, and a transformer is introduced into the infrared dim and small target detection. First, a U-shaped network is constructed. In the encoder part, the self-attention mechanism is used for infrared dim and small target feature extraction, which helps to solve the problems of losing dim and small target features of deep networks. Meanwhile, by using the encoding and decoding structure, infrared dim and small target features are filtered from the complex background while the shallow features and semantic information of the target are retained. Experiments show that anchor-free and transformer have great potential for infrared dim and small target detection. On the datasets with a complex background, our method outperforms the state-of-the-art detectors and meets the real-time requirement. The code is publicly available at https://github.com/Linaom1214/U-Transformer.  相似文献   

5.
Knowledge distillation has become a key technique for making smart and light-weight networks through model compression and transfer learning. Unlike previous methods that applied knowledge distillation to the classification task, we propose to exploit the decomposition-and-replacement based distillation scheme for depth estimation from a single RGB color image. To do this, Laplacian pyramid-based knowledge distillation is firstly presented in this paper. The key idea of the proposed method is to transfer the rich knowledge of the scene depth, which is well encoded through the teacher network, to the student network in a structured way by decomposing it into the global context and local details. This is fairly desirable for the student network to restore the depth layout more accurately with limited resources. Moreover, we also propose a new guidance concept for knowledge distillation, so-called ReplaceBlock, which replaces blocks randomly selected in the decoded feature of the student network with those of the teacher network. Our ReplaceBlock gives a smoothing effect in learning the feature distribution of the teacher network by considering the spatial contiguity in the feature space. This process is also helpful to clearly restore the depth layout without the significant computational cost. Based on various experimental results on benchmark datasets, the effectiveness of our distillation scheme for monocular depth estimation is demonstrated in details. The code and model are publicly available at : https://github.com/tjqansthd/Lap_Rep_KD_Depth.  相似文献   

6.
The saliency prediction precision has improved rapidly with the development of deep learning technology, but the inference speed is slow due to the continuous deepening of networks. Hence, this paper proposes a fast saliency prediction model. Concretely, the siamese network backbone based on tailored EfficientNetV2 accelerates the inference speed while maintaining high performance. The shared parameters strategy further curbs parameter growth. Furthermore, we add multi-channel activation maps to optimize the fine features considering different channels and low-level visual features, which improves the interpretability of the model. Extensive experiments show that the proposed model achieves competitive performance on the standard benchmark datasets, and prove the effectiveness of our method in striking a balance between prediction accuracy and inference speed. Moreover, the small model size allows our method to be applied in edge devices. The code is available at: https://github.com/lscumt/fast-fixation-prediction.  相似文献   

7.
The existing deraining methods based on convolutional neural networks (CNNs) have made great success, but some remaining rain streaks can degrade images drastically. In this work, we proposed an end-to-end multi-scale context information and attention network, called MSCIANet. The proposed network consists of multi-scale feature extraction (MSFE) and multi-receptive fields feature extraction (MRFFE). Firstly, the MSFE can pick up features of rain streaks in different scales and propagate deep features of the two layers across stages by skip connections. Secondly, the MRFFE can refine details of the background by attention mechanism and the depthwise separable convolution of different receptive fields with different scales. Finally, the fusion of these outputs of two subnetworks can reconstruct the clean background image. Extensive experimental results have shown that the proposed network achieves a good effect on the deraining task on synthetic and real-world datasets. The demo can be available at https://github.com/CoderLi365/MSCIANet.  相似文献   

8.
This paper presents a novel No-Reference Video Quality Assessment (NR-VQA) model that utilizes proposed 3D steerable wavelet transform-based Natural Video Statistics (NVS) features as well as human perceptual features. Additionally, we proposed a novel two-stage regression scheme that significantly improves the overall performance of quality estimation. In the first stage, transform-based NVS and human perceptual features are separately passed through the proposed hybrid regression scheme: Support Vector Regression (SVR) followed by Polynomial curve fitting. The two visual quality scores predicted from the first stage are then used as features for the similar second stage. This predicts the final quality scores of distorted videos by achieving score level fusion. Extensive experiments were conducted using five authentic and four synthetic distortion databases. Experimental results demonstrate that the proposed method outperforms other published state-of-the-art benchmark methods on synthetic distortion databases and is among the top performers on authentic distortion databases. The source code is available at https://github.com/anishVNIT/two-stage-vqa.  相似文献   

9.
Recently, vision transformer has gained a breakthrough in image recognition. Its self-attention mechanism (MSA) can extract discriminative tokens information from different patches to improve image classification accuracy. However, the classification token in its deep layer ignore the local features between layers. In addition, the patch embedding layer feeds fixed-size patches into the network, which inevitably introduces additional image noise. Therefore, we propose a hierarchical attention vision transformer (HAVT) based on the transformer framework. We present a data augmentation method for attention cropping to crop and drop image noise and force the network to learn key features. Second, the hierarchical attention selection (HAS) module is proposed, which improves the network's ability to learn discriminative tokens between layers by filtering and fusing tokens between layers. Experimental results show that the proposed HAVT outperforms state-of-the-art approaches and significantly improves the accuracy to 91.8% and 91.0% on CUB-200–2011 and Stanford Dogs, respectively. We have released our source code on GitHub https://github.com/OhJackHu/HAVT.git.  相似文献   

10.
Generative Adversarial Networks (GANs) have facilitated a new direction to tackle the image-to-image transformation problem. Different GANs use generator and discriminator networks with different losses in the objective function. Still there is a gap to fill in terms of both the quality of the generated images and close to the ground truth images. In this work, we introduce a new Image-to-Image Transformation network named Cyclic Discriminative Generative Adversarial Networks (CDGAN) that fills the above mentioned gaps. The proposed CDGAN generates high quality and more realistic images by incorporating the additional discriminator networks for cycled images in addition to the original architecture of the CycleGAN. The proposed CDGAN is tested over three image-to-image transformation datasets. The quantitative and qualitative results are analyzed and compared with the state-of-the-art methods. The proposed CDGAN method outperforms the state-of-the-art methods when compared over the three baseline Image-to-Image transformation datasets. The code is available at https://github.com/KishanKancharagunta/CDGAN.  相似文献   

11.
Video super-resolution aims at restoring the spatial resolution of the reference frame based on consecutive input low-resolution (LR) frames. Existing implicit alignment-based video super-resolution methods commonly utilize convolutional LSTM (ConvLSTM) to handle sequential input frames. However, vanilla ConvLSTM processes input features and hidden states independently in operations and has limited ability to handle the inter-frame temporal redundancy in low-resolution fields. In this paper, we propose a multi-stage spatio-temporal adaptive network (MS-STAN). A spatio-temporal adaptive ConvLSTM (STAC) module is proposed to handle input features in low-resolution fields. The proposed STAC module utilizes the correlation between input features and hidden states in the ConvLSTM unit and modulates the hidden states adaptively conditioned on fused spatio-temporal features. A residual stacked bidirectional (RSB) architecture is further proposed to fully exploit the processing ability of the STAC unit. The proposed STAC and RSB architecture promote the vanilla ConvLSTM’s ability to exploit the inter-frame correlations, thus improving the reconstruction quality. Furthermore, different from existing methods that only aggregate features from the temporal branch once at a specified stage of the network, the proposed network is organized in a multi-stage manner. The corresponding temporal correlation in features at different stages can be fully exploited. Experimental results on Vimeo-90K-T and UMD10 datasets show that the proposed method has comparable performance with current video super-resolution methods. The code is available at https://github.com/yhjoker/MS-STAN.  相似文献   

12.
Object detection across different scales is challenging as the variances of object scales. Thus, a novel detection network, Top-Down Feature Fusion Single Shot MultiBox Detector (TDFSSD), is proposed. The proposed network is based on Single Shot MultiBox Detector (SSD) using VGG-16 as backbone with a novel, simple yet efficient feature fusion module, namely, the Top-Down Feature Fusion Module. The proposed module fuses features from higher-level features, containing semantic information, to lower-level features, containing boundary information, iteratively. Extensive experiments have been conducted on PASCAL VOC2007, PASCAL VOC2012, and MS COCO datasets to demonstrate the efficiency of the proposed method. The proposed TDFSSD network is trained end to end and outperforms the state-of-the-art methods across the three datasets. The TDFSSD network achieves 81.7% and 80.1% mAPs on VOC2007 and 2012 respectively, which outperforms the reported best results of both one-stage and two-stage frameworks. In the meantime, it achieves 33.4% mAP on MS COCO test-dev, especially 17.2% average precision (AP) on small objects. Thus all the results show the efficiency of the proposed method on object detection. Code and model are available at: https://github.com/dongfengxijian/TDFSSD.  相似文献   

13.
14.
Recently, there has been a trend in tracking to use more refined segmentation mask instead of coarse bounding box to represent the target object. Some trackers proposed segmentation branches based on the tracking framework and maintain real-time speed. However, those trackers use a simple FCNs structure and lack of the edge information modeling. This makes performance quite unsatisfactory. In this paper, we propose an edge-aware segmentation network, which uses the complementarity between target information and edge information to provide a more refined representation of the target. Firstly, We use the high-level features of the tracking backbone network and the correlation features of the classification branch of the tracking framework to fuse, and use the target edge and target segmentation mask for simultaneous supervision to obtain an optimized high-level feature with rough edge information and target information. Secondly, we use the optimized high-level features to guide the low-level features of the tracking backbone network to generate more refined edge features. Finally, we use the refined edge features to fuse with the target features of each layer to generate the final mask. Our approach has achieved leading performance on recent pixel-wise object tracking benchmark VOT2020 and segmentation datasets DAVIS2016 and DAVIS2017 while running on 47 fps. Code is available at https://github.com/TJUMMG/EATtracker.  相似文献   

15.
针对目前基于度量学习的小样本方法存在特征提取尺度单一,类特征学习不准确,相似性计算依赖标准度量等问题,该文提出多级注意力特征网络。首先对图像进行尺度处理获得多个尺度图像;其次通过图像级注意力机制融合所提取的多个尺度图像特征获取图像级注意力特征;在此基础上使用类级注意机制学习每个类的类级注意力特征。最后通过网络计算样本特征与每个类的类级注意力特征的相似性分数来预测分类。该文在Omniglot和MiniImageNet两个数据集上验证多级注意力特征网络的有效性。实验结果表明,相比于单一尺度图像特征和均值类原型,多级注意力特征网络进一步提高了小样本条件下的分类准确率。  相似文献   

16.
Colored point cloud (PC) will inevitably encounter distortion during its acquisition, processing, coding and transmission, which may affect the visual quality of the colored PC. Therefore, it is necessary to design an effective tool for colored PC quality assessment (PCQA). In this paper, considering the mapping relationship of perception between the colored PC and its corresponding projection images, we propose a novel PCQA method based on texture and geometry projection (denoted as TGP-PCQA). The main idea of the proposed TGP-PCQA method is to obtain texture and geometry projection maps from different perspectives for evaluating the colored PC. Specifically, 4D tensor decomposition is used to obtain the combination and difference information between the reference and distorted texture projection maps for mainly characterizing texture distortion of colored PC. Meanwhile, the edge features of the geometry projection map are calculated to measure the global or local geometry distortion. All of the extracted features are combined to predict an overall quality of colored PC. In addition, this paper establishes a multi-distorted colored PC database named CPCD2.0 with compression distortions and Gaussian noise, which orients to the influence of both geometry and texture components in distortion. Experimental results on two open subjective evaluation databases (IRPC and SJTU-PCQA) and the self-built CPCD2.0 database show that the proposed TGP-PCQA method outperforms the state-of-the-art PCQA methods. We are also providing the self-built CPCD2.0 database free of charge at https://github.com/cherry0415/CPCD2.0.  相似文献   

17.
Semantic segmentation aims to map each pixel of an image into its corresponding semantic label. Most existing methods either mainly concentrate on high-level features or simple combination of low-level and high-level features from backbone convolutional networks, which may weaken or even ignore the compensation between different levels. To effectively take advantages from both shallow (textural) and deep (semantic) features, this paper proposes a novel plug-and-play module, namely feature enhancement module (FEM). The proposed FEM first uses an information extractor to extract the desired details or semantics from different stages, and then enhances target features by taking in the extracted message. Two types of FEM, i.e., detail FEM and semantic FEM, can be customized. Concretely, the former type strengthens textural information to protect key but tiny/low-contrast details from suppression/removal, while the other one highlights structural information to boost segmentation performance. By equipping a given backbone network with FEMs, there might contain two information flows, i.e., detail flow and semantic flow. Extensive experiments on the Cityscapes, ADE20K and PASCAL Context datasets are conducted to validate the effectiveness of our design. The code has been released at https://github.com/SuperZ-Liu/FENet.  相似文献   

18.
For fashion outfits to be considered aesthetically pleasing, the garments that constitute them need to be compatible in terms of visual aspects, such as style, category and color. Previous works have defined visual compatibility as a binary classification task with items in a garment being considered as fully compatible or fully incompatible. However, this is not applicable to Outfit Maker applications where users create their own outfits and need to know which specific items may be incompatible with the rest of the outfit. To address this, we propose the Visual InCompatibility TransfORmer (VICTOR) that is optimized for two tasks: 1) overall compatibility as regression and 2) the detection of mismatching items and utilize fashion-specific contrastive language-image pre-training for fine tuning computer vision neural networks on fashion imagery. We build upon the Polyvore outfit benchmark to generate partially mismatching outfits, creating a new dataset termed Polyvore-MISFITs, that is used to train VICTOR. A series of ablation and comparative analyses show that the proposed architecture can compete and even surpass the current state-of-the-art on Polyvore datasets while reducing the instance-wise floating operations by 88%, striking a balance between high performance and efficiency. We release our code at https://github.com/stevejpapad/Visual-InCompatibility-Transformer  相似文献   

19.
类属属性学习避免相同属性预测全部标记,是一种提取各标记独有属性进行分类的一种框架,在多标记学习中得到广泛的应用。而针对标记维度较大、标记分布密度不平衡等问题,已有的基于类属属性的多标记学习算法普遍时间消耗大、分类精度低。为提高多标记分类性能,该文提出一种基于标记密度分类间隔面的组类属属性学习(GLSFL-LDCM)方法。首先,使用余弦相似度构建标记相关性矩阵,通过谱聚类将标记分组以提取各标记组的类属属性,减少计算全部标记类属属性的时间消耗。然后,计算各标记密度以更新标记空间矩阵,将标记密度信息加入原标记中,扩大正负标记的间隔,通过标记密度分类间隔面的方法有效解决标记分布密度不平衡问题。最后,通过将组类属属性和标记密度矩阵输入极限学习机以得到最终分类模型。对比实验充分验证了该文所提算法的可行性与稳定性。  相似文献   

20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号