首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Due to the storage and retrieval efficiency of hashing, as well as the highly discriminative feature extraction by deep neural networks, deep cross-modal hashing retrieval has been attracting increasing attention in recent years. However, most of existing deep cross-modal hashing methods simply employ single-label to directly measure the semantic relevance across different modalities, but neglect the potential contributions from multiple category labels. With the aim to improve the accuracy of cross-modal hashing retrieval by fully exploring the semantic relevance based on multiple labels of training data, in this paper, we propose a multi-label semantics preserving based deep cross-modal hashing (MLSPH) method. MLSPH firstly utilizes multi-labels of instances to calculate semantic similarity of the original data. Subsequently, a memory bank mechanism is introduced to preserve the multiple labels semantic similarity constraints and enforce the distinctiveness of learned hash representations over the whole training batch. Extensive experiments on several benchmark datasets reveal that the proposed MLSPH surpasses prominent baselines and reaches the state-of-the-art performance in the field of cross-modal hashing retrieval. Code is available at: https://github.com/SWU-CS-MediaLab/MLSPH.  相似文献   

2.
To overcome the barrier of storage and computation, the hashing technique has been widely used for nearest neighbor search in multimedia retrieval applications recently. Particularly, cross-modal retrieval that searches across different modalities becomes an active but challenging problem. Although numerous of cross-modal hashing algorithms are proposed to yield compact binary codes, exhaustive search is impractical for large-scale datasets, and Hamming distance computation suffers inaccurate results. In this paper, we propose a novel search method that utilizes a probability-based index scheme over binary hash codes in cross-modal retrieval. The proposed indexing scheme employs a few binary bits from the hash code as the index code. We construct an inverted index table based on the index codes, and train a neural network for ranking and indexing to improve the retrieval accuracy. Experiments are performed on two benchmark datasets for retrieval across image and text modalities, where hash codes are generated and compared with several state-of-the-art cross-modal hashing methods. Results show the proposed method effectively boosts the performance on search accuracy, computation cost, and memory consumption in these datasets and hashing methods. The source code is available on https://github.com/msarawut/HCI.  相似文献   

3.
4.
In the field of security, faces are usually blurry, occluded, diverse pose and small in the image captured by an outdoor surveillance camera, which is affected by the external environment such as the camera pose and range, weather conditions, etc. It can be described as a problem of hard face detection in natural images. To solve this problem, we propose a deep convolutional neural network named feature hierarchy encoder–decoder network (FHEDN). It is motivated by two observations from contextual semantic information and the mechanism of multi-scale face detection. The proposed network is a scale-variant style architecture and single stage, which are composed of encoder and decoder subnetworks. Based on the assumption that contextual semantic information around face being auxiliary to detect faces, we introduce a residual mechanism to fuse context prior-based information into face feature and formulate the learning chain to train each encoder–decoder pair. In addition, we discuss some important factors in implement details such as the distribution of training dataset, the scale of feature hierarchy, and anchor box size, etc. They have some impact on the detection performance of the final network. Compared with some state-of-the-art algorithms, our method achieves promising performance on the popular benchmarks including AFW, PASCAL FACE, FDDB, and WIDER FACE. Consequently, the proposed approach can be efficiently implemented and routinely applied to detect faces with severe occlusion and arbitrary pose variations in unconstrained scenes. Our code and results are available on https://github.com/zzxcoder/EvaluationFHEDN.  相似文献   

5.
Semantic segmentation aims to map each pixel of an image into its corresponding semantic label. Most existing methods either mainly concentrate on high-level features or simple combination of low-level and high-level features from backbone convolutional networks, which may weaken or even ignore the compensation between different levels. To effectively take advantages from both shallow (textural) and deep (semantic) features, this paper proposes a novel plug-and-play module, namely feature enhancement module (FEM). The proposed FEM first uses an information extractor to extract the desired details or semantics from different stages, and then enhances target features by taking in the extracted message. Two types of FEM, i.e., detail FEM and semantic FEM, can be customized. Concretely, the former type strengthens textural information to protect key but tiny/low-contrast details from suppression/removal, while the other one highlights structural information to boost segmentation performance. By equipping a given backbone network with FEMs, there might contain two information flows, i.e., detail flow and semantic flow. Extensive experiments on the Cityscapes, ADE20K and PASCAL Context datasets are conducted to validate the effectiveness of our design. The code has been released at https://github.com/SuperZ-Liu/FENet.  相似文献   

6.
In the field of weakly supervised semantic segmentation (WSSS), Class Activation Maps (CAM) are typically adopted to generate pseudo masks. Yet, we find that the crux of the unsatisfactory pseudo masks is the incomplete CAM. Specifically, as convolutional neural networks tend to be dominated by the specific regions in the high-confidence channels of feature maps during prediction, the extracted CAM contains only parts of the object. To address this issue, we propose the Disturbed CAM (DCAM), a simple yet effective method for WSSS. Following CAM, we adopt a binary cross-entropy (BCE) loss to train a multi-label classification model. Then, we disturb the feature map with retraining to enhance the high-confidence channels. In addition, a softmax cross-entropy (SCE) loss branch is employed to increase the model attention to the target classes. Once converged, we extract DCAM in the same way as in CAM. The evaluation on both PASCAL VOC and MS COCO shows that DCAM not only generates high-quality masks (6.2% and 1.4% higher than the benchmark models), but also enables more accurate activation in object regions. The code is available at https://github.com/gyyang23/DCAM.  相似文献   

7.
Deep network has become a new favorite for person re-identification (Re-ID), whose research focus is how to effectively extract the discriminative feature representation for pedestrians. In the paper, we propose a novel Re-ID network named as improved ReIDNet (iReIDNet), which can effectively extract the local and global multi-granular feature representations of pedestrians by a well-designed spatial feature transform and coordinate attention (SFTCA) mechanism together with improved global pooling (IGP) method. SFTCA utilizes channel adaptability and spatial location to infer a 2D attention map and can help iReIDNet to focus on the salient information contained in pedestrian images. IGP makes iReIDNet capture more effectively the global information of the whole human body. Besides, to boost the recognition accuracy, we develop a weighted joint loss to guide the training of iReIDNet. Comprehensive experiments demonstrate the availability and superiority of iReIDNet over other Re-ID methods. The code is available at https://github.com/XuRuyu66/ iReIDNet.  相似文献   

8.
9.
Keypoint-based object detection achieves better performance without positioning calculations and extensive prediction. However, they have heavy backbone, and high-resolution is restored using upsampling that obtain unreliable features. We propose a self-constrained parallelism keypoint-based lightweight object detection network (SCPNet), which speeds inference, drops parameters, widens receptive fields, and makes prediction accurate. Specifically, the parallel multi-scale fusion module (PMFM) with parallel shuffle blocks (PSB) adopts parallel structure to obtain reliable features and reduce depth, adopts repeated multi-scale fusion to avoid too many parallel branches. The self-constrained detection module (SCDM) has a two-branch structure, with one branch predicting corners, and employing entad offset to match high-quality corner pairs, and the other branch predicting center keypoints. The distances between the paired corners’ geometric centers and the center keypoints are used for self-constrained detection. On MS-COCO 2017 and PASCAL VOC, SCPNet’s results are competitive with the state-of-the-art lightweight object detection. https://github.com/mengdie-wang/SCPNet.git.  相似文献   

10.
Aerators are essential and crucial auxiliary devices in intensive culture, especially in industrial culture in China. In this paper, we propose a real-time expert system for anomaly detection of aerators based on computer vision technology and existing surveillance cameras. The expert system includes two modules, i.e., object region detection and working state detection. First, we present a small object region detection method based on the region proposal idea. Moreover, we propose a novel algorithm called reference frame Kanade-Lucas-Tomasi (RF-KLT) algorithm for motion feature extraction in fixed regions. Then, we describe a dimension reduction method of time series for establishing a feature dataset with obvious boundaries between classes. Finally, we use machine learning algorithms to build the feature classifier. The proposed expert system can realize real-time, robust and cost-free anomaly detection of aerators in both the actual video dataset and the augmented video dataset. Demo is available at https://youtu.be/xThHRwu_cnI.  相似文献   

11.
Recently, there has been a trend in tracking to use more refined segmentation mask instead of coarse bounding box to represent the target object. Some trackers proposed segmentation branches based on the tracking framework and maintain real-time speed. However, those trackers use a simple FCNs structure and lack of the edge information modeling. This makes performance quite unsatisfactory. In this paper, we propose an edge-aware segmentation network, which uses the complementarity between target information and edge information to provide a more refined representation of the target. Firstly, We use the high-level features of the tracking backbone network and the correlation features of the classification branch of the tracking framework to fuse, and use the target edge and target segmentation mask for simultaneous supervision to obtain an optimized high-level feature with rough edge information and target information. Secondly, we use the optimized high-level features to guide the low-level features of the tracking backbone network to generate more refined edge features. Finally, we use the refined edge features to fuse with the target features of each layer to generate the final mask. Our approach has achieved leading performance on recent pixel-wise object tracking benchmark VOT2020 and segmentation datasets DAVIS2016 and DAVIS2017 while running on 47 fps. Code is available at https://github.com/TJUMMG/EATtracker.  相似文献   

12.
Object detection across different scales is challenging as the variances of object scales. Thus, a novel detection network, Top-Down Feature Fusion Single Shot MultiBox Detector (TDFSSD), is proposed. The proposed network is based on Single Shot MultiBox Detector (SSD) using VGG-16 as backbone with a novel, simple yet efficient feature fusion module, namely, the Top-Down Feature Fusion Module. The proposed module fuses features from higher-level features, containing semantic information, to lower-level features, containing boundary information, iteratively. Extensive experiments have been conducted on PASCAL VOC2007, PASCAL VOC2012, and MS COCO datasets to demonstrate the efficiency of the proposed method. The proposed TDFSSD network is trained end to end and outperforms the state-of-the-art methods across the three datasets. The TDFSSD network achieves 81.7% and 80.1% mAPs on VOC2007 and 2012 respectively, which outperforms the reported best results of both one-stage and two-stage frameworks. In the meantime, it achieves 33.4% mAP on MS COCO test-dev, especially 17.2% average precision (AP) on small objects. Thus all the results show the efficiency of the proposed method on object detection. Code and model are available at: https://github.com/dongfengxijian/TDFSSD.  相似文献   

13.
Infrared dim and small target detection is a key technology for space-based infrared search and tracking systems. Traditional detection methods have a high false alarm rate and fail to handle complex background and high-noise scenarios. Also, the methods cannot effectively detect targets on a small scale. In this paper, a U-Transformer method is proposed, and a transformer is introduced into the infrared dim and small target detection. First, a U-shaped network is constructed. In the encoder part, the self-attention mechanism is used for infrared dim and small target feature extraction, which helps to solve the problems of losing dim and small target features of deep networks. Meanwhile, by using the encoding and decoding structure, infrared dim and small target features are filtered from the complex background while the shallow features and semantic information of the target are retained. Experiments show that anchor-free and transformer have great potential for infrared dim and small target detection. On the datasets with a complex background, our method outperforms the state-of-the-art detectors and meets the real-time requirement. The code is publicly available at https://github.com/Linaom1214/U-Transformer.  相似文献   

14.
In the task of skeleton-based action recognition, CNN-based methods represent the skeleton data as a pseudo image for processing. However, it still remains as a critical issue of how to construct the pseudo image to model the spatial dependencies of the skeletal data. To address this issue, we propose a novel convolutional neural network with adaptive inferential framework (AIF-CNN) to exploit the dependencies among the skeleton joints. We particularly investigate several initialization strategies to make the AIF effective with each strategy introducing the different prior knowledge. Extensive experiments on the dataset of NTU RGB+D and Kinetics-Skeleton demonstrate that the performance is improved significantly by integrating the different prior information. The source code is available at: https://github.com/hhe-distance/AIF-CNN.  相似文献   

15.
Knowledge distillation has become a key technique for making smart and light-weight networks through model compression and transfer learning. Unlike previous methods that applied knowledge distillation to the classification task, we propose to exploit the decomposition-and-replacement based distillation scheme for depth estimation from a single RGB color image. To do this, Laplacian pyramid-based knowledge distillation is firstly presented in this paper. The key idea of the proposed method is to transfer the rich knowledge of the scene depth, which is well encoded through the teacher network, to the student network in a structured way by decomposing it into the global context and local details. This is fairly desirable for the student network to restore the depth layout more accurately with limited resources. Moreover, we also propose a new guidance concept for knowledge distillation, so-called ReplaceBlock, which replaces blocks randomly selected in the decoded feature of the student network with those of the teacher network. Our ReplaceBlock gives a smoothing effect in learning the feature distribution of the teacher network by considering the spatial contiguity in the feature space. This process is also helpful to clearly restore the depth layout without the significant computational cost. Based on various experimental results on benchmark datasets, the effectiveness of our distillation scheme for monocular depth estimation is demonstrated in details. The code and model are publicly available at : https://github.com/tjqansthd/Lap_Rep_KD_Depth.  相似文献   

16.
Multi-label classification with region-free labels is attracting increasing attention compared to that with region-based labels due to the time-consuming manual region-labeling process. Existing methods usually employ attention-based technology to discover the conspicuous label-related regions in a weakly-supervised manner with only image-level region-free labels, while the region covering is not precise without exploring global clues of multi-level features. To address this issue, a novel Global-guided Weakly-Supervised Learning (GWSL) method for multi-label classification is proposed. The GWSL first extracts the multi-level features to estimate their global correlation map which is further utilized to guide feature disentanglement in the proposed Feature Disentanglement and Localization (FDL) networks. Specifically, the FDL networks then adaptively combine the different correlated features and localize the fine-grained features for identifying multiple labels. The proposed method is optimized in an end-to-end manner under weakly supervision with only image-level labels. Experimental results demonstrate that the proposed method outperforms the state-of-the-arts for multi-label learning problems on several publicly available image datasets. To facilitate similar researches in the future, the codes are directly available online at https://github.com/Yong-DAI/GWSL.  相似文献   

17.
The existing deraining methods based on convolutional neural networks (CNNs) have made great success, but some remaining rain streaks can degrade images drastically. In this work, we proposed an end-to-end multi-scale context information and attention network, called MSCIANet. The proposed network consists of multi-scale feature extraction (MSFE) and multi-receptive fields feature extraction (MRFFE). Firstly, the MSFE can pick up features of rain streaks in different scales and propagate deep features of the two layers across stages by skip connections. Secondly, the MRFFE can refine details of the background by attention mechanism and the depthwise separable convolution of different receptive fields with different scales. Finally, the fusion of these outputs of two subnetworks can reconstruct the clean background image. Extensive experimental results have shown that the proposed network achieves a good effect on the deraining task on synthetic and real-world datasets. The demo can be available at https://github.com/CoderLi365/MSCIANet.  相似文献   

18.
19.
As the demand for realistic representation and its applications increases rapidly, 3D human modeling via a single RGB image has become the essential technique. Owing to the great success of deep neural networks, various learning-based approaches have been introduced for this task. However, partial occlusions still give the difficulty to accurately estimate the 3D human model. In this letter, we propose the part-attentive kinematic regressor for 3D human modeling. The key idea of the proposed method is to predict body part attentions based on each body center position and estimate parameters of the 3D human model via corresponding attentive features through the kinematic chain-based decoder in a one-stage fashion. One important advantage is that the proposed method has a good ability to yield natural shapes and poses even with severe occlusions. Experimental results on benchmark datasets show that the proposed method is effective for 3D human modeling under complicated real-world environments. The code and model are publicly available at: https://github.com/DCVL-3D/PKCN_release  相似文献   

20.
大多数多模态情感识别方法旨在寻求一种有效的融合机制,构建异构模态的特征,从而学习到具有语义一致性的特征表示。然而,这些方法通常忽略了模态间情感语义的差异性信息。为解决这一问题,提出了一种多任务学习框架,联合训练1个多模态任务和3个单模态任务,分别学习多模态特征间的情感语义一致性信息和各个模态所含情感语义的差异性信息。首先,为了学习情感语义一致性信息,提出了一种基于多层循环神经网络的时间注意力机制(TAM),通过赋予时间序列特征向量不同的权重来描述情感特征的贡献度。然后,针对多模态融合,在语义空间进行了逐语义维度的细粒度特征融合。其次,为了有效学习各个模态所含情感语义的差异性信息,提出了一种基于模态间特征向量相似度的自监督单模态标签自动生成策略(ULAG)。通过在CMU-MOSI, CMU-MOSEI, CH-SIMS 3个数据集上的大量实验结果证实,提出的TAM-ULAG模型具有很强的竞争力:在分类指标($ Ac{c_2} $, $ {F_1} $)和回归指标(MAE, Corr)上与基准模型的指标相比均有所提升;对于二分类识别准确率,在CMU-MOSI和CMU-MOSEI数据集上分别为87.2%和85.8%,而在CH-SIMS数据集上达到81.47%。这些研究结果表明, 同时学习多模态间的情感语义一致性信息和各模态情感语义的差异性信息,有助于提高自监督多模态情感识别方法的性能。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号