首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Due to the storage and retrieval efficiency of hashing, as well as the highly discriminative feature extraction by deep neural networks, deep cross-modal hashing retrieval has been attracting increasing attention in recent years. However, most of existing deep cross-modal hashing methods simply employ single-label to directly measure the semantic relevance across different modalities, but neglect the potential contributions from multiple category labels. With the aim to improve the accuracy of cross-modal hashing retrieval by fully exploring the semantic relevance based on multiple labels of training data, in this paper, we propose a multi-label semantics preserving based deep cross-modal hashing (MLSPH) method. MLSPH firstly utilizes multi-labels of instances to calculate semantic similarity of the original data. Subsequently, a memory bank mechanism is introduced to preserve the multiple labels semantic similarity constraints and enforce the distinctiveness of learned hash representations over the whole training batch. Extensive experiments on several benchmark datasets reveal that the proposed MLSPH surpasses prominent baselines and reaches the state-of-the-art performance in the field of cross-modal hashing retrieval. Code is available at: https://github.com/SWU-CS-MediaLab/MLSPH.  相似文献   

2.
To overcome the barrier of storage and computation, the hashing technique has been widely used for nearest neighbor search in multimedia retrieval applications recently. Particularly, cross-modal retrieval that searches across different modalities becomes an active but challenging problem. Although numerous of cross-modal hashing algorithms are proposed to yield compact binary codes, exhaustive search is impractical for large-scale datasets, and Hamming distance computation suffers inaccurate results. In this paper, we propose a novel search method that utilizes a probability-based index scheme over binary hash codes in cross-modal retrieval. The proposed indexing scheme employs a few binary bits from the hash code as the index code. We construct an inverted index table based on the index codes, and train a neural network for ranking and indexing to improve the retrieval accuracy. Experiments are performed on two benchmark datasets for retrieval across image and text modalities, where hash codes are generated and compared with several state-of-the-art cross-modal hashing methods. Results show the proposed method effectively boosts the performance on search accuracy, computation cost, and memory consumption in these datasets and hashing methods. The source code is available on https://github.com/msarawut/HCI.  相似文献   

3.
To increase the richness of the extracted text modality feature information and deeply explore the semantic similarity between the modalities. In this paper, we propose a novel method, named adaptive weight multi-channel center similar deep hashing (AMCDH). The algorithm first utilizes three channels with different configurations to extract feature information from the text modality; and then adds them according to the learned weight ratio to increase the richness of the information. We also introduce the Jaccard coefficient to measure the semantic similarity level between modalities from 0 to 1, and utilize it as the penalty coefficient of the cross-entropy loss function to increase its role in backpropagation. Besides, we propose a method of constructing center similarity, which makes the hash codes of similar data pairs close to the same center point, and dissimilar data pairs are scattered at different center points to generate high-quality hash codes. Extensive experimental evaluations on four benchmark datasets show that the performance of our proposed model AMCDH is significantly better than other competing baselines. The code can be obtained from https://github.com/DaveLiu6/AMCDH.git.  相似文献   

4.
Retrieving 3D shapes with 2D images has become a popular research area nowadays, and a great deal of work has been devoted to reducing the discrepancy between 3D shapes and 2D images to improve retrieval performance. However, most approaches ignore the semantic information and decision boundaries of the two domains, and cannot achieve both domain alignment and category alignment in one module. In this paper, a novel Collaborative Distribution Alignment (CDA) model is developed to address the above existing challenges. Specifically, we first adopt a dual-stream CNN, following a similarity guided constraint module, to generate discriminative embeddings for input 2D images and 3D shapes (described as multiple views). Subsequently, we explicitly introduce a joint domain-class alignment module to dynamically learn a class-discriminative and domain-agnostic feature space, which can narrow the distance between 2D image and 3D shape instances of the same underlying category, while pushing apart the instances from different categories. Furthermore, we apply a decision boundary refinement module to avoid generating class-ambiguity embeddings by dynamically adjusting inconsistencies between two discriminators. Extensive experiments and evaluations on two challenging benchmarks, MI3DOR and MI3DOR-2, demonstrate the superiority of the proposed CDA method for 2D image-based 3D shape retrieval task.  相似文献   

5.
2D image-based 3D model retrieval has become a hotspot topic in recent years. However, the current existing methods are limited by two aspects. Firstly, they are mostly based on the supervised learning, which limits their application because of the high time and cost consuming of manual annotation. Secondly, the mainstream methods narrow the discrepancy between 2D and 3D domains mainly by the image-level alignment, which may bring the additional noise during the image transformation and influence cross-domain effect. Consequently, we propose a Wasserstein distance feature alignment learning (WDFAL) for this retrieval task. First of all, we describe 3D models through a series of virtual views and use CNNs to extract features. Secondly, we design a domain critic network based on the Wasserstein distance to narrow the discrepancy between two domains. Compared to the image-level alignment, we reduce the domain gap by the feature-level distribution alignment to avoid introducing additional noise. Finally, we extract the visual features from 2D and 3D domains, and calculate their similarity by utilizing Euclidean distance. The extensive experiments can validate the superiority of the WDFAL method.  相似文献   

6.
Deep network has become a new favorite for person re-identification (Re-ID), whose research focus is how to effectively extract the discriminative feature representation for pedestrians. In the paper, we propose a novel Re-ID network named as improved ReIDNet (iReIDNet), which can effectively extract the local and global multi-granular feature representations of pedestrians by a well-designed spatial feature transform and coordinate attention (SFTCA) mechanism together with improved global pooling (IGP) method. SFTCA utilizes channel adaptability and spatial location to infer a 2D attention map and can help iReIDNet to focus on the salient information contained in pedestrian images. IGP makes iReIDNet capture more effectively the global information of the whole human body. Besides, to boost the recognition accuracy, we develop a weighted joint loss to guide the training of iReIDNet. Comprehensive experiments demonstrate the availability and superiority of iReIDNet over other Re-ID methods. The code is available at https://github.com/XuRuyu66/ iReIDNet.  相似文献   

7.
Aerators are essential and crucial auxiliary devices in intensive culture, especially in industrial culture in China. In this paper, we propose a real-time expert system for anomaly detection of aerators based on computer vision technology and existing surveillance cameras. The expert system includes two modules, i.e., object region detection and working state detection. First, we present a small object region detection method based on the region proposal idea. Moreover, we propose a novel algorithm called reference frame Kanade-Lucas-Tomasi (RF-KLT) algorithm for motion feature extraction in fixed regions. Then, we describe a dimension reduction method of time series for establishing a feature dataset with obvious boundaries between classes. Finally, we use machine learning algorithms to build the feature classifier. The proposed expert system can realize real-time, robust and cost-free anomaly detection of aerators in both the actual video dataset and the augmented video dataset. Demo is available at https://youtu.be/xThHRwu_cnI.  相似文献   

8.
As the demand for realistic representation and its applications increases rapidly, 3D human modeling via a single RGB image has become the essential technique. Owing to the great success of deep neural networks, various learning-based approaches have been introduced for this task. However, partial occlusions still give the difficulty to accurately estimate the 3D human model. In this letter, we propose the part-attentive kinematic regressor for 3D human modeling. The key idea of the proposed method is to predict body part attentions based on each body center position and estimate parameters of the 3D human model via corresponding attentive features through the kinematic chain-based decoder in a one-stage fashion. One important advantage is that the proposed method has a good ability to yield natural shapes and poses even with severe occlusions. Experimental results on benchmark datasets show that the proposed method is effective for 3D human modeling under complicated real-world environments. The code and model are publicly available at: https://github.com/DCVL-3D/PKCN_release  相似文献   

9.
Recently, there has been a trend in tracking to use more refined segmentation mask instead of coarse bounding box to represent the target object. Some trackers proposed segmentation branches based on the tracking framework and maintain real-time speed. However, those trackers use a simple FCNs structure and lack of the edge information modeling. This makes performance quite unsatisfactory. In this paper, we propose an edge-aware segmentation network, which uses the complementarity between target information and edge information to provide a more refined representation of the target. Firstly, We use the high-level features of the tracking backbone network and the correlation features of the classification branch of the tracking framework to fuse, and use the target edge and target segmentation mask for simultaneous supervision to obtain an optimized high-level feature with rough edge information and target information. Secondly, we use the optimized high-level features to guide the low-level features of the tracking backbone network to generate more refined edge features. Finally, we use the refined edge features to fuse with the target features of each layer to generate the final mask. Our approach has achieved leading performance on recent pixel-wise object tracking benchmark VOT2020 and segmentation datasets DAVIS2016 and DAVIS2017 while running on 47 fps. Code is available at https://github.com/TJUMMG/EATtracker.  相似文献   

10.
The existing deraining methods based on convolutional neural networks (CNNs) have made great success, but some remaining rain streaks can degrade images drastically. In this work, we proposed an end-to-end multi-scale context information and attention network, called MSCIANet. The proposed network consists of multi-scale feature extraction (MSFE) and multi-receptive fields feature extraction (MRFFE). Firstly, the MSFE can pick up features of rain streaks in different scales and propagate deep features of the two layers across stages by skip connections. Secondly, the MRFFE can refine details of the background by attention mechanism and the depthwise separable convolution of different receptive fields with different scales. Finally, the fusion of these outputs of two subnetworks can reconstruct the clean background image. Extensive experimental results have shown that the proposed network achieves a good effect on the deraining task on synthetic and real-world datasets. The demo can be available at https://github.com/CoderLi365/MSCIANet.  相似文献   

11.
基于内容的图像检索一直是一个受关注的研究热点,这里利用图像的颜色和形状特征,将基于内容的图像检索应用于电子购物领域。提出先利用不变距与傅里叶描述子相结合的方法对图像形状特征进行检索,再利用改进的颜色直方图进行二次检索的检索方法。在检索前引入图像背景消除法消除图像中背景信息的影响。最后通过实验验证了基于颜色和形状特征的服装图像检索效果以及利用图像背景去除对检索效果的影响。  相似文献   

12.
In the task of skeleton-based action recognition, CNN-based methods represent the skeleton data as a pseudo image for processing. However, it still remains as a critical issue of how to construct the pseudo image to model the spatial dependencies of the skeletal data. To address this issue, we propose a novel convolutional neural network with adaptive inferential framework (AIF-CNN) to exploit the dependencies among the skeleton joints. We particularly investigate several initialization strategies to make the AIF effective with each strategy introducing the different prior knowledge. Extensive experiments on the dataset of NTU RGB+D and Kinetics-Skeleton demonstrate that the performance is improved significantly by integrating the different prior information. The source code is available at: https://github.com/hhe-distance/AIF-CNN.  相似文献   

13.
14.
基于形状的图像检索技术是基于内容的图像检索技术的一个重要组成部分。现有的形状特征检索技术主要集中在形状特征的提取及相似性度量、形状特征与颜色和纹理特征结合、形状特征与高层的语义特征结合的研究。在分析现有的基于形状的图像检索技术的一些关键技术的基础上,对基于小波-傅里叶特征(WFD)的形状检索方法进行了研究,并提出了一些改进算法。结合Matlab和ACCESS实现了一个基于形状的图像检索实验系统,建立了用户界面,选取与设计了4个图像测试集,使用检索性能评价方法对形状特征的检索结果进行了客观的评价。实验结果表明,利用本文所提出改进的形状特征进行检索取得了较好的检索效果。  相似文献   

15.
16.
In the field of security, faces are usually blurry, occluded, diverse pose and small in the image captured by an outdoor surveillance camera, which is affected by the external environment such as the camera pose and range, weather conditions, etc. It can be described as a problem of hard face detection in natural images. To solve this problem, we propose a deep convolutional neural network named feature hierarchy encoder–decoder network (FHEDN). It is motivated by two observations from contextual semantic information and the mechanism of multi-scale face detection. The proposed network is a scale-variant style architecture and single stage, which are composed of encoder and decoder subnetworks. Based on the assumption that contextual semantic information around face being auxiliary to detect faces, we introduce a residual mechanism to fuse context prior-based information into face feature and formulate the learning chain to train each encoder–decoder pair. In addition, we discuss some important factors in implement details such as the distribution of training dataset, the scale of feature hierarchy, and anchor box size, etc. They have some impact on the detection performance of the final network. Compared with some state-of-the-art algorithms, our method achieves promising performance on the popular benchmarks including AFW, PASCAL FACE, FDDB, and WIDER FACE. Consequently, the proposed approach can be efficiently implemented and routinely applied to detect faces with severe occlusion and arbitrary pose variations in unconstrained scenes. Our code and results are available on https://github.com/zzxcoder/EvaluationFHEDN.  相似文献   

17.
Recently, vision transformer has gained a breakthrough in image recognition. Its self-attention mechanism (MSA) can extract discriminative tokens information from different patches to improve image classification accuracy. However, the classification token in its deep layer ignore the local features between layers. In addition, the patch embedding layer feeds fixed-size patches into the network, which inevitably introduces additional image noise. Therefore, we propose a hierarchical attention vision transformer (HAVT) based on the transformer framework. We present a data augmentation method for attention cropping to crop and drop image noise and force the network to learn key features. Second, the hierarchical attention selection (HAS) module is proposed, which improves the network's ability to learn discriminative tokens between layers by filtering and fusing tokens between layers. Experimental results show that the proposed HAVT outperforms state-of-the-art approaches and significantly improves the accuracy to 91.8% and 91.0% on CUB-200–2011 and Stanford Dogs, respectively. We have released our source code on GitHub https://github.com/OhJackHu/HAVT.git.  相似文献   

18.
Knowledge distillation has become a key technique for making smart and light-weight networks through model compression and transfer learning. Unlike previous methods that applied knowledge distillation to the classification task, we propose to exploit the decomposition-and-replacement based distillation scheme for depth estimation from a single RGB color image. To do this, Laplacian pyramid-based knowledge distillation is firstly presented in this paper. The key idea of the proposed method is to transfer the rich knowledge of the scene depth, which is well encoded through the teacher network, to the student network in a structured way by decomposing it into the global context and local details. This is fairly desirable for the student network to restore the depth layout more accurately with limited resources. Moreover, we also propose a new guidance concept for knowledge distillation, so-called ReplaceBlock, which replaces blocks randomly selected in the decoded feature of the student network with those of the teacher network. Our ReplaceBlock gives a smoothing effect in learning the feature distribution of the teacher network by considering the spatial contiguity in the feature space. This process is also helpful to clearly restore the depth layout without the significant computational cost. Based on various experimental results on benchmark datasets, the effectiveness of our distillation scheme for monocular depth estimation is demonstrated in details. The code and model are publicly available at : https://github.com/tjqansthd/Lap_Rep_KD_Depth.  相似文献   

19.
In the field of weakly supervised semantic segmentation (WSSS), Class Activation Maps (CAM) are typically adopted to generate pseudo masks. Yet, we find that the crux of the unsatisfactory pseudo masks is the incomplete CAM. Specifically, as convolutional neural networks tend to be dominated by the specific regions in the high-confidence channels of feature maps during prediction, the extracted CAM contains only parts of the object. To address this issue, we propose the Disturbed CAM (DCAM), a simple yet effective method for WSSS. Following CAM, we adopt a binary cross-entropy (BCE) loss to train a multi-label classification model. Then, we disturb the feature map with retraining to enhance the high-confidence channels. In addition, a softmax cross-entropy (SCE) loss branch is employed to increase the model attention to the target classes. Once converged, we extract DCAM in the same way as in CAM. The evaluation on both PASCAL VOC and MS COCO shows that DCAM not only generates high-quality masks (6.2% and 1.4% higher than the benchmark models), but also enables more accurate activation in object regions. The code is available at https://github.com/gyyang23/DCAM.  相似文献   

20.
张天  靳聪  帖云  李小兵 《信号处理》2020,36(6):966-976
跨模态检索旨在通过以某一模态的数据为查询词,使人们能够得到与之相关的其他不同模态数据的检索结果的新型检索方法,这已成为多媒体和信息检索领域中一个有趣的研究问题。但是,目前大多数的研究成果集中于文本到图像、文本到视频以及歌词到音频等跨模态相关任务上,而关于如何为特定的视频通过跨模态检索得到合适的音乐这一跨模态的相关研究却很有限。此外,大多现有的关于视频和音频跨模态的研究依赖于元数据(例如关键字,标签或描述)。本文介绍了一种基于音频和视频这两种模态数据内容的跨模态检索的方法,该方法以新型的双流处理网络为框架,并通过神经网络学习两模态数据在公共子空间的特征表达,以计算音频和视频数据之间的相似度。本文所提出的方法的创新点主要在以下三个方面:1)在原有的提取各模态特征的模型基础上引入注意力机制,以此得到了视频和音频的特征选择模型,并筛选出相应的特征表达。2)使用了样本挖掘机制,剔除了无效样本,使得数据的训练更加高效。3)从计算模态间相似性和保持模态内结构不变两方面出发,设计了相应的损失函数进行模型的训练。且所提出的模型在VEGAS数据集和自建数据集上都取得了较高的准确度。   相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号