首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 218 毫秒
1.
In the field of weakly supervised semantic segmentation (WSSS), Class Activation Maps (CAM) are typically adopted to generate pseudo masks. Yet, we find that the crux of the unsatisfactory pseudo masks is the incomplete CAM. Specifically, as convolutional neural networks tend to be dominated by the specific regions in the high-confidence channels of feature maps during prediction, the extracted CAM contains only parts of the object. To address this issue, we propose the Disturbed CAM (DCAM), a simple yet effective method for WSSS. Following CAM, we adopt a binary cross-entropy (BCE) loss to train a multi-label classification model. Then, we disturb the feature map with retraining to enhance the high-confidence channels. In addition, a softmax cross-entropy (SCE) loss branch is employed to increase the model attention to the target classes. Once converged, we extract DCAM in the same way as in CAM. The evaluation on both PASCAL VOC and MS COCO shows that DCAM not only generates high-quality masks (6.2% and 1.4% higher than the benchmark models), but also enables more accurate activation in object regions. The code is available at https://github.com/gyyang23/DCAM.  相似文献   

2.
Recently, vision transformer has gained a breakthrough in image recognition. Its self-attention mechanism (MSA) can extract discriminative tokens information from different patches to improve image classification accuracy. However, the classification token in its deep layer ignore the local features between layers. In addition, the patch embedding layer feeds fixed-size patches into the network, which inevitably introduces additional image noise. Therefore, we propose a hierarchical attention vision transformer (HAVT) based on the transformer framework. We present a data augmentation method for attention cropping to crop and drop image noise and force the network to learn key features. Second, the hierarchical attention selection (HAS) module is proposed, which improves the network's ability to learn discriminative tokens between layers by filtering and fusing tokens between layers. Experimental results show that the proposed HAVT outperforms state-of-the-art approaches and significantly improves the accuracy to 91.8% and 91.0% on CUB-200–2011 and Stanford Dogs, respectively. We have released our source code on GitHub https://github.com/OhJackHu/HAVT.git.  相似文献   

3.
Multi-label classification with region-free labels is attracting increasing attention compared to that with region-based labels due to the time-consuming manual region-labeling process. Existing methods usually employ attention-based technology to discover the conspicuous label-related regions in a weakly-supervised manner with only image-level region-free labels, while the region covering is not precise without exploring global clues of multi-level features. To address this issue, a novel Global-guided Weakly-Supervised Learning (GWSL) method for multi-label classification is proposed. The GWSL first extracts the multi-level features to estimate their global correlation map which is further utilized to guide feature disentanglement in the proposed Feature Disentanglement and Localization (FDL) networks. Specifically, the FDL networks then adaptively combine the different correlated features and localize the fine-grained features for identifying multiple labels. The proposed method is optimized in an end-to-end manner under weakly supervision with only image-level labels. Experimental results demonstrate that the proposed method outperforms the state-of-the-arts for multi-label learning problems on several publicly available image datasets. To facilitate similar researches in the future, the codes are directly available online at https://github.com/Yong-DAI/GWSL.  相似文献   

4.
We develop a full-reference (FR) video quality assessment framework that integrates analysis of space–time slices (STSs) with frame-based image quality measurement (IQA) to form a high-performance video quality predictor. The approach first arranges the reference and test video sequences into a space–time slice representation. To more comprehensively characterize space–time distortions, a collection of distortion-aware maps are computed on each reference–test video pair. These reference-distorted maps are then processed using a standard image quality model, such as peak signal-to-noise ratio (PSNR) or Structural Similarity (SSIM). A simple learned pooling strategy is used to combine the multiple IQA outputs to generate a final video quality score. This leads to an algorithm called Space–TimeSlice PSNR (STS-PSNR), which we thoroughly tested on three publicly available video quality assessment databases and found it to deliver significantly elevated performance relative to state-of-the-art video quality models. Source code for STS-PSNR is freely available at: http://live.ece.utexas.edu/research/Quality/STS-PSNR_release.zip.  相似文献   

5.
The existing deraining methods based on convolutional neural networks (CNNs) have made great success, but some remaining rain streaks can degrade images drastically. In this work, we proposed an end-to-end multi-scale context information and attention network, called MSCIANet. The proposed network consists of multi-scale feature extraction (MSFE) and multi-receptive fields feature extraction (MRFFE). Firstly, the MSFE can pick up features of rain streaks in different scales and propagate deep features of the two layers across stages by skip connections. Secondly, the MRFFE can refine details of the background by attention mechanism and the depthwise separable convolution of different receptive fields with different scales. Finally, the fusion of these outputs of two subnetworks can reconstruct the clean background image. Extensive experimental results have shown that the proposed network achieves a good effect on the deraining task on synthetic and real-world datasets. The demo can be available at https://github.com/CoderLi365/MSCIANet.  相似文献   

6.
Digital image watermarking has become a necessity in many applications such as data authentication, broadcast monitoring on the Internet and ownership identification. Various watermarking schemes have been proposed to protect the copyright information. There are three indispensable, yet contrasting requirements for a watermarking scheme: imperceptibility, robustness and payload. Therefore, a watermarking scheme should provide a trade-off among these requirements from the information-theoretic perspective. Generally, in order to enhance the imperceptibility, robustness and payload simultaneously, the human visual system (HVS) and the statistical properties of the image signal should be fully taken into account. The statistical model-based transform domain multiplicative watermarking scheme embodies the above ideas, and therefore the detection and extraction of the multiplicative watermarks have received a great deal of attention. The performance of a statistical model-based watermark detector or decoder is highly influenced by the accuracy of the statistical model itself and the applicability of decision rule. In this paper, we firstly propose a new hidden Markov trees (HMT) statistical model in Contourlet domain, namely Cauchy mixtures-based vector HMT (vector CMM–HMT), by describing the marginal distribution with Cauchy mixture model (CMM) and grouping Contourlet coefficients into a vector, which can capture both the subband marginal distributions and the strong dependencies across scales and orientations of the Contourlet coefficients. Then, by modeling the Contourlet coefficients with vector CMM–HMT and employing locally most powerful (LMP) test, we develop a locally optimum image watermark decoder in Contourlet domain. We conduct extensive experiments to evaluate the performance of the proposed blind watermark decoder, in which encouraging results validate the effectiveness of the proposed technique, in comparison with the state-of-the-art approaches recently proposed in the literature.  相似文献   

7.
8.
To increase the richness of the extracted text modality feature information and deeply explore the semantic similarity between the modalities. In this paper, we propose a novel method, named adaptive weight multi-channel center similar deep hashing (AMCDH). The algorithm first utilizes three channels with different configurations to extract feature information from the text modality; and then adds them according to the learned weight ratio to increase the richness of the information. We also introduce the Jaccard coefficient to measure the semantic similarity level between modalities from 0 to 1, and utilize it as the penalty coefficient of the cross-entropy loss function to increase its role in backpropagation. Besides, we propose a method of constructing center similarity, which makes the hash codes of similar data pairs close to the same center point, and dissimilar data pairs are scattered at different center points to generate high-quality hash codes. Extensive experimental evaluations on four benchmark datasets show that the performance of our proposed model AMCDH is significantly better than other competing baselines. The code can be obtained from https://github.com/DaveLiu6/AMCDH.git.  相似文献   

9.
Semantic segmentation aims to map each pixel of an image into its corresponding semantic label. Most existing methods either mainly concentrate on high-level features or simple combination of low-level and high-level features from backbone convolutional networks, which may weaken or even ignore the compensation between different levels. To effectively take advantages from both shallow (textural) and deep (semantic) features, this paper proposes a novel plug-and-play module, namely feature enhancement module (FEM). The proposed FEM first uses an information extractor to extract the desired details or semantics from different stages, and then enhances target features by taking in the extracted message. Two types of FEM, i.e., detail FEM and semantic FEM, can be customized. Concretely, the former type strengthens textural information to protect key but tiny/low-contrast details from suppression/removal, while the other one highlights structural information to boost segmentation performance. By equipping a given backbone network with FEMs, there might contain two information flows, i.e., detail flow and semantic flow. Extensive experiments on the Cityscapes, ADE20K and PASCAL Context datasets are conducted to validate the effectiveness of our design. The code has been released at https://github.com/SuperZ-Liu/FENet.  相似文献   

10.
Underwater image enhancement has attracted much attention due to the rise of marine resource development in recent years. Benefit from the powerful representation capabilities of Convolution Neural Networks(CNNs), multiple underwater image enhancement algorithms based on CNNs have been proposed in the past few years. However, almost all of these algorithms employ RGB color space setting, which is insensitive to image properties such as luminance and saturation. To address this problem, we proposed Underwater Image Enhancement Convolution Neural Network using 2 Color Space (UICE^2-Net) that efficiently and effectively integrate both RGB Color Space and HSV Color Space in one single CNN. To our best knowledge, this method is the first one to use HSV color space for underwater image enhancement based on deep learning. UIEC^2-Net is an end-to-end trainable network, consisting of three blocks as follow: a RGB pixel-level block implements fundamental operations such as denoising and removing color cast, a HSV global-adjust block for globally adjusting underwater image luminance, color and saturation by adopting a novel neural curve layer, and an attention map block for combining the advantages of RGB and HSV block output images by distributing weight to each pixel. Experimental results on synthetic and real-world underwater images show that the proposed method has good performance in both subjective comparisons and objective metrics. The code is available at https://github.com/BIGWangYuDong/UWEnhancement.  相似文献   

11.
Due to the storage and retrieval efficiency of hashing, as well as the highly discriminative feature extraction by deep neural networks, deep cross-modal hashing retrieval has been attracting increasing attention in recent years. However, most of existing deep cross-modal hashing methods simply employ single-label to directly measure the semantic relevance across different modalities, but neglect the potential contributions from multiple category labels. With the aim to improve the accuracy of cross-modal hashing retrieval by fully exploring the semantic relevance based on multiple labels of training data, in this paper, we propose a multi-label semantics preserving based deep cross-modal hashing (MLSPH) method. MLSPH firstly utilizes multi-labels of instances to calculate semantic similarity of the original data. Subsequently, a memory bank mechanism is introduced to preserve the multiple labels semantic similarity constraints and enforce the distinctiveness of learned hash representations over the whole training batch. Extensive experiments on several benchmark datasets reveal that the proposed MLSPH surpasses prominent baselines and reaches the state-of-the-art performance in the field of cross-modal hashing retrieval. Code is available at: https://github.com/SWU-CS-MediaLab/MLSPH.  相似文献   

12.
Many image co-segmentation algorithms have been proposed over the last decade. In this paper, we present a new dataset for evaluating co-segmentation algorithms, which contains 889 image groups with 18 images in each and the pixel-wise hand-annotated ground truths. The dataset is characterized by simple background produced from nearly a single color. It looks simple but is actually very challenging for current co-segmentation algorithms, because of four difficult cases in it: easy-confused foreground with background, transparent regions in objects, minor holes in objects, and shadows. In order to test the usefulness of our dataset, we review the state-of-the-art co-segmentation algorithms and evaluate seven algorithms on our dataset. The obtained performance of each algorithm is compared with those previously reported in the datasets with complex background. The results prove that our dataset is valuable for the development of co-segmentation techniques. It is more feasible to solve the four difficulties above on the simple background and then extend the solutions to the complex background problems. Our dataset can be freely downloaded from: http://www.iscbit.org/source/MLMR-COS.zip.  相似文献   

13.
14.
As the demand for realistic representation and its applications increases rapidly, 3D human modeling via a single RGB image has become the essential technique. Owing to the great success of deep neural networks, various learning-based approaches have been introduced for this task. However, partial occlusions still give the difficulty to accurately estimate the 3D human model. In this letter, we propose the part-attentive kinematic regressor for 3D human modeling. The key idea of the proposed method is to predict body part attentions based on each body center position and estimate parameters of the 3D human model via corresponding attentive features through the kinematic chain-based decoder in a one-stage fashion. One important advantage is that the proposed method has a good ability to yield natural shapes and poses even with severe occlusions. Experimental results on benchmark datasets show that the proposed method is effective for 3D human modeling under complicated real-world environments. The code and model are publicly available at: https://github.com/DCVL-3D/PKCN_release  相似文献   

15.
To overcome the barrier of storage and computation, the hashing technique has been widely used for nearest neighbor search in multimedia retrieval applications recently. Particularly, cross-modal retrieval that searches across different modalities becomes an active but challenging problem. Although numerous of cross-modal hashing algorithms are proposed to yield compact binary codes, exhaustive search is impractical for large-scale datasets, and Hamming distance computation suffers inaccurate results. In this paper, we propose a novel search method that utilizes a probability-based index scheme over binary hash codes in cross-modal retrieval. The proposed indexing scheme employs a few binary bits from the hash code as the index code. We construct an inverted index table based on the index codes, and train a neural network for ranking and indexing to improve the retrieval accuracy. Experiments are performed on two benchmark datasets for retrieval across image and text modalities, where hash codes are generated and compared with several state-of-the-art cross-modal hashing methods. Results show the proposed method effectively boosts the performance on search accuracy, computation cost, and memory consumption in these datasets and hashing methods. The source code is available on https://github.com/msarawut/HCI.  相似文献   

16.
The saliency prediction precision has improved rapidly with the development of deep learning technology, but the inference speed is slow due to the continuous deepening of networks. Hence, this paper proposes a fast saliency prediction model. Concretely, the siamese network backbone based on tailored EfficientNetV2 accelerates the inference speed while maintaining high performance. The shared parameters strategy further curbs parameter growth. Furthermore, we add multi-channel activation maps to optimize the fine features considering different channels and low-level visual features, which improves the interpretability of the model. Extensive experiments show that the proposed model achieves competitive performance on the standard benchmark datasets, and prove the effectiveness of our method in striking a balance between prediction accuracy and inference speed. Moreover, the small model size allows our method to be applied in edge devices. The code is available at: https://github.com/lscumt/fast-fixation-prediction.  相似文献   

17.
Images captured in weak illumination conditions could seriously degrade the image quality. Solving a series of degradation of low-light images can effectively improve the visual quality of images and the performance of high-level visual tasks. In this study, a novel Retinex-based Real-low to Real-normal Network (R2RNet) is proposed for low-light image enhancement, which includes three subnets: a Decom-Net, a Denoise-Net, and a Relight-Net. These three subnets are used for decomposing, denoising, contrast enhancement and detail preservation, respectively. Our R2RNet not only uses the spatial information of the image to improve the contrast but also uses the frequency information to preserve the details. Therefore, our model achieved more robust results for all degraded images. Unlike most previous methods that were trained on synthetic images, we collected the first Large-Scale Real-World paired low/normal-light images dataset (LSRW dataset) to satisfy the training requirements and make our model have better generalization performance in real-world scenes. Extensive experiments on publicly available datasets demonstrated that our method outperforms the existing state-of-the-art methods both quantitatively and visually. In addition, our results showed that the performance of the high-level visual task (i.e., face detection) can be effectively improved by using the enhanced results obtained by our method in low-light conditions. Our codes and the LSRW dataset are available at: https://github.com/JianghaiSCU/R2RNet.  相似文献   

18.
This paper describes an ultra high definition (UHD) video dataset named DVL2021 for the perceptual study of video quality assessment (VQA). To our knowledge, DVL2021 is the first authentically distorted 4K (3840 × 2160) UHD video quality dataset. The dataset contains 206 versatile 4K UHD video sequences, which are all collected in in-the-wild scenarios. Each sequence is captured at 50 frames per second (fps), stored in raw 10-bit 4:2:0 YUV format, and has a duration of 10 s. Following the subjective evaluation method of TV image quality granted by ITU-R BT.500-13, 32 unique participants take part in the manual annotation process, whose ages are from teenage to sixties (32.7 years old on average). DVL2021 has the following merits: (1) enormous variety of video contents, (2) captured by different types of cameras, (3) complex types and multiple levels of authentic distortion, (4) broadly distributed temporal/spatial information, and (5) a wide spectrum of mean opinion scores (MOS) distribution. Furthermore, we conduct a benchmark experiment by evaluating several mainstream VQA methods on DVL2021. The baseline results are higher than 0.75 in Spearman’s rank order correlation coefficient (SROCC) metric. Our study provides a basis for the UHD VQA problem. DVL2021 is publicly available at https://github.com/GZHU-DVL/DVL2021.  相似文献   

19.
This paper presents a novel No-Reference Video Quality Assessment (NR-VQA) model that utilizes proposed 3D steerable wavelet transform-based Natural Video Statistics (NVS) features as well as human perceptual features. Additionally, we proposed a novel two-stage regression scheme that significantly improves the overall performance of quality estimation. In the first stage, transform-based NVS and human perceptual features are separately passed through the proposed hybrid regression scheme: Support Vector Regression (SVR) followed by Polynomial curve fitting. The two visual quality scores predicted from the first stage are then used as features for the similar second stage. This predicts the final quality scores of distorted videos by achieving score level fusion. Extensive experiments were conducted using five authentic and four synthetic distortion databases. Experimental results demonstrate that the proposed method outperforms other published state-of-the-art benchmark methods on synthetic distortion databases and is among the top performers on authentic distortion databases. The source code is available at https://github.com/anishVNIT/two-stage-vqa.  相似文献   

20.
Recently, deep learning-based methods have reached an excellent performance on License Plate (LP) detection and recognition tasks. However, it is still challenging to build a robust model for Chinese LPs since there are not enough large and representative datasets. In this work, we propose a new dataset named Chinese Road Plate Dataset (CRPD) that contains multi-objective Chinese LP images as a supplement to the existing public benchmarks. The images are mainly captured with electronic monitoring systems with detailed annotations. To our knowledge, CRPD is the largest public multi-objective Chinese LP dataset with annotations of vertices. With CRPD, a unified detection and recognition network with high efficiency is presented as the baseline. The network is end-to-end trainable with totally real-time inference efficiency (30 fps with 640 p). The experiments on several public benchmarks demonstrate that our method has reached competitive performance. The code and dataset will be publicly available at https://github.com/yxgong0/CRPD.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号