首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 0 毫秒
1.
针对现有基于视觉注意力和基于文本注意力的图像描述自动生成模型无法同时兼顾描述图像细节和整体图像的问题,提出了一种基于演化深度学习的图像描述生成模型(evolutionary deep learning model for image captioning, EDLMIC),该模型是一种包含图像编码器、演化神经网络和自适应融合解码器三个子模块的图像描述自动生成模型,能够有效地融合视觉信息和文本信息,自动计算这两种信息在每个时间步所占的比例,从而基于融合的视觉文本信息更好地生成给定图像的相关描述。在Flickr30K和COCO2014两个公开数据集的实验结果表明,EDLMIC模型在METEOR、ROUGE-L、CIDEr和SPICE四个指标均优于其他基线模型,并且在多种不同的生活场景中具有较好的性能。  相似文献   

2.
Image captioning describes the visual content of a given image by using natural language sentences, and plays a key role in the fusion and utilization of the image features. However, in the existing image captioning models, the decoder sometimes fails to efficiently capture the relationships between image features because of their lack of sequential dependencies. In this paper, we propose a Relational-Convergent Transformer (RCT) network to obtain complex intramodality representations in image captioning. In RCT, a Relational Fusion Module (RFM) is designed for capturing the local and global information of an image by a recursive fusion. Then, a Relational-Convergent Attention (RCA) is proposed, which is composed of a self-attention and a hierarchical fusion module for aggregating global relational information to extract a more comprehensive intramodal contextual representation. To validate the effectiveness of the proposed model, extensive experiments are conducted on the MSCOCO dataset. The experimental results show that the proposed method outperforms some of the state-of-the-art methods.  相似文献   

3.
Wang  Shiwei  Lan  Long  Zhang  Xiang  Dong  Guohua  Luo  Zhigang 《Multimedia Tools and Applications》2020,79(3-4):2013-2030
Multimedia Tools and Applications - In image captioning, exploring the advanced semantic concepts is very important for boosting captioning performance. Although much progress has been made in this...  相似文献   

4.
5.
基于注意力机制的推荐模型在进行特征提取时用到的绝对位置是一个静态且孤立的信息.为克服上述缺点,提出基于翻译结构的相对位置注意力机制推荐模型.以时序排列用户历史行为并构造相对位置表征,分别在计算注意力权重和输出中加入相对位置表征,加深注意力编码层和解码层并用平均注意力进行预处理.实验结果表明,与基于注意力机制的模型相比,所提模型更能捕获用户偏好的动态变化,挖掘更深层的信息,更适合处理长序列.  相似文献   

6.

In this work, we present a novel multi-scale feature fusion network (M-FFN) for image captioning task to incorporate discriminative features and scene contextual information of an image. We construct multi-scale feature fusion network by leveraging spatial transformation and multi-scale feature pyramid networks via feature fusion block to enrich spatial and global semantic information. In particular, we take advantage of multi-scale feature pyramid network to incorporate global contextual information by employing atrous convolutions on top layers of convolutional neural network (CNN). And, the spatial transformation network is exploited on early layers of CNN to remove intra-class variability caused by spatial transformations. Further, the feature fusion block integrates both global contextual information and spatial features to encode the visual information of an input image. Moreover, spatial-semantic attention module is incorporated to learn attentive contextual features to guide the captioning module. The efficacy of the proposed model is evaluated on the COCO dataset.

  相似文献   

7.
This paper analyzes the predictions of image captioning models with attention mechanisms beyond visualizing the attention itself. We develop variants of Layer-wise Relevance Propagation (LRP) and gradient-based explanation methods, tailored to image captioning models with attention mechanisms. We compare the interpretability of attention heatmaps systematically against the explanations provided by explanation methods such as LRP, Grad-CAM, and Guided Grad-CAM. We show that explanation methods provide simultaneously pixel-wise image explanations (supporting and opposing pixels of the input image) and linguistic explanations (supporting and opposing words of the preceding sequence) for each word in the predicted captions. We demonstrate with extensive experiments that explanation methods (1) can reveal additional evidence used by the model to make decisions compared to attention; (2) correlate to object locations with high precision; (3) are helpful to “debug” the model, e.g. by analyzing the reasons for hallucinated object words. With the observed properties of explanations, we further design an LRP-inference fine-tuning strategy that reduces the issue of object hallucination in image captioning models, and meanwhile, maintains the sentence fluency. We conduct experiments with two widely used attention mechanisms: the adaptive attention mechanism calculated with the additive attention and the multi-head attention mechanism calculated with the scaled dot product.  相似文献   

8.
9.
为提高语义图像分类器性能,提出一种基于公理化模糊集的语义图像层次关联规则分类器。首先,为提高算法精度,在对图像数据集进行特征提取基础上,采用公理化理论(AFS)构建图像集模糊概念的AFS属性表达,提高图像集属性辨识度;其次,为提高算法计算效率,考虑采用层次结构关联规则,构建语义图像分类器,利用概念之间的本体信息,提高并行分类能力;最后,通过对算法参数及横向对比实验,显示所提算法具有较高的计算精度和计算效率。  相似文献   

10.
为方便非专业用户修图,提出一种基于Transformer的图像编辑模型TMGAN,使用户可通过自然语言描述自动修改图像属性。TMGAN整体框架采用生成对抗网络,生成器采用Transformer编码器结构提取全局上下文信息,解决生成图像不够真实的问题;判别器包含基于Transformer的多尺度判别器和词级判别器两部分,给生成器细粒度的反馈,生成符合文本描述的目标图像且保留原始图像中与文本描述无关的内容。实验表明,此模型在CUB Bird数据集上,IS(inception score)、FID(Fréchet inception distance)以及MP(manipulation precision)度量指标分别达到了9.07、8.64和0.081。提出的TMGAN模型对比现有模型效果更好,生成图像既满足了给定文本的属性要求又具有高语义性。  相似文献   

11.
12.
In order to cope with the ambiguity of spatial relative position concepts, we propose a new definition of the relative position between two objects in a fuzzy set framework. This definition is based on a morphological and fuzzy pattern-matching approach, and consists of comparing an object to a fuzzy landscape representing the degree of satisfaction of a directional relationship to a reference object. It has good formal properties, it is flexible, it fits the intuition, and it can be used for structural pattern recognition under imprecision. Moreover, it also applies in 3D and for fuzzy objects issued from images  相似文献   

13.
目的 注意力机制是图像描述模型的常用方法,特点是自动关注图像的不同区域以动态生成描述图像的文本序列,但普遍存在不聚焦问题,即生成描述单词时,有时关注物体不重要区域,有时关注物体上下文,有时忽略图像重要目标,导致描述文本不够准确。针对上述问题,提出一种结合多层级解码器和动态融合机制的图像描述模型,以提高图像描述的准确性。方法 对Transformer的结构进行扩展,整体模型由图像特征编码、多层级文本解码和自适应融合等3个模块构成。通过设计多层级文本解码结构,不断精化预测的文本信息,为注意力机制的聚焦提供可靠反馈,从而不断修正注意力机制以生成更加准确的图像描述。同时,设计文本融合模块,自适应地融合由粗到精的图像描述,使低层级解码器的输出直接参与文本预测,不仅可以缓解训练过程产生的梯度消失现象,同时保证输出的文本描述细节信息丰富且语法多样。结果 在MS COCO(Microsoft common objects in context)和Flickr30K两个数据集上使用不同评估方法对模型进行验证,并与具有代表性的12种方法进行对比实验。结果表明,本文模型性能优于其他对比方法。其中,在MS C...  相似文献   

14.
Hyperspectral images contain rich spatial and spectral information, which provides a strong basis for distinguishing different land-cover objects. Therefore, hyperspectral image (HSI) classification has been a hot research topic. With the advent of deep learning, convolutional neural networks (CNNs) have become a popular method for hyperspectral image classification. However, convolutional neural network (CNN) has strong local feature extraction ability but cannot deal with long-distance dependence well. Vision Transformer (ViT) is a recent development that can address this limitation, but it is not effective in extracting local features and has low computational efficiency. To overcome these drawbacks, we propose a hybrid classification network that combines the strengths of both CNN and ViT, names Spatial-Spectral Former(SSF). The shallow layer employs 3D convolution to extract local features and reduce data dimensions. The deep layer employs a spectral-spatial transformer module for global feature extraction and information enhancement in spectral and spatial dimensions. Our proposed model achieves promising results on widely used public HSI datasets compared to other deep learning methods, including CNN, ViT, and hybrid models.  相似文献   

15.
李康康  张静 《计算机应用》2021,41(9):2504-2509
图像描述任务是图像理解的一个重要分支,它不仅要求能够正确识别图像的内容,还要求能够生成在语法和语义上正确的句子.传统的基于编码器-解码器的模型不能充分利用图像特征并且解码方式单一.针对这些问题,提出一种基于注意力机制的多层次编码和解码的图像描述模型.首先使用Faster R-CNN(Faster Region-base...  相似文献   

16.
The socio-economic development of the World Wide Web gathered a large momentum through Web 2.0 and currently, the Web of Data is adding a further technological driver to this. The tremendous growth of media data combined with structured (linked) data promises further opportunities for digital market places. Although the integration of media content with linked data is only beginning there are already working groups and projects addressing this issue. We show how media and existing data sets can be seamlessly integrated and thus give possibilities for an extended user experience while interacting with media content on the web. We focus on automatic semantic enhancement services that can link arbitrary and open accessible data and introduce opportunities for media annotation, fragmentation and presentation. Our use case scenario is based on the Red Bull Content Pool, a media management system for videos, images and articles about Red Bull related content covering a multitude of sports events.  相似文献   

17.
The Internet of Vehicles (IoV) autonomous driving technology based on deep learning has achieved great success. However, under the tunnel environment, the computer vision-based IoV may fail due to low illumination. In order to handle this issue, this paper deploys an image enhancement module at the terminal of the IoV to alleviate the low illumination influence. The enhanced images can be submitted through IoT to the cloud server for further processing. The core algorithm of image enhancement is implemented by a dynamic graph embedded transformer network based on federated learning which can fully utilize the data resources of multiple devices in IoV and improve the generalization. Extensive comparative experiments are conducted on the publicly available dataset and the self-built dataset which is collected under the tunnel environment. Compared with other deep models, all results confirm that the proposed graph embedded Transformer model can effectively enhance the detail information of the low-light image, which can greatly improve the following tasks in IoV.  相似文献   

18.
Distinguishing the subtle differences among fine-grained images from subordinate concepts of a concept hierarchy is a challenging task. In this paper, we propose a Siamese transformer with hierarchical concept embedding(STr HCE), which contains two transformer subnetworks sharing all configurations,and each subnetwork is equipped with the hierarchical semantic information at different concept levels for fine-grained image embeddings. In particular, one subnetwork is for coarse-scale patches to l...  相似文献   

19.
20.
目的 红外图像在工业中发挥着重要的作用。但是由于技术原因,红外图像的分辨率一般较低,限制了其普遍适用性。许多低分辨率红外传感器都和高分辨率可见光传感器搭配使用,一种可行的思路是利用可见光传感器捕获的高分辨率图像,辅助红外图像进行超分辨率重建。方法 本文提出了一种使用高分辨率可见光图像引导红外图像进行超分辨率的神经网络模型,包含两个模块:引导Transformer模块和超分辨率重建模块。考虑到红外和可见光图像对一般存在一定的视差,两者之间是不完全对齐的,本文使用基于引导Transformer的信息引导与融合方法,从高分辨率可见光图像中搜索相关纹理信息,并将这些相关纹理信息与低分辨率红外图像的信息融合得到合成特征。然后这个合成特征经过后面的超分辨率重建子网络,得到最终的超分辨率红外图像。在超分辨率重建模块,本文使用通道拆分策略来消除深度模型中的冗余特征,减少计算量,提高模型性能。结果 本文方法在FLIR-aligned数据集上与其他代表性图像超分辨率方法进行对比。实验结果表明,本文方法可以取得优于对比方法的超分辨率性能。客观结果上,本文方法比其他红外图像引导超分辨率方法在峰值信噪比(pea...  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号