首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 62 毫秒
1.
High-spatial-resolution (HSR) remote sensing images serve as carriers of geographic information. Exploring geo-objects and their geospatial relations is fundamental in understanding HSR remote sensing images. To this end, this study proposes an intelligent semantic understanding method for HSR remote sensing images via geospatial relation captions. Firstly, we propose a method of geospatial relation expression to convey the topological, directional and distance relations of geo-objects in HSR images. Secondly, on the basis of images and their geospatial relation captions, an image dataset is constructed for model training. Finally, geospatial relation captioning is implemented for HSR images by using an attention-based deep neural network model. Experimental results demonstrate that the proposed captioning method can effectively provide geospatial semantics for HSR image understanding.  相似文献   

2.
目前大多数图像标题生成模型都是由一个基于卷积神经网络(Convolutional Neural Network,CNN)的图像编码器和一个基于循环神经网络(Recurrent Neural Network,RNN)的标题解码器组成。其中图像编码器用于提取图像的视觉特征,标题解码器基于视觉特征通过注意力机制来生成标题。然而,使用基于注意力机制的RNN的问题在于,解码端虽然可以对图像特征和标题交互的部分进行注意力建模,但是却忽略了标题内部交互作用的自我注意。因此,针对图像标题生成任务,文中提出了一种能同时结合循环网络和自注意力网络优点的模型。该模型一方面能够通过自注意力模型在统一的注意力区域内同时捕获模态内和模态间的相互作用,另一方面又保持了循环网络固有的优点。在MSCOCO数据集上的实验结果表明,CIDEr值从1.135提高到了1.166,所提方法能够有效提升图像标题生成的性能。  相似文献   

3.
For automatically mining the underlying relationships between different famous persons in daily news, for example, building a news person based network with the faces as icons to facilitate face-based person finding, we need a tool to automatically label faces in new images with their real names. This paper studies the problem of linking names with faces from large-scale news images with captions. In our previous work, we proposed a method called Person-based Subset Clustering which is mainly based on face clustering for all face images derived from the same name. The location where a name appears in a caption, as well as the visual structural information within a news image provided informative cues such as who are really in the associated image. By combining the domain knowledge from the captions and the corresponding image we propose a novel cross-modality approach to further improve the performance of linking names with faces. The experiments are performed on the data sets including approximately half a million news images from Yahoo! news, and the results show that the proposed method achieves significant improvement over the clustering-only methods.  相似文献   

4.
图像描述任务是利用计算机自动为已知图像生成一个完整、通顺、适用于对应场景的描述语句,实现从图像到文本的跨模态转换。随着深度学习技术的广泛应用,图像描述算法的精确度和推理速度都得到了极大提升。本文在广泛文献调研的基础上,将基于深度学习的图像描述算法研究分为两个层面,一是图像描述的基本能力构建,二是图像描述的应用有效性研究。这两个层面又可以细分为传递更加丰富的特征信息、解决暴露偏差问题、生成多样性的图像描述、实现图像描述的可控性和提升图像描述推理速度等核心技术挑战。针对上述层面所对应的挑战,本文从注意力机制、预训练模型和多模态模型的角度分析了传递更加丰富的特征信息的方法,从强化学习、非自回归模型和课程学习与计划采样的角度分析了解决暴露偏差问题的方法,从图卷积神经网络、生成对抗网络和数据增强的角度分析了生成多样性的图像描述的方法,从内容控制和风格控制的角度分析了图像描述可控性的方法,从非自回归模型、基于网格的视觉特征和基于卷积神经网络解码器的角度分析了提升图像描述推理速度的方法。此外,本文还对图像描述领域的通用数据集、评价指标和已有算法性能进行了详细介绍,并对图像描述中待解决的问题与未来研究...  相似文献   

5.
基于视频的字幕检索与提取   总被引:2,自引:0,他引:2  
在许多视频流如新闻节目、VCD中均含有字幕,这些字幕含有丰富的语义信息。本文针对字幕的独有特性,提出了基于视频的字幕检索和提取方法,实验结果令人满意。另外这种方法对于日文、韩文等其它语言字幕的检索也有一定的参考价值。  相似文献   

6.
7.
Given an unstructured collection of captioned images of cluttered scenes featuring a variety of objects, our goal is to simultaneously learn the names and appearances of the objects. Only a small fraction of local features within any given image are associated with a particular caption word, and captions may contain irrelevant words not associated with any image object. We propose a novel algorithm that uses the repetition of feature neighborhoods across training images and a measure of correspondence with caption words to learn meaningful feature configurations (representing named objects). We also introduce a graph-based appearance model that captures some of the structure of an object by encoding the spatial relationships among the local visual features. In an iterative procedure, we use language (the words) to drive a perceptual grouping process that assembles an appearance model for a named object. Results of applying our method to three data sets in a variety of conditions demonstrate that, from complex, cluttered, real-world scenes with noisy captions, we can learn both the names and appearances of objects, resulting in a set of models invariant to translation, scale, orientation, occlusion, and minor changes in viewpoint or articulation. These named models, in turn, are used to automatically annotate new, uncaptioned images, thereby facilitating keyword-based image retrieval.  相似文献   

8.
9.
In this paper, we present methods for face recognition using a collection of images with captions. We consider two tasks: retrieving all faces of a particular person in a data set, and establishing the correct association between the names in the captions and the faces in the images. This is challenging because of the very large appearance variation in the images, as well as the potential mismatch between images and their captions.  相似文献   

10.
针对显微图像拼接中的误匹配和误差积累问题,把图像对准误差划分为第一类误差和第二类误差,并提出了一种新的图像拼接方法.首先利用所有重叠图像对的局部对准约束建立全局对准模型,它可以消除第一类误差引起的误差积累;然后根据全局对准误差的分布特性,提出消除第二类误差的最小回路一致性方法.实验表明,该图像拼接方法计算简单、有效,适用于大规模的显微图像拼接.  相似文献   

11.
目的 生物医学文献中的图像经常是包含多种模式的复合图像,自动标注其类别,将有助于提高图像检索的性能,辅助医学研究或教学。方法 融合图像内容和说明文本两种模态的信息,分别搭建基于深度卷积神经网络的多标签分类模型。视觉分类模型借用自然图像和单标签的生物医学简单图像,实现异质迁移学习和同质迁移学习,捕获通用领域的一般特征和生物医学领域的专有特征,而文本分类模型利用生物医学简单图像的说明文本,实现同质迁移学习。然后,采用分段式融合策略,结合两种模态模型输出的结果,识别多标签医学图像的相关模式。结果 本文提出的跨模态多标签分类算法,在ImageCLEF2016生物医学图像多标签分类任务数据集上展开实验。基于图像内容的混合迁移学习方法,比仅采用异质迁移学习的方法,具有更低的汉明损失和更高的宏平均F1值。文本分类模型引入同质迁移学习后,能够明显提高标签的分类性能。最后,融合两种模态的多标签分类模型,获得与评测任务最佳成绩相近的汉明损失,而宏平均F1值从0.320上升到0.488,提高了约52.5%。结论 实验结果表明,跨模态生物医学图像多标签分类算法,融合图像内容和说明文本,引入同质和异质数据进行迁移学习,缓解生物医学图像领域标注数据规模小且标签分布不均衡的问题,能够更有效地识别复合医学图像中的模式信息,进而提高图像检索性能。  相似文献   

12.
Images play an important role in the representation and acquisition of specialized knowledge. Not surprisingly, terminological knowledge bases (TKBs) often include images as a way to enhance the information in concept entries. However, the selection of these images should not be random, but rather based on specific guidelines that take into account the type and nature of the concept being described. This paper presents a proposal on how to combine the features of images with the conceptual propositions in EcoLexicon, a multilingual TKB on the environment. This proposal is based on the following: (1) the combinatory possibilities of concept types; (2) image types, such as photographs, drawings and flow charts; (3) morphological features or visual knowledge patterns (VKPs), such as labels, colours, arrows, and their effect on the functional nature of each image type. Currently, images are stored in association with concept entries according to the semantic content of their definitions, but they are not described or annotated according to the parameters that guided their selection, which would undoubtedly contribute to the systematization and automatization of the process. First, the images included in EcoLexicon were analyzed in terms of their adequateness, the semantic relations expressed, the concept types and their VKPs. Then, with these data, guidelines for image selection and annotation were created. The final aim is twofold: (1) to systematize the selection of images and (2) to start annotating old and new images so that the system can automatically allocate them in different concept entries based on shared conceptual propositions.  相似文献   

13.
IC (Image Captioning) is a crucial part of Visual Data Processing and aims at understanding for providing captions that verbalize an image’s important elements. However, in existing works, because of the complexity in images, neglecting major relation between the object in an image, poor quality image, labelling it remains a big problem for researchers. Hence, the main objective of this work attempts to overcome these challenges by proposing a novel framework for IC. So in this research work the main contribution deals with the framework consists of three phases that is image understanding, textual understanding and decoding. Initially, the image understanding phase is initiated with image pre-processing to enhance image quality. Thereafter, object has been detected using IYV3MMDs (Improved YoloV3 Multishot Multibox Detectors) in order to relate the interrelation between the image and the object, and then it is followed by MBFOCNNs (Modified Bacterial Foraging Optimization in Convolution Neural Networks), which encodes and provides final feature vectors. Secondly, the textual understanding phase is performed based on an image which is initiated with preprocessing of text where unwanted words, phrases, punctuations are removed in order to provide a healthy text. It is then followed by MGloVEs (Modified Global Vectors for Word Representation), which provides a word embedding of features with the highest priority towards the object present in an image. Finally, the decoding phase has been performed, which decodes the image whether it may be a normal or complex scene image and provides an accurate text by its learning ability using MDAA (Modified Deliberate Adaptive Attention). The experimental outcome of this work shows better accuracy of shows 96.24% when compared to existing and similar methods while generating captions for images.  相似文献   

14.
视频字幕检测和提取是视频理解的关键技术之一。文中提出一种两阶段的字幕检测和提取算法,将字幕帧和字幕区域分开检测,从而提高检测效率和准确率。第一阶段进行字幕帧检测:首先,根据帧间差算法进行运动检测,对字幕进行初步判断,得到二值化图像序列;然后,根据普通字幕和滚动字幕的动态特征对该序列进行二次筛选,得到字幕帧。第二阶段对字幕帧进行字幕区域检测和提取:首先,利用Sobel边缘检测算法初检文字区域;然后,利用高度约束等剔除背景,并根据宽高比区分出纵向字幕和横向字幕,从而得到字幕帧中的所有字幕,即静止字幕、普通字幕、滚动字幕。该方法减少了需要检测的帧数,将字幕检测效率提高了约11%。实验对比结果证明, 相比单一使用帧间差和边缘检测的方法,该方法在F值上提升约9%。  相似文献   

15.
随着多媒体技术的快速发展及广泛应用,图像质量评价因其在多媒体处理中的重要作用得到越来越多的关注,其作用包括图像数据筛选、算法参数选择与优化等。根据图像质量评价应用时是否需要参考信息,它可分为全参考图像质量评价、半参考图像质量评价和无参考图像质量评价,前两类分别需要全部参考信息和部分参考信息,而第3类不需要参考信息。无论是全参考、半参考还是无参考图像质量评价,图像失真对图像质量评价的影响均较大,主要体现在图像质量评价数据库构建和图像质量评价模型设计两方面。本文从图像失真的角度,主要概述2011—2021年国内外公开发表的图像质量评价模型,涵盖全参考、半参考和无参考模型。根据图像的失真类型,将图像质量评价模型分为针对合成失真的图像质量评价模型、针对真实失真的图像质量评价模型和针对算法相关失真的图像质量评价模型。其中,合成失真是指人工添加噪声,如高斯噪声和模糊失真,通常呈现均匀分布;真实失真是指在图像的获取中,由于环境、拍摄设备或拍摄操作不当等因素所引入的失真类型。相对合成失真,真实失真更为复杂,可能包括一种或多种失真,数据收集难度更大;算法相关失真是指图像处理算法或计算机视觉算法在处理图像...  相似文献   

16.
17.
18.
当前图像描述生成的研究主要仅限于单语言(如英文),这得益于大规模的已人工标注的图像及其英文描述语料。该文探索零标注资源情况下,以英文作为枢轴语言的图像中文描述生成研究。具体地,借助于神经机器翻译技术,该文提出并比较了两种图像中文描述生成的方法: (1)串行法,该方法首先将图像生成英文描述,然后由英文描述翻译成中文描述; (2)构建伪训练语料法,该方法首先将训练集中图像的英文描述翻译为中文描述,得到图像-中文描述的伪标注语料,然后训练一个图像中文描述生成模型。特别地,对于第二种方法,该文还比较了基于词和基于字的中文描述生成模型。实验结果表明,采用构建伪训练语料法优于串行法,同时基于字的中文描述生成模型也要优于基于词的模型,BLEU_4值达到0.341。  相似文献   

19.
A generic algorithm is presented for automatic extraction of buildings and roads from complex urban environments in high-resolution satellite images where the extraction of both object types at the same time enhances the performance. The proposed approach exploits spectral properties in conjunction with spatial properties, both of which actually provide complementary information to each other. First, a high-resolution pansharpened colour image is obtained by merging the high-resolution panchromatic (PAN) and the low-resolution multispectral images yielding a colour image at the resolution of the PAN band. Natural and man-made regions are classified and segmented by the Normalized Difference Vegetation Index (NDVI). Shadow regions are detected by the chromaticity to intensity ratio in the YIQ colour space. After the classification of the vegetation and the shadow areas, the rest of the image consists of man-made areas only. The man-made areas are partitioned by mean shift segmentation where some resulting segments are irrelevant to buildings in terms of shape. These artefacts are eliminated in two steps: First, each segment is thinned using morphological operations and its length is compared to a threshold which is determined according to the empirical length of the buildings. As a result, long segments which most probably represent roads are masked out. Second, the erroneous thin artefacts which are classified by principal component analysis (PCA) are removed. In parallel to PCA, small artefacts are wiped out based on morphological processes as well. The resultant man-made mask image is overlaid on the ground-truth image, where the buildings are previously labelled, for the accuracy assessment of the methodology. The method is applied to Quickbird images (2.4 m multispectral R, G, B, near-infrared (NIR) bands and 0.6 m PAN band) of eight different urban regions, each of which includes different properties of surface objects. The images are extending from simple to complex urban area. The simple image type includes a regular urban area with low density and regular building pattern. The complex image type involves almost all kinds of challenges such as small and large buildings, regions with bare soil, vegetation areas, shadows and so on. Although the performance of the algorithm slightly changes for various urban complexity levels, it performs well for all types of urban areas.  相似文献   

20.
Most of the written materials are consisted of Multimedia (MM) information because beside text usually contain image information. The present information retrieval and filtering systems use only text parts of the documents or in best case images represented by keywords or image captions. Why do not use both, text and image features of the documents and in the retrieval or filtering process utilize more completely the document information content? Can such approach increase the effectiveness of retrieval and filtering processes? There is a very little difference between retrieval and filtering at an abstract level. In this paper, we will discuss some possible similarities and differences between them on the application level taking into account the experiments in retrieval and filtering of multimedia mineral information.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号