首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 484 毫秒
1.
For automatically mining the underlying relationships between different famous persons in daily news, for example, building a news person based network with the faces as icons to facilitate face-based person finding, we need a tool to automatically label faces in new images with their real names. This paper studies the problem of linking names with faces from large-scale news images with captions. In our previous work, we proposed a method called Person-based Subset Clustering which is mainly based on face clustering for all face images derived from the same name. The location where a name appears in a caption, as well as the visual structural information within a news image provided informative cues such as who are really in the associated image. By combining the domain knowledge from the captions and the corresponding image we propose a novel cross-modality approach to further improve the performance of linking names with faces. The experiments are performed on the data sets including approximately half a million news images from Yahoo! news, and the results show that the proposed method achieves significant improvement over the clustering-only methods.  相似文献   

2.
In this paper we report on our experiments on aligning names and faces as found in images and captions of online news Websites. Developing accurate technologies for linking names and faces is valuable when retrieving or mining information from multimedia collections. We perform exhaustive and systematic experiments exploiting the (a)symmetry between the visual and textual modalities. This leads to different schemes for assigning names to the faces, assigning faces to the names, and establishing name-face link pairs. On top of that, we investigate generic approaches to the use of textual and visual structural information to predict the presence of the corresponding entity in the other modality. The proposed methods are completely unsupervised and are inspired by methods for aligning phrases and words in texts of different languages developed for constructing dictionaries for machine translation. The results are competitive with state-of-the-art performance on the ?Labeled Faces in the Wild? dataset in terms of recall values, now reported on the complete dataset, include excellent precision values, and show the value of text and image analysis for identifying the probability of being pictured or named in the alignment process.  相似文献   

3.
基于边缘检测和线条特征的新闻字幕探测   总被引:2,自引:0,他引:2  
新闻视频中的字幕包含有丰富的语义信息,对理解当前的视频内容,具有重要的意义.如何准确的探测出新闻字幕,显得尤为重要.通过对新闻字幕的特点进行分析,提出了一种基于边缘检测和线条特征的新闻字幕探测方法.算法首先对图像进行灰度变换,去除冗余颜色信息,然后进行边缘检测、线条过滤,去除不符合字符特征的线条,最后进行字幕区域探测与合并,提取出字幕.选用不同频道的新闻视频帧对文中算法进行实验,并与其他方法进行比较,结果表明,提出的算法具有较高的探测召回率与探测准确率.  相似文献   

4.
目前大多数图像标题生成模型都是由一个基于卷积神经网络(Convolutional Neural Network,CNN)的图像编码器和一个基于循环神经网络(Recurrent Neural Network,RNN)的标题解码器组成。其中图像编码器用于提取图像的视觉特征,标题解码器基于视觉特征通过注意力机制来生成标题。然而,使用基于注意力机制的RNN的问题在于,解码端虽然可以对图像特征和标题交互的部分进行注意力建模,但是却忽略了标题内部交互作用的自我注意。因此,针对图像标题生成任务,文中提出了一种能同时结合循环网络和自注意力网络优点的模型。该模型一方面能够通过自注意力模型在统一的注意力区域内同时捕获模态内和模态间的相互作用,另一方面又保持了循环网络固有的优点。在MSCOCO数据集上的实验结果表明,CIDEr值从1.135提高到了1.166,所提方法能够有效提升图像标题生成的性能。  相似文献   

5.
High-spatial-resolution (HSR) remote sensing images serve as carriers of geographic information. Exploring geo-objects and their geospatial relations is fundamental in understanding HSR remote sensing images. To this end, this study proposes an intelligent semantic understanding method for HSR remote sensing images via geospatial relation captions. Firstly, we propose a method of geospatial relation expression to convey the topological, directional and distance relations of geo-objects in HSR images. Secondly, on the basis of images and their geospatial relation captions, an image dataset is constructed for model training. Finally, geospatial relation captioning is implemented for HSR images by using an attention-based deep neural network model. Experimental results demonstrate that the proposed captioning method can effectively provide geospatial semantics for HSR image understanding.  相似文献   

6.
The SLIF project combines text-mining and image processing to extract structured information from biomedical literature.SLIF extracts images and their captions from published papers. The captions are automatically parsed for relevant biological entities (protein and cell type names), while the images are classified according to their type (e.g., micrograph or gel). Fluorescence microscopy images are further processed and classified according to the depicted subcellular localization.The results of this process can be queried online using either a user-friendly web-interface or an XML-based web-service. As an alternative to the targeted query paradigm, SLIF also supports browsing the collection based on latent topic models which are derived from both the annotated text and the image data.The SLIF web application, as well as labeled datasets used for training system components, is publicly available at http://slif.cbi.cmu.edu.  相似文献   

7.
提出了融合视音频特征的影片摘要生成算法。以特写人脸检测、紧张、激烈镜头检测作为选取重要视频片段的依据。针对影片语音端点难以检测的问题,利用影片字幕文件提取影片语音首尾时间以及语音内容,从而实现了影片语音端点的准确检测。实验证明,该方法生成的影片摘要具有较好的有效性。  相似文献   

8.
基于视频的字幕检索与提取   总被引:2,自引:0,他引:2  
在许多视频流如新闻节目、VCD中均含有字幕,这些字幕含有丰富的语义信息。本文针对字幕的独有特性,提出了基于视频的字幕检索和提取方法,实验结果令人满意。另外这种方法对于日文、韩文等其它语言字幕的检索也有一定的参考价值。  相似文献   

9.
10.
11.
图像描述任务是利用计算机自动为已知图像生成一个完整、通顺、适用于对应场景的描述语句,实现从图像到文本的跨模态转换。随着深度学习技术的广泛应用,图像描述算法的精确度和推理速度都得到了极大提升。本文在广泛文献调研的基础上,将基于深度学习的图像描述算法研究分为两个层面,一是图像描述的基本能力构建,二是图像描述的应用有效性研究。这两个层面又可以细分为传递更加丰富的特征信息、解决暴露偏差问题、生成多样性的图像描述、实现图像描述的可控性和提升图像描述推理速度等核心技术挑战。针对上述层面所对应的挑战,本文从注意力机制、预训练模型和多模态模型的角度分析了传递更加丰富的特征信息的方法,从强化学习、非自回归模型和课程学习与计划采样的角度分析了解决暴露偏差问题的方法,从图卷积神经网络、生成对抗网络和数据增强的角度分析了生成多样性的图像描述的方法,从内容控制和风格控制的角度分析了图像描述可控性的方法,从非自回归模型、基于网格的视觉特征和基于卷积神经网络解码器的角度分析了提升图像描述推理速度的方法。此外,本文还对图像描述领域的通用数据集、评价指标和已有算法性能进行了详细介绍,并对图像描述中待解决的问题与未来研究...  相似文献   

12.
13.
Given an unstructured collection of captioned images of cluttered scenes featuring a variety of objects, our goal is to simultaneously learn the names and appearances of the objects. Only a small fraction of local features within any given image are associated with a particular caption word, and captions may contain irrelevant words not associated with any image object. We propose a novel algorithm that uses the repetition of feature neighborhoods across training images and a measure of correspondence with caption words to learn meaningful feature configurations (representing named objects). We also introduce a graph-based appearance model that captures some of the structure of an object by encoding the spatial relationships among the local visual features. In an iterative procedure, we use language (the words) to drive a perceptual grouping process that assembles an appearance model for a named object. Results of applying our method to three data sets in a variety of conditions demonstrate that, from complex, cluttered, real-world scenes with noisy captions, we can learn both the names and appearances of objects, resulting in a set of models invariant to translation, scale, orientation, occlusion, and minor changes in viewpoint or articulation. These named models, in turn, are used to automatically annotate new, uncaptioned images, thereby facilitating keyword-based image retrieval.  相似文献   

14.
Face recognition by independent component analysis   总被引:74,自引:0,他引:74  
A number of current face recognition algorithms use face representations found by unsupervised statistical methods. Typically these methods find a set of basis images and represent faces as a linear combination of those images. Principal component analysis (PCA) is a popular example of such methods. The basis images found by PCA depend only on pairwise relationships between pixels in the image database. In a task such as face recognition, in which important information may be contained in the high-order relationships among pixels, it seems reasonable to expect that better basis images may be found by methods sensitive to these high-order statistics. Independent component analysis (ICA), a generalization of PCA, is one such method. We used a version of ICA derived from the principle of optimal information transfer through sigmoidal neurons. ICA was performed on face images in the FERET database under two different architectures, one which treated the images as random variables and the pixels as outcomes, and a second which treated the pixels as random variables and the images as outcomes. The first architecture found spatially local basis images for the faces. The second architecture produced a factorial face code. Both ICA representations were superior to representations based on PCA for recognizing faces across days and changes in expression. A classifier that combined the two ICA representations gave the best performance.  相似文献   

15.
Synthesis of Novel Views from a Single Face Image   总被引:8,自引:3,他引:5  
  相似文献   

16.
Most of the written materials are consisted of Multimedia (MM) information because beside text usually contain image information. The present information retrieval and filtering systems use only text parts of the documents or in best case images represented by keywords or image captions. Why do not use both, text and image features of the documents and in the retrieval or filtering process utilize more completely the document information content? Can such approach increase the effectiveness of retrieval and filtering processes? There is a very little difference between retrieval and filtering at an abstract level. In this paper, we will discuss some possible similarities and differences between them on the application level taking into account the experiments in retrieval and filtering of multimedia mineral information.  相似文献   

17.
Human faces are remarkably similar in global properties, including size, aspect ratio, and location of main features, but can vary considerably in details across individuals, gender, race, or due to facial expression. We propose a novel method for 3D shape recovery of faces that exploits the similarity of faces. Our method obtains as input a single image and uses a mere single 3D reference model of a different person's face. Classical reconstruction methods from single images, i.e., shape-from-shading, require knowledge of the reflectance properties and lighting as well as depth values for boundary conditions. Recent methods circumvent these requirements by representing input faces as combinations (of hundreds) of stored 3D models. We propose instead to use the input image as a guide to "mold" a single reference model to reach a reconstruction of the sought 3D shape. Our method assumes Lambertian reflectance and uses harmonic representations of lighting. It has been tested on images taken under controlled viewing conditions as well as on uncontrolled images downloaded from the Internet, demonstrating its accuracy and robustness under a variety of imaging conditions and overcoming significant differences in shape between the input and reference individuals including differences in facial expressions, gender, and race.  相似文献   

18.
IC (Image Captioning) is a crucial part of Visual Data Processing and aims at understanding for providing captions that verbalize an image’s important elements. However, in existing works, because of the complexity in images, neglecting major relation between the object in an image, poor quality image, labelling it remains a big problem for researchers. Hence, the main objective of this work attempts to overcome these challenges by proposing a novel framework for IC. So in this research work the main contribution deals with the framework consists of three phases that is image understanding, textual understanding and decoding. Initially, the image understanding phase is initiated with image pre-processing to enhance image quality. Thereafter, object has been detected using IYV3MMDs (Improved YoloV3 Multishot Multibox Detectors) in order to relate the interrelation between the image and the object, and then it is followed by MBFOCNNs (Modified Bacterial Foraging Optimization in Convolution Neural Networks), which encodes and provides final feature vectors. Secondly, the textual understanding phase is performed based on an image which is initiated with preprocessing of text where unwanted words, phrases, punctuations are removed in order to provide a healthy text. It is then followed by MGloVEs (Modified Global Vectors for Word Representation), which provides a word embedding of features with the highest priority towards the object present in an image. Finally, the decoding phase has been performed, which decodes the image whether it may be a normal or complex scene image and provides an accurate text by its learning ability using MDAA (Modified Deliberate Adaptive Attention). The experimental outcome of this work shows better accuracy of shows 96.24% when compared to existing and similar methods while generating captions for images.  相似文献   

19.
当前图像描述生成的研究主要仅限于单语言(如英文),这得益于大规模的已人工标注的图像及其英文描述语料。该文探索零标注资源情况下,以英文作为枢轴语言的图像中文描述生成研究。具体地,借助于神经机器翻译技术,该文提出并比较了两种图像中文描述生成的方法: (1)串行法,该方法首先将图像生成英文描述,然后由英文描述翻译成中文描述; (2)构建伪训练语料法,该方法首先将训练集中图像的英文描述翻译为中文描述,得到图像-中文描述的伪标注语料,然后训练一个图像中文描述生成模型。特别地,对于第二种方法,该文还比较了基于词和基于字的中文描述生成模型。实验结果表明,采用构建伪训练语料法优于串行法,同时基于字的中文描述生成模型也要优于基于词的模型,BLEU_4值达到0.341。  相似文献   

20.
征察  吉立新  李邵梅  高超 《计算机应用》2017,37(10):3006-3011
针对传统新闻图像中人脸标注方法主要依赖人脸相似度信息,分辨噪声和非噪声人脸能力以及非噪声人脸标注能力较差的问题,提出一种基于多模态信息融合的新闻图像人脸标注方法。首先根据人脸和姓名的共现关系,利用改进的K近邻算法,获得基于人脸相似度信息的人脸姓名匹配度;然后,分别从图像中提取人脸大小和位置的信息对人脸重要程度进行表征,从文本中提取姓名位置信息对姓名重要程度进行表征;最后,使用反向传播神经网络来融合上述信息完成人脸标签的推理,并提出一个标签修正策略来进一步改善标注结果。在Label Yahoo! News数据集上的测试效果表明,所提方法的标注准确率、精度和召回率分别达到了77.11%、73.58%和78.75%,与仅基于人脸相似度的算法相比,具有较好的分辨噪声和非噪声人脸能力以及非噪声人脸标注能力。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号