首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
We address the problem of automatically learning the recurring associations between the visual structures in images and the words in their associated captions, yielding a set of named object models that can be used for subsequent image annotation. In previous work, we used language to drive the perceptual grouping of local features into configurations that capture small parts (patches) of an object. However, model scope was poor, leading to poor object localization during detection (annotation), and ambiguity was high when part detections were weak. We extend and significantly revise our previous framework by using language to drive the perceptual grouping of parts, each a configuration in the previous framework, into hierarchical configurations that offer greater spatial extent and flexibility. The resulting hierarchical multipart models remain scale, translation and rotation invariant, but are more reliable detectors and provide better localization. Moreover, unlike typical frameworks for learning object models, our approach requires no bounding boxes around the objects to be learned, can handle heavily cluttered training scenes, and is robust in the face of noisy captions, i.e., where objects in an image may not be named in the caption, and objects named in the caption may not appear in the image. We demonstrate improved precision and recall in annotation over the non-hierarchical technique and also show extended spatial coverage of detected objects.  相似文献   

2.
Probabilistic Models of Appearance for 3-D Object Recognition   总被引:6,自引:0,他引:6  
We describe how to model the appearance of a 3-D object using multiple views, learn such a model from training images, and use the model for object recognition. The model uses probability distributions to describe the range of possible variation in the object's appearance. These distributions are organized on two levels. Large variations are handled by partitioning training images into clusters corresponding to distinctly different views of the object. Within each cluster, smaller variations are represented by distributions characterizing uncertainty in the presence, position, and measurements of various discrete features of appearance. Many types of features are used, ranging in abstraction from edge segments to perceptual groupings and regions. A matching procedure uses the feature uncertainty information to guide the search for a match between model and image. Hypothesized feature pairings are used to estimate a viewpoint transformation taking account of feature uncertainty. These methods have been implemented in an object recognition system, OLIVER. Experiments show that OLIVER is capable of learning to recognize complex objects in cluttered images, while acquiring models that represent those objects using relatively few views.  相似文献   

3.
We present an approach to the recognition of complex-shaped objects in cluttered environments based on edge information. We first use example images of a target object in typical environments to train a classifier cascade that determines whether edge pixels in an image belong to an instance of the desired object or the clutter. Presented with a novel image, we use the cascade to discard clutter edge pixels and group the object edge pixels into overall detections of the object. The features used for the edge pixel classification are localized, sparse edge density operations. Experiments validate the effectiveness of the technique for recognition of a set of complex objects in a variety of cluttered indoor scenes under arbitrary out-of-image-plane rotation. Furthermore, our experiments suggest that the technique is robust to variations between training and testing environments and is efficient at runtime.  相似文献   

4.
由目标检测算法提取的目标特征在图像描述生成任务中发挥重要作用,但仅使用对图像进行目标检测的特征作为图像描述任务的输入会导致除关键目标信息以外的其余信息获取缺失,且生成的文本描述对图像内目标之间的关系缺乏准确表达。针对上述不足,提出用于编码图像内目标特征的目标Transformer编码器,以及用于编码图像内关系特征的转换窗口Transformer编码器,从不同角度对图像内不同方面的信息进行联合编码。通过拼接方法将目标Transformer编码的目标特征与转换窗口Transformer编码的关系特征相融合,达到图像内部关系特征和局部目标特征融合的目的,最终使用Transformer解码器将融合后的编码特征解码生成对应的图像描述。在MS-COCO数据集上进行实验,结果表明,所构建模型性能明显优于基线模型,BLEU-4、METEOR、ROUGE-L、CIDEr指标分别达到38.6%、28.7%、58.2%和127.4%,优于传统图像描述网络模型,能够生成更详细准确的图像描述。  相似文献   

5.
The appearance of an object is composed of local structure. This local structure can be described and characterized by a vector of local features measured by local operators such as Gaussian derivatives or Gabor filters. This article presents a technique where appearances of objects are represented by the joint statistics of such local neighborhood operators. As such, this represents a new class of appearance based techniques for computer vision. Based on joint statistics, the paper develops techniques for the identification of multiple objects at arbitrary positions and orientations in a cluttered scene. Experiments show that these techniques can identify over 100 objects in the presence of major occlusions. Most remarkably, the techniques have low complexity and therefore run in real-time.  相似文献   

6.
目前大多数图像标题生成模型都是由一个基于卷积神经网络(Convolutional Neural Network,CNN)的图像编码器和一个基于循环神经网络(Recurrent Neural Network,RNN)的标题解码器组成。其中图像编码器用于提取图像的视觉特征,标题解码器基于视觉特征通过注意力机制来生成标题。然而,使用基于注意力机制的RNN的问题在于,解码端虽然可以对图像特征和标题交互的部分进行注意力建模,但是却忽略了标题内部交互作用的自我注意。因此,针对图像标题生成任务,文中提出了一种能同时结合循环网络和自注意力网络优点的模型。该模型一方面能够通过自注意力模型在统一的注意力区域内同时捕获模态内和模态间的相互作用,另一方面又保持了循环网络固有的优点。在MSCOCO数据集上的实验结果表明,CIDEr值从1.135提高到了1.166,所提方法能够有效提升图像标题生成的性能。  相似文献   

7.
Jia  Xin  Wang  Yunbo  Peng  Yuxin  Chen  Shengyong 《Multimedia Tools and Applications》2022,81(15):21349-21367

Transformer-based architectures have shown encouraging results in image captioning. They usually utilize self-attention based methods to establish the semantic association between objects in an image for predicting caption. However, when appearance features between the candidate object and query object show weak dependence, the self-attention based methods are hard to capture the semantic association between them. In this paper, a Semantic Association Enhancement Transformer model is proposed to address the above challenge. First, an Appearance-Geometry Multi-Head Attention is introduced to model a visual relationship by integrating the geometry features and appearance features of the objects. The visual relationship characterizes the semantic association and relative position among the objects. Secondly, a Visual Relationship Improving module is presented to weigh the importance of appearance feature and geometry feature of query object to the modeled visual relationship. Then, the visual relationship among different objects is adaptively improved according to the constructed importance, especially the objects with weak dependence on appearance features, thereby enhancing their semantic association. Extensive experiments on MS COCO dataset demonstrate that the proposed method outperforms the state-of-the-art methods.

  相似文献   

8.
图像描述任务是利用计算机自动为已知图像生成一个完整、通顺、适用于对应场景的描述语句,实现从图像到文本的跨模态转换。随着深度学习技术的广泛应用,图像描述算法的精确度和推理速度都得到了极大提升。本文在广泛文献调研的基础上,将基于深度学习的图像描述算法研究分为两个层面,一是图像描述的基本能力构建,二是图像描述的应用有效性研究。这两个层面又可以细分为传递更加丰富的特征信息、解决暴露偏差问题、生成多样性的图像描述、实现图像描述的可控性和提升图像描述推理速度等核心技术挑战。针对上述层面所对应的挑战,本文从注意力机制、预训练模型和多模态模型的角度分析了传递更加丰富的特征信息的方法,从强化学习、非自回归模型和课程学习与计划采样的角度分析了解决暴露偏差问题的方法,从图卷积神经网络、生成对抗网络和数据增强的角度分析了生成多样性的图像描述的方法,从内容控制和风格控制的角度分析了图像描述可控性的方法,从非自回归模型、基于网格的视觉特征和基于卷积神经网络解码器的角度分析了提升图像描述推理速度的方法。此外,本文还对图像描述领域的通用数据集、评价指标和已有算法性能进行了详细介绍,并对图像描述中待解决的问题与未来研究...  相似文献   

9.
3D object pose estimation for grasping and manipulation is a crucial task in robotic and industrial applications. Robustness and efficiency for robotic manipulation are desirable properties that are still very challenging in complex and cluttered scenes, because 3D objects have different appearances, illumination and occlusion when seen from different viewpoints. This article proposes a Semantic Point Pair Feature (PPF) method for 3D object pose estimation, which combines the semantic image segmentation using deep learning with the voting-based 3D object pose estimation. The Part Mask RCNN ispresented to obtain the semantic object-part segmentation related to the point cloud of object, which is combined with the PPF method for 3D object pose estimation. In order to reduce the cost of collecting datasets in cluttered scenes, a physically-simulated environment is constructed to generate labeled synthetic semantic datasets. Finally, two robotic bin-picking experiments are demonstrated and the Part Mask RCNN for scene segmentation is evaluated through the constructed 3D object datasets. The experimental results show that the proposed Semantic PPF methodimproves the robustness and efficiency of 3D object pose estimation in cluttered scenes with partial occlusions.  相似文献   

10.
目标检测在自然场景和遥感场景中的研究极具挑战。尽管许多先进的算法在自然场景下取得了优异的成果,但是遥感图像的复杂性、目标尺度的多样性及目标密集分布的特性,使得针对遥感图像目标检测的研究步伐缓慢。本文提出一个新颖的多类别目标检测模型,可以自动学习特征融合时的权重,并突出目标特征,实现在复杂的遥感图像中有效地检测小目标和密集分布的目标。模型在公开数据集DOTA和NWPU VHR-10上的实验结果表明检测效果超过了大多数经典算法。  相似文献   

11.
为克服传统图像描述模型只能描述已知对象的问题,结合小样本目标检测器和知识图谱,提出一种新的图像描述模型。小样本目标检测器能够检测出描述模型无法识别的对象,并且给出对象的名称,利用知识图谱提供对象的背景知识,结合对象信息,通过引入注意力机制引导模型选取合适的单词,进而生成包含这些对象的描述语句。实验结果表明,该模型的平均F1值较基线模型提升了6.6个百分点,而且所生成的描述语句的质量在SPICE标准上提高了2.0个百分点,证明该模型所采用的方法是有效的。  相似文献   

12.
This article presents a system for texture-based probabilistic classification and localisation of three-dimensional objects in two-dimensional digital images and discusses selected applications. In contrast to shape-based approaches, our texture-based method does not rely on object features extracted using image segmentation techniques. Rather, the objects are described by local feature vectors computed directly from image pixel values using the wavelet transform. Both gray level and colour images can be processed. In the training phase, object features are statistically modelled as normal density functions. In the recognition phase, the system classifies and localises objects in scenes with real heterogeneous backgrounds. Feature vectors are calculated and a maximisation algorithm compares the learned density functions with the extracted feature vectors and yields the classes and poses of objects found in the scene. Experiments carried out on a real dataset of over 40,000 images demonstrate the robustness of the system in terms of classification and localisation accuracy. Finally, two important real application scenarios are discussed, namely recognising museum exhibits from visitors’ own photographs and classification of metallography images.  相似文献   

13.
14.
15.
16.
17.
Face detection from cluttered images is challenging due to the wide variability of face appearances and the complexity of image backgrounds. This paper proposes a classification-based method for locating frontal faces in cluttered images. To improve the detection performance, we extract gradient direction features from local window images as the input of the underlying two-class classifier. The gradient direction representation provides better discrimination ability than the image intensity, and we show that the combination of gradient directionality and intensity outperforms the gradient feature alone. The underlying classifier is a polynomial neural network (PNN) on a reduced feature subspace learned by principal component analysis (PCA). The incorporation of the residual of subspace projection into the PNN was shown to improve the classification performance. The classifier is trained on samples of face and non-face images to discriminate between the two classes. The superior detection performance of the proposed method is justified in experiments on a large number of images.  相似文献   

18.
For automatically mining the underlying relationships between different famous persons in daily news, for example, building a news person based network with the faces as icons to facilitate face-based person finding, we need a tool to automatically label faces in new images with their real names. This paper studies the problem of linking names with faces from large-scale news images with captions. In our previous work, we proposed a method called Person-based Subset Clustering which is mainly based on face clustering for all face images derived from the same name. The location where a name appears in a caption, as well as the visual structural information within a news image provided informative cues such as who are really in the associated image. By combining the domain knowledge from the captions and the corresponding image we propose a novel cross-modality approach to further improve the performance of linking names with faces. The experiments are performed on the data sets including approximately half a million news images from Yahoo! news, and the results show that the proposed method achieves significant improvement over the clustering-only methods.  相似文献   

19.
The explosion of the Internet provides us with a tremendous resource of images shared online. It also confronts vision researchers the problem of finding effective methods to navigate the vast amount of visual information. Semantic image understanding plays a vital role towards solving this problem. One important task in image understanding is object recognition, in particular, generic object categorization. Critical to this problem are the issues of learning and dataset. Abundant data helps to train a robust recognition system, while a good object classifier can help to collect a large amount of images. This paper presents a novel object recognition algorithm that performs automatic dataset collecting and incremental model learning simultaneously. The goal of this work is to use the tremendous resources of the web to learn robust object category models for detecting and searching for objects in real-world cluttered scenes. Humans contiguously update the knowledge of objects when new examples are observed. Our framework emulates this human learning process by iteratively accumulating model knowledge and image examples. We adapt a non-parametric latent topic model and propose an incremental learning framework. Our algorithm is capable of automatically collecting much larger object category datasets for 22 randomly selected classes from the Caltech 101 dataset. Furthermore, our system offers not only more images in each object category but also a robust object category model and meaningful image annotation. Our experiments show that OPTIMOL is capable of collecting image datasets that are superior to the well known manually collected object datasets Caltech 101 and LabelMe.  相似文献   

20.
This correspondence presents a matching algorithm for obtaining feature point correspondences across images containing rigid objects undergoing different motions. First point features are detected using newly developed feature detectors. Then a variety of constraints are applied starting with simplest and following with more informed ones. First, an intensity-based matching algorithm is applied to the feature points to obtain unique point correspondences. This is followed by the application of a sequence of newly developed heuristic tests involving geometry, rigidity, and disparity. The geometric tests match two-dimensional geometrical relationships among the feature points, the rigidity test enforces the three dimensional rigidity of the object, and the disparity test ensures that no matched feature point in an image could be rematched with another feature, if reassigned another disparity value associated with another matched pair or an assumed match on the epipolar line. The computational complexity is proportional to the numbers of detected feature points in the two images. Experimental results with indoor and outdoor images are presented, which show that the algorithm yields only correct matches for scenes containing rigid objects  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号