首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
目的 目前文本到图像的生成模型仅在具有单个对象的图像数据集上表现良好,当一幅图像涉及多个对象和关系时,生成的图像就会变得混乱。已有的解决方案是将文本描述转换为更能表示图像中场景关系的场景图结构,然后利用场景图生成图像,但是现有的场景图到图像的生成模型最终生成的图像不够清晰,对象细节不足。为此,提出一种基于图注意力网络的场景图到图像的生成模型,生成更高质量的图像。方法 模型由提取场景图特征的图注意力网络、合成场景布局的对象布局网络、将场景布局转换为生成图像的级联细化网络以及提高生成图像质量的鉴别器网络组成。图注意力网络将得到的具有更强表达能力的输出对象特征向量传递给改进的对象布局网络,合成更接近真实标签的场景布局。同时,提出使用特征匹配的方式计算图像损失,使得最终生成图像与真实图像在语义上更加相似。结果 通过在包含多个对象的COCO-Stuff图像数据集中训练模型生成64×64像素的图像,本文模型可以生成包含多个对象和关系的复杂场景图像,且生成图像的Inception Score为7.8左右,与原有的场景图到图像生成模型相比提高了0.5。结论 本文提出的基于图注意力网络的场景图到图像生成模型不仅可以生成包含多个对象和关系的复杂场景图像,而且生成图像质量更高,细节更清晰。  相似文献   

2.
Computing the visibility of out-door scenes is often much harder than of in-door scenes. A typical urban scene, for example, is densely occluded, and it is effective to precompute its visibility space, since from a given point only a small fraction of the scene is visible. The difficulty is that although the majority of objects are hidden, some parts might be visible at a distance in an arbitrary location, and it is not clear how to detect them quickly. In this paper we present a method to partition the viewspace into cells containing a conservative superset of the visible objects. For a given cell the method tests the visibility of all the objects in the scene. For each object it searches for a strong occluder which guarantees that the object is not visible from any point within the cell. We show analytically that in a densely occluded scene, the vast majority of objects are strongly occluded, and the overhead of using conservative visibility (rather than visibility) is small. These results are further supported by our experimental results. We also analyze the cost of the method and discuss its effectiveness.  相似文献   

3.
Recognizing scene information in images or has attracted much attention in computer vision or videos, such as locating the objects and answering "Where am research field. Many existing scene recognition methods focus on static images, and cannot achieve satisfactory results on videos which contain more complex scenes features than images. In this paper, we propose a robust movie scene recognition approach based on panoramic frame and representative feature patch. More specifically, the movie is first efficiently segmented into video shots and scenes. Secondly, we introduce a novel key-frame extraction method using panoramic frame and also a local feature extraction process is applied to get the representative feature patches (RFPs) in each video shot. Thirdly, a Latent Dirichlet Allocation (LDA) based recognition model is trained to recognize the scene within each individual video scene clip. The correlations between video clips are considered to enhance the recognition performance. When our proposed approach is implemented to recognize the scene in realistic movies, the experimental results shows that it can achieve satisfactory performance.  相似文献   

4.
This paper presents a computational model to recover the most likely interpretation of the 3D scene structure from a planar image, where some objects may occlude others. The estimated scene interpretation is obtained by integrating some global and local cues and provides both the complete disoccluded objects that form the scene and their ordering according to depth. Our method first computes several distal scenes which are compatible with the proximal planar image. To compute these different hypothesized scenes, we propose a perceptually inspired object disocclusion method, which works by minimizing the Euler’s elastica as well as by incorporating the relatability of partially occluded contours and the convexity of the disoccluded objects. Then, to estimate the preferred scene, we rely on a Bayesian model and define probabilities taking into account the global complexity of the objects in the hypothesized scenes as well as the effort of bringing these objects in their relative position in the planar image, which is also measured by an Euler’s elastica-based quantity. The model is illustrated with numerical experiments on, both, synthetic and real images showing the ability of our model to reconstruct the occluded objects and the preferred perceptual order among them. We also present results on images of the Berkeley dataset with provided figure-ground ground-truth labeling.  相似文献   

5.
6.
人工智能发展至今正逐渐进入认知时代,计算机对真实物理世界的认知与推理能力亟待提高。有关物体物理属性与运动预测的现有工作多局限于简单的物体和场景,因此尝试拓展常识推理至仿真场景下物体场景流的预测。首先,为了弥补相关领域数据集的短缺,提出了一个基于仿真场景的数据集 ModernCity,从常识推理的角度出发还原了现代都市的街边景象,并提供了包括 RGB 图像、深度图、场景流数据和语义分割图在内的多种标签;此外,设计了一个物体描述子解码模型(ODD),通过物体属性辅助预测场景流。通过消融实验证明,该模型可以在仿真的场景下通过物体的属性准确地预测物体未来的运动趋势,通过与其他 SOTA 模型的对比实验验证了该模型的性能及 ModernCity 数据集的可靠性。  相似文献   

7.
Trivedi  M.M. Chen  C. Marapane  S.B. 《Computer》1989,22(6):91-97
A model-based approach has been proposed to make object recognition computationally tractable. In this approach, models associated with objects expected to appear in the scene are recorded in the system's knowledge base. The system extracts various features from the input images using robust, low-level, general-purpose operators. Finally, matching is performed between the image-derived features and the scene domain models to recognize objects. Factors affecting the successful design and implementation of model-based vision systems include the ability to derive suitable object models, the nature of image features extracted by the operators, a computationally effective matching approach, knowledge representation schemes, and effective control mechanisms for guiding the systems's overall operation. The vision system they describe uses gray-scale images, which can successfully handle complex scenes with multiple object types  相似文献   

8.
9.
Scenes are closely related to the kinds of objects that may appear in them. Objects are widely used as features for scene categorization. On the other hand, landscapes with more spatial structures of scenes are representative of scene categories. In this paper, we propose a deep learning based algorithm for scene categorization. Specifically, we design two-pathway convolutional neural networks for exploiting both object attributes and spatial structures of scene images. Different from conventional deep learning methods, which usually focus on only one aspect of images, each pathway of the proposed architecture is tuned to capture a different aspect of images. As a result, complementary information of image contents can be utilized effectively. In addition, to deal with the feature redundancy problem caused by combining features from different sources, we adopt the ? 2,1 norm during classifier training to control selectivity of each type of features. Extensive experiments are conducted to evaluate the proposed method. Obtained results demonstrate that the proposed approach achieves superior performances over conventional methods. Moreover, the proposed method is a general framework, which can be easily extended to more pathways and applied to solve other problems.  相似文献   

10.
Even though visual attention models using bottom-up saliency can speed up object recognition by predicting object locations, in the presence of multiple salient objects, saliency alone cannot discern target objects from the clutter in a scene. Using a metric named familiarity, we propose a top-down method for guiding attention towards target objects, in addition to bottom-up saliency. To demonstrate the effectiveness of familiarity, the unified visual attention model (UVAM) which combines top-down familiarity and bottom-up saliency is applied to SIFT based object recognition. The UVAM is tested on 3600 artificially generated images containing COIL-100 objects with varying amounts of clutter, and on 126 images of real scenes. The recognition times are reduced by 2.7× and 2×, respectively, with no reduction in recognition accuracy, demonstrating the effectiveness and robustness of the familiarity based UVAM.  相似文献   

11.
This paper presents a symbolic formalism for modeling and retrieving video data via the moving objects contained in the video images. The model integrates the representations of individual moving objects in a scene with the time-varying relationships between them by incorporating both the notions of object tracks and temporal sequences of PIRs (projection interval relationships). The model is supported by a set of operations which form the basis of a moving object algebra. This algebra allows one to retrieve scenes and information from scenes by specifying both spatial and temporal properties of the objects involved. It also provides operations to create new scenes from existing ones. A prototype implementation is described which allows queries to be specified either via an animation sketch or using the moving object algebra.  相似文献   

12.
There has been a growing interest in exploiting contextual information in addition to local features to detect and localize multiple object categories in an image. A context model can rule out some unlikely combinations or locations of objects and guide detectors to produce a semantically coherent interpretation of a scene. However, the performance benefit of context models has been limited because most of the previous methods were tested on data sets with only a few object categories, in which most images contain one or two object categories. In this paper, we introduce a new data set with images that contain many instances of different object categories, and propose an efficient model that captures the contextual information among more than a hundred object categories using a tree structure. Our model incorporates global image features, dependencies between object categories, and outputs of local detectors into one probabilistic framework. We demonstrate that our context model improves object recognition performance and provides a coherent interpretation of a scene, which enables a reliable image querying system by multiple object categories. In addition, our model can be applied to scene understanding tasks that local detectors alone cannot solve, such as detecting objects out of context or querying for the most typical and the least typical scenes in a data set.  相似文献   

13.
Dynamic scene occlusion culling   总被引:3,自引:0,他引:3  
Large, complex 3D scenes are best rendered in an output-sensitive way, i.e., in time largely independent of the entire scene model's complexity. Occlusion culling is one of the key techniques for output-sensitive rendering. We generalize existing occlusion culling algorithms, intended for static scenes, to handle dynamic scenes having numerous moving objects. The data structure used by an occlusion culling method is updated to reflect the objects' possible positions. To avoid updating the structure for every dynamic object at each frame, a temporal bounding volume (TBV) is created for each occluded dynamic object, using some known constraints on the object's motion. The TBV is inserted into the structure instead of the object. Subsequently, the object is ignored as long as the TBV is occluded and guaranteed to contain the object. The generalized algorithms' rendering time is linearly affected only by the scene's visible parts, not by hidden parts or by occluded dynamic objects. Our techniques also save communications in distributed graphic systems, e.g., multiuser virtual environments, by eliminating update messages for hidden dynamic objects. We demonstrate the adaptation of two occlusion culling algorithms to dynamic scenes: hierarchical Z-buffering and BSP tree projection  相似文献   

14.
Given a large-scale collection of images our aim is to efficiently associate images which contain the same entity, for example a building or object, and to discover the significant entities. To achieve this, we introduce the Geometric Latent Dirichlet Allocation (gLDA) model for unsupervised discovery of particular objects in unordered image collections. This explicitly represents images as mixtures of particular objects or facades, and builds rich latent topic models which incorporate the identity and locations of visual words specific to the topic in a geometrically consistent way. Applying standard inference techniques to this model enables images likely to contain the same object to be probabilistically grouped and ranked.  相似文献   

15.
We present a data‐driven method for synthesizing 3D indoor scenes by inserting objects progressively into an initial, possibly, empty scene. Instead of relying on few hundreds of hand‐crafted 3D scenes, we take advantage of existing large‐scale annotated RGB‐D datasets, in particular, the SUN RGB‐D database consisting of 10,000+ depth images of real scenes, to form the prior knowledge for our synthesis task. Our object insertion scheme follows a co‐occurrence model and an arrangement model, both learned from the SUN dataset. The former elects a highly probable combination of object categories along with the number of instances per category while a plausible placement is defined by the latter model. Compared to previous works on probabilistic learning for object placement, we make two contributions. First, we learn various classes of higher‐order object‐object relations including symmetry, distinct orientation, and proximity from the database. These relations effectively enable considering objects in semantically formed groups rather than by individuals. Second, while our algorithm inserts objects one at a time, it attains holistic plausibility of the whole current scene while offering controllability through progressive synthesis. We conducted several user studies to compare our scene synthesis performance to results obtained by manual synthesis, state‐of‐the‐art object placement schemes, and variations of parameter settings for the arrangement model.  相似文献   

16.
17.
In this work, we introduce the ‘mobility‐tree’ construct for high‐level functional representation of complex 3D indoor scenes. In recent years, digital indoor scenes are becoming increasingly popular, consisting of detailed geometry and complex functionalities. These scenes often consist of objects that reoccur in various poses and interrelate with each other. In this work we analyse the reoccurrence of objects in the scene and automatically detect their functional mobilities. ‘Mobility’ analysis denotes the motion capabilities (i.e. degree of freedom) of an object and its subpart which typically relates to their indoor functionalities. We compute an object's mobility by analysing its spatial arrangement, repetitions and relations with other objects and store it in a ‘mobility‐tree’. Repetitive motions in the scenes are grouped in ‘mobility‐groups’, for which we develop a set of sophisticated controllers facilitating semantical high‐level editing operations. We show applications of our mobility analysis to interactive scene manipulation and reorganization, and present results for a variety of indoor scenes.  相似文献   

18.
Since indoor scenes are frequently changed in daily life, such as re‐layout of furniture, the 3D reconstructions for them should be flexible and easy to update. We present an automatic 3D scene update algorithm to indoor scenes by capturing scene variation with RGBD cameras. We assume an initial scene has been reconstructed in advance in manual or other semi‐automatic way before the change, and automatically update the reconstruction according to the newly captured RGBD images of the real scene update. It starts with an automatic segmentation process without manual interaction, which benefits from accurate labeling training from the initial 3D scene. After the segmentation, objects captured by RGBD camera are extracted to form a local updated scene. We formulate an optimization problem to compare to the initial scene to locate moved objects. The moved objects are then integrated with static objects in the initial scene to generate a new 3D scene. We demonstrate the efficiency and robustness of our approach by updating the 3D scene of several real‐world scenes.  相似文献   

19.
新视角图像生成任务指通过多幅参考图像,生成场景新视角图像。然而多物体场景存在物体间遮挡,物体信息获取不全,导致生成的新视角场景图像存在伪影、错位问题。为解决该问题,提出一种借助场景布局图指导的新视角图像生成网络,并标注了全新的多物体场景数据集(multi-objects novel view Synthesis,MONVS)。首先,将场景的多个布局图信息和对应的相机位姿信息输入到布局图预测模块,计算出新视角下的场景布局图信息;然后,利用场景中标注的物体边界框信息构建不同物体的对象集合,借助像素预测模块生成新视角场景下的各个物体信息;最后,将得到的新视角布局图和各个物体信息输入到场景生成器中构建新视角下的场景图像。在MONVS和ShapeNet cars数据集上与最新的几种方法进行了比较,实验数据和可视化结果表明,在多物体场景的新视角图像生成中,所提方法在两个数据集上都有较好的效果表现,有效地解决了生成图像中存在伪影和多物体在场景中位置信息不准确的问题。  相似文献   

20.
We address the problem of automatically learning the recurring associations between the visual structures in images and the words in their associated captions, yielding a set of named object models that can be used for subsequent image annotation. In previous work, we used language to drive the perceptual grouping of local features into configurations that capture small parts (patches) of an object. However, model scope was poor, leading to poor object localization during detection (annotation), and ambiguity was high when part detections were weak. We extend and significantly revise our previous framework by using language to drive the perceptual grouping of parts, each a configuration in the previous framework, into hierarchical configurations that offer greater spatial extent and flexibility. The resulting hierarchical multipart models remain scale, translation and rotation invariant, but are more reliable detectors and provide better localization. Moreover, unlike typical frameworks for learning object models, our approach requires no bounding boxes around the objects to be learned, can handle heavily cluttered training scenes, and is robust in the face of noisy captions, i.e., where objects in an image may not be named in the caption, and objects named in the caption may not appear in the image. We demonstrate improved precision and recall in annotation over the non-hierarchical technique and also show extended spatial coverage of detected objects.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号