首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
2.
The goal of object categorization is to locate and identify instances of an object category within an image. Recognizing an object in an image is difficult when images include occlusion, poor quality, noise or background clutter, and this task becomes even more challenging when many objects are present in the same scene. Several models for object categorization use appearance and context information from objects to improve recognition accuracy. Appearance information, based on visual cues, can successfully identify object classes up to a certain extent. Context information, based on the interaction among objects in the scene or global scene statistics, can help successfully disambiguate appearance inputs in recognition tasks. In this work we address the problem of incorporating different types of contextual information for robust object categorization in computer vision. We review different ways of using contextual information in the field of object categorization, considering the most common levels of extraction of context and the different levels of contextual interactions. We also examine common machine learning models that integrate context information into object recognition frameworks and discuss scalability, optimizations and possible future approaches.  相似文献   

3.
《Artificial Intelligence》2007,171(8-9):568-585
Head pose and gesture offer several conversational grounding cues and are used extensively in face-to-face interaction among people. To accurately recognize visual feedback, humans often use contextual knowledge from previous and current events to anticipate when feedback is most likely to occur. In this paper we describe how contextual information can be used to predict visual feedback and improve recognition of head gestures in human–computer interfaces. Lexical, prosodic, timing, and gesture features can be used to predict a user's visual feedback during conversational dialog with a robotic or virtual agent. In non-conversational interfaces, context features based on user–interface system events can improve detection of head gestures for dialog box confirmation or document browsing. Our user study with prototype gesture-based components indicate quantitative and qualitative benefits of gesture-based confirmation over conventional alternatives. Using a discriminative approach to contextual prediction and multi-modal integration, performance of head gesture detection was improved with context features even when the topic of the test set was significantly different than the training set.  相似文献   

4.
5.
The automatic recognition of user’s communicative style within a spoken dialog system framework, including the affective aspects, has received increased attention in the past few years. For dialog systems, it is important to know not only what was said but also how something was communicated, so that the system can engage the user in a richer and more natural interaction. This paper addresses the problem of automatically detecting “frustration”, “politeness”, and “neutral” attitudes from a child’s speech communication cues, elicited in spontaneous dialog interactions with computer characters. Several information sources such as acoustic, lexical, and contextual features, as well as, their combinations are used for this purpose. The study is based on a Wizard-of-Oz dialog corpus of 103 children, 7–14 years of age, playing a voice activated computer game. Three-way classification experiments, as well as, pairwise classification between polite vs. others and frustrated vs. others were performed. Experimental results show that lexical information has more discriminative power than acoustic and contextual cues for detection of politeness, whereas context and acoustic features perform best for frustration detection. Furthermore, the fusion of acoustic, lexical and contextual information provided significantly better classification results. Results also showed that classification performance varies with age and gender. Specifically, for the “politeness” detection task, higher classification accuracy was achieved for females and 10–11 years-olds, compared to males and other age groups, respectively.  相似文献   

6.
Semantic image segmentation aims to partition an image into non-overlapping regions and assign a pre-defined object class label to each region. In this paper, a semantic method combining low-level features and high-level contextual cues is proposed to segment natural scene images. The proposed method first takes the gist representation of an image as its global feature. The image is then over-segmented into many super-pixels and histogram representations of these super-pixels are used as local features. In addition, co-occurrence and spatial layout relations among object classes are exploited as contextual cues. Finally the features and cues are integrated into the inference framework based on conditional random field by defining specific potential terms and introducing weighting functions. The proposed method has been compared with state-of-the-art methods on the MSRC database, and the experimental results show its effectiveness.  相似文献   

7.
Automatic image orientation detection for natural images is a useful, yet challenging research topic. Humans use scene context and semantic object recognition to identify the correct image orientation. However, it is difficult for a computer to perform the task in the same way because current object recognition algorithms are extremely limited in their scope and robustness. As a result, existing orientation detection methods were built upon low-level vision features such as spatial distributions of color and texture. Discrepant detection rates have been reported for these methods in the literature. We have developed a probabilistic approach to image orientation detection via confidence-based integration of low-level and semantic cues within a Bayesian framework. Our current accuracy is 90 percent for unconstrained consumer photos, impressive given the findings of a psychophysical study conducted recently. The proposed framework is an attempt to bridge the gap between computer and human vision systems and is applicable to other problems involving semantic scene content understanding.  相似文献   

8.
Rong  Wenzhong  Han  Jin  Liu  Gen 《Multimedia Tools and Applications》2022,81(6):8617-8632

Leveraging the contextual information at instance-level can improve the accuracy in object detection. However, the-state-of-the-art object detection systems still detect each target individually without using contextual information. One reason is that contextual information is difficult to model. To solve this problem, the object relation module based on one-stage object detectors helps the object detectors learn the correlations between objects. It extracts and fuses the feature maps from various layers, including geometric features, categorical features, and appearance features, a transformation driven by visual attention mechanism are then performed to generate instance-level primary object relation features. Furthermore, a lightweight subnet is used to generate new feature prediction layer based on primary relation features and fused with the original detection layer to improve the detection ability. It does not require excessive amounts of computations and additional supervision and it can be easily ported to different one-stage object detection frameworks. The relation module is added to several one-stage object detectors (YOLO, Retinanet, and FCOS) as demonstrations and evaluate it on MS-COCO benchmark dataset after training. The results show that the relation module effectively improves the accuracy in one-stage object detection pipelines. Specifically, the relation module gives a 2.4 AP improvement for YOLOv3, 1.8 AP improvement for Retinanet and 1.6 AP improvement for FCOS.

  相似文献   

9.
Kim  Minkyu  Sentis  Luis 《Applied Intelligence》2022,52(12):14041-14052

When performing visual servoing or object tracking tasks, active sensor planning is essential to keep targets in sight or to relocate them when missing. In particular, when dealing with a known target missing from the sensor’s field of view, we propose using prior knowledge related to contextual information to estimate its possible location. To this end, this study proposes a Dynamic Bayesian Network that uses contextual information to effectively search for targets. Monte Carlo particle filtering is employed to approximate the posterior probability of the target’s state, from which uncertainty is defined. We define the robot’s utility function via information theoretic formalism as seeking the optimal action which reduces uncertainty of a task, prompting robot agents to investigate the location where the target most likely might exist. Using a context state model, we design the agent’s high-level decision framework using a Partially-Observable Markov Decision Process. Based on the estimated belief state of the context via sequential observations, the robot’s navigation actions are determined to conduct exploratory and detection tasks. By using this multi-modal context model, our agent can effectively handle basic dynamic events, such as obstruction of targets or their absence from the field of view. We implement and demonstrate these capabilities on a mobile robot in real-time.

  相似文献   

10.
Many researchers argue that fusing multiple cues increases the reliability and robustness of visual tracking. However, how the multi-cue integration is realized during tracking is still an open issue. In this work, we present a novel data fusion approach for multi-cue tracking using particle filter. Our method differs from previous approaches in a number of ways. First, we carry out the integration of cues both in making predictions about the target object and in verifying them through observations. Our second and more significant contribution is that both stages of integration directly depend on the dynamically changing reliabilities of visual cues. These two aspects of our method allow the tracker to easily adapt itself to the changes in the context, and accordingly improve the tracking accuracy by resolving the ambiguities.  相似文献   

11.
12.
水下目标自动检测方法对海洋智能捕捞工作发挥着重要作用,针对现有目标检测方法存在的对水下生物检测精度不高问题,提出了一种GA-RetinaNet算法的水下目标检测方法.首先,针对水下图像存在密集目标的特点,通过引入分组卷积替换普通卷积,在不增加参数复杂度的基础上得到更多特征图,提高模型的检测精度;其次,根据水下生物多为小目标生物的特点,引入上下文特征金字塔模块(AC-FPN),利用上下文提取模块保证高分辨率输入的同时获得多个感受野,提取到更多上下文信息,并通过上下文注意力模块和内容注意力模块从中捕获有用特征,准确定位到目标位置.实验结果显示,选用URPC2021数据集进行实验,改进的GA-RetinaNet算法比原算法检测精度提高了2.3%.相比其他主流模型,该算法对不同类型的水下目标均获得了较好的检测结果,检测精度有较大提升.  相似文献   

13.
In recent years a number of psycholinguistic experiments have pointed to the interaction between language and vision. In particular, the interaction between visual attention and linguistic reference. In parallel with this, several theories of discourse have attempted to provide an account of the relationship between types of referential expressions on the one hand and the degree of mental activation on the other. Building on both of these traditions, this paper describes an attention based approach to visually situated reference resolution. The framework uses the relationship between referential form and preferred mode of interpretation as a basis for a weighted integration of linguistic and visual attention scores for each entity in the multimodal context. The resulting integrated attention scores are then used to rank the candidate referents during the resolution process, with the candidate scoring the highest selected as the referent. One advantage of this approach is that the resolution process occurs within the full multimodal context, in so far as the referent is selected from a full list of the objects in the multimodal context. As a result situations where the intended target of the reference is erroneously excluded, due to an individual assumption within the resolution process, are avoided. Moreover, the system can recognise situations where attention cues from different modalities make a reference potentially ambiguous.  相似文献   

14.
Model-based 3-D object tracking has earned significant importance in areas such as augmented reality, surveillance, visual servoing, robotic object manipulation and grasping. Key problems to robust and precise object tracking are the outliers caused by occlusion, self-occlusion, cluttered background, reflections and complex appearance properties of the object. Two of the most common solutions to the above problems have been the use of robust estimators and the integration of visual cues. The tracking system presented in this paper achieves robustness by integrating model-based and model-free cues together with robust estimators. As a model-based cue, a wireframe edge model is used. As model-free cues, automatically generated surface texture features are used. The particular contribution of this work is the integration framework where not only polyhedral objects are considered. In particular, we deal also with spherical, cylindrical and conical objects for which the complete pose cannot be estimated using only wireframe models. Using the integration with the model-free features, we show how a full pose estimate can be obtained. Experimental evaluation demonstrates robust system performance in realistic settings with highly textured objects and natural backgrounds.  相似文献   

15.
A biologically inspired visual system capable of motion detection and pursuit motion is implemented using a Discrete Leaky Integrate-and-Fire (DLIF) neuron model. The system consists of a visual world, a virtual retina, the neural network circuitry (DLIF) to process the information, and a set of virtual eye muscles that serve to move the input area (visual field) of the retina within the visual world. Temporal aspects of the DLIF model are heavily exploited including: spike propagation latency, relative spike timing, and leaky potential integration. A novel technique for motion detection is employed utilizing coincidence detection aspects of the DLIF and relative spike timing. The system as a whole encodes information using relative spike timing of individual action potentials as well as rate coded spike trains. Experimental results are presented in which the motion of objects is detected and tracked in real and animated video. Pursuit motion is successful using linear and also sinusoidal paths which include object velocity changes. The visual system exhibits dynamic overshoot correction heavily exploiting neural network characteristics. System performance is within the bounds of real-time applications.  相似文献   

16.
The mobile Internet introduces new opportunities to gain insight in the user’s environment, behavior, and activity. This contextual information can be used as an additional information source to improve traditional recommendation algorithms. This paper describes a framework to detect the current context and activity of the user by analyzing data retrieved from different sensors available on mobile devices. The framework can easily be extended to detect custom activities and is built in a generic way to ensure easy integration with other applications. On top of this framework, a recommender system is built to provide users a personalized content offer, consisting of relevant information such as points-of-interest, train schedules, and touristic info, based on the user’s current context. An evaluation of the recommender system and the underlying context recognition framework shows that power consumption and data traffic is still within an acceptable range. Users who tested the recommender system via the mobile application confirmed the usability and liked to use it. The recommendations are assessed as effective and help them to discover new places and interesting information.  相似文献   

17.
刘茂福  施琦  聂礼强 《软件学报》2022,33(9):3210-3222
图像描述生成有着重要的理论意义与应用价值,在计算机视觉与自然语言处理领域皆受到广泛关注.基于注意力机制的图像描述生成方法,在同一时刻融合当前词和视觉信息以生成目标词,忽略了视觉连贯性及上下文信息,导致生成描述与参考描述存在差异.针对这一问题,本文提出一种基于视觉关联与上下文双注意力机制的图像描述生成方法(visual relevance and context dual attention,简称VRCDA).视觉关联注意力在传统视觉注意力中增加前一时刻注意力向量以保证视觉连贯性,上下文注意力从全局上下文中获取更完整的语义信息,以充分利用上下文信息,进而指导生成最终的图像描述文本.在MSCOCO和Flickr30k两个标准数据集上进行了实验验证,结果表明本文所提出的VRCDA方法能够有效地生成图像语义描述,相比于主流的图像描述生成方法,在各项评价指标上均取得了较高的提升.  相似文献   

18.
《Information Fusion》2001,2(1):49-71
The application of multi-sensor fusion, which aims at recognizing a state among a set of hypotheses for object classification, is of major interest with regard to the performance improvement brought by the sensor complementarity. Nevertheless, this needs to take into account the most accurate information and take advantage of the statistical learning of the previous measurements acquired by sensors. When previous learning is not representative of real measurements provided by the sensors, the classical probabilistic fusion methods lack performance. The Dempster–Shafer theory is then introduced to face this disadvantage by integrating further information which is the context of the sensor acquisitions. In this paper, we propose a model formalism for the sensor reliability in a context that leads to two methods of integration when all the hypotheses, associated to the objects of the scene acquired by sensors, are previously learned: the first one models the integration of this further information in the fusion rule as degrees of trust and the second models the sensor reliability directly as probability mass. These two methods are based on the theory of fuzzy events. Simulations of typical cases are developed in order to define the respective validity domains of these two methods. Afterwards, we are interested in the development of these two methods in the case where the previous learning is unavailable for an object and a global method of contextual information integration can be deduced.  相似文献   

19.
20.
目标检测提取的特征信息不足,导致识别小目标或被遮挡目标时精确度不高.因此,文中提出多层上下文卷积网络(MLC-CNN),通过提取多层上下文信息特征并结合物体特征进行目标检测.MLC-CNN由区域生成网络(RPN)和多层上下文信息(MLC)两个子网络组成,RPN获取固定长度的特征向量作为目标特征,MLC获取不同层特征图上对应的上下文信息特征,最后融合两部分特征.此外,为了解决数据不均衡问题,融入难负样本训练.在PASCAL VOC2007和PASCAL VOC2012数据集上的实验表明,MLC-CNN的均值平均精度(mAP)具有明显提高.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号