首页 | 本学科首页   官方微博 | 高级检索  
 共查询到20条相似文献,搜索用时 15 毫秒
视觉问答任务旨在给机器输入一幅图像和一相关问题,计算机能够准确作答。针对这一任务,对记忆和注意力机制的神经网络结构进行了深入研究,这类网络显示出问题回答所需的某些推理能力。在分析动态记忆网络的基础上,提出了一种新的动态记忆网络,对原来的DMN的内存和输入模块进行改进。结合这些变化,一个新的图像输入模块引入到视觉问答系统中。在DAQUAR-ALL、COCO-QA和VQA数据集上验证了该方法的有效性。实验结果表明,所提出的新动态记忆模型取得了很好的结果,比一些经典深度方法都更出色。  相似文献   

Zhu  Xi  Mao  Zhendong  Chen  Zhineng  Li  Yangyang  Wang  Zhaohui  Wang  Bin 《Multimedia Tools and Applications》2021,80(11):16247-16265
Multimedia Tools and Applications - Visual Question Answering(VQA), an important task to evaluate the cross-modal understanding capability of an Artificial Intelligence model, has been a hot...  相似文献   


指代短语理解(referring expression comprehension,REC)任务的目的是定位输入短语所指代的图像区域,其中最主要的挑战之一是在图像中建立和定位由输入短语描述的物体之间的关系. 现有的主流方法之一是根据物体本身的特性以及与其他物体的关系对当前物体进行打分,将得分最高的物体作为预测的被指代区域. 然而,这类方法往往只考虑物体与其周围环境之间的关系,而忽略了输入短语中所描述的周围环境之间的交互关系,这大大影响了对物体间关系的建模. 为了解决这一问题,提出了关系聚合网络(relationship aggregation network,RAN)来构建物体之间的关系,进而预测输入短语所指代的内容. 具体来说,利用图注意力网络建模图像物体之间完备的关系;然后利用跨模态注意力方法选择与输入短语最相关的关系进行聚合;最后,计算目标区域与输入短语之间的匹配分数. 除此之外,对指代短语理解中的擦除方法进行了改进,通过自适应扩充擦除范围的方式促使模型利用更多的线索来定位正确的区域. 在3个广泛使用的基准数据集上进行了大量的实验,结果证明了所提出方法的优越性.


目的 现有大多数视觉问答模型均采用自上而下的视觉注意力机制,对图像内容无加权统一处理,无法更好地表征图像信息,且因为缺乏长期记忆模块,无法对信息进行长时间记忆存储,在推理答案过程中会造成有效信息丢失,从而预测出错误答案。为此,提出一种结合自底向上注意力机制和记忆网络的视觉问答模型,通过增强对图像内容的表示和记忆,提高视觉问答的准确率。方法 预训练一个目标检测模型提取图像中的目标和显著性区域作为图像特征,联合问题表示输入到记忆网络,记忆网络根据问题检索输入图像特征中的有用信息,并结合输入图像信息和问题表示进行多次迭代、更新,以生成最终的信息表示,最后融合记忆网络记忆的最终信息和问题表示,推测出正确答案。结果 在公开的大规模数据集VQA (visual question answering)v2.0上与现有主流算法进行比较实验和消融实验,结果表明,提出的模型在视觉问答任务中的准确率有显著提升,总体准确率为64.0%。与MCB(multimodal compact bilinear)算法相比,总体准确率提升了1.7%;与性能较好的VQA machine算法相比,总体准确率提升了1%,其中回答是/否、计数和其他类型问题的准确率分别提升了1.1%、3.4%和0.6%。整体性能优于其他对比算法,验证了提出算法的有效性。结论 本文提出的结合自底向上注意力机制和记忆网络的视觉问答模型,更符合人类的视觉注意力机制,并且在推理答案的过程中减少了信息丢失,有效提升了视觉问答的准确率。  相似文献   

Video question answering (Video QA) involves a thorough understanding of video content and question language, as well as the grounding of the textual semantic to the visual content of videos. Thus, to answer the questions more accurately, not only the semantic entity should be associated with certain visual instance in video frames, but also the action or event in the question should be localized to a corresponding temporal slot. It turns out to be a more challenging task that requires the ability of conducting reasoning with correlations between instances along temporal frames. In this paper, we propose an instance-sequence reasoning network for video question answering with instance grounding and temporal localization. In our model, both visual instances and textual representations are firstly embedded into graph nodes, which benefits the integration of intra- and inter-modality. Then, we propose graph causal convolution (GCC) on graph-structured sequence with a large receptive field to capture more causal connections, which is vital for visual grounding and instance-sequence reasoning. Finally, we evaluate our model on TVQA+ dataset, which contains the groundtruth of instance grounding and temporal localization, three other Video QA datasets and three multimodal language processing datasets. Extensive experiments demonstrate the effectiveness and generalization of the proposed method. Specifically, our method outperforms the state-of-the-art methods on these benchmarks.  相似文献   

答疑系统问题的Z语言规约   总被引:2,自引:0,他引:2  
分析了目前答疑系统存在的不足之一,即缺少标准框架,从而很难实现答疑系统之间资源的共享.因此提出了使用形式化方法来构建统一的答疑系统,利用形式化语言Z对答疑系统的主要模块进行需求规格说明.同时用Z语言描述了答疑系统的主要操作模式,包括关键词的提取、问题的检索和知识库的更新等操作.  相似文献   

We present an evolutionary approach for the computation of exact answers to natural languages (NL) questions. Answers are extracted directly from the N-best snippets, which have been identified by a standard Web search engine using NL questions. The core idea of our evolutionary approach to Web question answering is to search for those substrings in the snippets whose contexts are most similar to contexts of already known answers. This context model together with the words mentioned in the NL question are used to evaluate the fitness of answer candidates, which are actually randomly selected substrings from randomly selected sentences of the snippets. New answer candidates are then created by applying specialized operators for crossover and mutation, which either stretch and shrink the substring of an answer candidate or transpose the span to new sentences. Since we have no predefined notion of patterns, our context alignment methods are very dynamic and strictly data-driven. We assessed our system with seven different datasets of question/answer pairs. The results show that this approach is promising, especially when it deals with specific questions.  相似文献   

Neural generative model in question answering (QA) usually employs sequence-to-sequence (Seq2Seq) learning to generate answers based on the user’s questions as opposed to the retrieval-based model selecting the best matched answer from a repository of pre-defined QA pairs. One key challenge of neural generative model in QA lies in generating high-frequency and generic answers regardless of the questions, partially due to optimizing log-likelihood objective function. In this paper, we investigate multitask learning (MTL) in neural network-based method under a QA scenario. We define our main task as agenerative QA via Seq2Seq learning. And we define our auxiliary task as a discriminative QA via binary QAclassification. Both main task and auxiliary task are learned jointly with shared representations, allowing to obtain improved generalization and transferring classification labels as extra evidences to guide the word sequence generation of the answers. Experimental results on both automatic evaluations and human annotations demonstrate the superiorities of our proposed method over baselines.  相似文献   

Question answering is an important problem that aims to deliver specific answers to questions posed by humans in natural language. How to efficiently identify the exact answer with respect to a given question has become an active line of research. Previous approaches in factoid question answering tasks typically focus on modeling the semantic relevance or syntactic relationship between a given question and its corresponding answer. Most of these models suffer when a question contains very little content that is indicative of the answer. In this paper, we devise an architecture named the temporality-enhanced knowledge memory network (TE-KMN) and apply the model to a factoid question answering dataset from a trivia competition called quiz bowl. Unlike most of the existing approaches, our model encodes not only the content of questions and answers, but also the temporal cues in a sequence of ordered sentences which gradually remark the answer. Moreover, our model collaboratively uses external knowledge for a better understanding of a given question. The experimental results demonstrate that our method achieves better performance than several state-of-the-art methods.  相似文献   

VQA attracts lots of researchers in recent years. It could be potentially applied to the remote consultation of COVID-19. Attention mechanisms provide an effective way of utilizing visual and question information selectively in visual question and answering (VQA). The attention methods of existing VQA models generally focus on spatial dimension. In other words, the attention is modeled as spatial probabilities that re-weights the image region or word token features. However, feature-wise attention cannot be ignored, as image and question representations are organized in both spatial and feature-wise modes. Taking the question “What is the color of the woman’s hair” for example, identifying the hair color attribute feature is as important as focusing on the hair region. In this paper, we propose a novel neural network module named “multimodal feature-wise attention module” (MulFA) to model the feature-wise attention. Extensive experiments show that MulFA is capable of filtering representations for feature refinement and leads to improved performance. By introducing MulFA modules, we construct an effective union feature-wise and spatial co-attention network (UFSCAN) model for VQA. Our evaluation on two large-scale VQA datasets, VQA 1.0 and VQA 2.0, shows that UFSCAN achieves performance competitive with state-of-the-art models.  相似文献   

In this paper, we develop a framework of Question Answering Pages (referred to as QA pages) recommendation. Our proposed framework consists of the two modules: the off-line module to determine the importance of QA pages and the on-line module for on-line QA page recommendation. In the off-line module, we claim that the importance of QA pages could be discovered from user click streams. If the QA pages are of higher importance, many users will click and spend their time on these QA pages. Moreover, the relevant relationships among QA pages are captured by the browsing behavior on these QA pages. As such, we exploit user click streams to model the browsing behavior among QA pages as QA browsing graph structures. The importance of QA pages is derived from our proposed QA browsing graph structures. However, we observe that the QA browsing graph is sparse and that most of the QA pages do not link to other QA pages. This is referred to as a sparsity problem. To overcome this problem, we utilize the latent browsing relations among QA pages to build a QA Latent Browsing Graph. In light of QA latent browsing graph, the importance score of QA pages (referred to as Latent Browsing Rank) and the relevance score of QA pages (referred to as Latent Browsing Recommendation Rank) are proposed. These scores demonstrate the use of a QA latent browsing graph not only to determine the importance of QA pages but also to recommend QA pages. We conducted extensive empirical experiments on Yahoo! Asia Knowledge Plus to evaluate our proposed framework.  相似文献   

Wu  Wenqing  Zhu  Zhenfang  Zhang  Guangyuan  Kang  Shiyong  Liu  Peiyu 《Applied Intelligence》2021,51(7):4515-4524
Applied Intelligence - Multi-relation Question Answering is an important task of knowledge base over question answering (KBQA), multi-relation means that the question contains multiple relations...  相似文献   

为了提高视觉问答(VQA)模型回答复杂图像问题的准确率,提出了面向视觉问答的跨模态交叉融合注意网络(CCAN).首先,提出了一种改进的残差通道自注意方法对图像进行注意,根据图像整体信息来寻找重要区域,从而引入一种新的联合注意机制,将单词注意和图像区域注意结合在一起;其次,提出一种"跨模态交叉融合"网络生成多个特征,将两...  相似文献   

The usage of computer applications in the construction industry is increasing, as is the complexity of software applications and this makes it difficult for project personnel to maintain familiarity. Furthermore, the causes of practical problems, such as project delays and cost over-runs, are often not derivable from the output of most software. A question answering system provides a means for directly extracting knowledge from this output. This paper begins with an examination of issues involved in building such a system. An emerging industry standard, ifcXML, is adopted as the knowledge representation format, thereby reducing the effort that is necessary to build a knowledge base. We then explore the mechanisms that use information in the knowledge base for question understanding. A prototype system has been built and tested to illustrate usefulness for project management applications.  相似文献   

目的 图表问答是计算机视觉多模态学习的一项重要研究任务,传统关系网络(relation network,RN)模型简单的两两配对方法可以包含所有像素之间的关系,因此取得了不错的结果,但此方法不仅包含冗余信息,而且平方式增长的关系配对的特征数量会给后续的推理网络在计算量和参数量上带来很大的负担。针对这个问题,提出了一种基于融合语义特征提取的引导性权重驱动的重定位关系网络模型来改善不足。方法 首先通过融合场景任务的低级和高级图像特征来提取更丰富的统计图语义信息,同时提出了一种基于注意力机制的文本编码器,实现融合语义的特征提取,然后对引导性权重进行排序进一步重构图像的位置,从而构建了重定位的关系网络模型。结果 在2个数据集上进行实验比较,在FigureQA(an annotated figure dataset for visual reasoning)数据集中,相较于IMG+QUES(image+questions)、RN和ARN(appearance and relation networks),本文方法的整体准确率分别提升了26.4%,8.1%,0.46%,在单一验证集上,相较于LEA...  相似文献   

Multi-hop Knowledge Base Question Answering (KBQA) aims to predict answers that require multi-hop reasoning from the topic entity in the question over the Knowledge Base (KB). Relation extraction is a core step in KBQA, which extracts the relation path from the topic entity to the answer entity. Compared with single-hop questions, multi-hop ones have more complex syntactic structures to understand, and multi-hop relation paths lead to a larger search space, which makes it much more challenging to extract the correct relation paths. To tackle the above challenges, most existing relation extraction approaches focus on the semantic similarity between questions and relation paths. However, those approaches only consider the word semantics of the relation names but ignore the graph semantics inside the knowledge base. As a result, their generalization ability relying on the naming rules of the relations, making it more difficult to generalize over large knowledge bases.To address the current limitations and take advantage of the graph semantics of relations, we propose a novel translational embedding-based relation extractor that utilizes pretrained embeddings from TransE. In particular, we treat the multi-hop relation path as a translation from the first relation to the last one in the semantic space of TransE. Then we map the question into this space under the supervision of the path embeddings. To take full advantage of the pretrained graph semantics in TransE, we propose a KBQA framework that leverages pretrained relation semantics in relation extraction and pretrained entity semantics in answer selection. Our approach achieves state-of-the-art performance on two benchmark datasets, WebQuestionSP and MetaQA, demonstrating its effectiveness on the multi-hop KBQA task.  相似文献   

Question Answering Systems (QAS) are receiving increasing attention from IS researchers, particularly those in the information retrieval and natural language processing communities. Evaluation of an IS's success and user satisfaction are important issues, especially for emerging online service systems using the Internet. Although many QAS have been implemented, little work has been done on the development of an evaluation model for them. Our purpose was to develop a validated instrument to measure user satisfaction with QAS (USQAS). The proposed validated instrument was intended as a reference for the design of QAS from a user's perspective.  相似文献   

智能答疑中问题相关度算法研究及系统实现   总被引:7,自引:0,他引:7  
针对现有答疑系统缺乏智能性和人机交互不够友好的不足,提出了一个智能答疑系统实现方案。为提高系统中问题与答案的匹配准确程度,着重对问题相关度算法进行了研究,在自动分词后用关键词集合相似度来计算问题的相关度,通过有监督的机器学习BP模型建立一个适合智能答疑系统的学习模型来优化分词权值。实验证明,这种算法可以帮助智能答疑系统提高准确性和智能性,具有一定的实用价值。  相似文献   

Answering complex questions involving multiple relations over knowledge bases is a challenging task. Many previous works rely on dependency parsing. However, errors in dependency parsing would influence their performance, in particular for long complex questions. In this paper, we propose a novel skeleton grammar to represent the high-level structure of a complex question. This lightweight formalism and its BERT-based parsing algorithm help to improve the downstream dependency parsing. To show the effectiveness of skeleton, we develop two question answering approaches: skeleton-based semantic parsing (called SSP) and skeleton-based information retrieval (called SIR). In SSP, skeleton helps to improve structured query generation. In SIR, skeleton helps to improve path ranking. Experimental results show that, thanks to skeletons, our approaches achieve state-of-the-art results on three datasets: LC-QuAD 1.0, GraphQuestions, and ComplexWebQuestions 1.1.  相似文献   


Visual Question Answering (VQA) expands on the Turing Test, as it involves the ability to answer questions about visual content. Current efforts in VQA, however, still do not fully consider whether a question about visual content is relevant and if it is not, how to edit it best to make it answerable. Question relevance has only been considered so far at the level of a whole question using binary classification and without the capability to edit a question to make it grounded and intelligible. The only exception to this is our prior research effort into question part relevance that allows for relevance and editing based on object nouns. This paper extends previous work on object relevance to determine the relevance for a question action and leverage this capability to edit an irrelevant question to make it relevant. Practical applications of such a capability include answering biometric-related queries across a set of images, including people and their action (behavioral biometrics). The feasibility of our approach is shown using Context-Collaborative VQA (C2VQA) Action/Relevance/Edit (ARE). Our results show that our proposed approach outperforms all other models for the novel tasks of question action relevance (QAR) and question action editing (QAE) by a significant margin. The ultimate goal for future research is to address full-fledged W5 + type of inquires (What, Where, When, Why, Who, and How) that are grounded to and reference video using both nouns and verbs in a collaborative context-aware fashion.


设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号