首页 | 本学科首页   官方微博 | 高级检索  
     

面向视觉问答的多模块协同注意模型
引用本文:邹品荣,肖锋,张文娟,张万玉,王晨阳.面向视觉问答的多模块协同注意模型[J].计算机工程,2022,48(2):250-260.
作者姓名:邹品荣  肖锋  张文娟  张万玉  王晨阳
作者单位:1. 西安工业大学 兵器科学与技术学院, 西安 710021;2. 西安工业大学 计算机科学与工程学院, 西安 710021;3. 西安工业大学 基础学院, 西安 710021
基金项目:国家自然科学基金(61572392,62171361);;陕西省科技计划项目(2020GY-066);;陕西省自然科学基础研究项目(2021JM-440);
摘    要:视觉问答(VQA)是计算机视觉和自然语言处理领域中典型的多模态问题,然而传统VQA模型忽略了双模态中语义信息的动态关系和不同区域间丰富的空间结构。提出一种新的多模块协同注意力模型,对视觉场景中对象间关系的动态交互和文本上下文表示进行充分理解,根据图注意力机制建模不同类型对象间关系,学习问题的自适应关系表示,将问题特征和带关系属性的视觉关系通过协同注意编码,加强问题词与对应图像区域间的依赖性,通过注意力增强模块提升模型的拟合能力。在开放数据集VQA 2.0和VQA-CP v2上的实验结果表明,该模型在“总体”、“是/否”、“计数”和“其他”类别问题上的精确度明显优于DA-NTN、ReGAT和ODA-GCN等对比方法,可有效提升视觉问答的准确率。

关 键 词:视觉问答  注意力机制  图注意网络  关系推理  多模态学习  特征融合  
收稿时间:2021-03-16
修稿时间:2021-05-25

Multi-Module Co-Attention Model for Visual Question Answering
ZOU Pinrong,XIAO Feng,ZHANG Wenjuan,ZHANG Wanyu,WANG Chenyang.Multi-Module Co-Attention Model for Visual Question Answering[J].Computer Engineering,2022,48(2):250-260.
Authors:ZOU Pinrong  XIAO Feng  ZHANG Wenjuan  ZHANG Wanyu  WANG Chenyang
Affiliation:1. School of Armament Science and Technology, Xi'an Technological University, Xi'an 710021, China;2. School of Computer Science and Engineering, Xi'an Technological University, Xi'an 710021, China;3. School of Science, Xi'an Technological University, Xi'an 710021, China
Abstract:Visual Question Answering(VQA) is a typical multi-modal problem in computer vision and natural language processing.Most of the existing VQA models ignore the dynamic relationships of semantic information between two modes and the rich spatial structure of an image.For this reason, the paper proposes a novel multi-module Co-Attention Network named MMCAN, which can fully understand the dynamic interactions between objects and contextual text representation in a visual scenario.Based on the graph attention mechanism, relations between different types of objects are modeled.The adaptive relation representation of the problem is learnt, and the visual object relations as well as the problem features are encoded through co-attention to strengthen the dependence between the words and corresponding image areas.Finally, the enhancement module is used to improve the fitting ability of the model. Experimental results on the open data set VQA 2.0 and VQA-CP V2 show that the accuracy of the proposed model is significantly better than that of DA-NTN, ReGAT and ODA-GCN for "total", "yes/no", "count" and "other" categories of questions.It can effectively improve the accuracy of visual question answering.
Keywords:Visual Question Answering(VQA)  attention mechanism  graph attention network  relational reasoning  multimodal learning  feature fusion
本文献已被 维普 等数据库收录!
点击此处可从《计算机工程》浏览原始摘要信息
点击此处可从《计算机工程》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号