面向视觉问答的多模块协同注意模型 Multi-Module Co-Attention Model for Visual Question Answering期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

面向视觉问答的多模块协同注意模型

引用本文：	邹品荣,肖锋,张文娟,张万玉,王晨阳.面向视觉问答的多模块协同注意模型[J].计算机工程,2022,48(2):250-260.

作者姓名：	邹品荣肖锋张文娟张万玉王晨阳

作者单位：	1. 西安工业大学兵器科学与技术学院, 西安 710021;2. 西安工业大学计算机科学与工程学院, 西安 710021;3. 西安工业大学基础学院, 西安 710021

基金项目：	国家自然科学基金（61572392,62171361）;;陕西省科技计划项目（2020GY-066）;;陕西省自然科学基础研究项目（2021JM-440）;

摘要：	视觉问答（VQA）是计算机视觉和自然语言处理领域中典型的多模态问题,然而传统VQA模型忽略了双模态中语义信息的动态关系和不同区域间丰富的空间结构。提出一种新的多模块协同注意力模型,对视觉场景中对象间关系的动态交互和文本上下文表示进行充分理解,根据图注意力机制建模不同类型对象间关系,学习问题的自适应关系表示,将问题特征和带关系属性的视觉关系通过协同注意编码,加强问题词与对应图像区域间的依赖性,通过注意力增强模块提升模型的拟合能力。在开放数据集VQA 2.0和VQA-CP v2上的实验结果表明,该模型在“总体”、“是/否”、“计数”和“其他”类别问题上的精确度明显优于DA-NTN、ReGAT和ODA-GCN等对比方法,可有效提升视觉问答的准确率。
关键词：	视觉问答注意力机制图注意网络关系推理多模态学习特征融合
收稿时间：	2021-03-16
修稿时间：	2021-05-25
Multi-Module Co-Attention Model for Visual Question Answering

ZOU Pinrong,XIAO Feng,ZHANG Wenjuan,ZHANG Wanyu,WANG Chenyang.Multi-Module Co-Attention Model for Visual Question Answering[J].Computer Engineering,2022,48(2):250-260.

Authors:	ZOU Pinrong XIAO Feng ZHANG Wenjuan ZHANG Wanyu WANG Chenyang

Affiliation:	1. School of Armament Science and Technology, Xi'an Technological University, Xi'an 710021, China;2. School of Computer Science and Engineering, Xi'an Technological University, Xi'an 710021, China;3. School of Science, Xi'an Technological University, Xi'an 710021, China

Abstract:	Visual Question Answering(VQA) is a typical multi-modal problem in computer vision and natural language processing.Most of the existing VQA models ignore the dynamic relationships of semantic information between two modes and the rich spatial structure of an image.For this reason, the paper proposes a novel multi-module Co-Attention Network named MMCAN, which can fully understand the dynamic interactions between objects and contextual text representation in a visual scenario.Based on the graph attention mechanism, relations between different types of objects are modeled.The adaptive relation representation of the problem is learnt, and the visual object relations as well as the problem features are encoded through co-attention to strengthen the dependence between the words and corresponding image areas.Finally, the enhancement module is used to improve the fitting ability of the model. Experimental results on the open data set VQA 2.0 and VQA-CP V2 show that the accuracy of the proposed model is significantly better than that of DA-NTN, ReGAT and ODA-GCN for "total", "yes/no", "count" and "other" categories of questions.It can effectively improve the accuracy of visual question answering.

Keywords:	Visual Question Answering(VQA) attention mechanism graph attention network relational reasoning multimodal learning feature fusion
本文献已被维普等数据库收录！
	点击此处可从《计算机工程》浏览原始摘要信息
	点击此处可从《计算机工程》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏