基于空间关系与频率特征的视觉问答模型 Visual Question Answering Model Based on Spatial Relation and Frequency Feature期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

基于空间关系与频率特征的视觉问答模型

引用本文：	付鹏程,杨关,刘小明,刘阳,张紫明,成曦.基于空间关系与频率特征的视觉问答模型[J].计算机工程,2022,48(9):96-104.

作者姓名：	付鹏程杨关刘小明刘阳张紫明成曦

作者单位：	1. 中原工学院计算机学院, 郑州 450007;2. 河南省网络舆情监测与智能分析重点实验室, 郑州 450007;3. 西安电子科技大学通信工程学院, 西安 710071

基金项目：	国家自然科学基金（61772576，61906141）；陕西省自然科学基金（2020JQ-317）；河南省科技厅科技攻关计划（182102210126）。

摘要：	视觉问答作为多模态数据处理中的重要任务，需要将不同模态的信息进行关联表示。现有视觉问答模型无法有效区分相似目标对象且对于目标对象之间的空间关系表达不准确，从而影响模型整体性能。为充分利用视觉问答图像和问题中的细粒度信息与空间关系信息，基于自底向上和自顶向下的注意力（BUTD）模型及模块化协同注意力网络（MCAN）模型，结合空间域特征和频率域特征构造多维增强注意力（BUDR）模型和模块化共同增强注意力网络（MCDR）模型。利用离散余弦变换得到频率信息，改善图像细节丢失问题。采用关系网络学习空间结构信息和潜在关系信息，减少图像和问题特征出现对齐错误，并加强模型推理能力。在VQA v2.0数据集和test-dev验证集上的实验结果表明，BUDR和MCDR模型能够增强图像细粒度识别性能，提高图像和问题目标对象间的关联性，相比于BUTD和MCAN模型预测精确率分别提升了0.14和0.25个百分点。
关键词：	离散余弦变换细粒度识别关系网络注意力机制特征融合
收稿时间：	2021-08-12
修稿时间：	2021-10-04
Visual Question Answering Model Based on Spatial Relation and Frequency Feature

FU Pengcheng,YANG Guan,LIU Xiaoming,LIU Yang,ZHANG Ziming,CHENG Xi.Visual Question Answering Model Based on Spatial Relation and Frequency Feature[J].Computer Engineering,2022,48(9):96-104.

Authors:	FU Pengcheng YANG Guan LIU Xiaoming LIU Yang ZHANG Ziming CHENG Xi

Affiliation:	1. School of Computer Science, Zhongyuan University of Technology, Zhengzhou 450007, China;2. Henan Key Laboratory on Public Opinion Intelligent Analysis, Zhengzhou 450007, China;3. School of Telecommunications Engineering, Xidian University, Xi'an 710071, China

Abstract:	As an important task in multimodal data processing, Visual Question Answering(VQA) needs to associate and represent information from different modalities.However, existing VQA models can not effectively distinguish similar target objects and can not accurately express the spatial relationship between target objects, thus affecting the model's overall performance.They also have the problem of low recognition of similar objects and wrongly expressing the spatial relationship between target objects.To fully exploit fine-grained and spatial relationship information in images and questions of VQA, this study combines spatial domain and frequency domain features with the Bottom-Up and Top-Down attention(BUTD) model and Modular Co-Attention Network(MCAN) model to construct a multi-dimensional enhanced attention model, called BUDR, and a modular co-enhanced attention network model, called MCDR.BUDR and MCDR models use Discrete Cosine Transform(DCT) to obtain frequency information to improve the image detail loss problem, and Relation Network(RN) to learn spatial structure information and latent relational information to reduce the misalignment of image and question features, and enhance model reasoning capabilities.The experimental results on the VQA v2.0 dataset and the test-dev validation set show that the BUDR and MCDR models can enhance the performance of fine-grained image recognition and improve the correlation between the image and the target object of the question.Compared with the BUTD and MCAN models, the prediction accuracy of the BUDR and MCDR models is increased by 0.14 and 0.25 percentage points, respectively.

Keywords:	Discrete Cosine Transform(DCT) fine-grained identification Relation Network(RN) attention mechanism feature fusion

	点击此处可从《计算机工程》浏览原始摘要信息
	点击此处可从《计算机工程》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏