首页 | 本学科首页   官方微博 | 高级检索  
     

基于空间关系与频率特征的视觉问答模型
引用本文:付鹏程,杨关,刘小明,刘阳,张紫明,成曦.基于空间关系与频率特征的视觉问答模型[J].计算机工程,2022,48(9):96-104.
作者姓名:付鹏程  杨关  刘小明  刘阳  张紫明  成曦
作者单位:1. 中原工学院 计算机学院, 郑州 450007;2. 河南省网络舆情监测与智能分析重点实验室, 郑州 450007;3. 西安电子科技大学 通信工程学院, 西安 710071
基金项目:国家自然科学基金(61772576,61906141);陕西省自然科学基金(2020JQ-317);河南省科技厅科技攻关计划(182102210126)。
摘    要:视觉问答作为多模态数据处理中的重要任务,需要将不同模态的信息进行关联表示。现有视觉问答模型无法有效区分相似目标对象且对于目标对象之间的空间关系表达不准确,从而影响模型整体性能。为充分利用视觉问答图像和问题中的细粒度信息与空间关系信息,基于自底向上和自顶向下的注意力(BUTD)模型及模块化协同注意力网络(MCAN)模型,结合空间域特征和频率域特征构造多维增强注意力(BUDR)模型和模块化共同增强注意力网络(MCDR)模型。利用离散余弦变换得到频率信息,改善图像细节丢失问题。采用关系网络学习空间结构信息和潜在关系信息,减少图像和问题特征出现对齐错误,并加强模型推理能力。在VQA v2.0数据集和test-dev验证集上的实验结果表明,BUDR和MCDR模型能够增强图像细粒度识别性能,提高图像和问题目标对象间的关联性,相比于BUTD和MCAN模型预测精确率分别提升了0.14和0.25个百分点。

关 键 词:离散余弦变换  细粒度识别  关系网络  注意力机制  特征融合  
收稿时间:2021-08-12
修稿时间:2021-10-04

Visual Question Answering Model Based on Spatial Relation and Frequency Feature
FU Pengcheng,YANG Guan,LIU Xiaoming,LIU Yang,ZHANG Ziming,CHENG Xi.Visual Question Answering Model Based on Spatial Relation and Frequency Feature[J].Computer Engineering,2022,48(9):96-104.
Authors:FU Pengcheng  YANG Guan  LIU Xiaoming  LIU Yang  ZHANG Ziming  CHENG Xi
Affiliation:1. School of Computer Science, Zhongyuan University of Technology, Zhengzhou 450007, China;2. Henan Key Laboratory on Public Opinion Intelligent Analysis, Zhengzhou 450007, China;3. School of Telecommunications Engineering, Xidian University, Xi'an 710071, China
Abstract:As an important task in multimodal data processing, Visual Question Answering(VQA) needs to associate and represent information from different modalities.However, existing VQA models can not effectively distinguish similar target objects and can not accurately express the spatial relationship between target objects, thus affecting the model's overall performance.They also have the problem of low recognition of similar objects and wrongly expressing the spatial relationship between target objects.To fully exploit fine-grained and spatial relationship information in images and questions of VQA, this study combines spatial domain and frequency domain features with the Bottom-Up and Top-Down attention(BUTD) model and Modular Co-Attention Network(MCAN) model to construct a multi-dimensional enhanced attention model, called BUDR, and a modular co-enhanced attention network model, called MCDR.BUDR and MCDR models use Discrete Cosine Transform(DCT) to obtain frequency information to improve the image detail loss problem, and Relation Network(RN) to learn spatial structure information and latent relational information to reduce the misalignment of image and question features, and enhance model reasoning capabilities.The experimental results on the VQA v2.0 dataset and the test-dev validation set show that the BUDR and MCDR models can enhance the performance of fine-grained image recognition and improve the correlation between the image and the target object of the question.Compared with the BUTD and MCAN models, the prediction accuracy of the BUDR and MCDR models is increased by 0.14 and 0.25 percentage points, respectively.
Keywords:Discrete Cosine Transform(DCT)  fine-grained identification  Relation Network(RN)  attention mechanism  feature fusion  
点击此处可从《计算机工程》浏览原始摘要信息
点击此处可从《计算机工程》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号