基于复合图文特征的视觉问答模型研究 Research on visual question answering model based on composite graphic features期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

基于复合图文特征的视觉问答模型研究

引用本文：	邱南,顾玉宛,石林,李宁,庄丽华,徐守坤.基于复合图文特征的视觉问答模型研究[J].计算机应用研究,2021,38(8):2293-2298.

作者姓名：	邱南顾玉宛石林李宁庄丽华徐守坤

作者单位：	常州大学计算机与人工智能学院阿里云大数据学院,江苏常州213164

基金项目：	国家自然科学基金资助项目(61906021);常州市城市大数据分析与应用技术重点实验室资助项目(CM20193007);江苏省研究生科研创新计划资助项目(KYCX21_2829)

摘要：	针对当前主流视觉问答(visual question answering,VQA)任务使用区域特征作为图像表示而面临的训练复杂度高、推理速度慢等问题,提出一种基于复合视觉语言的卷积网络(composite visionlinguistic convnet,CVlCN)来对视觉问答任务中的图像进行表征.该方法将图像特征和问题语义通过复合学习表示成复合图文特征,然后从空间和通道上计算复合图文特征的注意力分布,以选择性地保留与问题语义相关的视觉信息.在VQA-v2数据集上的测试结果表明,该方法在视觉问答任务上的准确率有明显的提升,整体准确率达到64.4％.模型的计算复杂度较低且推理速度更快.
关键词：	视觉问答复合视觉语言特征区域特征多模态融合
收稿时间：	2020/12/15 0:00:00
修稿时间：	2021/7/10 0:00:00
Research on visual question answering model based on composite graphic features

qiu nan,gu yu wan,shi lin,li ning,zhuang li hua and xu shou kun.Research on visual question answering model based on composite graphic features[J].Application Research of Computers,2021,38(8):2293-2298.

Authors:	qiu nan gu yu wan shi lin li ning zhuang li hua and xu shou kun

Affiliation:	Changzhou University,,,,,

Abstract:	In view of the problems of high training complexity and slow inference speed involved by the current mainstream visual question answering task which uses regional features as image representations, this paper proposed a convolutional network(composite visionlinguistic ConvNet, CVlCN) based on composite visual language to extract the image features in visual question answering tasks. The proposed method represented image features and problem semantics into composite picture-text features through composite learning, and then calculated the attention distribution of composite picture-text features from space and channels to selectively retain visual information related to problem semantics. The experimental results show that, on the VQA-v2 dataset, the test accuracy of the proposed method on the visual question answering task is obviously improved, and the overall accuracy is 64.4%. And the model has low computational complexity and fast inference speed.

Keywords:	visual question answering composite visionlinguistic feature regional feature multimodal fusion
本文献已被万方数据等数据库收录！
	点击此处可从《计算机应用研究》浏览原始摘要信息
	点击此处可从《计算机应用研究》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏