视觉问答技术研究 Research on visual question answering techniques期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

视觉问答技术研究

引用本文：	刘明阳, 王若梅, 周凡, 林格. 基于多模态知识主动学习的视频问答方案[J]. 计算机研究与发展, 2024, 61(4): 889-902. DOI: 10.7544/issn1000-1239.202221008

作者姓名：	刘明阳王若梅周凡林格

作者单位：	1.中山大学计算机学院国家数字家庭工程技术研究中心　广州　510006

基金项目：	国家重点研发计划项目（2021YFF0900900）

摘要：	视频问答是人工智能领域的一个热点研究问题. 现有方法在特征提取方面缺乏针对视觉目标运动细节的获取，从而会导致错误因果关系的建立. 此外，在数据融合与推理过程中，现有方法缺乏有效的主动学习能力，难以获取特征提取之外的先验知识，影响了模型对多模态内容的深度理解. 针对这些问题，首先，设计了一种显性多模态特征提取模块，通过获取图像序列中视觉目标的语义关联以及与周围环境的动态关系来建立每个视觉目标的运动轨迹. 进一步通过动态内容对静态内容的补充，为数据融合与推理提供了更加精准的视频特征表达. 其次，提出了知识自增强多模态数据融合与推理模型，实现了多模态信息理解的自我完善和逻辑思维聚焦，增强了对多模态特征的深度理解，减少了对先验知识的依赖. 最后，提出了一种基于多模态知识主动学习的视频问答方案. 实验结果表明，该方案的性能优于现有最先进的视频问答算法，大量的消融和可视化实验也验证了方案的合理性.
关键词：	视频问答数据融合与推理多模态主动学习视频细节描述提取深度学习
收稿时间：	2022-12-16
修稿时间：	2023-06-26
Research on visual question answering techniques

Liu Mingyang, Wang Ruomei, Zhou Fan, Lin Ge. Video Question Answering Scheme Base on Multimodal Knowledge Active Learning[J]. Journal of Computer Research and Development, 2024, 61(4): 889-902. DOI: 10.7544/issn1000-1239.202221008

Authors:	Liu Mingyang Wang Ruomei Zhou Fan Lin Ge

Affiliation:	1.National Engineering Research Center of Digital Life, School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou 510006

Abstract:	Video question answering requires models to understand, fuse, and reason about the multimodal data in videos to assist people in quickly retrieving, analyzing, and summarizing complex scenes in videos, becoming a hot research topic in artificial intelligence. However, existing methods lack abilities of obtaining the motion details of visual objects in feature extraction, which may lead to false causality. In addition, in data fusion and reasoning, existing methods lack effective active learning ability, making it difficult to obtain prior knowledge beyond feature extraction, which affects the model’s deep understanding of multimodal content. To address these issues, we propose a multimodal knowledge-based active learning video question answering solution. The solution acquires the semantic correlation of visual targets in image sequences and the dynamic relationship with the surrounding environment to establish the motion trajectory of each visual target. Further, static content is supplemented with dynamic content to provide more accurate video feature expression for data fusion and reasoning. Then, the solution achieves self-improvement and logical thinking focus of multimodal information understanding through knowledge auto-enhancement multimodal data fusion and reasoning model, filling the gap in deep understanding of multimodal content. Experimental results show that the performance of our scheme is better than the most advanced video question answering algorithm, and a large number of ablation and visualization experiments also verify the rationality of this solution.

Keywords:

	点击此处可从《计算机研究与发展》浏览原始摘要信息
	点击此处可从《计算机研究与发展》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏