面向图像文本的多模态处理方法综述 Comprehensive review of multimodal processing methods for image-text期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

面向图像文本的多模态处理方法综述

引用本文：	姜丽梅,李秉龙.面向图像文本的多模态处理方法综述[J].计算机应用研究,2024,41(5).

作者姓名：	姜丽梅李秉龙

作者单位：	华北电力大学保定计算机系,华北电力大学保定计算机系

基金项目：	华北电力大学中央高校基本科研业务费专项资金资助项目(2022MS102)

摘要：	在深度学习领域，解决实际应用问题往往需要结合多种模态信息进行推理和决策，其中视觉和语言信息是交互过程中重要的两种模态。在诸多应用场景中，处理多模态任务往往面临着模型架构组织方式庞杂、训练方法效率低下等问题。综合以上问题，梳理了在图像文本多模态领域的近五年的代表性成果。首先从主流的多模态任务出发，介绍了相关文本和图像多模态数据集以及预训练目标。其次，考虑以Transformer为基础结构的视觉语言模型，结合特征提取方法，从多模态组织架构、跨模态融合方法等角度进行分析，总结比较不同处理策略的共性和差异性。然后从数据输入、结构组件等多角度介绍模型的轻量化方法。最后，对基于图像文本的多模态方法未来的研究方向进行了展望。
关键词：	多模态架构融合轻量化
收稿时间：	2023/8/28 0:00:00
修稿时间：	2024/4/7 0:00:00
Comprehensive review of multimodal processing methods for image-text

Jiang Limei and Li Binglong.Comprehensive review of multimodal processing methods for image-text[J].Application Research of Computers,2024,41(5).

Authors:	Jiang Limei and Li Binglong

Affiliation:	Department of Computer Science,North China Electric Power UniversityBaoding,Baoding Hebei 071003,

Abstract:	In the field of deep learning, solving problems often requires combining multiple modal information for reasoning and decision-making, among which visual and language information are two important modalities in the interaction process. In many application scenarios, processing multi-modal tasks often faces problems such as complex model architecture organization and inefficient training methods. Based on the above problems, this paper reviewed the representative achievements in the field of multimodal image text in the past five years. This paper first started from the mainstream multi-modal tasks and introduced related text and image multi-modal datasets and pre-training targets. Secondly, considering the visual language model based on Transformer and the feature extraction method, this paper analyzed from the perspectives of multi-modal organization architecture and cross-modal fusion methods, and summarized and compared the commonalities and differences of different processing strategies. Then it introduced the lightweight method of the model from data input, structural components and other aspects. Finally, it prospected the future research direction of multimodal methods based on image text.

Keywords:	multimodal architecture fusion lightweight

	点击此处可从《计算机应用研究》浏览原始摘要信息
	点击此处可从《计算机应用研究》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏