首页 | 本学科首页   官方微博 | 高级检索  
     

视觉语言多模态预训练综述
引用本文:张浩宇,王天保,李孟择,赵洲,浦世亮,吴飞.视觉语言多模态预训练综述[J].中国图象图形学报,2022,27(9):2652-2682.
作者姓名:张浩宇  王天保  李孟择  赵洲  浦世亮  吴飞
作者单位:浙江大学计算机与科学技术学院, 杭州 310013;杭州海康威视数字技术股份有限公司, 杭州 310051
基金项目:国家重点研发计划资助(2020YFC0832500);浙江省科技计划项目(2022C01044)
摘    要:在多模态机器学习领域,为特定任务而制作的人工标注数据昂贵,且不同任务难以进行迁移,从而需要大量重新训练,导致训练多个任务时效率低下、资源浪费。预训练模型通过以自监督为代表的方式进行大规模数据训练,对数据集中不同模态的信息进行提取和融合,以学习其中蕴涵的通用知识表征,从而服务于广泛的相关下游视觉语言多模态任务,这一方法逐渐成为人工智能各领域的主流方法。依靠互联网所获取的大规模图文对与视频数据,以及以自监督学习为代表的预训练方法的进步,视觉语言多模态预训练模型在很大程度上打破了不同视觉语言任务之间的壁垒,提升了多个任务训练的效率并促进了具体任务的性能表现。本文总结视觉语言多模态预训练领域的进展,首先对常见的预训练数据集和预训练方法进行汇总,然后对目前最新方法以及经典方法进行系统概述,按输入来源分为图像—文本预训练模型和视频—文本多模态模型两大类,阐述了各方法之间的共性和差异,并将各模型在具体下游任务上的实验情况进行汇总。最后,总结了视觉语言预训练面临的挑战和未来发展趋势。

关 键 词:多模态机器学习  视觉语言多模态  预训练  自监督学习  图像文本预训练  视频文本预训练
收稿时间:2022/3/10 0:00:00
修稿时间:2022/6/15 0:00:00

Comprehensive review of visual-language-oriented multimodal pre-training methods
Zhang Haoyu,Wang Tianbao,Li Mengze,Zhao Zhou,Pu Shiliang,Wu Fei.Comprehensive review of visual-language-oriented multimodal pre-training methods[J].Journal of Image and Graphics,2022,27(9):2652-2682.
Authors:Zhang Haoyu  Wang Tianbao  Li Mengze  Zhao Zhou  Pu Shiliang  Wu Fei
Affiliation:College of Computer Science and Technology, Zhejiang University, Hangzhou 310013, China;Hangzhou Hikvision Digital Technology Co., Ltd., Hangzhou 310051, China
Abstract:Multimodal machine learning has been challenging for labor-intensive and labeled cost and data migration constraints,which requires amount of retraining process,resulting in low efficiency and imbalanced resources allocation for multiple training tasks.To learn the internal knowledge representation and meet the requirement of the related downstream visual language multimodal tasks,pre-training model is carried out for large-scale data training task through self-supervision,the multiple modes information extraction and integration of the data set context,etc.The exploration of pre-trained models is focused on cheaper labeled data due to the expensive human labels.First,the model is pre-trained based on cheap labeled data,and the model is fine-tuned using less expensive human annotations.Large-scale data and long time span training are often required to pre-train the model because of the less information and noise derived from cheap labeled data.The large-scale unlabeled-data-based pre-trained model not only transfer the more general knowledge to the target task through the learned unlabeled data,but also get a better parameter initial point through the pre-training learning.The future multimodal contexts have their potentials like learning demonstration,sentiment analysis and task-oriented large-scale human-computer interactions.Multimodal pre-training models can be as a pathway derived of weak artificial intelligence from local to global.It is possible to transfer multi-tasks learning results to non-supervision multi-domains data automatically and quickly.The plain text pre-training model can cover less online data only,and richer data have not been fully utilized and learned.Multimodal-contexts are benefited from information gathering,context perception,knowledge learning,and demonstration.To generate commonly-used artificial intelligence model,the pre-training model has been developing from single-modal to multi-modal.The intensive growth of pre-training models has extended to the field of visual and textual interaction since 2019.Thanks to the large-scale image-text pairs and video data online and the growth of pre-training technique like self-supervised learning,the visual-language multimodal pre-training model has been promoted and bridged the gap between different visual-language tasks,which optimizes multi-task training and improves the performance of specific tasks.Current multimodal researches are challenged to an intelligent system organizing,multimodal information perceiving and the semantic gap bridging.We review existing pre-training datasets and pre-training methods,and propose a systematic overview of the latest and traditional methods.The universals and differences between the methods are critical analyzed,and the experimental conditions of each model are summarized on specific downstream tasks.Finally,the challenges and future research direction of visual language pre-training are predicted.
Keywords:multimodal machine learning  visual language multimodality  pre-training  self-supervised learning  image-text pre-training  video-text pre-training
点击此处可从《中国图象图形学报》浏览原始摘要信息
点击此处可从《中国图象图形学报》下载免费的PDF全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号