多模态视觉语言表征学习研究综述 Survey on Multimodal Visual Language Representation Learning期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

多模态视觉语言表征学习研究综述

引用本文：	杜鹏飞,李小勇,高雅丽. 多模态视觉语言表征学习研究综述[J]. 软件学报, 2021, 32(2): 327-348

作者姓名：	杜鹏飞李小勇高雅丽

作者单位：	可信分布式计算与服务教育部重点实验室(北京邮电大学),北京100876;北京邮电大学网络空间安全学院,北京100876;可信分布式计算与服务教育部重点实验室(北京邮电大学),北京100876;北京邮电大学网络空间安全学院,北京100876;可信分布式计算与服务教育部重点实验室(北京邮电大学),北京100876;北京邮电大学网络空间安全学院,北京100876

基金项目：	国家自然科学基金（61370069，61672111）；国家自然科学基金-通用技术基础研究联合基金（U1836215）；北京市自然科学基金（4162043）；国家重点研发计划（2016QY03D0605）

摘要：	我们生活在一个由大量不同模态内容构建而成的多媒体世界中,不同模态信息之间具有高度的相关性和互补性,多模态表征学习的主要目的就是挖掘出不同模态之间的共性和特性,产生出可以表示多模态信息的隐含向量.主要介绍了目前应用较广的视觉语言表征的相应研究工作,包括传统的基于相似性模型的研究方法和目前主流的基于语言模型的预训练的方法....
关键词：	多模态表征学习表征学习多模态机器学习深度学习
收稿时间：	2020-05-11
修稿时间：	2020-06-26
Survey on Multimodal Visual Language Representation Learning

DU Peng-Fei,LI Xiao-Yong,GAO Ya-Li. Survey on Multimodal Visual Language Representation Learning[J]. Journal of Software, 2021, 32(2): 327-348

Authors:	DU Peng-Fei LI Xiao-Yong GAO Ya-Li

Affiliation:	Key Laboratory of Trustworthy Distributed Computing and Service of Ministry of Education(Beijing University of Posts and Telecommunications), Beijing 100876, China;School of Cyberspace Security, Beijing University of Posts and Telecommunications, Beijing 100876, China

Abstract:	We live in a multimedia world built from a large number of different modal contents.. The information between different modalities is highly correlated and complementary. The main purpose of multi-modal representation learning is to mine the different modalities. Commonness and characteristics produce implicit vectors that can represent multimodal information. This article mainly introduces the corresponding research work of the currently widely used visual language representation, including traditional research methods based on similarity models and current mainstream pre-training methods based on language models. The current better ideas and solutions are to semanticize visual features and then generate representations with textual features through a powerful feature extractor. Transformer[1] is currently used in various tasks of representation learning as the mainstream network architecture. This article elaborated from several different angles of research background, division of different studies, evaluation methods, future development trends, etc.

Keywords:	Multimodal Representation Learning Representation Learning Multimodal Machine Learning Deep Learning
本文献已被万方数据等数据库收录！
	点击此处可从《软件学报》浏览原始摘要信息
	点击此处可从《软件学报》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏