首页 | 本学科首页   官方微博 | 高级检索  
     

视觉语言导航研究进展
引用本文:司马双霖, 黄岩, 何科技, 安东, 袁辉, 王亮. 视觉语言导航研究进展. 自动化学报, 2023, 49(1): 1−14 doi: 10.16383/j.aas.c210352
作者姓名:司马双霖  黄岩  何科技  安东  袁辉  王亮
作者单位:1.中国科学院自动化研究所智能感知与计算研究中心 北京 100190;;2.中国科学院大学人工智能学院 北京 100049;;3.中国科学院自动化研究所模式识别国家重点实验室 北京 100190;;4.中国科学院自动化研究所脑科学与智能技术卓越创新中心 上海 200031;;5.中科人工智能创新技术研究院 胶州 266300
摘    要:视觉语言导航, 即在一个未知环境中, 智能体从一个起始位置出发, 结合指令和周围视觉环境进行分析, 并动态响应生成一系列动作, 最终导航到目标位置. 视觉语言导航有着广泛的应用前景, 该任务近年来在多模态研究领域受到了广泛关注. 不同于视觉问答和图像描述生成等传统多模态任务, 视觉语言导航在多模态融合和推理方面, 更具有挑战性. 然而由于传统模仿学习的缺陷和数据稀缺的现象, 模型面临着泛化能力不足的问题. 系统地回顾了视觉语言导航的研究进展, 首先对于视觉语言导航的数据集和基础模型进行简要介绍; 然后全面地介绍视觉语言导航任务中的代表性模型方法, 包括数据增强、搜索策略、训练方法和动作空间四个方面; 最后根据不同数据集下的实验, 分析比较模型的优势和不足, 并对未来可能的研究方向进行了展望.

关 键 词:视觉语言导航   视觉语言理解   跨模态匹配   具身智能
收稿时间:2021-04-22

Recent Advances in Vision-and-language Navigation
Sima Shuang-Lin, Huang Yan, He Ke-Ji, An Dong, Yuan Hui, Wang Liang. Recent advances in vision-and-language navigation. Acta Automatica Sinica, 2023, 49(1): 1−14 doi: 10.16383/j.aas.c210352
Authors:SIMA Shuang-Lin  HUANG Yan  HE Ke-Ji  AN Dong  YUAN Hui  WANG Liang
Affiliation:1. Center of Research on Intelligent Perception and Computing, Institute of Automation, Chinese Academy of Sciences, Beijing 100190;;2. School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing 100049;;3. National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing 100190;;4. Center for Excellence in Brain Science and Intelligence Technology, Institute of Automation, Chinese Academy of Sciences, Shanghai 200031;;5. Artificial Intelligence Research, Chinese Academy of Sciences, Jiaozhou 266300
Abstract:Vision-and-language navigation means that an agent in an unknown environment, starting from a starting location, dynamically generates a series of actions by making analysis with language instructions and the visual environment, and finally navigates to the goal location. And due to the widespread application prospect, in recent years, it has received increasing attention from researchers especially in multi-modal research. It is different from traditional multi-modal tasks such as vision question answer and image captioning, vision-and-language navigation is more challenging in terms of dynamic reasoning and multi-modal fusion. However, with the limitations of imitation learning and the phenomenon of data scarcity, the model is faced with the problem of insufficient generalization. In this paper, we review the current advances in the research of vision-and-language navigation. Firstly, we briefly introduce data sets in visual-and-language navigation. Then, we comprehensively introduce the representative models in vision-and-language navigation, including data augmentation, search strategies, training methods and action spaces. Finally, from the experiments under different data sets, we analyze the advantages and disadvantages of the existing models, and prospect some future and possible research directions.
Keywords:Vision-and-language navigation  vision-and-language comprehension  cross-modal matching  embodied artificial intelligence
点击此处可从《自动化学报》浏览原始摘要信息
点击此处可从《自动化学报》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号