基于语境辅助转换器的图像标题生成算法 Context-assisted Transformer for Image Captioning期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

基于语境辅助转换器的图像标题生成算法

引用本文：	连政, 王瑞, 李海昌, 姚辉, 胡晓惠. 基于语境辅助转换器的图像标题生成算法. 自动化学报, 2023, 49(9): 1889−1903 doi: 10.16383/j.aas.c220767

作者姓名：	连政王瑞李海昌姚辉胡晓惠

作者单位：	1.中国科学院大学北京 101408;;2.中国科学院软件研究所天基综合信息系统重点实验室北京 100190

基金项目：	国家重点研发计划(2019YFB1405100)；;国家自然科学基金(61802380)资助~~；

摘要：	在图像标题生成领域, 交叉注意力机制在建模语义查询与图像区域的关系方面, 已经取得了重要的进展. 然而, 其视觉连贯性仍有待探索. 为填补这项空白, 提出一种新颖的语境辅助的交叉注意力(Context-assisted cross attention, CACA)机制, 利用历史语境记忆(Historical context memory, HCM), 来充分考虑先前关注过的视觉线索对当前注意力语境生成的潜在影响. 同时, 提出一种名为“自适应权重约束(Adaptive weight constraint, AWC)” 的正则化方法, 来限制每个CACA模块分配给历史语境的权重总和. 本文将CACA模块与AWC方法同时应用于转换器(Transformer)模型, 构建一种语境辅助的转换器(Context-assisted transformer, CAT)模型, 用于解决图像标题生成问题. 基于MS COCO (Microsoft common objects in context)数据集的实验结果证明, 与当前先进的方法相比, 该方法均实现了稳定的提升.
关键词：	图像标题生成注意力机制转换器视觉连贯性
收稿时间：	2022-09-26
Context-assisted Transformer for Image Captioning

Lian Zheng, Wang Rui, Li Hai-Chang, Yao Hui, Hu Xiao-Hui. Context-assisted transformer for image captioning. Acta Automatica Sinica, 2023, 49(9): 1889−1903 doi: 10.16383/j.aas.c220767

Authors:	LIAN Zheng WANG Rui LI Hai-Chang YAO Hui HU Xiao-Hui

Affiliation:	1. University of Chinese Academy of Sciences, Beijing 101408;;2. Science & Technology on Integrated Information System Laboratory, Institute of Software, Chinese Academy of Sciences, Beijing 100190

Abstract:	The cross attention mechanism has made significant progress in modeling the relationship between semantic queries and image regions in image captioning. However, its visual coherence remains to be explored. To fill this gap, we propose a novel context-assisted cross attention (CACA) mechanism. With the help of historical context memory (HCM), CACA fully considers the potential impact of previously attended visual cues on the generation of current attention context. Moreover, we present a regularization method, called adaptive weight constraint (AWC), to restrict the total weight assigned to the historical contexts of each CACA module. We apply CACA and AWC to the Transformer model and construct a context-assisted transformer (CAT) for image captioning. Experimental results on the MS COCO (microsoft common objects in context) dataset demonstrate that our method achieves consistent improvement over the current state-of-the-art methods.

Keywords:	Image captioning attention mechanism transformer visual coherence

	点击此处可从《自动化学报》浏览原始摘要信息
	点击此处可从《自动化学报》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏