首页 | 本学科首页   官方微博 | 高级检索  
     

基于语境辅助转换器的图像标题生成算法
引用本文:连政, 王瑞, 李海昌, 姚辉, 胡晓惠. 基于语境辅助转换器的图像标题生成算法. 自动化学报, 2023, 49(9): 1889−1903 doi: 10.16383/j.aas.c220767
作者姓名:连政  王瑞  李海昌  姚辉  胡晓惠
作者单位:1.中国科学院大学 北京 101408;;2.中国科学院软件研究所天基综合信息系统重点实验室 北京 100190
基金项目:国家重点研发计划(2019YFB1405100);;国家自然科学基金(61802380)资助~~;
摘    要:在图像标题生成领域, 交叉注意力机制在建模语义查询与图像区域的关系方面, 已经取得了重要的进展. 然而, 其视觉连贯性仍有待探索. 为填补这项空白, 提出一种新颖的语境辅助的交叉注意力(Context-assisted cross attention, CACA)机制, 利用历史语境记忆(Historical context memory, HCM), 来充分考虑先前关注过的视觉线索对当前注意力语境生成的潜在影响. 同时, 提出一种名为“自适应权重约束(Adaptive weight constraint, AWC)” 的正则化方法, 来限制每个CACA模块分配给历史语境的权重总和. 本文将CACA模块与AWC方法同时应用于转换器(Transformer)模型, 构建一种语境辅助的转换器(Context-assisted transformer, CAT)模型, 用于解决图像标题生成问题. 基于MS COCO (Microsoft common objects in context)数据集的实验结果证明, 与当前先进的方法相比, 该方法均实现了稳定的提升.

关 键 词:图像标题生成   注意力机制   转换器   视觉连贯性
收稿时间:2022-09-26

Context-assisted Transformer for Image Captioning
Lian Zheng, Wang Rui, Li Hai-Chang, Yao Hui, Hu Xiao-Hui. Context-assisted transformer for image captioning. Acta Automatica Sinica, 2023, 49(9): 1889−1903 doi: 10.16383/j.aas.c220767
Authors:LIAN Zheng  WANG Rui  LI Hai-Chang  YAO Hui  HU Xiao-Hui
Affiliation:1. University of Chinese Academy of Sciences, Beijing 101408;;2. Science & Technology on Integrated Information System Laboratory, Institute of Software, Chinese Academy of Sciences, Beijing 100190
Abstract:The cross attention mechanism has made significant progress in modeling the relationship between semantic queries and image regions in image captioning. However, its visual coherence remains to be explored. To fill this gap, we propose a novel context-assisted cross attention (CACA) mechanism. With the help of historical context memory (HCM), CACA fully considers the potential impact of previously attended visual cues on the generation of current attention context. Moreover, we present a regularization method, called adaptive weight constraint (AWC), to restrict the total weight assigned to the historical contexts of each CACA module. We apply CACA and AWC to the Transformer model and construct a context-assisted transformer (CAT) for image captioning. Experimental results on the MS COCO (microsoft common objects in context) dataset demonstrate that our method achieves consistent improvement over the current state-of-the-art methods.
Keywords:Image captioning  attention mechanism  transformer  visual coherence
点击此处可从《自动化学报》浏览原始摘要信息
点击此处可从《自动化学报》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号