一致性协议匹配的跨模态图像文本检索方法 Matching with agreement for cross-modal image-text retrieval期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

一致性协议匹配的跨模态图像文本检索方法

引用本文：	宫大汉,,陈辉,,陈仕江,包勇军,丁贵广,.一致性协议匹配的跨模态图像文本检索方法[J].智能系统学报,2021,16(6):1143-1150.

作者姓名：	宫大汉陈辉陈仕江包勇军丁贵广

作者单位：	1. 清华大学软件学院，北京 100084;2. 清华大学北京信息科学与技术国家研究中心，北京 100084;3. 清华大学自动化系，北京 100084;4. 涿溪脑与智能研究所，浙江杭州 311121;5. 京东集团，北京 100176

摘要：	跨模态图像文本检索的任务对于理解视觉和语言之间的对应关系很重要，大多数现有方法利用不同的注意力模块挖掘区域到词和词到区域的对齐来探索细粒度的跨模态关联。然而，现有的方法没有考虑到基于双重注意力会导致对齐不一致的问题。为此，本文提出了一种一致性协议匹配方法，旨在利用一致性对齐来增强跨模态检索的性能。本文采用注意力实现跨模态关联对齐，并基于跨模态对齐结果设计了基于竞争性投票的跨模态协议，该协议衡量了跨模态对齐的一致性，可以有效提升跨模态图像文本检索的性能。在Flickr30K和MS COCO两个基准数据集上，本文通过大量的实验证明了所提出的方法的有效性。
关键词：	人工智能计算机视觉视觉和语言跨模态检索一致性协议匹配注意力卷积神经网络循环神经网络门控循环单元
Matching with agreement for cross-modal image-text retrieval

GONG Dahan,,CHEN Hui,,CHEN Shijiang,BAO Yongjun,DING Guiguang,.Matching with agreement for cross-modal image-text retrieval[J].CAAL Transactions on Intelligent Systems,2021,16(6):1143-1150.

Authors:	GONG Dahan CHEN Hui CHEN Shijiang BAO Yongjun DING Guiguang

Affiliation:	1. School of Software, Tsinghua University, Beijing 100084, China;2. Beijing National Research Center for Information Science and Technology, Tsinghua University, Beijing 100084, China;3. Department of Automation, Tsinghua University, Beijing 1000

Abstract:	The task of cross-modal image-text retrieval is important to understand the correspondence between vision and language. Most existing methods leverage different attention modules to explore region-to-word and word-to-region alignments and study fine-grained cross-modal correlations. However, the inconsistent alignment problem based on attention has rarely been considered. This study proposes a matching with agreement (MAG) method, which aims to take advantage of the alignment consistency, enhancing the cross-modal retrieval performance. The attention mechanism is adopted to achieve the cross-modal association alignment, which is then used to perform a cross-modal matching agreement with a novel competitive voting strategy. This agreement evaluates the cross-modal matching consistency and effectively improves the performance. The extensive experiments on two benchmark datasets, namely, Flickr30K and MS COCO, show that our MAG method can achieve state-of-the-art performance, demonstrating its effectiveness well.

Keywords:	artificial intelligence computer vision vision and language cross-modal retrieval matching with agreement attention convolutional neural network recurrent neural network gated recurrent unit

	点击此处可从《智能系统学报》浏览原始摘要信息
	点击此处可从《智能系统学报》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏