首页 | 本学科首页   官方微博 | 高级检索  
     

一致性协议匹配的跨模态图像文本检索方法
引用本文:宫大汉,,陈辉,,陈仕江,包勇军,丁贵广,.一致性协议匹配的跨模态图像文本检索方法[J].智能系统学报,2021,16(6):1143-1150.
作者姓名:宫大汉    陈辉    陈仕江  包勇军  丁贵广  
作者单位:1. 清华大学 软件学院,北京 100084;2. 清华大学 北京信息科学与技术国家研究中心,北京 100084;3. 清华大学 自动化系,北京 100084;4. 涿溪脑与智能研究所,浙江 杭州 311121;5. 京东集团,北京 100176
摘    要:跨模态图像文本检索的任务对于理解视觉和语言之间的对应关系很重要,大多数现有方法利用不同的注意力模块挖掘区域到词和词到区域的对齐来探索细粒度的跨模态关联。然而,现有的方法没有考虑到基于双重注意力会导致对齐不一致的问题。为此,本文提出了一种一致性协议匹配方法,旨在利用一致性对齐来增强跨模态检索的性能。本文采用注意力实现跨模态关联对齐,并基于跨模态对齐结果设计了基于竞争性投票的跨模态协议,该协议衡量了跨模态对齐的一致性,可以有效提升跨模态图像文本检索的性能。在Flickr30K和MS COCO两个基准数据集上,本文通过大量的实验证明了所提出的方法的有效性。

关 键 词:人工智能  计算机视觉  视觉和语言  跨模态检索  一致性协议匹配  注意力  卷积神经网络  循环神经网络  门控循环单元

Matching with agreement for cross-modal image-text retrieval
GONG Dahan,,CHEN Hui,,CHEN Shijiang,BAO Yongjun,DING Guiguang,.Matching with agreement for cross-modal image-text retrieval[J].CAAL Transactions on Intelligent Systems,2021,16(6):1143-1150.
Authors:GONG Dahan    CHEN Hui    CHEN Shijiang  BAO Yongjun  DING Guiguang  
Affiliation:1. School of Software, Tsinghua University, Beijing 100084, China;2. Beijing National Research Center for Information Science and Technology, Tsinghua University, Beijing 100084, China;3. Department of Automation, Tsinghua University, Beijing 1000
Abstract:The task of cross-modal image-text retrieval is important to understand the correspondence between vision and language. Most existing methods leverage different attention modules to explore region-to-word and word-to-region alignments and study fine-grained cross-modal correlations. However, the inconsistent alignment problem based on attention has rarely been considered. This study proposes a matching with agreement (MAG) method, which aims to take advantage of the alignment consistency, enhancing the cross-modal retrieval performance. The attention mechanism is adopted to achieve the cross-modal association alignment, which is then used to perform a cross-modal matching agreement with a novel competitive voting strategy. This agreement evaluates the cross-modal matching consistency and effectively improves the performance. The extensive experiments on two benchmark datasets, namely, Flickr30K and MS COCO, show that our MAG method can achieve state-of-the-art performance, demonstrating its effectiveness well.
Keywords:artificial intelligence  computer vision  vision and language  cross-modal retrieval  matching with agreement  attention  convolutional neural network  recurrent neural network  gated recurrent unit
点击此处可从《智能系统学报》浏览原始摘要信息
点击此处可从《智能系统学报》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号