首页 | 本学科首页   官方微博 | 高级检索  
     

面向跨模态检索的协同注意力网络模型
引用本文:邓一姣,张凤荔,陈学勤,艾擎,余苏喆. 面向跨模态检索的协同注意力网络模型[J]. 计算机科学, 2020, 47(4): 54-59
作者姓名:邓一姣  张凤荔  陈学勤  艾擎  余苏喆
作者单位:电子科技大学信息与软件工程学院 成都 610054;电子科技大学信息与软件工程学院 成都 610054;电子科技大学信息与软件工程学院 成都 610054;电子科技大学信息与软件工程学院 成都 610054;电子科技大学信息与软件工程学院 成都 610054
基金项目:四川省科技计划;国家自然科学基金
摘    要:随着图像、文本、声音、视频等多模态网络数据的急剧增长,人们对多样化的检索需求日益强烈,其中的跨模态检索受到广泛关注。然而,由于其存在异构性差异,在不同的数据模态之间寻找内容相似性仍然具有挑战性。现有方法大都将异构数据通过映射矩阵或深度模型投射到公共子空间,来挖掘成对的关联关系,即图像和文本的全局信息对应关系,而忽略了数据内局部的上下文信息和数据间细粒度的交互信息,无法充分挖掘跨模态关联。为此,文中提出文本-图像协同注意力网络模型(CoAN),通过选择性地关注多模态数据的关键信息部分来增强内容相似性的度量。CoAN利用预训练的VGGNet模型和循环神经网络深层次地提取图像和文本的细粒度特征,利用文本-视觉注意力机制捕捉语言和视觉之间的细微交互作用;同时,该模型分别学习文本和图像的哈希表示,利用哈希方法的低存储特性和计算的高效性来提高检索速度。在实验得出,在两个广泛使用的跨模态数据集上,CoAN的平均准确率均值(mAP)超过所有对比方法,文本检索图像和图像检索文本的mAP值分别达到0.807和0.769。实验结果说明,CoAN有助于检测多模态数据的关键信息区域和数据间细粒度的交互信息,充分挖掘跨模态数据的内容相似性,提高检索精度。

关 键 词:跨模态检索  协同注意力机制  细粒度特征提取  深度哈希  多模态数据

Collaborative Attention Network Model for Cross-modal Retrieval
DENG Yi-jiao,ZHANG Feng-li,CHEN Xue-qin,AI Qing,YU Su-. Collaborative Attention Network Model for Cross-modal Retrieval[J]. Computer Science, 2020, 47(4): 54-59
Authors:DENG Yi-jiao  ZHANG Feng-li  CHEN Xue-qin  AI Qing  YU Su-
Affiliation:(School of Information and Software Engineering,University of Electronic Science and Technology of China,610054,Chengdu)
Abstract:With the rapid growth of image,text,sound,video and other multi-modal network data,the demand for diversified retrieval is increasingly strong.And cross-modal retrieval has been widely concerned.However,there are heterogeneity differences among different modes.It is still a challenging to find the content similarity of heterogeneous data.Most of the existing methods project heterogeneous data into a common subspace by a mapping matrix or a deep model.In this way,a pair of correlation relation is mined,and the global information correspondence relation between image and text is obtained.However,these methods ignore the local context information and the fine-grained interaction information between the data,so the cross-modal correlation cannot be fully mined.Therefore,a text-image collaborative attention network model(CoAN)is proposed.In order to enhance the measurement of content similarity,we selectively focus on key information parts of multi-modal data.The pre-trained VGGNet model and LSTM model are used to extract the fine-grained features of image and text,and the CoAN model is used to capture the subtle interaction between text and image by using text-image attention mechanism.At the same time,this model studies the hash representation of text and image respectively.The retrieval speed is improved by using the low storage and high efficiency of hashing method.Experiments show that,on two widely used cross-modal data sets,the mean Average Precision(mAP)of CoAN model is higher than that of all other comparative methods,and the mAP value of text retrieval image and image retrieval text reaches 0.807 and 0.769.Experimental data show that CoAN model is helpful to detect key information and fine-grained interactive information of multi-modal data,and the retrieval accuracy is improved by fully mining the content similarity of cross-modal data.
Keywords:Cross-modal retrieval  Collaborative attention mechanism  Fine-grained feature extraction  Deep hash  Multi-modal data
本文献已被 维普 万方数据 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号