基于语义关系图的跨模态张量融合网络的图像文本检索 Cross-modal tensor fusion network based on semantic relation graph for image-text retrieval期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

基于语义关系图的跨模态张量融合网络的图像文本检索

引用本文：	刘长红,曾胜,张斌,陈勇.基于语义关系图的跨模态张量融合网络的图像文本检索[J].计算机应用,2022,42(10):3018-3024.

作者姓名：	刘长红曾胜张斌陈勇

作者单位：	江西师范大学计算机信息工程学院，南昌 330022 南昌工程学院工商管理学院，南昌 330029

基金项目：	国家自然科学基金资助项目(62067004)

摘要：	跨模态图像文本检索的难点是如何有效地学习图像和文本间的语义相关性。现有的大多数方法都是学习图像区域特征和文本特征的全局语义相关性或模态间对象间的局部语义相关性，而忽略了模态内对象之间的关系和模态间对象关系的关联。针对上述问题，提出了一种基于语义关系图的跨模态张量融合网络（CMTFN-SRG）的图像文本检索方法。首先，采用图卷积网络（GCN）学习图像区域间的关系并使用双向门控循环单元（Bi-GRU）构建文本单词间的关系；然后，将所学习到的图像区域和文本单词间的语义关系图通过张量融合网络进行匹配以学习两种不同模态数据间的细粒度语义关联；同时，采用门控循环单元（GRU）学习图像的全局特征，并将图像和文本的全局特征进行匹配以捕获模态间的全局语义相关性。将所提方法在Flickr30K和MS-COCO两个基准数据集上与多模态交叉注意力（MMCA）方法进行了对比分析。实验结果表明，所提方法在Flickr30K测试集、MS-COCO1K测试集以及MS-COCO5K测试集上文本检索图像任务的Recall@1分别提升了2.6%、9.0%和4.1%，召回率均值（mR）分别提升了0.4、1.3和0.1个百分点，可见该方法能有效提升图像文本检索的精度。
关键词：	跨模态检索张量融合网络图卷积网络语义相关性语义关系图
收稿时间：	2021-09-14
修稿时间：	2021-12-20
Cross-modal tensor fusion network based on semantic relation graph for image-text retrieval

Changhong LIU,Sheng ZENG,Bin ZHANG,Yong CHEN.Cross-modal tensor fusion network based on semantic relation graph for image-text retrieval[J].journal of Computer Applications,2022,42(10):3018-3024.

Authors:	Changhong LIU Sheng ZENG Bin ZHANG Yong CHEN

Affiliation:	School of Computer and Information Engineering，Jiangxi Normal University，Nanchang Jiangxi 330022，China School of Business Administration，Nanchang Institute of Technology，Nanchang Jiangxi 330029，China

Abstract:	The key of cross-modal image-text retrieval is how to capture the semantic correlation between images and text effectively. Most of the existing methods learn the global semantic correlation between image region features and text features or local semantic correlation between inter-modality objects， and ignore the correlation between the intra-modality object relationships and inter-modality object relationships. To solve this problem， a method of Cross-Modal Tensor Fusion Network based on Semantic Relation Graph （CMTFN-SRG） for image-text retrieval was proposed. Firstly， the relationships of image regions and text words were generated by Graph Convolutional Network （GCN） and Bidirectional Gated Recurrent Unit （Bi-GRU） respectively. Then， the fine-grained semantic correlation between the data of two modals was learned by using the tensor fusion network to match the learned semantic relation graph of image regions and the graph of text words. At the same time， Gated Recurrent Unit （GRU） was used to learn global features of the image， and the global features of the image and the text were matched to capture the inter-modality global semantic correlation. The proposed method was compared with the Multi-Modality Cross Attention （MMCA） method on the benchmark datasets Flickr30K and MS-COCO. Experimental results show that the proposed method improves the Recall@1 of text-to-image retrieval task by 2.6%， 9.0% and 4.1% respectively on the test datasets Flickr30K， MS-COCO1K and MS-COCO5K.And mean Recall （mR） improves by 0.4， 1.3 and 0.1 percentage points respectively. It can be seen that the proposed method can effectively improve the precision of image-text retrieval.

Keywords:	cross-modal retrieval tensor fusion network Graph Convolutional Network (GCN) semantic correlation semantic relation graph

	点击此处可从《计算机应用》浏览原始摘要信息
	点击此处可从《计算机应用》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏