首页 | 本学科首页   官方微博 | 高级检索  
     

基于语义关系图的跨模态张量融合网络的图像文本检索
引用本文:刘长红,曾胜,张斌,陈勇.基于语义关系图的跨模态张量融合网络的图像文本检索[J].计算机应用,2022,42(10):3018-3024.
作者姓名:刘长红  曾胜  张斌  陈勇
作者单位:江西师范大学 计算机信息工程学院,南昌 330022
南昌工程学院 工商管理学院,南昌 330029
基金项目:国家自然科学基金资助项目(62067004)
摘    要:跨模态图像文本检索的难点是如何有效地学习图像和文本间的语义相关性。现有的大多数方法都是学习图像区域特征和文本特征的全局语义相关性或模态间对象间的局部语义相关性,而忽略了模态内对象之间的关系和模态间对象关系的关联。针对上述问题,提出了一种基于语义关系图的跨模态张量融合网络(CMTFN-SRG)的图像文本检索方法。首先,采用图卷积网络(GCN)学习图像区域间的关系并使用双向门控循环单元(Bi-GRU)构建文本单词间的关系;然后,将所学习到的图像区域和文本单词间的语义关系图通过张量融合网络进行匹配以学习两种不同模态数据间的细粒度语义关联;同时,采用门控循环单元(GRU)学习图像的全局特征,并将图像和文本的全局特征进行匹配以捕获模态间的全局语义相关性。将所提方法在Flickr30K和MS-COCO两个基准数据集上与多模态交叉注意力(MMCA)方法进行了对比分析。实验结果表明,所提方法在Flickr30K测试集、MS-COCO1K测试集以及MS-COCO5K测试集上文本检索图像任务的Recall@1分别提升了2.6%、9.0%和4.1%,召回率均值(mR)分别提升了0.4、1.3和0.1个百分点,可见该方法能有效提升图像文本检索的精度。

关 键 词:跨模态检索  张量融合网络  图卷积网络  语义相关性  语义关系图  
收稿时间:2021-09-14
修稿时间:2021-12-20

Cross-modal tensor fusion network based on semantic relation graph for image-text retrieval
Changhong LIU,Sheng ZENG,Bin ZHANG,Yong CHEN.Cross-modal tensor fusion network based on semantic relation graph for image-text retrieval[J].journal of Computer Applications,2022,42(10):3018-3024.
Authors:Changhong LIU  Sheng ZENG  Bin ZHANG  Yong CHEN
Affiliation:School of Computer and Information Engineering,Jiangxi Normal University,Nanchang Jiangxi 330022,China
School of Business Administration,Nanchang Institute of Technology,Nanchang Jiangxi 330029,China
Abstract:The key of cross-modal image-text retrieval is how to capture the semantic correlation between images and text effectively. Most of the existing methods learn the global semantic correlation between image region features and text features or local semantic correlation between inter-modality objects, and ignore the correlation between the intra-modality object relationships and inter-modality object relationships. To solve this problem, a method of Cross-Modal Tensor Fusion Network based on Semantic Relation Graph (CMTFN-SRG) for image-text retrieval was proposed. Firstly, the relationships of image regions and text words were generated by Graph Convolutional Network (GCN) and Bidirectional Gated Recurrent Unit (Bi-GRU) respectively. Then, the fine-grained semantic correlation between the data of two modals was learned by using the tensor fusion network to match the learned semantic relation graph of image regions and the graph of text words. At the same time, Gated Recurrent Unit (GRU) was used to learn global features of the image, and the global features of the image and the text were matched to capture the inter-modality global semantic correlation. The proposed method was compared with the Multi-Modality Cross Attention (MMCA) method on the benchmark datasets Flickr30K and MS-COCO. Experimental results show that the proposed method improves the Recall@1 of text-to-image retrieval task by 2.6%, 9.0% and 4.1% respectively on the test datasets Flickr30K, MS-COCO1K and MS-COCO5K.And mean Recall (mR) improves by 0.4, 1.3 and 0.1 percentage points respectively. It can be seen that the proposed method can effectively improve the precision of image-text retrieval.
Keywords:cross-modal retrieval  tensor fusion network  Graph Convolutional Network (GCN)  semantic correlation  semantic relation graph  
点击此处可从《计算机应用》浏览原始摘要信息
点击此处可从《计算机应用》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号