首页 | 本学科首页   官方微博 | 高级检索  
     

基于模态语义增强的跨模态食谱检索方法
引用本文:李明,周栋,雷芳,曹步清.基于模态语义增强的跨模态食谱检索方法[J].计算机应用研究,2024,41(4):1131-1137.
作者姓名:李明  周栋  雷芳  曹步清
作者单位:1. 湖南科技大学计算机科学与工程学院;2. 广东外语外贸大学信息科学与技术学院
基金项目:国家自然科学基金资助项目(62376062);;广东省哲学社会科学“十四五”规划项目(GD23CTS03);;广东省自然科学基金资助项目(2023A1515012718);;湖南省自然科学基金资助项目(2022JJ30020);;教育部人文社会科学研究资助项目(23YJAZH220);
摘    要:在跨模态食谱检索任务中,如何有效地对模态进行特征表示是一个热点问题。目前一般使用两个独立的神经网络分别获取图像和食谱的特征,通过跨模态对齐实现跨模态检索。但这些方法主要关注模态内的特征信息,忽略了模态间的特征交互,导致部分有效模态信息丢失。针对该问题,提出一种通过多模态编码器来增强模态语义的跨模态食谱检索方法。首先使用预训练模型提取图像和食谱的初始语义特征,并借助对抗损失缩小模态间差异;然后利用成对跨模态注意力使来自一个模态的特征反复强化另一个模态的特征,进一步提取有效信息;接着采用自注意力机制对模态的内部特征进行建模,以捕捉丰富的模态特定语义信息和潜在关联知识;最后,引入三元组损失最小化同类样本间的距离,实现跨模态检索学习。在Recipe 1M数据集上的实验结果表明,该方法在中位数排名(MedR)和前K召回率(R@K)等方面均优于目前的主流方法,为跨模态检索任务提供了有力的解决方案。

关 键 词:跨模态食谱检索  特征提取  模态语义增强  多模态编码器
收稿时间:2023/7/23 0:00:00
修稿时间:2024/3/13 0:00:00

Cross-modal recipe retrieval method based on modality semantic enhancement
Li Ming,Zhou Dong,Lei Fang and Cao Buqing.Cross-modal recipe retrieval method based on modality semantic enhancement[J].Application Research of Computers,2024,41(4):1131-1137.
Authors:Li Ming  Zhou Dong  Lei Fang and Cao Buqing
Affiliation:School of Computer Science and Engineering, Hunan University of Science and Technology,,,
Abstract:Effectively representing features of modalities is a hot issue in cross-modal recipe retrieval. The current methods generally adopt two independent neural networks to extract the features of images and recipes respectively, achieving retrieval through cross-modal alignment. However, these methods mainly focus on the intra-modal information and ignore the intermodal interactions, resulting in the loss of some effective modality information. To address the problem, this paper proposed a cross-modal recipe retrieval method to enhance modality semantics through multimodal encoders. First, it used a pre-trained model to extract initial semantic features of images and recipes and utilizing modality alignment to reduce the inter-model differences. Second, it employed the pairwise cross-modal attention to repeatedly reinforce the features of one modality by using features from another modality, extracted valid information. Third, it used the self-attention mechanism to modal the internal features of modalities to capture rich modality-specific semantic information and potential associations. Finally, it introduced the triplet loss to minimize the distance between similar samples, achieved cross-modal retrieval learning. Experimental results on Recipe 1M dataset show that the proposed approach outperforms the current mainstream methods in terms of median ranking(MedR) and recall rate at top K(R@K), providing a powerful solution for cross-modal retrieval tasks.
Keywords:cross-modal recipe retrieval  feature extraction  modality semantic enhancement  multimodal encoder
点击此处可从《计算机应用研究》浏览原始摘要信息
点击此处可从《计算机应用研究》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号