面向多模态视频时刻检索的查询感知跨模态双重对比学习网络 Query Aware Cross-modal Dual Contrastive Learning Network for Multi-modal Video Moment Retrieval期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

面向多模态视频时刻检索的查询感知跨模态双重对比学习网络

引用本文：	尹梦冉,梁美玉,于洋,曹晓雯,杜军平,薛哲. 面向多模态视频时刻检索的查询感知跨模态双重对比学习网络[J]. 软件学报, 2024, 35(5)

作者姓名：	尹梦冉梁美玉于洋曹晓雯杜军平薛哲

作者单位：	北京邮电大学计算机学院(国家示范性软件学院), 智能通信软件与多媒体北京市重点实验室, 北京 100876

基金项目：	国家自然科学基金项目（62192784，U22B2038，62172056，62272058）;中国人工智能学会-华为MindSpore学术奖励基金（No.CAAIXSJLJJ-2021-007B）

摘要：	近期,跨模态视频语料库时刻检索（VCMR）这一新任务被提出,它的目标是从未分段的视频语料库中检索出与查询语句相对应的一小段视频片段.现有的跨模态视频文本检索工作的关键点在于不同模态特征的对齐和融合,然而,简单地执行跨模态对齐和融合不能确保来自相同模态且语义相似的数据在联合特征空间下保持接近,也未考虑查询语句的语义.为了解决上述问题,本文提出了一种面向多模态视频片段检索的查询感知跨模态双重对比学习网络（QACLN）,该网络通过结合模态间和模态内的双重对比学习来获取不同模态数据的统一语义表示.具体地,本文提出了一种查询感知的跨模态语义融合策略,根据感知到的查询语义自适应地融合视频的视觉模态特征和字幕模态特征等多模态特征,获得视频的查询感知多模态联合表示.此外,提出了一种面向视频和查询语句的模态间及模态内双重对比学习机制,以增强不同模态的语义对齐和融合,从而提高不同模态数据表示的可分辨性和语义一致性.最后,采用一维卷积边界回归和跨模态语义相似度计算来完成时刻定位和视频检索.大量实验验证表明,所提出的QACLN优于基准方法.
关键词：	跨模态语义融合跨模态检索视频时刻定位对比学习
收稿时间：	2023-03-26
修稿时间：	2023-06-08
Query Aware Cross-modal Dual Contrastive Learning Network for Multi-modal Video Moment Retrieval

YIN Meng-Ran,LIANG Mei-Yu,YU Yang,CAO Xiao-Wen,DU Ju-Ping,XUE Zhe. Query Aware Cross-modal Dual Contrastive Learning Network for Multi-modal Video Moment Retrieval[J]. Journal of Software, 2024, 35(5)

Authors:	YIN Meng-Ran LIANG Mei-Yu YU Yang CAO Xiao-Wen DU Ju-Ping XUE Zhe

Affiliation:	Beijing Key Laboratory of Intelligent Telecommunication Software and Multimedia, School of Computer Science (National Pilot Software Engineering School), Beijing University of Posts and Telecommunications, Beijing 100876, China

Abstract:	Recently, a new task named Video Corpus Moment Retrieval (VCMR) has been proposed. The exiting cross-modal video encoding and retrieval approaches greatly benefit from the feature alignment and fusion of different modalities. However, simply performing cross-modal alignment and fusion cannot ensure that semantically similar inputs from the same modal and different modal stay close by, and the query context semantic is not considered. To solve the above problems, we propose a query aware cross-modal contrastive learning network for multi-modal video moment retrieval (QACLN), which achieves the unified representation of different modal data by combining cross-modal and intra-modal contrastive learning. First, we propose a query-aware cross-modal semantic fusion strategy, obtaining the query attentive multi-modal joint representation of the video by adaptively fusing the visual modality and caption modality of the video. Then, a cross-modal and intra-modal dual contrastive learning mechanism for video and text query is proposed to enhance the semantic alignment and fusion of different modalities, which can improve the discriminability and semantic consistency of data representations of different modalities. Finally, the 1D convolution boundary regression and cross-modal semantic similarity calculation are employed to perform the moment localization and video retrieval. Extensive experiments on the benchmark dataset demonstrate that the proposed QACLN outperforms the state-of-the-art methods.

Keywords:	cross-modal semantic fusion cross-modal retrieval video moment localization contrastive learning

	点击此处可从《软件学报》浏览原始摘要信息
	点击此处可从《软件学报》下载免费的PDF全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏