首页 | 本学科首页   官方微博 | 高级检索  
     

融合多模态特征与时区检测的视频摘要算法
引用本文:白晨,范涛,王文静,王国中. 融合多模态特征与时区检测的视频摘要算法[J]. 计算机应用研究, 2023, 40(11): 3276-3281+3288
作者姓名:白晨  范涛  王文静  王国中
作者单位:上海工程技术大学电子电气工程学院
基金项目:国家重点研发计划重点专项2019年度资助项目(2019YFB180270200);
摘    要:针对传统视频摘要算法没有充分利用视频的多模态信息、难以确保摘要视频片段时序一致性的问题,提出了一种融合多模态特征与时区检测的视频摘要算法(MTNet)。首先,通过GoogLeNet与VGGish预训练模型提取视频图像与音频的特征表示,设计了一种维度平滑操作对齐两种模态特征,使模型具备全面的表征能力;其次,考虑到生成的视频摘要应具备全局代表性,因此通过单双层自注意力机制结合残差结构分别提取视频图像与音频特征的长范围时序特征,获取模型在时序范围的单一向量表示;最后,通过分离式时区检测与权值共享方法对视频逐个时序片段的摘要边界与重要性进行预测,并通过非极大值抑制来选取关键视频片段生成视频摘要。实验结果表明,在两个标准数据集SumMe与TvSum上,MTNet的表征能力与鲁棒性更强;它的F1值相较基于无锚框的视频摘要算法DSNet-AF以及基于镜头重要性预测的视频摘要算法VASNet,在两个数据集上分别有所提高。

关 键 词:多模态特征  特征融合  视频摘要  时区检测  注意力机制
收稿时间:2023-02-23
修稿时间:2023-10-13

Research on video summarization algorithm fusing multimodal features and time zone detection
Bai Chen,Fan Tao,Wang Wenjing and Wang Guozhong. Research on video summarization algorithm fusing multimodal features and time zone detection[J]. Application Research of Computers, 2023, 40(11): 3276-3281+3288
Authors:Bai Chen  Fan Tao  Wang Wenjing  Wang Guozhong
Affiliation:Shanghai University of Engineering Science,,,
Abstract:To address the issues of traditional video summarization algorithms not fully utilizing multimodal information in videos and struggling to ensure temporal consistency of summary video segments, this paper proposed a new video summarization algorithm(MTNet) that fused multimodal features and temporal zone detection. Firstly, it extracted the visual and audio features of the videos using pre-trained GoogLeNet and VGGish models, and designed a dimension smoothing operation to align the two modal features, endowing the model with comprehensive representation capabilities. Secondly, considering that the generated video summaries should have global representativeness, it combined single and double-layer self-attention mechanisms with residual structures. It was used to extract long-range temporal features of video images and audio features, obtaining a single vector representation of the model in the temporal domain. Lastly, it predicted summary boundaries and importance of individual temporal segments in the video using separated temporal zone detection and weight sharing methods. It selected key video segments to generate video summaries through non-maximum suppression. Experimental results show that MTNet has stronger representation capabilities and robustness on two standard datasets, SumMe and TvSum. It achieves an increase of 2.4 to 3.9 percentage points in F1-score compared to the anchor-free video summarization algorithm DSNet and the shot importance prediction-based video summarization algorithm VASNet on both datasets.
Keywords:multimodal features   feature fusion   video summarization   time zone detection   attention mechanism
点击此处可从《计算机应用研究》浏览原始摘要信息
点击此处可从《计算机应用研究》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号