首页 | 本学科首页   官方微博 | 高级检索  
     

基于空时变换网络的视频摘要生成
引用本文:李群,肖甫,张子屹,张锋,李延超.基于空时变换网络的视频摘要生成[J].软件学报,2022,33(9):3195-3209.
作者姓名:李群  肖甫  张子屹  张锋  李延超
作者单位:南京邮电大学 计算机学院、软件学院、网络空间安全学院, 江苏 南京 210023
基金项目:国家自然科学基金(No.61906099,No.61906098)
摘    要:视频摘要生成是计算机视觉领域必不可少的关键任务,这一任务的目标是通过选择视频内容中信息最丰富的部分来生成一段简洁又完整的视频摘要,从而对视频内容进行总结.所生成的视频摘要通常为一组有代表性的视频帧(如视频关键帧)或按时间顺序将关键视频片段缝合所形成的一个较短的视频.虽然视频摘要生成方法的研究已经取得了相当大的进展,但现有的方法存在缺乏时序信息和特征表示不完备的问题,很容易影响视频摘要的正确性和完整性.为了解决视频摘要生成问题,本文提出一种空时变换网络模型,该模型包括三大模块,分别为:嵌入层、特征变换与融合层、输出层.其中,嵌入层可同时嵌入空间特征和时序特征,特征变换与融合层可实现多模态特征的变换和融合,最后输出层通过分段预测和关键镜头选择完成视频摘要的生成.通过空间特征和时序特征的分别嵌入,以弥补现有模型对时序信息表示的不足;通过多模态特征的变换和融合,以解决特征表示不完备的问题.我们在两个基准数据集上做了充分的实验和分析,验证了我们模型的有效性.

关 键 词:视频摘要生成  空时变换网络  ViLBERT  特征融合  多模态
收稿时间:2021/6/29 0:00:00
修稿时间:2021/8/15 0:00:00

Video Summarization Based on Spacial-temporal Transform Network
LI Qun,XIAO Fu,ZHANG Zi-Yi,ZHANG Feng,LI Yan-Chao.Video Summarization Based on Spacial-temporal Transform Network[J].Journal of Software,2022,33(9):3195-3209.
Authors:LI Qun  XIAO Fu  ZHANG Zi-Yi  ZHANG Feng  LI Yan-Chao
Affiliation:School of Computer Science and Technology, Nanjing University of Posts and Telecommunications, Nanjing 210023, China
Abstract:Video summarization is an indispensable task in computer vision. The goal of this task is to generate a concise and complete video summary by selecting the most informative part of the video. The video summary is a set of representative video frames (such as video key frames) or a shorter video formed by stitching key video segments in time sequence. Although the study on video summarization has made considerable progress, the existing methods have the problems of lack of temporal information and incomplete feature representation, which can easily affect the correctness and completeness of the video summary. In order to solve the problems of video summarization, this paper proposes a model, which is based on a spacial-temporal transform network. This model includes three modules including the embedding layer, the feature transformation and the fusion layer, and output layer. Among them, the embedding layer can embed spatial and temporal features, and the feature transformation and fusion layer can realize the transformation and fusion of multi-modal features, and finally the output layer generates the video summary by using the segment prediction and key shot selection. The spatial and temporal features are embedded separately to fix the lack of temporal information in existing models. The transformation and fusion of multi-modal features are used to solve the problem of incomplete feature representation. We have done sufficient experiments and analysis on two benchmark datasets to verify the effectiveness of our model.
Keywords:video summarization  spacial-temporal transform network  ViLBERT  feature fusion  multi-modal
点击此处可从《软件学报》浏览原始摘要信息
点击此处可从《软件学报》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号