首页 | 本学科首页   官方微博 | 高级检索  
     

基于多模态融合的城市道路场景视频描述模型研究
引用本文:李铭兴,徐成,李学伟,刘宏哲,闫晨阳,廖文森.基于多模态融合的城市道路场景视频描述模型研究[J].计算机应用研究,2023,40(2).
作者姓名:李铭兴  徐成  李学伟  刘宏哲  闫晨阳  廖文森
作者单位:北京联合大学,北京联合大学,北京联合大学,北京联合大学,北京联合大学,北京联合大学
基金项目:国家自然科学基金资助项目(62171042,62102033,61906017,61802019);北京市重点科技项目(KZ202211417048);协同创新中心资助项目(CYXC2203);北京联合大学学术研究项目(BPHR2020DZ02,ZB10202003,ZK40202101,ZK120202104)
摘    要:城市道路视频描述存在仅考虑视觉信息而忽视了同样重要的音频信息的问题,多模态融合算法是解决此问题的方案之一。针对现有基于Transformer的多模态融合算法都存在着模态之间融合性能低、计算复杂度高的问题,为了提高多模态信息之间的交互性,提出了一种新的基于Transformer的视频描述模型多模态注意力瓶颈视频描述(multimodal attention bottleneck for video captioning,MABVC)。首先使用预训练好的I3D和VGGish网络提取视频的视觉和音频特征并将提取好的特征输入到Transformer模型当中,然后解码器部分分别训练两个模态的信息再进行多模态的融合,最后将解码器输出的结果经过处理生成人们可以理解的文本描述。在通用数据集MSR-VTT、MSVD和自建数据集BUUISE上进行对比实验,通过评价指标对模型进行验证。实验结果表明,基于多模态注意力融合的视频描述模型在各个指标上都有明显提升。该模型在交通场景数据集上依旧能够取得良好的效果,在智能驾驶行业具有很大的应用前景。

关 键 词:视频描述    多模态融合    注意力机制    智能驾驶
收稿时间:2022/6/8 0:00:00
修稿时间:2023/1/16 0:00:00

Multimodal fusion for video captioning on urban road scene
Li Mingxing,Xu Cheng,Li Xuewei,Liu Hongzhe,Yan Chenyang and Liao Wensen.Multimodal fusion for video captioning on urban road scene[J].Application Research of Computers,2023,40(2).
Authors:Li Mingxing  Xu Cheng  Li Xuewei  Liu Hongzhe  Yan Chenyang and Liao Wensen
Affiliation:Department of Key Laboratory of Information Service Engineering,Beijing Union University Beijing,,,,,
Abstract:Multimodal fusion algorithm is one of the solutions to the problem of urban road video caption which only considers visual information and ignores the equally important audio information. Existing multimodal fusion algorithms based on Transformer all have the problem of low fusion performance between modes and high computational complexity. In order to improve the interaction between multimodal information, this paper recently proposed a new Transformer based model called multimodal attention bottleneck for video captioning(MABVC). Firstly, this paper used pre-trained I3D and VGGish networks to extract visual and audio features of video and input the extracted features into Transformer model. Then, the decoder part would train the information of the two modes respectively and perform multimodal fusion. Finally, the model processed the results of the decoder and generated text captions that people could understand. This paper conducted a comparison experiments using data sets MSR-VTT, MSVD and self-built data sets BUUISE, and validated model results using evaluation metrics the model. The experimental results show that the video caption model based on multimodal attention fusion has obvious improvement in all indicators. The model can still achieve good results on traffic scene data sets, and has great application prospects, which can be promoted and applied in intelligent driving industry.
Keywords:video caption  multimodal fusion  attention mechanism  intelligent driving
点击此处可从《计算机应用研究》浏览原始摘要信息
点击此处可从《计算机应用研究》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号