首页 | 本学科首页   官方微博 | 高级检索  
     


D2F: discriminative dense fusion of appearance and motion modalities for end-to-end video classification
Authors:Wang  Lin  Wang  Xingfu  Hawbani  Ammar  Xiong  Yan  Zhang  Xu
Affiliation:1.School of Computer Science and Technology, University of Science and Technology of China, Hefei, 230027, China
;2.National Computer Network Emergency Response Technical Center of China, Chengdu, 610072, China
;
Abstract:

Recently, two-stream networks with multi-modality inputs have shown to be of vital importance for state-of-the-art video understanding. Previous deep systems typically employ a late fusion strategy, however, despite its simplicity and effectiveness, the late strategy might experience insufficient fusion due to that it performs fusion across modalities only once and treats each modality equally without discrimination. In this paper, we propose a Discriminative Dense Fusion (D2F) network, addressing these limitations by densely inserting an attention-based fusion block at each layer. We experiment with two typical action classification benchmarks and three popular classification backbones, where our proposed module consistently outperforms state-of-the-art baselines by noticeable margins. Specifically, the two-stream VGG16, ResNet and I3D achieve accuracy of 93.5%, 69.2%], 94.6%, 70.5%], 94.1%, 72.3%] with D2F on UCF101, HMDB51], respectively, with absolute gains of 5.5%, 9.8%], 5.13%, 9.91%], and 0.7%, 5.9%] compared with their late fusion counterparts. The qualitative performance also demonstrates that our model can learn more informative complementary representation.

Keywords:
本文献已被 SpringerLink 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号