Multi-view motion modelled deep attention networks (M2DA-Net) for video based sign language recognition |
| |
Affiliation: | 1. Computer Engineering Department, College of Computer and Information Sciences, King Saud University, Riyadh 11543, Saudi Arabia;2. Department of Computer Science, Prince Sultan University, Riyadh 11586, Saudi Arabia;3. College of Applied Computer Sciences, King Saud University, Saudi Arabia;4. Turabah University College, Computer Sciences Program, Taif University, Taif 21944, Saudi Arabia;5. Software Engineering Department, College of Computer and Information Sciences, King Saud University, Riyadh 11543, Saudi Arabia;6. Centre of Smart Robotics Research (CS2R), King Saud University, Riyadh 11543, Saudi Arabia;7. Artificial Intelligence Center of Advanced Studies (Thakaa), King Saud University, Saudi Arabia. |
| |
Abstract: | Currently, video-based Sign language recognition (SLR) has been extensively studied using deep learning models such as convolutional neural networks (CNNs) and recurrent neural networks (RNNs). In addition, using multi view attention mechanism along with CNNs could be an appealing solution that can be considered in order to make the machine interpretation process immune to finger self-occlusions. The proposed multi stream CNN mixes spatial and motion modelled video sequences to create a low dimensional feature vector at multiple stages in the CNN pipeline. Hence, we solve the view invariance problem into a video classification problem using attention model CNNs. For superior network performance during training, the signs are learned through a motion attention network thus focusing on the parts that play a major role in generating a view based paired pooling using a trainable view pair pooling network (VPPN). The VPPN, pairs views to produce a maximally distributed discriminating features from all the views for an improved sign recognition. The results showed an increase in recognition accuracies on 2D video sign language datasets. Similar results were obtained on benchmark action datasets such as NTU RGB D, MuHAVi, WEIZMANN and NUMA as there is no multi view sign language dataset except ours. |
| |
Keywords: | Multi view Sign language recognition Deep learning Attention models Motion modelled |
本文献已被 ScienceDirect 等数据库收录! |
|