Abstract: | Intelligently tracking objects with varied shapes, color, lighting conditions, and backgrounds is an extremely useful application in many HCI applications, such as human body motion capture, hand gesture recognition, and virtual reality (VR) games. However, accurately tracking different objects under uncontrolled environments is a tough challenge due to the possibly dynamic object parts, varied lighting conditions, and sophisticated backgrounds. In this work, we propose a novel semantically-aware object tracking framework, wherein the key is weakly-supervised learning paradigm that optimally transfers the video-level semantic tags into various regions. More specifically, give a set of training video clips, each of which is associated with multiple video-level semantic tags, we first propose a weakly-supervised learning algorithm to transfer the semantic tags into various video regions. The key is a MIL (Zhong et al., 2020) [1]-based manifold embedding algorithm that maps the entire video regions into a semantic space, wherein the video-level semantic tags are well encoded. Afterward, for each video region, we use the semantic feature combined with the appearance feature as its representation. We designed a multi-view learning algorithm to optimally fuse the above two types of features. Based on the fused feature, we learn a probabilistic Gaussian mixture model to predict the target probability of each candidate window, where the window with the maximal probability is output as the tracking result. Comprehensive comparative results on a challenging pedestrian tracking task as well as the human hand gesture recognition have demonstrated the effectiveness of our method. Moreover, visualized tracking results have shown that non-rigid objects with moderate occlusions can be well localized by our method. |