Visual question answering via Attention-based syntactic structure tree-LSTM期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

Visual question answering via Attention-based syntactic structure tree-LSTM

Abstract:	Due to the various patterns of the image and free-form language of the question, the performance of Visual Question Answering (VQA) still lags behind satisfaction. Existing approaches mainly infer answers from the low-level features and sequential question words, which neglects the syntactic structure information of the question sentence and its correlation with the spatial structure of the image. To address these problems, we propose a novel VQA model, i.e., Attention-based Syntactic Structure Tree-LSTM (ASST-LSTM). Specifically, a tree-structured LSTM is used to encode the syntactic structure of the question sentence. A spatial-semantic attention model is proposed to learn the visual-textual correlation and the alignment between image regions and question words. In the attention model, Siamese network is employed to explore the alignment between visual and textual contents. Then, the tree-structured LSTM and the spatial-semantic attention model are integrated with a joint deep model, in which the multi-task learning method is used to train the model for answer inferring. Experiments conducted on three widely used VQA benchmark datasets demonstrate the superiority of the proposed model compared with state-of-the-art approaches.

Keywords:	Visual question answering Visual attention Tree-LSTM Spatial-semantic correlation
本文献已被 ScienceDirect 等数据库收录！