Relational-Convergent Transformer for image captioning |
| |
Affiliation: | 1. Air Force Institute of Technology, 2950 Hobson Way (AFIT/ENV), Wright Patterson AFB, OH 45433, USA;1. Department of Oral Surgery, Ninth People’s Hospital, College of Stomatology, Shanghai Jiao Tong University School of Medicine, and Shanghai Key Laboratory of Stomatology and Shanghai Research Institute of Stomatology, Shanghai 200011, China;2. Institute of Image Communication and Network Engineering, Shanghai Jiao Tong University, Shanghai 201100, China;3. Department of medical record statistics, Ningbo City First Hospital, Ningbo Hospital of Zhejiang University, Ningbo 315010, China |
| |
Abstract: | Image captioning describes the visual content of a given image by using natural language sentences, and plays a key role in the fusion and utilization of the image features. However, in the existing image captioning models, the decoder sometimes fails to efficiently capture the relationships between image features because of their lack of sequential dependencies. In this paper, we propose a Relational-Convergent Transformer (RCT) network to obtain complex intramodality representations in image captioning. In RCT, a Relational Fusion Module (RFM) is designed for capturing the local and global information of an image by a recursive fusion. Then, a Relational-Convergent Attention (RCA) is proposed, which is composed of a self-attention and a hierarchical fusion module for aggregating global relational information to extract a more comprehensive intramodal contextual representation. To validate the effectiveness of the proposed model, extensive experiments are conducted on the MSCOCO dataset. The experimental results show that the proposed method outperforms some of the state-of-the-art methods. |
| |
Keywords: | Image captioning Relational fusion Relational-Convergent Attention |
本文献已被 ScienceDirect 等数据库收录! |
|