基于混合注意力的Transformer视觉目标跟踪算法 Transformer visual object tracking algorithm based on mixed attention期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

基于混合注意力的Transformer视觉目标跟踪算法

引用本文：	侯志强,郭凡,杨晓麟,马素刚,范九伦.基于混合注意力的Transformer视觉目标跟踪算法[J].控制与决策,2024,39(3):739-748.

作者姓名：	侯志强郭凡杨晓麟马素刚范九伦

作者单位：	西安邮电大学计算机学院,西安 710121;西安邮电大学通信与信息工程学院,西安 710121

基金项目：	国家自然科学基金项目(62072370).

摘要：	基于Transformer的视觉目标跟踪算法能够很好地捕获目标的全局信息,但是,在对目标特征的表述上还有进一步提升的空间.为了更好地提升对目标特征的表达能力,提出一种基于混合注意力的Transformer视觉目标跟踪算法.首先,引入混合注意力模块捕捉目标在空间和通道维度中的特征,实现对目标特征上下文依赖关系的建模;然后,通过多个不同空洞率的平行空洞卷积对特征图进行采样,以获得图像的多尺度特征,增强局部特征表达能力;最后,在Transformer编码器中加入所构建的卷积位置编码层,为跟踪器提供精确且长度自适应的位置编码,提升跟踪定位的精度.在OTB100、VOT2018和LaSOT等数据集上进行大量实验,实验结果表明,通过基于混合注意力的Transformer网络学习特征间的关系,能够更好地表示目标特征.与其他主流目标跟踪算法相比,所提出算法具有更好的跟踪性能,且能够达到26帧/s的实时跟踪速度.
关键词：	计算机视觉目标跟踪孪生网络深度学习注意力机制 Transformer
Transformer visual object tracking algorithm based on mixed attention

HOU Zhi-qiang,GUO Fan,YANG Xiao-lin,MA Su-gang,FAN Jiu-lun.Transformer visual object tracking algorithm based on mixed attention[J].Control and Decision,2024,39(3):739-748.

Authors:	HOU Zhi-qiang GUO Fan YANG Xiao-lin MA Su-gang FAN Jiu-lun

Affiliation:	School of Computer,Xián University of Posts & Telecommunications,Xián 710121,China; School of Communication and Information Engineering,Xián University of Posts & Telecommunications,Xián 710121,China

Abstract:	The Transformer-based visual object tracking algorithm can capture the global information of the target well, but there is a possibility of further improvement in the presentation of the object features. To better improve the expression ability of object features, a Transformer visual object tracking algorithm based on mixed attention is proposed. First, the mixed attention module is introduced to capture the features of the object in the spatial and channel dimensions, so as to model the contextual dependencies of the target features. Second, the feature maps are sampled by multiple parallel dilated convolutions with different dilation rates to obtain the multi-scale features of the images, and enhance the local feature representation. Finally, the convolutional position encoding constructed is added to the Transformer encoder to provide accurate and length-adaptive position coding for the tracker, thereby improving the accuracy of tracking and positioning. The experimental results of the proposed algorithm on OTB100, VOT2018 and LaSOT show that by learning the relationship between features through the Transformer network based on mixed attention, the object features can be better represented. Compared with other mainstream object tracking algorithms, the proposed algorithm has better tracking performance and achieves a real-time tracking speed of 26 frames per second.

Keywords:

	点击此处可从《控制与决策》浏览原始摘要信息
	点击此处可从《控制与决策》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏