Abstract: | In visual tracking, both convolution and attention are widely employed for feature enhancement and fusion. However, convolution does not adequately model global dependencies of samples due to its operation on local neighbors, while attention gives too much attention to global dependencies and too little to local dependencies. It is intrinsically infeasible to combine both methods to integrate global and local information. However, a recently-proposed model called involution uses kernels differing in spatial extent but sharing across channels, making it possible to take advantage of both convolution and attention. We propose an attention-involution (Att-Inv) model that uses an attention mechanism to generate involution kernels to take both global and local dependencies of samples into account. To improve the performance of our tracker, we develop and implement strategies of backbone network modification, template updates, and regression of bounding box distributions. We evaluate our tracker using benchmarks such as GOT10k, LaSOT, TrackingNet and OxUvA. Experimental results show that it is competitive with state-of-the-art trackers. |