Transformer tracking with multi-scale dual-attention

Transformer-based trackers greatly improve tracking success rate and precision rate. Attention mechanism in Transformer can fully explore the context information across successive frames. Nevertheless, it ignores the equally important local information and structured spatial information. And irrelevant regions may also affect the template features and search region features. In this work, a multi-scale feature fusion network is designed with box attention and instance attention in Encoder–Decoder architecture based on Transformer. After extracting features, the local information and structured spatial information is learnt by multi-scale box attention, and the global context information is explored by instance attention. Box attention samples grid features from the region of interest. Therefore, it effectively focuses on the region of interest (ROI) and avoids the influence of irrelevant regions in feature extraction. At the same time, instance attention can also pay attention to the context information across successive frames, and avoid falling into local optimum. The long-range feature dependencies are learned in this stage. Extensive experiments are conducted on six challenging tracking datasets to demonstrate the superiority of the proposed tracker MDTT, including UAV123, GOT-10k, LaSOT, VOT2018, TrackingNet, and NfS. In particular, the proposed tracker achieves AUC score of 64.7%\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$64.7 \% $$\end{document} on LaSOT, 78.1%\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$78.1 \%$$\end{document} on TrackingNet and precision score of 89.2%\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$89.2 \%$$\end{document} on UAV123, which outperforms the baseline and most recent advanced trackers.


Introduction
Visual tracking has an important research significance in computer vision [1].It has a large number of practical applications, such as video surveillance, automatic driving, and visual localization.The goal of visual tracking is to use the target information given in the previous frame to predict the position of target in the next frame.Visual tracking still confronts many challenges due to a lot of complicating factors in real-world scenes, such as partial occlusion, out-of-view, background clutter, viewpoint change, scale variation, etc.
Recently, Vaswani et al. [2] first propose an attention mechanism based on Transformer for nature language processing.The Transformer explores long-range dependencies in sequences by computing the attention weights with triples (i.e., query, key, and value).Based on the excellent ability of attention mechanism in feature fusion, Transformer structures have successfully been introduced to visual tracking and achieved encouraging results.Wang et al. [3] propose an encoder-decoder-based tracking framework to explore the rich context information across successive frames.It is a meaningful attempt and achieves great success.
Existing Transformer-based trackers use CNN (Convolutional Neural Network) as a backbone network for feature extracting.CNN focuses more on local information, and ignores the global information and the connection between them.These disadvantages may have some impacts on tracking performance, especially in the complicated tracking scenes, such as severe occlusion, out-of-view, and drastic illumination.Even these challenges can lead to tracking drift or failure.In Transformer-based trackers, the encoderdecoder structure compensates for this deficiency of CNN, and the global context information is fully explored.However, the structured spatial information is not adequately exploited.How to effectively explore the context information across successive frames without losing a lot of useful spatial information becomes a crucial factor to improve the tracking performance.
In this paper, a novel multi-scale dual-attention-based tracking method is proposed to further explore structured spatial information.The proposed method is inspired by the encouraging work of TrDiMP [3], which first introduces Transformer to the tracking field and builds a bridge to explore context information across successive frames.Different from TrDiMP, the proposed method uses a novel feature fusion network, which can not only explore context information, but also fully explore local information and structured spatial information across successive frames.The proposed method predicts the ROI by applying a geometric transformation to the reference window.So that it can focus more on the predicted regions.By this way, the structured spatial information can be fully explored.In addition, the instance attention is introduced to the decoder structure, which can focus more on the global context information across successive frames.The proposed tracker can make the attention module more flexible, and can quickly focus on the region of interest.The proposed tracker performs well on six tracking benchmarks, including UAV123 [4], VOT2018 [5], GOT-10k [6], NfS [7], LaSOT [8], and TrackingNet [9].
In summary, the main contributions of this work can be summarized as follows: • A Transformer-based multi-scale feature fusion network with dual attentions is designed, namely, box attention and instance attention.With the feature fusion network, we can quickly obtain multiple bounding boxes with high confidence scores in encoder, and then refine and obtain the predicted bounding boxes in decoder.

Siamese-based visual tracking
Recently, trackers based on Siamese networks achieve a well balance between tracking speed and accuracy.As the pioneering work, SiamFC [10] uses two branches (i.e., template branch and search branch) to extract the template image features and search region image features, respectively.It trains an end-to-end tracking network and computes the score maps by cross-correlation.SiamFC achieves superior tracking performance on some current tracking benchmarks.Based on SiamFC, Dong et al. [11] add a triplet loss to Siamese network as the training strategy.To save the time of multi-scale testing, SiamRPN [12] uses the Region Proposal Network (RPN) structure in Siamese tracking.SiamFC and most Siamese-based trackers usually use the shallow AlexNet as the feature extractor.Li et al. [13] propose a layer-based feature aggregation structure to calculate similarity, which is helpful to obtain more accurate similarity maps from multiple layers.Instead of using AlexNet as the backbone, Abdelpakey et al. [14] design a new network structure with Dense blocks that reinforces template features by adding self-attention mechanism.
Due to the use of deep convolution operation in feature extraction, trackers usually only focus on local regions of images.Additionally, in Siamese-based trackers, the cross-correlation operation is usually used as the similarity matching method.It focuses more on the local information than global information, and easily traps in local optimum.

Attention mechanisms in computer vision
In recent years, attention mechanisms are increasingly being used in various fields of computer vision.It focuses on important information and ignores the irrelevant information.Attention mechanisms can be divided into channel attention, spatial attention, mixed attention, frequency domain attention, self-attention, and so on.
SENet [15] proposes to learn from the channel dimension to get the importance of each channel.Woo et al. [16] combine the channel attention and spatial attention, which can effectively help information transfer across the network by learning to reinforce or suppress relevant feature information.Self-attention uses a particular modeling method to explore the global context information.However, since the self-attention needs to capture the global context information, it will focus less on the local region.Xiao et al. [17] design a federated learning system, which uses attention mechanism and long short-term memory to explore the global relationships hidden in the data.Xing et al. [18] propose a robust semisupervised model, which simplifies semisupervised learning techniques and achieves excellent performance.
Transformer is proposed by Vaswani et al. [2] and first used in NLP (Natural Language Processing).Due to its unique and superiority of parallel computing, it is gradually used in computer vision.Swin Transformer [19] builds a hierarchical Transformer by introducing a hierarchical construction.Based on Swin Transformer, Xia et al. [20] use the strategy of flow field migration to focus more on relevant areas for key and value, so as to obtain more context information.Although attention mechanisms are now well equipped to deal with associations between different features, it is still a question how to combine their advantages to obtain features with stronger representational ability.

Transformer in visual tracking
In recent years, Transformer-based tracking algorithms are proposed and applied in vision fields.Existing trackers based on Transformer use encoder or decoder structure to incorporate or enhance features extracted by CNN.Wang et al. [3] first apply Transformer in visual tracking and propose a remarkable tracking method TrDiMP.TrDiMP uses the Transformer encoder and Transformer decoder structures to build the relationship across successive frames, which explores the rich context information across them.The tracking algorithm [21] uses a full convolutional network to predict response maps of the upper left and lower right corner, and obtains an optimal bounding box for each frame.It does not use any pre-defined anchors for bounding box regressions.
Lin et al. [22] propose an attention-based tracking method SwinTrack.It uses Transformer for feature extraction and feature fusion.Zhao et al. [23] use multi-head self-attention and multi-head cross-attention to adequately explore the global rich context information instead of using the crosscorrelation operation.Inspired by Transformer, Chen et al. [24] propose a novel attention-based feature fusion network.It directly extracts search region features without using any relevant operations.Mayer et al. [25] propose a tracking structure based on Transformer model.It captures global relationships with less inductive bias, and enables it to learn stronger target model predictions.In these Transformerbased trackers, the structured spatial information is not fully exploited.
In this work, a Siamese network architecture is designed.The difference is that the Encoder-Decoder structure is used instead of the cross-correlation layer.Two different efficient attention mechanisms are introduced to the feature fusion network, which can more accurately focus on the region of interest.

Overall architecture
In this section, a novel Transformer-based feature fusion network with dual attentions is designed in a Siamese tracking framework, as shown in Fig. 1

Box attention in transformer encoder
In this section, first, the computation of multi-head selfattention is briefly reviewed.Then, the multi-head box attention is introduced, which can focus more on the region of interest in the feature map.Some object proposals with high confidence scores are obtained by geometrically transforming the reference windows on the input feature map.By introducing box attention to Encoder, the proposed tracker MDTT can perform well to appearance variations, such as occlusion, out-of-view, fast motion, and scale variation.
Multi-head self-attention was first proposed in [2].Selfattention is computed using the scaled dot product in attention map as where the inputs of attention function are Q, K , and V .They are obtained by the linear transformation of query, key, and value.d k is the dimension of key.The multi-head self-attention (MHSA) of n attention heads is calculated as follows: where W O is a learnable projection matrix, , and h l is the number of attention heads.
As shown in Fig. 1, similar to the calculation of MHSA, when calculating the attention of the i th head, given the bounding boxes b i ∈ R d in the i th head, a m ×m grid feature map v i ∈ R m×m×d h centered on b i is extracted by the bilinear interpolation.After that, the attention on the extracted grid feature map is computed.
Here, an important module named Where-to-Attend is used after generating the m × m grid feature map v i .The module is an important part of box attention, which can transform v i into an attended region through a geometric transformation.Therefore, the region of attention can adapt to the appearance changes of target.Finally, the attention weights are generated by computing matrix multiplication between query q and key v.Using bilinear interpolation to extract grid features can effectively reduce quantization errors in bounding box regression.This operation is actually identical to RoIAlign [26], which also extracts a finite number of bounding box proposals within regions of interest.This method can capture more accurate target information and obtain more accurate pixel-level information.
After that, the attention weights will be calculated by softmax function to get Q K i .Finally, the final box attention h i ∈ R d h is obtained by calculating the weighted average of linear transformation matrix For calculating the attention weight, the attention should be focused around the center of target, and the critical Whereto-Attend module is used.The role of Where-to-Attend module is to make the box attention focusing more on the necessary regions and predicting the bounding boxes more accurately.And the module can transform reference window of query vector q into a more accurate region through geometric transformations.It can predict bounding box proposals in grid feature map using structured spatial information.
As in Fig. 2, b q = [x, y, w, h] are used to denote the reference windows of q, where x and y denote the central position coordinates of the reference window, and w and h denote the width and height of the reference window, respectively.
Here, translation function F t is used to convert the reference window b q .F t takes query q and b q as its inputs and adjusts the center of the reference window.The output of F t is calculated as follows: where x and y are offsets relating to the central position of reference window b q .In addition, we resize the reference Fig. 2 Where-to-attend modules.It allows box attention to spotlight on the dynamic region of target and effectively use limited attention calculations window b q by another translation function F s .F s has the same input as F t , and its output is computed as follows: where w and h are offsets of the size of the reference window b q .The offset parameters x, y, w, and h are implemented by a linear projection of query q as follows: where W is the weight of the linear projection.τ is the temperature hyperparameter and set to 2. b x , b y , b w , and b h are bias vectors.The reference window is resized by multiplication, which preserves the scale invariance.Finally, the translation result of reference window b q is computed with F t and F s as follows: Then, the box attention calculation of a single head is completed.Furthermore, the box attention calculation is easily extended to multi-head box attention with multiple heads.Given multiple attention heads, the boxes of interest b i ∈ R d in the query q ∈ R d are expanded to a set of boxes Next, a grid of feature v i ∈ R (t×m×m)×d h from each box is sampled and the multi-head box attention is computed.

Instance attention in transformer decoder
The purpose of using box attention in Encoder is to generate high-quality object proposals.Similarly, the instance attention is used in Decoder to generate accurate bounding boxes.Different from the box attention, in the i th attention head, instance attention takes the grid features of object proposals in Encoder as input, and then generates two outputs, h i and h mask i .Here, only h i ∈ R d is used for classification to distinguish foreground from surrounding background.
Similarly, the instance attention is extended to multihead instance attention calculation with multiple heads.First, v i ∈ R (t×m×m)×d h is obtained by the same way in the box attention.Before creating h i , the softmax function is used to normalize t × m × m attention scores and then applied to v i .

Tracking with box attention and instance attention
As shown in Fig. 1 ( where d model is set to 256, i is the dimension, and pos is the location information.The Encoder encodes feature maps x j t−1 j=1 (t = 4) extracted from the backbone network and obtains the multi-scale contextual representations e j t j=1 .Here, the ResNet50 [27] is used as the feature extraction network.In the Transformer structure, each of the Encoder layers includes the box attention, and each feed-forward layer is followed by a normalized layer with a residual structure.Encoder takes the target template features as its input and outputs multiple object proposals with high confidence scores.Experiments show that the Encoder with box attention makes the proposed tracker MDTT more effective in dealing with some tracking challenges, such as occlusion, scale variation, and fast motion.
The Decoder predicts bounding boxes and distinguishes the foreground from the background.In decoder layer, the instance attention is used instead of the cross-attention.The object proposals in Encoder were put as the input of Decoder, which will be detailed to the object proposals so as to get a 0 5 10 15 20 25  Fig. 4 Area Under the Curve (AUC) plots on UAV123 for eight challenging aspects: scale variation, partial occlusion, similar object, fast motion, low resolution, aspect ratio change, illumination variation, and camera motion.The proposed MDTT performs well on all these aspects, especially on similar object and fast motion more precise proposal.Since the Decoder use the encoder features with highest classification scores as the input features, this will provide more effective context information to Decoder.This is crucial for the tracking process, since there is a lot of context information across the successive frames.

Experiments
In this section, first, the implement details are given.Then, the proposed tracker MDTT is compared with many recent state-of-the-art trackers on six tracking benchmarks.Finally, the ablation study is conducted and the effects of the key components of the feature fusion network are analyzed.
In particular, the training set and test set in GOT-10k have different tracking targets, respectively.It contains 560 classes of outdoor moving objects in real scenes, and the test set contains 180 video sequences.As well as the TrackingNet benchmark, the ground truth of GOT-10k is not publicly available, so we evaluate the proposed tracker by an online evaluation.The average overlap (AO) and success rate (SR) are used to compare the performance of trackers.As shown in Fig. 6, MDTT is compared against several recent popular trackers, including STARK [21], TrDiMP [3], SuperDiMP [42], SiamLA [49], AutoMatch [43], and TREG [40].MDTT achieves the best tracking performance on suc-  cess.Also, Table 1 presents the tracking performance of the proposed method and recent state-of-the-art trackers on GOT-10k, such as UTT [39], SBT [41], SiamPW-RBO [37], and SiamLA [49].As shown in Table 1, the proposed tracker outperforms most of them.Compared with the popular tracker TrDiMP, MDTT is higher on AO, S R 0.5 , and S R 0.75 by 1.6%, 2.5%, and 1.7%, respectively.The above results demonstrate that the proposed method is adapt to a large number of different scenarios and challenges.
LaSOT [8]: LaSOT is a large-scale and complex single object dataset.It contains 280 sequences with an average 2448 frames per sequence in the test set.We evaluate the proposed tracker on LaSOT dataset to validate its long-term capability.
Figure 7 shows the success and precision plots of MDTT tracker and 13 state-of-the-art trackers.These trackers are ranked according to the AUC and precision scores.From Fig. 7, it can be seen that MDTT achieves the top-rank AUC score of 64.7% and achieves the performance on precision of 67.5%.Table 2 shows the tracking performance on success, precision and normalized precision metrics.Compared with UTT, TrDiMP, and DualTFR, the proposed method improves the AUC score by 0.1%, 0.7%, and 1.2%, respectively.The above results demonstrate that MDTT adapts to long-term tracking and performs well in terms of success, precision, and normalized precision.Table 3 shows more details compared with these trackers.As shown in Table 3, although CGACD achieves an EAO score of 44.9%, MDTT achieves the better performance on EAO of 45.2% and accuracy rate of 61.9%, which outperforms all other SOTA trackers.In addition, MDTT outperforms the baseline by 1.5% on EAO and 1.9% on accuracy, respectively.

Ablation study and analysis
The ablation study is performed on UAV123 and GOT-10k to verify the effectiveness of box attention and instance attention in the designed feature fusion network in MDTT.
Only using box attention.Here, the box attention is used in the Encoder structure.The Decoder structure still uses cross-attention.MDTT is evaluated in UAV123 and GOT-10k to verify the effect of the multi-head box attention.As shown in Tables 6 and 7, MDTT improves the success rate by 0.8% and precision rate by 0.7% on UAV123 compared to the baseline when only using box attention.On GOT-10k, the AO rate increased by 1.9% compared with the baseline.Experiment results show that box attention plays a key role in the designed structure.
Only using instance attention.The original self-attention is used in Encoder and the instance attention is used in Decoder.The proposed tracker is evaluated in UAV123 and GOT-10k to verify the influence of instance attention.As shown in Tables 6 and 7, compared to the baseline when only using the instance attention, MDTT improves the success rate by 0.6% and precision rate by 0.5% on UAV123.On GOT-10k, the AO rate increases by 1.5% compared with the baseline.
Compared with the method only with box attention, the performance of instance attention is almost the same as using the box attention.Nevertheless, it also shows the effectiveness of instance attention.
Using both.Finally, both of box attention and instance attention are introduced to the baseline and the experiments are conducted on UAV123 and GOT-10k.As shown in Tables 6 and 7, the proposed method improves the success rate by 1.7% and precision rate by 2.0% on UAV123 compared with the baseline.On GOT-10k, the AO rate increases by 2.6% compared with the baseline.The experimental results show that using both of box attention and instance attention greatly improved the performance of our tracker by 2.0%.In addition, the proposed method also shows better performance compared to TrDiMP.It reaches 1.6% improvement of AO in GOT-10k and 1.6% improvement of precision in UAV123 than TrDiMP.

Speed, FLOPs, and params
We use ResNet-50 as the backbone network of the proposed tracker MDTT.The results in Table 8 refer to the official website and the author's personal homepage.As shown in Table 8, MDTT can run in real-time tracking speed.And the number of parameters does not increase significantly.In addition, the FLOPs and Params of MDTT are less than SiamRPN++.Although we use more advanced attention mechanisms in the feature fusion network, the increasing in FLOPs and Params was not significant.This indicates that using of the box attention and instance attention allows the tracker MDTT to explore structured spatial information and global information at a lower cost.

Limitations and future works
Limitations.Although the proposed feature fusion network is effective, we did not optimize the feature extraction network.In Fig. 9, we show two tracking failure cases of the proposed tracker MDTT on Bird1 and Car1 video sequences.The tracking results of MDTT and the corresponding ground truths are shown in green and red boxes.As shown in Fig. 9, when the proposed tracker deals with challenging, such as out-of-view and motion blur, it fails to track the targets.This illustrates that the designed feature fusion network is not robust in capturing the appearance variations in some scenes.The proposed tracker MDTT is not very robust in dealing with large appearance variations, such as out-of-view or motion blur.Meanwhile, the proposed MDTT lacks a target template updating mechanism.
Future Works.The proposed tracker can capture the structured spatial information and global information well.However, when occurring the appearance variations of the target disappearing or target blurred, the structure information captured by the tracker is not accurate, which leads to tracking drift or failure.In the future, we will further optimize the feature extraction network to improve the feature representation ability.On the other hand, we will design a target template updating mechanism to capture the target appearance variations.In addition, we found that some statistical tests [71] can be used to verify the tracking performance.In future study, we will use some statistical tests for experimental comparisons to evaluate the tracking performance.

Conclusion
A Transformer-based tracking framework with the multiscale feature fusion network is proposed.The box attention can capture more relevant information and the structured spatial information, and gives more attention to the region of interest in template images.Then, instance attention exploits the temporal information.By integrating the box attention and instance attention in Encoder-Decoder architecture, the feature fusion network not only focuses on the temporal information across successive frames, but also focuses more on the ROI.And the network effectively improves the accuracy of classification and regression.The ablation study on UAV123 and GOT-10k verifies the effectiveness of the multi-scale feature fusion network.Experimental results on six challenging tracking datasets show that MDTT outperforms many recent trackers.

Fig. 1
Fig. 1 An overview of the proposed architecture.Given the target template image and the search image in subsequent frames, multi-scale feature maps are extracted from the backbone network.Convolution layers share common network weight.Then, multi-scale feature maps are fed to the Encoder-Decoder structure.Unlike TrDiMP, the box attention , a Transformer-based feature fusion network is designed with both box attention and instance attention.To enable the model to make full use of sequential information across successive frames, the positional encoding is added at the bottom of Encoder and Decoder as follows: PE pos,2i = sin pos 10000 2i/d model PE pos,2i+1 = cos pos 10000 2i/d model .

Fig. 7 Fig. 8
Fig. 7 Success and precision plots against competitive trackers on LaSOT dataset

Fig. 9
Fig. 9 Two cases of failure

Table 1
Comparison results of the competing trackers on GOT-10k in terms of average overlap (AO) and success rate (SR).The best two results are highlighted in bold and italic, respectively

Table 2
Results on LaSOT.Trackers are evaluated by the area under the curve (AUC), precision (P), and normalized precision (P Norm ).The best two results are highlighted in bold and italic, respectively

Table 3
Comparison on VOT2018 in terms of accuracy (A), robustness (R), and expected average overlap (EAO).The best two results are highlighted in bold and italic, respectively

Table 4
Comparison on the TrackingNet in terms of the area under the curve (AUC), precision (P), and normalized precision (P Norm ).The best two results are highlighted in bold and italic, respectively

Table 5
Comparison with SOTA trackers on NfS and UAV123 datasets in terms of AUC.Bold and Italic fonts indicate the top-2 trackers

Table 6
The ablation study on UAV123 in terms of Precision (P) and Area Under the Curve (AUC).Box-Att denotes box attention.Ins-Att denotes instance attention

Table 7
The ablation study on GOT-10k in terms of AO, S R 0.5 , and S R 0.75 , respectively.Box-Att denotes box attention.Ins-Att denotes instance attention

Table 8
Comparison about the speed, FLOPs, and Params