Spatiotemporal key region transformer for visual tracking

Visual tracking is an important field of computer vision research. Although transformer-based trackers have achieved remarkable performance, the transformer structure is globally computationally inefficient, it does not screen important patches, and it cannot focus on key target regions. At the same time, temporal motion features are easily overlooked. To solve these problems, this paper proposes a new method, SKRT, that removes the CNN structure and directly uses a transformer as the backbone network to extract multiframe video features. Then, these feature maps are mixed and superimposed to obtain spatiotemporal information. To focus on important parts efficiently, we use key region extraction to obtain a small set of template and search feature map patches and reinput them into the transformer as a cross-correlation computation. Finally, we predict the position of a tracking object through center-corner prediction. To demonstrate the effectiveness of our method, we conduct experiments on challenging benchmark datasets (GOT-10K, TrackingNet, VOT2018, OTB100, LaSOT), and the results show that SKRT is competitive with other state-of-the-art methods.


Introduction
Visual tracking is an important branch in the field of computer vision. Its main research task is to obtain a target in the first frame of a video template and continue to track it in subsequent video frames. At present, various methods of research are mainly applied to the intelligent interaction, autonomous driving, augmented reality, and military fields [1][2][3][4]. Although these methods have made considerable progress, they are still challenged by size changes, rotations, occlusions, fast movements, lighting changes, etc. [5][6][7][8][9][10].
Siamese neural networks, as a general visual tracking structure, are widely used in various tasks. Although they have made a large impact, these frameworks also produce many defects related to convolutional neural networks (CNNs). Currently, many trackers use CNN structures, which are mainly used as backbone networks and for crosscorrelation calculations. However, CNNs not only limit the capture of global features and complex nonlinear interactions between features but also result in the loss of semantic information involving template and search features [11,12]. To improve the shortcomings of CNNs, the proposed method uses a transformer to extract and fuse features to better capture global context information and generate more semantic features [13,14]. A transformer, as a global attention network, can effectively focus on long-distance feature relationships [15]. However, the simultaneous existence of the target and background also contains many useful and useless features. At present, transformer-based tracking methods often pay too much attention to useless background information. Although the patches obtained by using a transformer structure can establish a certain connection between the target and the background, if the distance between the patches is too far or the degree of correlation response is not high, these connections are often redundant, and it is easy to contaminate the features of the tracking object.
In addition, compared with a transformer, the convolution kernel of a CNN is still local, and the performance of a transformer is better than that of a CNN when training with a large number of datasets, which also provides a prerequisite for us to directly replace the CNN with the transformer [13,16]. In many trackers, spatial information is more widely used than temporal information, and temporal information is often overlooked [13]. Although some methods also exploit temporal features, they cannot focus on important temporal regions, lack the interconnection of key locations in multiframe videos, and ignore the tendency of object motion. During the generation of the prediction box, the corner prediction is at the edge of the target and is easily affected by interference information. Although the bounding box can be estimated, it is not sufficiently accurate, and the robustness needs to be improved.
In our work, we need to consider some issues. First, how can a transformer be used to extract spatiotemporal information and establish target temporal motion connections? Second, how can we selectively input transformer patches while maintaining efficiency? Third, how can we select the input patches at the attention highlight position? Fourth, how can we make the predicted box closer to the ground truth box to make the prediction more accurate and robust? A new method named SKRT is proposed, and its performance is validated. Our tracker consists of four parts: a transformer, a spatiotemporal overlay structure, key region extraction and center-corner prediction. As an attention mechanism, a transformer can establish a global relationship, through which it can be used as a cross-correlation operation, thus replacing the previous CNN structure. By stacking the feature maps of 3 consecutive frames, we can highlight the motion trajectories of the target and establish temporal connections. Key region extraction can select the features of the target highlight position, form multiple patches in a certain range of pixel blocks, and input them into the transformer, deepening the connection between the template and the search features and improving the efficiency. Finally, the center-corner prediction of the target is directly estimated to locate it.
In summary, compared with those in other papers, four different viewpoints are proposed.
In this paper, we replace the backbone network using a CNN with a transformer and use it as a cross-correlation operation. In this way, more global long-distance nonlinear features can be established, and the algorithm performs better under existing dataset training.
To use the transformer architecture to fuse the spatiotemporal information, we overlay feature maps of three consecutive frames to establish motion relationships based on temporal information, which can focus on the temporal variation of important targets.
Many transformer patches are redundant, and they have the potential to negatively impact tracking. Through key region extraction, we can not only extract the key and effective parts but also ensure efficiency.
For a predicted bounding box based on a corner, a center point is added for the prediction, and when the edge of the tracked object encounters interference information, the predicted box is more robust.

Siamese network tracker
In recent years, the visual tracking method based on the Siamese network has been widely used. The method of extracting the template features and search branches and then calculating the similarity has become the current mainstream method. SiamFC [11], which is a method proposed in 2016, has attracted great attention in the field of visual tracking. This two-branch cross-correlation method had a large impact on subsequent research. Since the SiamFC bounding box is fixed at multiple scales, the tracking accuracy will be affected. Thus, an RPN [17] was added, and a new method named SiamRPN [12] was developed. In SiamRPN++ [17], the AlexNet [18] backbone network is replaced with the deeper ResNet [19] backbone network, which can extract and fuse deeper features, improving the accuracy and reducing the number of parameters. SiamDW [20] also focuses on backbone networks. Based on ResNet [19], the outermost pixels of each module feature map are removed to eliminate the padding effect. Gao et al. investigated the impacts of three main aspects of visual tracking, i.e., the backbone network, the attention mechanism, and the detection component, and proposed a Siamese attention keypoint network named SATIN for efficient tracking and accurate localization [21]. PrDiMP [22] incorporates a model predictor in the Siamese network, and it can predict the conditional probability density of the target state and train the regression network with a minimum KL dispersion to improve the performance of the algorithm. Although the anchor base-based approach yields good results, the anchor-free-based approach performs better. Siam R-CNN [23] uses the idea of redetection for tracking, and a novel hard example mining method, which is specifically trained for difficult distractors, is proposed. The tracklet dynamic programming algorithm (TDPA) can simultaneously track all potential targets, including interferers. In SiamFC++ [24], a classification branch and target motion estimation branch with an unambiguous classification score and a no prior knowledge branch with an estimated quality score were designed, and extensive analysis and extensive research confirmed its effectiveness. SiamCAR [25] adds a discussion of the size of the tracking box, removes the influence of the anchor parameter, and makes the network faster. The anchor-based Siamese tracker has achieved significant progress in terms of accuracy, but further improvements are limited by the robustness of lag tracking. In Ocean [26], a novel object-aware anchor-free network is proposed to solve this problem. It directly predicts the location and scale of target objects in an anchor-free manner and introduces a feature alignment module to learn object-aware features from the predicted bounding boxes. CGACD [27] learns about correlation-guided attention in a two-stage corner detection network, which includes correlation-guided spatial attention in the pixel direction and correlation-guided channel attention in the channel direction, enabling accurate visual tracking. CSART [28] proposes a novel channel and spatial attention-guided residual learning framework for tracking, which can improve the feature representation of Siamese networks by exploiting a self-attention mechanism to capture powerful contextual information.

Vision transformer
The use of a transformer as an attention mechanism first occurred in natural language processing. At present, transformers have also attracted great attention in the field of computer vision. DETR [29] uses a CNN and a transformer to perform end-to-end detection. It obtains the relationship between the target object and the global image context and directly outputs the final prediction result. As a visual classification task, ViT [30] only uses a transformer. It slices the image and builds a sequence as input. When the training dataset is sufficiently large, the accuracy is better than that of a CNN. In the Swin transformer [31], a hierarchical transformer, whose representation is computed by shifting windows, is proposed. The window-shifting scheme improves the efficiency by confining the self-attention computation to nonoverlapping local windows while allowing cross-window connections. On the basis of a vision transformer, Bertasius et al. focused on the time dimension and extended image classification to video classification. This was also the first video classification model that completely abandoned CNNs and only used transformers to build the entire network [32]. Wang et al. [33] proposed a concise and novel transformer-assisted tracking framework. They modified the classic transformer to better explore the transformer's potential and make it more suitable for tracking tasks. TrTr [34] introduces a transformer encoder-decoder architecture, where the explicit cross-correlation between feature maps extracted from templates and search images is replaced by self and cross-attention operations to obtain global and rich contextual correlations. A confidence-based object classification head and a shape-agnostic anchor-based object regression head were developed. At the same time, a plug-in online update module for classification is designed to further enhance the tracking performance. TransT [13] uses a transformer to combine template and search area features and designs a feature fusion network based on a self-attention-based self-context augmentation module and a cross-attention cross-feature augmentation module. It adaptively focuses on useful information such as edges and similar objects and establishes associations between distance features, enabling the tracker to obtain better classification and regression results.

Spatiotemporal tracking method
Visual tracking video is composed of multiple frames of images, which increase the temporal dimension and include motion features. Zhang et al. [35] proposed a tracking method that incorporates the spatiotemporal environment, which can enhance the different features of the target and make online adjustments to the target localization based on the background information. Teng et al. [36] proposed a deep temporal and spatial network that can solve sparse optimization problems and collect key historical temporal samples. The temporal network can feed the spatial network back to refine the location of the tracked target. Liu et al. [37] designed a spatiotemporal future prediction method that addresses the occlusion problem by exploiting the current and future possible locations of the target object from its past trajectory. GCT [38] utilizes the appearance model of the spatiotemporal structure to extract the contextual information of historical tracking objects. STGL [39] presents a novel spatiotemporal graph representation and learning model to generate a robust target representation for visual tracking problems. TRAT [40] designs a two-stream network, including a 2D and 3D CNN, and achieves excellent results based on ATOM [41]. In SiamSTM [42], a spatiotemporal matching procedure is proposed to deeply explore the capabilities of four-dimensional matching in space (height, width, and channels) and time. STARK [14] can capture the global features of spatiotemporal information in video sequences. The entire method is an end-to-end method, and it does not require any postprocessing steps, greatly simplifying existing tracking pipelines.
According to previous research, we have learned that visual tracking is a similarity calculation that requires the use of a Siamese network to establish the relationship between a target and background; at the same time, the joint spatiotemporal algorithm helps to predict the motion pose. In addition, the transformer has gradually replaced the CNN as a research hotspot in vision algorithms. To make our related research more suitable for visual tracking, we focus on the following aspects: how to reasonably use a transformer to make it more suitable for visual tracking and how to efficiently select transformer patches and highlight important spatiotemporal information under the premise of ensuring accuracy. To solve these problems, we conduct in-depth research and design a new tracker combined with a transformer.

Methods
We propose a new visual tracker named SKRT, as shown in Fig. 1. It contains 3 important components: a transformer mechanism, key region extraction with spatiotemporal fusion and center-corner prediction. We introduce it below and specifically show the composition of each part.

Transformer
Transformers, as attention mechanisms, can replace CNNs to extract important features [15]. We slice the feature map into multiple patch sequences. We output the matrices from the same sequence as query (Q), key (K ), and value (V ). By comparing Q and K and multiplying by V , we obtain the final result, as defined in Eq. 1.
where Q, K ∈ R n×d k , V ∈ R n×d v , Q and K are multiplied to obtain the similarity between each pair of patches and divided by √ d k to obtain the attention score. The scaled attention score is obtained and multiplied by V to obtain the weighted sum and the final output. Equation 1 is altered by different linear changes, such as mapping the input to different subspaces, so that the model can understand the input sequence from different perspectives, as shown in Eqs. 2 and 3.
where  Fig. 2, we show the structure of the transformer. It contains alternating layers of multihead self-attention and MLP. LayerNorm (LN) is applied before each block, and residual connections are applied after each block. The MLP contains two layers with GELU nonlinearity. To distinguish the position information of the transformer sequence, which is affected by the DETR [29], we use the sine function to generate the positional encodings. It can be described by Eqs. 4 and 5.
X ∈ R d×m represents the transformer input sequence, X T ∈ R d×m represents the transformer output, and P x ∈ R d×m is the positional encoding. d and m are the number of channels and sequences, respectively.

Key regions extraction
To focus on important spatiotemporal regions, we superimpose the feature maps of 3 frames. As shown in Eqs. 6 and 7, after stacking, T F represents the template feature maps, S F represents the search feature maps, and t-1, t-2, and t-3 represent 3-frame feature maps. During the historical temporal period, objects will move at any time, forming different focuses. We superimpose the historical features and increase the weight of the motion relationship.
As shown in Fig. 3, we obtain M patches, which contain the highest weight pixel in T F, and based on them, we select a matrix whose length and width are S hw regions. Similar to T F, we obtain N patches, which contain the highest weight pixel in S F, and based on them, we also select a matrix whose length and width are S hw regions. To minimize overlap, we do not choose center points within the extracted regions. Finally, we concatenate the selected region matrices and use them as the input into the transformer. As described in Eq. 8, T F S hw ×S hw and S F S hw ×S hw represent the key regions for template and search feature extraction, respectively. in this paper, we set S hw = 5, M = 6, N =12.

Center-corner prediction
The FCN consists of Conv-BN-ReLU layers, and it outputs three probability maps, the top-left corner, the center, and the bottom-right corner of the prediction box. We denote them as P t (x, y), P c (x, y) and P r (x, y),respectively, as shown in Eqs. 9, 10 and 11, respectively. y r ) and ( x c , y c ) represent the corner and center coordinates of the prediction box. However, the target box can be predicted using only the corner points. Thus, to increase the influence of the center point and improve the accuracy of tracking, in Fig. 4 and Eq. 12, we calculate the center coordinate of the corner point and average it with the center point, ( x pc , y pc ) is the center point of the prediction box, n = 0.5. Under the condition of keeping the length and width unchanged, the bounding box is finally predicted by calculating the offset.  As shown in Eq. 13, we use G I oU loss [43] and L 1 loss [44]. b i is the ground truth, b i is the predicted box, and γ G I oU and γ L 1 are hyperparameters.

Implementation details
Our trackers are implemented using Python and PyTorch. SKRT training is conducted on 4 11 GB GeForce RTX 2080Ti GPUs. We calculate the params and FLOPs of the network architecture, which are 30.9 M and 19.57 G, respectively. We pretrain the network with ImageNet-22k [45]. The training datasets include LaSOT [46], GOT-10k [47], TrackingNet [48], COCO [49] and ImageNet VID [45]. We resize the template and search images to 128×128 and 320×320, respectively. AdamW [50] is used as our optimization method. During training, the learning rate is set to 1e-5, and the weight decay is set to 0.0005. We set a total of 500 training epochs. The learning rate decreases by a factor of 10 after 400 epochs. In the online tracking stage, we use the last new frame t as the tracking frame. We create a 3-frame video sequence, and if the number of sequences is less than 3, then we duplicate the initial frames to complement the sequence. The template frame directly affects the tracking accuracy, and the shape appearance of the template target will change over time, which requires us to set a robust template.
Tracking template update According to STARK [14], we design an online template update mechanism. As shown  Fig. 5, transformer cross-correlation operations are performed on the initial frame, the new frame, and the template frame to generate a confidence score. When the score is higher than the threshold and the update interval is reached, the template is dynamically updated. We use cross-entropy as the loss function to optimize the similarity, as shown in Eq. 14.
where y i is the ground truth label and p i is the predicted confidence.

Comparison to state-of-the-art trackers
Experiments on the GOT-10K dataset GOT-10K [47] is a large object tracking dataset that contains 9335 training videos and 180 testing videos in total. To make the trained model have stronger generalization ability, there is no overlap between the training set and the test set. As shown in Fig. 6 and Table 1, our method compares with other state-of-the-art methods, including TransT [13], STARK [14], TrDiMP [33], SiamRCNN [23], FCOT [51], Ocean [26], SiamFC++ [24], ATOM [41], SiamRPN++ [17], DaSiamRPN [52], SiamFC [11], MDNet [53]. AO, S R 0.5 and S R 0.75 of SKRT reach 0.728, 0.837 and 0.688, respectively, ranking first among all methods. The FPS in Table 1 is assessed by the hardware level of the methods listed. Compared to other trackers using the transformer structure (TransT [13], STARK [14], TrDiMP [33]), our method achieves the best results. This is because we process spatiotemporal information in a fused manner, and at the same time, we extract the patches of key parts, making the training more focused on the highlighted locations, more efficient and refined. Figure 7 shows a visual comparison of SKRT with STARK [14] and TransT [13]. Our method can pay attention to the feature information of boundary locations, while the motion weights improve the ability of the temporal pre-   diction of objects. Center-corner prediction is more robust, so the predicted bounding box of our tracker is closer to the ground truth bounding box. In addition, Table 2 shows that the efficiency and performance of SKRT are more balanced than those of STARK [14] and TransT [13]. This is because we remove the CNN backbone with multiple residuals and add the key regions extraction component, which requires much less algorithm complexity, resulting in a significant reduction in params. Key region extraction focuses on the local features of the tracking object, similar to the local advantages of CNN structures, and the tracking is best when combined with transformer structures.
Experiments on the trackingnet dataset TrackingNet [48] is a large-scale tracking dataset, whose videos are sampled from YouTube, providing more than 30K video labels with more than 14 million dense bounding boxes, including various object classes and scenes. From Table 3, we find that SKRT achieves the best performance with the large dataset benchmark.
Experiments on the VOT2018 dataset The VOT2018 [56] benchmark contains 60 challenging videos, including videos with fast motion, deformation, occlusion, etc. The dataset contains three metrics. The expectation average overlap (EAO) is the nonreset overlap expected value for each tracker on a short image sequence. The accuracy (A↑) is the average overlap rate of the tracker under a single test sequence. The robustness (R↓) is the number of tracker failures under a single test sequence that can be determined as failures when the overlap rate is 0. As shown in Table 4, SKRT achieves competitive results. Under the precondition of using an attention mechanism, SKRT has better results than other trackers (CGACD [27], etc.), which reflects the superiority of our method. First, a transformer is a multihead attention patch structure that captures richer feature information, expands the viewing field of the image and extracts more context information than traditional spatial and channel CNN attention methods. Second, the key regions extraction component obtains patches with high response values, which optimizes the patches input into the transformer, making them more refined and efficient.
Experiments on the LaSOT dataset LaSOT [46] is a largescale dataset for long-term tracking, and its test set has 280 videos, with an average of 2512 frames per video. Compared to other datasets, it focuses on long-term tracking, so it is more difficult. In the LaSOT [46] dataset experiment, SKRT is compared with STARK [14], TransT [13], TrDiMP   [33], Ocean [26], GlobalTrack [58], DiMP [59], SiamCAR [25], DaSiamRPN [52], ATOM [41], SiamRPN++ [17], C-RPN [60], SiamDW [20]. As shown in Fig. 9, our method achieves the best results, ranking first for both precision and success plots, which are 0.678 and 0.681, respectively. This is because the transformer performs better than the CNN in the current training dataset, and the novel structure in our proposed method is more suitable for test object tracking experiments than that of other trackers. As shown in Figs. 10 and 11, we compare all methods based on 14 attributes of the dataset, including the aspect ratio change, background clutter, camera motion, deformation, fast motion, full occlusion, illumination variation, low resolution, motion blur, out-of-view, partial occlusion, rotation, scale variation, and viewpoint change. Although SKRT achieves excellent performance in most attribute experiments, there are also some shortcomings. In terms of deformation, rotation and out-of-view, the results of SKRT are not as good as those of STARK [14], and in terms of illumination variation, the success plot using SKRT is not as good as that of TrDiMP [33]. To determine the reasons for these results, we analyze them from several aspects. First, the temporal method of continuous frame extraction will generate a certain inertia for predicting the target position. If the tracked object undergoes irregular changes, such as disordered deformation and rotation, the temporal method cannot predict such abrupt changes, which will eventually adversely affect the prediction. In addition, illumination changes cause the target color feature to change rapidly. The fusion of multiple frames makes this feature less similar to the template, and the superimposed highlights will cause more interference. The online template update cannot adapt to this change, resulting in inaccurate tracking. Second, since the transformer sequence selected by our method is at a position with a high response value of the feature map, this sequence of selected sequences ignores nonfocus regions. When the tracked object disappears from view, the influence of the nonfocus region increases, which helps to establish a relationship with the target. These relationships are useful for locating tracked objects when the target reappears. To improve efficiency, these nonfocus factors are removed, which is not conducive to the experiment of out-of-view attributes.

Ablation studies
To verify the effects of various parts of SKRT, we perform ablation experiments with the LaSOT [46] and GOT-10K [47] datasets.
Component parameter setting experiment Key region extraction is an important module in SKRT, and the number of key patches will also affect the final result. To verify the appropriate number, as shown in Table 5, when M = 6 and N = 12,

0.681
Best result is indicated in bold the results are the best. This is because when the number is insufficient and there are few features, it is difficult to extract effective information, and when the number is too large and the features are redundant, interference information will be extracted.
Defect improvement experiment As shown in Table 6, to test the influence of the number of input frames on the experimental results, we run experiments with 1-5 frames. We can see that the experiment works best when 3 frames are used. When the number of frames is less than 3, due to the limited motion information extracted, the temporal prediction features are lacking, which will cause the experimental results to drop. When the number of frames is greater than 3, since much motion information is redundant, it will affect the prediction of unnecessary noise features, so the experimental results will also decrease.
To verify the effect of center-corner and corner prediction, we perform ablation experiments. As shown in Table  7, the center-corner is used for the best performance. This is because the center position retains most of the important features of the tracking object, and adding center corners can enhance the feature influence of the object center position. The information collected in the top-left and bottom-right corners can provide more identifiable information for the

0.681
Best result is indicated in bold central region. If only the corners are used, the central information may be ignored. To fuse the influences of these positions and increase the prediction accuracy, we take the median value so that the experimental effect is the best. As shown in Table 8, to verify the impacts of the CNN, transformer and key regions extraction (KRE), we also conduct ablation experiments. In the case of a large number of training datasets, we replace the transformer structure with a CNN, and the success plot is 0.568, which shows that the transformer can establish a global long-distance relationship and is better than that of the CNN. When we only use the transformer and do not select key regions, the attention can easily focus on the meaningless features of the background, causing the weight of key regions to drop and the tracking results to drop. When we use the CNN (ResNet-101) as the backbone network and combine it with the transformer, although the CNN helps to increase local attention, the experimental improvement is not significant. To improve the defects based on the transformer input, we design the KRE component. KRE can reduce the redundancy of the sequence input transformer, focus more on the parts related to the tracking target, improve efficiency and achieve the best tracking results.
To verify the efficiency of SKRT when combining different components, we calculate the params and FLOPs and test the speed (FPS) with the GOT-10K dataset. As shown in Table 9, we only use the transformer, which has the least complexity, the lowest computation cost and the fastest tracking speed. When we add a CNN (ResNet-101) and combine it with the transformer as the backbone network, because of the multilayer residual structure of the CNN, the algorithm complexity is greatly increased, and the tracking speed is the lowest. Because the model complexity and calculation cost of key region extraction (KRE) are much lower than those of the CNN and the efficiency of inputting patches is improved, our method can run in real time at a tracking speed of more than 50 FPS. Fig. 12 shows the attention weights after input to the transformer using key region extraction. We can see that the attention regions are mainly focused on the target, and the interference of the surrounding background information is relatively small. At the same time, the motion features of the tracked objects are obvious, and the target is highlighted to make the tracking more accurate.

Visualization
To demonstrate that KRE improves defects based on the transformer model, we visualize the feature maps. As shown in Fig. 13, we remove the KRE component and use only the transformer structure. Because it is global, the target region of interest is too large, and many locations with less relevance to the target receive more response values, resulting in redundancy. In addition, the transformer will focus on interfering  objects that are similar to the target, which will increase some irrelevant response values. When we add the KRE component, we can see that the focus is on the most significant regions of the target, and the related response is partially efficient, eliminating the impact of similar backgrounds.

Conclusions
This paper proposes a new tracker named SKRT, which is a Siamese network structure. It uses a transformer instead of a CNN as the backbone network, which can extract global context information more efficiently. At the same time, three frames of feature maps are superimposed, and the spatiotemporal information of the tracking target can be obtained. Before the transformer similarity calculation is performed on the template and the search feature map, the key regions of interest are concentrated through key region extraction, and the background interference information is ignored, making the method more efficient. Finally, the bounding box is predicted by the center-corner, which ensures more robustness. Our method is tested with the GOT-10K, TrackingNet, VOT2020, OTB100 and LaSOT datasets, and the results show that SKRT achieves competitive results. We hope to conduct more in-depth research on the transformer in the future to increase the important weights of relevant features in the extraction of patch sequences, ignore the parts with low response values, improve the network efficiency, and make its structure more suitable for visual tracking.