Introduction

Visual tracking is a fundamental research task in computer vision, which aims to estimate the states of a target in each frame of a video sequence. It has extensive practical applications, such as intelligent driving, human-computer interaction, video surveillance, etc. Despite significant progress is achieved in recent years, visual tracking is still an open issue due to the influences of some challenging factors such as out-of-plane deformation, illumination variation, motion blur, and so on.

Deep convolutional neural networks (CNNs) have superior performance in feature learning. Based on the strength of CNNs features, Siamese based trackers have been proposed and achieve state-of-the-art tracking performance, such as SiamCDA [1], SiamFC [2], SiamRPN [3] and SiamGAT [4]. Firstly, these Siamese based trackers extract the corresponding features in the template and search branches and obtain feature maps. Then, these trackers utilize cross-correlation to compute the similarity. Siamese Backbone Networks and Cross-correlation operation act as critical roles in Siamese based trackers. Despite great progress in tracking performance, there are still some disadvantages: (1) In a traditional CNNs, the features of input images are extracted by a backbone network with convolutional kernels in prefixed sizes. When the scale of the template target changes drastically, the template features may contain some background information or miss some foreground information, leading to drift in tracking process. (2) The correlation operation is a linear fusion manner for computing the similarity between the template and a search region. Therefore, it is easy to lose semantic information and fall into local optimum. Further, the complicated non-linear interaction between the template and search branches is not captured in a correlation operation.

Fig. 1
figure 1

Overview of the proposed tracking framework. We use the ResNet50 with feature recognizers (FR) for feature extraction. The multiscale feature fusion network performs correlation operations on the template feature maps. At first, the proposed multi-scale feature fusion network performs linear projection of template features and encoder features to obtain image patches of different sizes and classification (CLS) information. Then, cross-attention fuses the CLS information and the image patches from both branches. Finally stronger image features are produced. The internal details of the operation of cross-attention will be presented in Sect. 3.3. In transformer decoder, we obtain a score map for locating object

To address the above issues, as shown Fig. 1, firstly, we modify ResNet50 [5] by adding the Feature Recognizer (FR) after conv1, conv2\(\_\)x and conv3\(\_\)x blocks in both the template and search branches, respectively. The FR generates a 3D attention map to focus on where and what the important elements are, and dynamically adjusts the weights of target features. Then, the powerful target features are obtained for the subsequent feature fusion and tracking prediction. Inspired by Vision Transformers (ViT) for image classification [6], we propose a novel tracking algorithm based on a multi-scale feature fusion network (MSF) in a transformer.

In the template branch, the feature sets of template patches as input are fed to the Transformer encoder. Then, MSF combines the template features and the corresponding encoded features in different sizes. In the search branch, the features of search patches and encoded features are fed to the Transformer decoder and then the score maps are obtained for locating the targets. We have evaluated the proposed SiamFMT algorithm on five benchmarks, including LaSOT [7], GOT-10k [8], OTB-100 [9], UAV123 [10] and VOT2018 [11]. The proposed tracking algorithm achieves superior tracking performance. The main contributions are summarized as follows:

  • We propose a feature recognizer (FR) module to construct hierarchical feature extraction networks by locating the module in different convolutional blocks. The FR can focus on the location and contents of important elements and to obtain robust object features by dynamically adjusting the object feature weights.

  • We propose a multi-scale feature fusion network based on cross-attention to enhance the feature representation ability. Compared with the cross-correlation based method, our method improves the non-linear interaction between the template and search branches, and establishes the association among features at different scales.

  • Extensive experiments on six challenging benchmarks demonstrate that the proposed tracker outperforms many state-of-the-art trackers. Especially, it achieves leading tracking performance on the large-scale datasets TrackingNet and GOT-10k as well as UAV123.

Related work

In this section, we briefly review some related methods and techniques including Siamese network-based visual tracking, attention mechanism and Transformer for visual tracking.

Siamese network-based visual tracking

In recent years, Siamese network-based trackers have drawn a lot of attention due to their balanced accuracy and speed [12, 13]. SiamFC, a pioneering work, adopts fully convolutional Siamese networks for feature extraction, and utilizes a cross-correlation layer to combine feature maps from the template and search branches. The cross-correlation layer performs convolution operations with template features on the search region to obtain response maps. Based on SiamFC, DSiam [14] learns the target appearance variation via an online transformation learning model. SA-Siam [15] utilizes Siamese networks to train a semantic branch and an appearance branch. The similarities on semantic features and appearance features are computed, respectively. Then, the final response map is obtained by combining the semantic similarity and appearance similarity. However, these tracking methods require a multi-scale testing to cope with variations in target appearances.

To get more accurate tracking results, Li \(et\ al\). first apply the region proposal network (RPN) in tracking task and propose the Siamese region proposal network-based tracker (SiamRPN) [3]. In SiamRPN, Siamese networks are followed by two subnetworks, i.e., a classification branch and a regression branch. The classification branch is used to discriminate the target from the surrounding background, and the regression branch refines the output box. Based on SiamRPN, Zhu \(et\ al\). [16] investigate an accurate and long-term tracking with a distractor-aware module. Fan \(et\ al\) [17] propose to cascade a set of RPN (C-RPN) from deep high-level layers to shallow low-level layers in Siamese networks. The discriminability of C-RPN is further improved by feature transfer blocks that make full use of multi-level features for each RPN, while exploiting the high-level semantic and low-level spatial information.

Apart from deepening the Siamese networks, researchers propose some anchor-free trackers, such as SiamBAN [18], SiamFC++ [19] and SiamCAR [20] to eliminate the negative effects of anchors. These anchor-free trackers treat the tracking task as a joint classification and regression problem. The trackers use one or more prediction heads to predict target locations and regress bounding boxes from the response maps in a pixel-by-pixel prediction manner. Guo et al. find that traditional cross-correlation operations retain a large amount of background information, which may misclassify target features. To solve this issue, they propose a target-aware Siamese Graph Attention network for general object tracking (SiamGAT) [4]. SiamGAT uses a bipartite graph-based feature search mechanism to match the template features and search image features.

Attention mechanism

Attention mechanisms are introduced into computer vision for the dynamic adjustment of the feature weights. Hu \(et\ al\). [21] propose a SENet, which pioneered channel attention. SE blocks consist of a squeeze module and an excitation module. The squeeze module collects global spatial information and the excitation module captures channel-wise relationships to improve the representation ability of the network. Park et al. [22] propose a simple and light-weight attention module that is placed at the bottleneck of CNNs. Efficient attention maps are generated by learning channels and spatial attention, which improve the representational power of the network and reduce the computational cost. Yang et al. [23] propose a Gated channel transformation (GCT). Unlike previous methods, GCT collects global information by calculating the \(L_2\) parametric of each channel. It is also lightweight and can be added to each convolutional layer of CNNs.

Attention mechanisms are also successfully used in visual tracking. Fan et al. [24] propose a discriminative spatial attention for short-term visual tracking. Choi et al. [25] introduce attention mechanism to correlation filter networks for object tracking. CSR-DCF [26] introduce the concept of channel and spatial reliability to discriminative correlation filters. Wang et al. [27] propose a Residual Attentional Siamese Network (RASNet) for object tracking. These mechanisms include the General Attention, Residual Attention, and Channel Attention. RASNet not only mitigates the overfitting problem in deep network training, but also improves the discriminative capability and adaptability of the network. Yu et al. [28] propose a Deformable Siamese Attention Network (SiamAttn). SiamAttn learns context information through spatial attention and cross-attention, and aggregate-rich contexts correlations between the template and search branches. To better exploit the feature extraction capability of Siamese networks, we add Feature Recognizer (FR) to traditional CNNs to improve the feature attention potentials of the backbone network. More details will be presented in Sect. 3

Transformer for Visual Tracking

Vaswani et al. [29] first propose the Transformer based on self-attention mechanism. Benefiting from the high representation ability, the Transformer is applied to visual tracking [30]. Wang et al. [31] introduce Transformer to object tracking, and present a novel transformer-assisted tracking framework (TrDiMP). To better suit tracking task, TrDiMP includes the encoder and decoder branches. The transformer encoder is used to generate a high-quality tracking model and the Transformer decoder searches the target.

Fig. 2
figure 2

The basic Siamese tracking framework

Chen et al. [32] propose a feature fusion network based on a self-attention module and a cross-feature module instead of the traditional correlation operation. The ego-context augment (ECA) module is used to enhance the contextual information of the input. The cross-feature augment (CFA) module is used to adaptively fuse features from both branches. To improve the localisation accuracy of the tracker in complex scenes and enhance performance in transformer vision tasks, Cao et al. [33] propose an efficient hierarchical feature transformer (HiFT). HiFT feeds the similarity graphs generated by the multilayer convolutional layers into the feature transformer, and achieves an interactive fusion of space and semantic information.

In contrast to the traditional transformer-assisted tracking framework, Lin et al. [34] propose a fully attention-based transformer tracker (SwinTrack). SwinTrack uses Transformer for both feature extraction and feature fusion. Swin-Transformer, consisting of a backbone network and feature fusion network, introduces IOU-aware classification scores into the prediction branch to select more accurate bounding box predictions. Xie et al. [35] propose a Siamese-like Dual-branch network (DualTRF). Each branch of DualTRF consists of local attention blocks, and global attention blocks and uses cross-attention blocks to fuse features between the template and search branches. Subsequently, to make the tracking model more flexible, Xie et al. [36] proposed single branch transformer for tracking (SBT) based on DualTRF. SBT embeds cross-image feature associations in multiple layers of the feature network, which can suppress non-target features and achieve instantiated feature extraction. In addition, SBT is the first work to propose a specialized target-dependent feature network for VOT. Cui et al. [30] proposed a tracking framework based on a mixed attention module (MixFormer). Mixformer constructs a feature extraction network by simply stacking multiple the mixed attention module. It can extract target-specific discriminative features and communicate extensively between the target and the search region, resulting in a highly efficient tracking performance.

Method

In this section, we describe the propose SiamFMT framework. As shown in Fig. 1, the SiamFMT consists of Siamese backbone network, multi-scale feature fusion network and prediction heads. The Siamese backbone network extracts the features of the template image and search images with shared weights, respectively. Then, the proposed feature fusion network propagates a large amount of information from the target template to search regions.

Overview of Siamese tracking framework

Before describing the proposed tracking algorithm in detail, we briefly review recent popular tracking methods. As shown in Fig. 2, a Siamese network-based tracker consists of backbone network, feature fusion network and prediction head. In particular, the mainstream feature fusion methods mainly include the correlation operation based and Transformer-based networks.

Siamese networks architecture [2] takes the template z and search branches x as inputs and extracts the corresponding features by weight sharing in CNNs. Finally, the feature maps are generated by the correlation operator as follows:

$$\begin{aligned} F(z, x)=\varphi (z) * \varphi (x)+b\cdot 1 \end{aligned}$$
(1)

where \(*\), \(\varphi (\cdot )\) and \(b\cdot 1\) denote the correlation operator, convolutional operations and bias term, respectively.

The basic framework in Transformer-based tracking uses a Transformer instead of the original correlation operation for the feature fusion. Both the template and search images are fed into CNN backbone for feature extraction. After that, the features from the two branches are send to two parallel branches of a Siamese-like network consisting of the Transformer encoder and decoder. The core component of Transformer is self-attention. The attention function is a scale dot-product attention as follows:

$$\begin{aligned} {\text {Atten}}(Q, K, V)={\text {softmax}}\left( \frac{Q K^{T}}{\sqrt{d_{k}}} V\right) , \end{aligned}$$
(2)

where Q, K, and V are the query, key and value vectors, \(d_{k}\) is the dimension of key. As described in [37], a linear projection and multi-headed attention are introduced to the attention module to make the mechanism focusing on different aspects of information. We get the multi-head variant defined as follows:

$$\begin{aligned}{} & {} {\text {MultiHead}}(Q, K, V)={\text {Concat}}\left( H_{1}, \ldots , H_{n_{h}}\right) W^{O}, \end{aligned}$$
(3)
$$\begin{aligned}{} & {} H_{i}={\text {Attention}}\left( Q W_{i}^{Q}, K W_{i}^{K}, V W_{i}^{V}\right) , \end{aligned}$$
(4)

where \(W_{i}^{Q} \in {\mathbb {R}}^{d_{\text{ model }} \times d_{k}}\), \(W_{i}^{K} \in {\mathbb {R}}^{d_{\text{ model }} \times d_{k}}\), \(W_{i}^{V} \in {\mathbb {R}}^{d_{\text{ model }} \times d_{v}}\), and \(W^{O} \in {\mathbb {R}}^{n_{h} d_{v} \times d_{\text{ model }}}\) are parameter matrices. h is the number of heads. In this work, we set \(n_{h}\), \(d_{\text{ model }}\), \(d_{v}\) and \(d_{k}\) to 8, 512, 64 and 64, respectively.

Siamese backbone network

Convolutional neural networks have been successfully applied in Siamese network-based trackers and achieve robust tracking performance, such as ResNet [5], VGGNet [38] and AlexNet [39].

In the proposed tracker, we modify ResNet50 as the Siamese backbone network through adding the Feature Recognizers. First, existing models of CNNs significantly increase the computational complexity by naive stacking convolutional layers. Secondly, the features from the lower layers still have a limited field of perception. The proposed feature recognizer module alleviates the above issues nicely. Specifically, the FR module is an efficient and lightweight attention mechanism. The FR modules follow the blocks of the conv1, conv2\(\_\)x and conv3\(\_\)x, which makes lower layer features to benefit from the contextual information. Based on the lightweight modules, the Siamese backbone network is trained in an end-to-end manner. The overall structure of FR is illustrated in Fig. 3. The FR module is constituted by channel and spatial attention branches.

Fig. 3
figure 3

The proposed Feature Recognizer(FR) module. It consists of two sub-modules, i.e. the channel and spatial attention modules. It takes the outputs of conv1, conv2\(\_\)x and conv3\(\_\)x blocks as the input and extracts the corresponding channel attention features and spatial attention features. In the spatial attention module, two 3 \(\times \) 3 dilated convolutions are continuously used to enlarge the receptive fields. Then, a 1 \(\times \) 1 convolution is used after them

Channel attention module. For the channel attention module, we take global average pooling to aggregate the feature map F in each channel. To indicate the importance of channels, we use a scaling factor \(\gamma \) to batch normalization (BN) layer [40]. Taking the c-th channel as an example, the expression of BN layer can be rewritten as follows:

$$\begin{aligned} \begin{aligned}&z_{c}=\frac{BN_{c}^{in}-{\hat{\mu }}_{c}}{\sqrt{{\hat{\sigma }}_{c}^{2}+\epsilon }},\\&BN_{c}^{\text{ out }}=\gamma _{c} \cdot z_{c}+\beta _{c}, \end{aligned} \end{aligned}$$
(5)

where the subscript c indicates the parameter of the c-th channel. \(\epsilon \) is a positive value for numerical stability. \({\hat{\mu }}_{c}\) and \({\hat{\sigma }}_{c}^{2}\) denote mean and variations of mini-batches, respectively. \(\beta \) is learnable shift transformation parameter in the BN layer to affine the normalized feature F.

On this basis, the input features are processed after global average pooling and BN layer. To highlight the feature response of the object and suppress the less salient feature responses (non-targets), we introduce a weight value to the feature response of each channel, and the weights are computed as

$$\begin{aligned} \begin{aligned}&W_{\gamma }=\left\{ w_1, w_2, w_3, \cdots , w_c\right\} , \\&w_i={\gamma _i} / \sum _{i=1}^c {\gamma _i}, \end{aligned} \end{aligned}$$
(6)

where \(w_i\) denotes the weight value of the each channel, \(\gamma _i\) is the scaling factor in BN, and the c denotes the number of channels.

Lastly, the channel attention \(M_{c}(F) \in {\mathbb {R}}^{C \times 1 \times 1}\) is computed as

$$\begin{aligned} M_{c}(F)={\text {sigmoid}}(W_{\gamma }(BN({\text {Avg}} {\text {Pool}}(F)))). \end{aligned}$$
(7)

Spatial Attention module. We exploit the spatial attention module to focus on the important spatial information of a target. It produces a spatial attention map \(M_{s}(F) \in {\mathbb {R}}^{H \times W}\) to emphasize or suppress feature in different spatial locations. Contextual information enables the model to better focus on spatial location of an object. To efficiently aggregate the contextual information, we use two 3 \(\times \) 3 dilated convolutions to enlarge the receptive fields. Then, we use a 1 \(\times \) 1 convolution at the end of a spatial branch to reduce the features to \({\mathbb {R}}^{H \times W}\) spatial attention maps, and the BN layer is applied to scale adjustment the feature map. Next, to measure the importance of pixels, we also apply the scaling factor of BN to the spatial dimension. The weights of the spatial attention module are computed as

$$\begin{aligned} \begin{aligned}&W_{\lambda }=\left\{ w_1, w_2, w_3, \cdots , w_i\right\} , \\&w_i = {\lambda _i} / \sum _{i=1}^{h \times w} {\lambda _i}, \end{aligned} \end{aligned}$$
(8)

where \(w_i\) denotes the weight value of each pixel, \(\lambda _i\) is the scaling factor of BN, \(h \times w\) indicates the number of pixel.

Finally, the spatial attention \(M_{s}(F)\) is computed as follows:

$$\begin{aligned} M_{s}(F)={\text {sigmoid}}(W_{\lambda }(BN(f_{2}^{3 \times 3}(f_{1}^{3 \times 3}f_{0}^{1 \times 1}(F))))), \end{aligned}$$
(9)

where f is a convolution operation and the superscripts denote the convolutional filter sizes. To save both the number of parameters and computational overhead, we only use three convolution operations.

Overall structure. We adopt a residual mechanism and Logistic function to facilitate gradient. First, we compute the channel attention \(M_{c}(F) \in {\mathbb {R}}^{C \times 1 \times 1}\) and the spatial attention \(M_{s}(F) \in {\mathbb {R}}^{1 \times H \times W}\) as two separate modules, Since these two attention maps have different shapes, we expand the attention maps to \({\mathbb {R}}^{C \times H \times W}\). Then, we choose element-wise summation to combine the channel attention map and the spatial attention map. Finally, the attention map M(F) is computed as:

$$\begin{aligned} M(F)=\sigma (M_{c}(F)+M_{s}(F)), \end{aligned}$$
(10)

where \(\sigma \) is a sigmoid function. For the given input feature map \(F \in {\mathbb {R}}^{C \times H \times W}\), based on the channel and spatial attention modules, a 3D attention map \(M(F) \in {\mathbb {R}}^{C \times H \times W}\) is generated. The final output feature \(F^{\prime }\) is computed as

$$\begin{aligned} F^{\prime }=F+F \otimes M(F), \end{aligned}$$
(11)

where \(\otimes \) denotes matrix multiplication.

Fig. 4
figure 4

The architecture of the cross-attention module. It consists of the template feature branch and encoded feature branch. Many cross-attention modules (L) are stacked to increase frequency of fusion across the two branches. The two branches exchange information with each other through cross-attention to generate high quality image features

To suppress less salient features and highlight the target features and target locations, we add a regularization term in the loss function as follows:

$$\begin{aligned} Loss=\sum _{(F, F^{\prime })} l(f(F, W), F^{\prime })+\xi \sum g(\gamma )+\xi \sum g(\lambda ),\nonumber \\ \end{aligned}$$
(12)

where F and \(F^{\prime }\) denote the input and output, respectively; W represents FR module weights; \(l(\cdot )\) is the loss function; \(g(\cdot )\) is \(l_{1}\) norm penalty function; \(\xi \) is the penalty that balances \(g(\gamma )\) and \(g(\lambda )\). \(\gamma \) and \(\lambda \) are the scaling factor of the channel attention module and the spatial attention module, respectively. Then, we jointly train the weights and these scaling factors with \(l_{1}\) regularization imposed on the scaling factors.

Multi-scale feature fusion network

In this section, we learn multi-scale feature representations in Transformer model for object tracking. We propose a simple and effective cross-attention based multi-scale feature fusion network that produces robust image features. Specifically, to fuse multi-scale features more efficient, we first align dimensional projections of the encoded feature branch and template feature branch mapped in the same feature space. Then, the encoded feature branch is used as a query. The template feature branch exchanges information with the encoded feature branch through cross-attention for multiple feature fusion. The encoded feature branch learns to abstract information in Transformer encoder, and interacts with the template feature branch to combine features at different scales.

An illustration of the cross-attention operation in a multi-scale feature fusion network is shown in Fig. 4.

\(F_{encoded} \in {\mathbb {R}}^{n \times C \times H \times W}\) denotes the input to the encoded feature branch. \(T_{i} \in {\mathbb {R}}^{n \times C \times H \times W}\) denotes the input to the template feature branch, which are further concatenated to form the template feature ensemble \(T={\text {Concat}}\left( T_{1}, \ldots , T_{n}\right) \). Specifically, we adopt a projection function to map the features of both branches into the same feature space as follows:

$$\begin{aligned} \begin{aligned}&Q=f^{l}\left( F_{\text{ encoded } }\right) , \\&K=f^{l}\left( \text{ Concat } \left( T_{1}, \cdots , T_{n}\right) \right) , \\&V=f^{l}\left( \text{ Concat } \left( T_{1}, \ldots , T_{n}\right) \right) \otimes M, \end{aligned} \end{aligned}$$
(13)

where \(f^{l}(\cdot )\) is the projection function for dimension alignment, \(\otimes \) is the broadcasting element-wise multiplication. \(Q, K, V \in {\mathbb {R}}^{n \times C \times H \times W}\) are learnable parameters. M is a mask ensemble. In visual tracking, to reduce the interference of similar targets, we construct the Gaussian-shaped masks of the template features through \(m(y)=\exp \left( -\frac{\Vert y-c\Vert ^{2}}{2 \sigma ^{2}}\right) \), where c is the ground-truth target position. Then, we concatenate the reconstructed masks \(m_{i} \in {\mathbb {R}}^{H \times W}\) to obtain a mask ensemble \(M={\text {Concat}}\left( m_{1}, \cdots , m_{n}\right) \in {\mathbb {R}}^{n \times H \times W}\).

As shown in Fig. 4, Q and K are performed cross-attention. Then, the attention map (AM) generated in cross-attention is obtained as follows:

$$\begin{aligned} AM={\text {Softmax}}\left( Q \otimes K^{\top } / \sqrt{C / h}\right) , \end{aligned}$$
(14)

where C and h are the embedding dimension and number of heads. After performing the cross-attention \(C A=A M \otimes V\), we propagate the masks ensemble from the template feature branch to the encoded feature branch. In addition, as in Transformer, we also use multi-heads in cross-attention and represent it as Multi-heads Cross Attention (MCA). MCA enables fusion network to attend to multiple parts of the input feature simultaneously. This allows fusion network to capture different types of information and dependencies within the input, leading to better tracking performance. MCA increase the frequency of fusion across the template feature branch and encoded feature branch. Finally, the output of a multi-scale feature fusion (MSF) network with layer normalization and residual structure is computed as follows:

$$\begin{aligned} M S F=g^{l}\left[ f^{l}\left( F_{\text{ encoodel } }\right) +M C A({\text {LN}}(A M \otimes V))\right] ,\nonumber \\ \end{aligned}$$
(15)

where \(g^{l}(\cdot )\) is the back-projection function for the dimension alignment. LN denotes layer normalization. The final output feature is reshaped to the original size as \(MSF \in {\mathbb {R}}^{n \times C \times H \times W}\).

Experiments

In this section, we conduct extensive experiments on six challenging benchmarks including LaSOT [7], GOT-10k [8], OTB-100 [9], TrackingNet [41], UAV123 [10] and VOT2018 [11]. We also compared the proposed tracker with several state-of-the-art trackers on three small-scale datasets for inference speed, as shown in Table 1. To further validate the effectiveness of the proposed Siamese backbone network and multi-scale fusion network, we conduct the ablation study on GOT-10k and UAV123.

Table 1 Comparison with the SOTA trackers on the NFS30 [42], UAV123 [10] and OTB100 [9] datasets in terms of AUC score and the inference speed
Table 2 Ablation study on GOT-10k [7] in terms of Average Overlap (AO)
Table 3 The ablation study on UAV123 [10] in terms of precision (Prec.) and success (AUC)

Implementation details

The proposed SiamFMT is implemented in Pytorch and executed on Intel(R) Core(TM) i5-10400 CPU @ 2.90GHz with 16GB Memory and a NVIDIA GTX-1080Ti GPU. We utilize the training splits of LaSOT [7], TrackingNet [41] and GOT-10k [8] for offline training. We apply some transformations on the original images to generate image pairs. The common data augmentation (such as translation and brightness jitter) is applied to enlarge the training sets. We set the central jitter factor and the scale jitter factor to 3 and 0.25, respectively. The sizes of the input template and search patches are 128\(\times \)128 and 256\(\times \)ss256, respectively. Our framework is trained for 50 epochs with 3,571 iterations per epoch and the batch size is set to 14. We train the model with ADAM optimizer, and set the initial learning rate to \(1 \textrm{e}\)-3 and a decay factor 0.2 for every 15 epochs. The proposed SiamFMT achieves competing tracking performance against SOTA trackers.

Ablation study and analysis

To verify the effectiveness of the designed feature recognizer module and Multi-Scale feature Fusion network, we conduct the ablation study on GOT-10k and UVA123 benchmarks.

We also use LaSOT [7], GOT-10k [8] and TrackingNet [41] as training sets on a single Nvidia 1080Ti GPU to train TrDiMP as the baseline. To further validate the generalization of SiamFMT, we choose GOT-10k and UAV123 test sets to evaluate the proposed tracker. In GOT-10k, there are no overlap of object classes between the training and test sets.

Backbone architecture. We embed the Feature Recognizer module in ResNet50 [5] to constitute a Siamese backbone network for feature extracting. As shown in Tables 2 and 3, we conduct ablation experiments on GOT-10k and UAV123 test sets, respectively. Compared with the baseline, our method improves the average overlap (AO) by 0.6%. The precision (Prec.) is improved by 1% from 0.853 to 0.863, and the success (AUC score) is improved by 0.5% from 0.643 to 0.648. The experimental results show that the proposed method has a positive effect on the tracking results.

Feature fusion network. To show the superiority of the multi-scale feature fusion network (MSF), we add the MSF to the baseline without feature recognizer modules and keep the other components unchanged. The multi-scale feature fusion network performs multiple fusions of template features and encoded features to combine features at different scales. Compared to traditional cross-correlation method, ours tracker can focus more on target edge information and make the tracker obtain better robustness. In Table 3, by comparing the values in the second and fourth rows, we notice that the precision and the success is improved by 1.3% and 1% with the same backbone, respectively. It is worth noting that, as shown in Table 2, the average overlap(AO) is improved by 1.7% from 66.2% to 67.9%, while keeping other components constant. Meanwhile, our method improves by 0.8% compared to TrDiMP [31].

Fig. 5
figure 5

Visualization of tracking results without (second column) or with (third column) the designed Feature Recognizer and Multi-Scale feature Fusion Network. The proposed SiamFMT significantly reduces the impact of background complexity, scale changes and similar target

Overall structure. Finally, we add both the Feature Recognizer module and the multi-scale feature fusion network to the baseline. It is worth pointing out that TrDiMP [31] already achieves outstanding results while our approach consistently improves such a strong baseline. As shown in Table 2, compared with the baseline, the method with Feature Recognizer and the multi-scale feature fusion network brings 2.2% performance gains on average overlap(AO). Compared with TrDiMP, our approach also achieves 1.3% improvement in average overlap (AO). By comparing the second row and fourth row in Table 3, the precision (Prec.) is improved by 3.2% from 0.853 to 0.885, and the success (AUC score) increases by 2.4% from 0.643 to 0.667. The results further demonstrate that the performance of our tracker has improved by 1.9% over the performance of TrDiMP. This is benefit from the designed FR module and MSF.

Table 4 The performance of our method on the test split of GOT-10k when setting the number of cross-attention heads to 2, 4, 6 and 8

As shown in Fig. 5, we visualize some tracking results. As can be seen in the second and third columns, our tracker is able to highlight the locations of the targets well while suppressing background and similar target information against on distracting factors.

The number of cross-attention heads. In our method, multi-head cross-attention is able to fuse features at different scales and capture the dependencies between different features. Thus, the number of cross-attention heads is important. As shown in Table 4, we list the performance on different numbers of cross-attention heads. The tracking performance gradually improves as the number of headers increases. However, when the number of cross-attention heads is more than 6, the performance drops. We argue that excess cross-attention heads may lead to model overfitting. In addition, we observe that increasing the number of heads improves tracking accuracy, but decreases tracking speed. Therefore, to better balance the tracking performance and speed, we choose the number of cross-attention heads to be 4.

Fig. 6
figure 6

Comparison with nine state-of-the-art trackers on UAV123 [10]

Fig. 7
figure 7

AUC scores of different attributes on UAV123 [10]

Evaluation on UAV123

UAV123 [10] is a new aerial video datasets consisting of 123 low-altitude aerial video sequences. Different from other tracking benchmarks, the tracked targets in UAV123 are small because the viewpoint is in the air. The UAV123 is very challenging to trackers. We compare the proposed tracker with 9 state-of-the-art and real-time trackers, including TrDiMP [31], TransT [32], DiMP50 [45], SiamAttn [28], SiamGAT [4], SiamBAN [18], CLNet [49], STMTrack [50] and SiamCAR [20]. A comparison with state-of-the-art trackers is shown in Fig. 6 in terms of the precision and success of OPE. Our tracker reaches a success score of 66.7% and precision of 88.5%, which outperforms the recently proposed TrDiMP [31] by 1.9% on precision and TransT [32] by 0.7% on success.

Figure 7 reports the attribute-based evaluation of the proposed SiamFMT and nine representative state-of-the-art tracking algorithms. The proposed SiamFMT ranks on the first place on attributes of aspect ratio change, background clutter, camera motion, full occlusion, similar object and scale variation. The results demonstrate that our tracker is robust to complicated appearance variations.

Evaluation on OTB-100

OTB-100 is one of the most classic benchmarks for visual tracking. It consists of 98 video sequences with 11 interference attributes. These attributes include background clutter (BC), low resolution (LR), out-of-view (OV), illumination variation (IV), scale variation (SV), occlusion (OCC), deformation (DEF), motion blur (MB), fast motion (FM), in-plane rotation (IPR) and out-of-plane rotation (OPR).

Fig. 8
figure 8

Comparison with nine state-of-the-art trackers on OTB100 [9] in terms of precision and success of OPE

The comparison with state-of-the-art trackers are shown in Fig. 8 in terms of success and precision plots of OPE. Our tracker reaches a success score of 69.3% and precision of 91.2% that surpasses many state-of-the-art trackers. Especially, our tracker significantly improve tracking success and precision on the aspects of background clutter (BC), occlusion (OCC), out-of-plane rotation (OPR) and out-of-view (OV). This is benefit from our multi-scale feature fusion network. As shown in Fig. 9, the proposed tracker ranks on the first place on these challenging attributes.

Fig. 9
figure 9

Comparison on OTB-100 [9] in terms of challenging aspects: background clutter (BC), occlusion (OCC), out-of-plane rotation (OPR) and out-of-view (OV). Our tracker achieves the best results on all these aspects

Evaluation on VOT2018

VOT2018 is a widely used benchmark for visual tracking and contains 60 video sequences. It evaluates the tracking performance on three metrics including accuracy (A), robustness (R) and expected average overlap (EAO).

As shown in Fig. 10, the proposed tracking algorithm is compared with nine state-of-the-art methods including ATOM [43], LADCF [51], SiamRPN [3], SiamMask [52], UPDT [53], RCO [54], DeepSTRCF [55], CPT [56] and SA-Siam-R [57]. Experimental results demonstrate that the proposed tracker achieves the top EAO score on VOT2018. Compared with the recent trackers SiamMask [52] and ATOM [43], our method improves EAO by 4.3% and 2.3%, respectively.

In Table 5, our tracker achieves 0.617 accuracy, 0.192 robustness and 0.424 EAO on VOT2018. We further compare the proposed SiamFMT in terms of the accuracy, robustness and EAO against SOTA trackers including ATOM [43], DRT [58], DeepSTRCF [55], CPT [56], SA-Saiam-R [57], LSART [59], ECO [60], CCOT [61] and SiamFC [2]. Compared with these trackers, the proposed tracking algorithm achieves superior tracking performance.

Fig. 10
figure 10

Expected average overlap (EAO) against SOTA trackers. The proposed tracking algorithm achieves the competitive tracking performance on VOT2018

Table 5 Comparison with state-of-the-art trackers on VOT2018

Evaluation on GOT-10k

GOT-10k [8] is a challenging large-scale dataset that consists of more than 10,000 videos. There is no overlap between the classes of the training and testing datasets. GOT-10k is usually used to evaluate the generalization of a tracker. We follow the GOT-10k protocol and train the proposed model on the given training datasets and test the proposed tracking algorithm SiamFMT on the given testing datasets. After uploading the tracking results to the official website, the corresponding tracking results in average overlap (AO) and success rate (\(SR_{0.50}\) and \(SR_{0.75}\)) are obtained.

As can be seen from Fig. 11,

Fig. 11
figure 11

Success plots on GOT-10k [8]. Our tracker achieves excellent tracking results compared with some state-of-the-art trackers

the proposed SiamFMT outperforms many SOTA trackers in success. As shown in Table 6, we evaluate our tracker on GOT-10k and compare it with state-of-the-art trackers including TrDiMP [31], SBT [36], STMTrack [50], KYS [62], PrDiMP [63], SiamGAT [4], DiMP50 [45], D3S [64], SiamFC++ [19], Ocean-offline [65], SiamCAR [20], SiamRPN++ [44], ATOM [43], SPM [66], SiamMask-EU [52], SiamRPN [3] and SiamFC [2]. The proposed tracker performs excellent performance in terms of average overlap (AO) and success rates (\(SR_{0.50}\) and \(SR_{0.75}\)). In particular, the proposed SiamFMT achieve the second best performance in terms of \(SR_{0.75}\) behind SBT-small [36] and outperforms other excellent trackers. Our tracker performs 1.6% and 2.5% higher than SBT-small in AO and \(SR_{0.50}\). In addition, our method performs 4.8% and 4.7% higher than KYS in AO and \(SR_{0.50}\) metrics, respectively. Compared with SiamGAT, the proposed tracker is 5.7%, 5.5% and 10.2% higher in terms of AO, \(SR_{0.50}\) and \(SR_{0.75}\), respectively. These results demonstrate that our tracker has a good generalization.

Table 6 Comparison with state-of-the-art trackers on GOT-10k

Evaluation on LaSOT

LaSOT [7] is a large-scale, densely annotated, and challenging single object tracking dataset. The dataset a training set consisting of 1400 sequences and a testing set consisting of 280 sequences. With an average length of over 2500 frames per sequence, LaSOT is more challenging than previous short-term tracking datasets. It is used to evaluate a tracker’s ability in re-detecting a target and long-term tracking. We use the one-pass evaluation including success rate, precision and Normalized precision to compare different tracking algorithms on the LaSOT testing set, including TrSiam [31], STMTrack [50], SiamGAT [4], Ocean-online [65], SiamBAN [18], CLNet [49], SiamFC++ [19], SiamCAR [20], GlobalTrack [67], ATOM [43], SiamRPN++ [44], D3S [64], DiMP50 [45] and SiamFC [2]. From Fig. 12, we can see that the proposed tracking algorithm achieves superior tracking results against some state-of-the-art trackers. Compared with the recently proposed TrSiam [31], STMTrack [50] and SiamGAT [4], the proposed SiamFMT improves the AUC scores by 0.4%, 2.2% and 8.9%, respectively.

Fig. 12
figure 12

Comparison with state-of-the-art trackers on LaSOT [7] in terms of success rate of OPE. Our tracker achieves superior tracking performance

We also report the Success (Succ.), precision (Prec.) and normalized precision (N.Prec.) in Table 7. From Table 7, the proposed SiamFMT obtains the best tracking result. The proposed SiamFMT outperforms SBT-small by 1.7% in Succ. and 0.6% in Prec. Meanwhile, compared to the Transformer-based tracker, The proposed tracker outperforms TrSiam by 4.4% in Prec. and 0.7% in N.Prec. The results indicate that the proposed tracker is competitive for long-term tracking tasks.

Table 7 Comparison with state-of-the-art trackers on LaSOT [7] in terms of precision (Prec.) and normalized precision (N.Prec.)
Table 8 Comparison with state-of-the-art trackers on TrackingNet in terms of Success (Succ.), precision (Prec.) and normalized precision (N.Prec.)

Evaluation on TrackingNet

Tracking is a large-scale dataset with a test set of 511 videos sequences, and covering different object classes and complex tracking scenarios. We submit the raw tracking results to the online evaluation server to obtain the tracking metrics shown in Table 8. We compare our tracker with the state-of-the-art trackers such as STMTrack [50], DualTRF [35], UTT [68], TrDiMP [31], E.T.Track [47] and TrTr [69]. The proposed tracker achieves tracking results that are on par with the current state of the art. Among all the compared trackers, only STMTrack [50] and DualTRF [35] has a slightly N.Prec. than our tracker. However, in terms of Succ and Prec, our tracker obtains 80.4% and 79.8%, respectively, outperforming other excellent Transformer-based trackers such as UTT [68], E.T.Track [47] and TrTr [69].

Conclusion

In this paper, we propose an effective and lightweight tracking framework. The framework includes two main parts: a Siamese backbone architecture based on hierarchical attention, and a multi-scale feature fusion(MSF) network based on cross-attention. The hierarchical attention consisting of channel and spatial branches is built on the designed feature recognizers to emphasize on important feature elements. The multi-scale feature fusion network fuses template features and encoded features through cross-attention and allows the tracker to adapt to changes in target scales. The MSF makes a bridge between the template and search branches, and provides stronger encoding features for subsequent decoders. The ablation study and experiments show that the proposed SiamFMT is robust to cluttered background, scale variation, and similar targets. On several mainstream benchmarks such as OTB100, GOT-10k and UAV123, the proposed tracker performs significantly better than some state-of-the-art trackers.