Abstract
Convolutional neural networks (CNNs) have been the dominant architectures for feature extraction tasks, but CNNs do not look for and focus on some specific image features. Correlation operations play an important role in visual tracking. However, the correlation operation reserves a large amount of unfavorable background information. In this paper, we propose an effective feature recognizer including channel and spatial attention modules to focus on important object feature information. Thus, the representation power of the feature extraction network is improved. Further, we design a multi-scale feature fusion network. The fusion network performs feature fusion on template feature and encoded feature branches to establish connections between features at different scales. Experiments on six benchmarks demonstrate that the proposed tracker outperforms the state-of-the-art trackers. In particular, the proposed tracker achieves an 80.4% AUC on TrackingNet and a 68.4% AUC on GOT-10k while running at a real-time speed.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
Introduction
Visual tracking is a fundamental research task in computer vision, which aims to estimate the states of a target in each frame of a video sequence. It has extensive practical applications, such as intelligent driving, human-computer interaction, video surveillance, etc. Despite significant progress is achieved in recent years, visual tracking is still an open issue due to the influences of some challenging factors such as out-of-plane deformation, illumination variation, motion blur, and so on.
Deep convolutional neural networks (CNNs) have superior performance in feature learning. Based on the strength of CNNs features, Siamese based trackers have been proposed and achieve state-of-the-art tracking performance, such as SiamCDA [1], SiamFC [2], SiamRPN [3] and SiamGAT [4]. Firstly, these Siamese based trackers extract the corresponding features in the template and search branches and obtain feature maps. Then, these trackers utilize cross-correlation to compute the similarity. Siamese Backbone Networks and Cross-correlation operation act as critical roles in Siamese based trackers. Despite great progress in tracking performance, there are still some disadvantages: (1) In a traditional CNNs, the features of input images are extracted by a backbone network with convolutional kernels in prefixed sizes. When the scale of the template target changes drastically, the template features may contain some background information or miss some foreground information, leading to drift in tracking process. (2) The correlation operation is a linear fusion manner for computing the similarity between the template and a search region. Therefore, it is easy to lose semantic information and fall into local optimum. Further, the complicated non-linear interaction between the template and search branches is not captured in a correlation operation.
To address the above issues, as shown Fig. 1, firstly, we modify ResNet50 [5] by adding the Feature Recognizer (FR) after conv1, conv2\(\_\)x and conv3\(\_\)x blocks in both the template and search branches, respectively. The FR generates a 3D attention map to focus on where and what the important elements are, and dynamically adjusts the weights of target features. Then, the powerful target features are obtained for the subsequent feature fusion and tracking prediction. Inspired by Vision Transformers (ViT) for image classification [6], we propose a novel tracking algorithm based on a multi-scale feature fusion network (MSF) in a transformer.
In the template branch, the feature sets of template patches as input are fed to the Transformer encoder. Then, MSF combines the template features and the corresponding encoded features in different sizes. In the search branch, the features of search patches and encoded features are fed to the Transformer decoder and then the score maps are obtained for locating the targets. We have evaluated the proposed SiamFMT algorithm on five benchmarks, including LaSOT [7], GOT-10k [8], OTB-100 [9], UAV123 [10] and VOT2018 [11]. The proposed tracking algorithm achieves superior tracking performance. The main contributions are summarized as follows:
-
We propose a feature recognizer (FR) module to construct hierarchical feature extraction networks by locating the module in different convolutional blocks. The FR can focus on the location and contents of important elements and to obtain robust object features by dynamically adjusting the object feature weights.
-
We propose a multi-scale feature fusion network based on cross-attention to enhance the feature representation ability. Compared with the cross-correlation based method, our method improves the non-linear interaction between the template and search branches, and establishes the association among features at different scales.
-
Extensive experiments on six challenging benchmarks demonstrate that the proposed tracker outperforms many state-of-the-art trackers. Especially, it achieves leading tracking performance on the large-scale datasets TrackingNet and GOT-10k as well as UAV123.
Related work
In this section, we briefly review some related methods and techniques including Siamese network-based visual tracking, attention mechanism and Transformer for visual tracking.
Siamese network-based visual tracking
In recent years, Siamese network-based trackers have drawn a lot of attention due to their balanced accuracy and speed [12, 13]. SiamFC, a pioneering work, adopts fully convolutional Siamese networks for feature extraction, and utilizes a cross-correlation layer to combine feature maps from the template and search branches. The cross-correlation layer performs convolution operations with template features on the search region to obtain response maps. Based on SiamFC, DSiam [14] learns the target appearance variation via an online transformation learning model. SA-Siam [15] utilizes Siamese networks to train a semantic branch and an appearance branch. The similarities on semantic features and appearance features are computed, respectively. Then, the final response map is obtained by combining the semantic similarity and appearance similarity. However, these tracking methods require a multi-scale testing to cope with variations in target appearances.
To get more accurate tracking results, Li \(et\ al\). first apply the region proposal network (RPN) in tracking task and propose the Siamese region proposal network-based tracker (SiamRPN) [3]. In SiamRPN, Siamese networks are followed by two subnetworks, i.e., a classification branch and a regression branch. The classification branch is used to discriminate the target from the surrounding background, and the regression branch refines the output box. Based on SiamRPN, Zhu \(et\ al\). [16] investigate an accurate and long-term tracking with a distractor-aware module. Fan \(et\ al\) [17] propose to cascade a set of RPN (C-RPN) from deep high-level layers to shallow low-level layers in Siamese networks. The discriminability of C-RPN is further improved by feature transfer blocks that make full use of multi-level features for each RPN, while exploiting the high-level semantic and low-level spatial information.
Apart from deepening the Siamese networks, researchers propose some anchor-free trackers, such as SiamBAN [18], SiamFC++ [19] and SiamCAR [20] to eliminate the negative effects of anchors. These anchor-free trackers treat the tracking task as a joint classification and regression problem. The trackers use one or more prediction heads to predict target locations and regress bounding boxes from the response maps in a pixel-by-pixel prediction manner. Guo et al. find that traditional cross-correlation operations retain a large amount of background information, which may misclassify target features. To solve this issue, they propose a target-aware Siamese Graph Attention network for general object tracking (SiamGAT) [4]. SiamGAT uses a bipartite graph-based feature search mechanism to match the template features and search image features.
Attention mechanism
Attention mechanisms are introduced into computer vision for the dynamic adjustment of the feature weights. Hu \(et\ al\). [21] propose a SENet, which pioneered channel attention. SE blocks consist of a squeeze module and an excitation module. The squeeze module collects global spatial information and the excitation module captures channel-wise relationships to improve the representation ability of the network. Park et al. [22] propose a simple and light-weight attention module that is placed at the bottleneck of CNNs. Efficient attention maps are generated by learning channels and spatial attention, which improve the representational power of the network and reduce the computational cost. Yang et al. [23] propose a Gated channel transformation (GCT). Unlike previous methods, GCT collects global information by calculating the \(L_2\) parametric of each channel. It is also lightweight and can be added to each convolutional layer of CNNs.
Attention mechanisms are also successfully used in visual tracking. Fan et al. [24] propose a discriminative spatial attention for short-term visual tracking. Choi et al. [25] introduce attention mechanism to correlation filter networks for object tracking. CSR-DCF [26] introduce the concept of channel and spatial reliability to discriminative correlation filters. Wang et al. [27] propose a Residual Attentional Siamese Network (RASNet) for object tracking. These mechanisms include the General Attention, Residual Attention, and Channel Attention. RASNet not only mitigates the overfitting problem in deep network training, but also improves the discriminative capability and adaptability of the network. Yu et al. [28] propose a Deformable Siamese Attention Network (SiamAttn). SiamAttn learns context information through spatial attention and cross-attention, and aggregate-rich contexts correlations between the template and search branches. To better exploit the feature extraction capability of Siamese networks, we add Feature Recognizer (FR) to traditional CNNs to improve the feature attention potentials of the backbone network. More details will be presented in Sect. 3
Transformer for Visual Tracking
Vaswani et al. [29] first propose the Transformer based on self-attention mechanism. Benefiting from the high representation ability, the Transformer is applied to visual tracking [30]. Wang et al. [31] introduce Transformer to object tracking, and present a novel transformer-assisted tracking framework (TrDiMP). To better suit tracking task, TrDiMP includes the encoder and decoder branches. The transformer encoder is used to generate a high-quality tracking model and the Transformer decoder searches the target.
Chen et al. [32] propose a feature fusion network based on a self-attention module and a cross-feature module instead of the traditional correlation operation. The ego-context augment (ECA) module is used to enhance the contextual information of the input. The cross-feature augment (CFA) module is used to adaptively fuse features from both branches. To improve the localisation accuracy of the tracker in complex scenes and enhance performance in transformer vision tasks, Cao et al. [33] propose an efficient hierarchical feature transformer (HiFT). HiFT feeds the similarity graphs generated by the multilayer convolutional layers into the feature transformer, and achieves an interactive fusion of space and semantic information.
In contrast to the traditional transformer-assisted tracking framework, Lin et al. [34] propose a fully attention-based transformer tracker (SwinTrack). SwinTrack uses Transformer for both feature extraction and feature fusion. Swin-Transformer, consisting of a backbone network and feature fusion network, introduces IOU-aware classification scores into the prediction branch to select more accurate bounding box predictions. Xie et al. [35] propose a Siamese-like Dual-branch network (DualTRF). Each branch of DualTRF consists of local attention blocks, and global attention blocks and uses cross-attention blocks to fuse features between the template and search branches. Subsequently, to make the tracking model more flexible, Xie et al. [36] proposed single branch transformer for tracking (SBT) based on DualTRF. SBT embeds cross-image feature associations in multiple layers of the feature network, which can suppress non-target features and achieve instantiated feature extraction. In addition, SBT is the first work to propose a specialized target-dependent feature network for VOT. Cui et al. [30] proposed a tracking framework based on a mixed attention module (MixFormer). Mixformer constructs a feature extraction network by simply stacking multiple the mixed attention module. It can extract target-specific discriminative features and communicate extensively between the target and the search region, resulting in a highly efficient tracking performance.
Method
In this section, we describe the propose SiamFMT framework. As shown in Fig. 1, the SiamFMT consists of Siamese backbone network, multi-scale feature fusion network and prediction heads. The Siamese backbone network extracts the features of the template image and search images with shared weights, respectively. Then, the proposed feature fusion network propagates a large amount of information from the target template to search regions.
Overview of Siamese tracking framework
Before describing the proposed tracking algorithm in detail, we briefly review recent popular tracking methods. As shown in Fig. 2, a Siamese network-based tracker consists of backbone network, feature fusion network and prediction head. In particular, the mainstream feature fusion methods mainly include the correlation operation based and Transformer-based networks.
Siamese networks architecture [2] takes the template z and search branches x as inputs and extracts the corresponding features by weight sharing in CNNs. Finally, the feature maps are generated by the correlation operator as follows:
where \(*\), \(\varphi (\cdot )\) and \(b\cdot 1\) denote the correlation operator, convolutional operations and bias term, respectively.
The basic framework in Transformer-based tracking uses a Transformer instead of the original correlation operation for the feature fusion. Both the template and search images are fed into CNN backbone for feature extraction. After that, the features from the two branches are send to two parallel branches of a Siamese-like network consisting of the Transformer encoder and decoder. The core component of Transformer is self-attention. The attention function is a scale dot-product attention as follows:
where Q, K, and V are the query, key and value vectors, \(d_{k}\) is the dimension of key. As described in [37], a linear projection and multi-headed attention are introduced to the attention module to make the mechanism focusing on different aspects of information. We get the multi-head variant defined as follows:
where \(W_{i}^{Q} \in {\mathbb {R}}^{d_{\text{ model }} \times d_{k}}\), \(W_{i}^{K} \in {\mathbb {R}}^{d_{\text{ model }} \times d_{k}}\), \(W_{i}^{V} \in {\mathbb {R}}^{d_{\text{ model }} \times d_{v}}\), and \(W^{O} \in {\mathbb {R}}^{n_{h} d_{v} \times d_{\text{ model }}}\) are parameter matrices. h is the number of heads. In this work, we set \(n_{h}\), \(d_{\text{ model }}\), \(d_{v}\) and \(d_{k}\) to 8, 512, 64 and 64, respectively.
Siamese backbone network
Convolutional neural networks have been successfully applied in Siamese network-based trackers and achieve robust tracking performance, such as ResNet [5], VGGNet [38] and AlexNet [39].
In the proposed tracker, we modify ResNet50 as the Siamese backbone network through adding the Feature Recognizers. First, existing models of CNNs significantly increase the computational complexity by naive stacking convolutional layers. Secondly, the features from the lower layers still have a limited field of perception. The proposed feature recognizer module alleviates the above issues nicely. Specifically, the FR module is an efficient and lightweight attention mechanism. The FR modules follow the blocks of the conv1, conv2\(\_\)x and conv3\(\_\)x, which makes lower layer features to benefit from the contextual information. Based on the lightweight modules, the Siamese backbone network is trained in an end-to-end manner. The overall structure of FR is illustrated in Fig. 3. The FR module is constituted by channel and spatial attention branches.
Channel attention module. For the channel attention module, we take global average pooling to aggregate the feature map F in each channel. To indicate the importance of channels, we use a scaling factor \(\gamma \) to batch normalization (BN) layer [40]. Taking the c-th channel as an example, the expression of BN layer can be rewritten as follows:
where the subscript c indicates the parameter of the c-th channel. \(\epsilon \) is a positive value for numerical stability. \({\hat{\mu }}_{c}\) and \({\hat{\sigma }}_{c}^{2}\) denote mean and variations of mini-batches, respectively. \(\beta \) is learnable shift transformation parameter in the BN layer to affine the normalized feature F.
On this basis, the input features are processed after global average pooling and BN layer. To highlight the feature response of the object and suppress the less salient feature responses (non-targets), we introduce a weight value to the feature response of each channel, and the weights are computed as
where \(w_i\) denotes the weight value of the each channel, \(\gamma _i\) is the scaling factor in BN, and the c denotes the number of channels.
Lastly, the channel attention \(M_{c}(F) \in {\mathbb {R}}^{C \times 1 \times 1}\) is computed as
Spatial Attention module. We exploit the spatial attention module to focus on the important spatial information of a target. It produces a spatial attention map \(M_{s}(F) \in {\mathbb {R}}^{H \times W}\) to emphasize or suppress feature in different spatial locations. Contextual information enables the model to better focus on spatial location of an object. To efficiently aggregate the contextual information, we use two 3 \(\times \) 3 dilated convolutions to enlarge the receptive fields. Then, we use a 1 \(\times \) 1 convolution at the end of a spatial branch to reduce the features to \({\mathbb {R}}^{H \times W}\) spatial attention maps, and the BN layer is applied to scale adjustment the feature map. Next, to measure the importance of pixels, we also apply the scaling factor of BN to the spatial dimension. The weights of the spatial attention module are computed as
where \(w_i\) denotes the weight value of each pixel, \(\lambda _i\) is the scaling factor of BN, \(h \times w\) indicates the number of pixel.
Finally, the spatial attention \(M_{s}(F)\) is computed as follows:
where f is a convolution operation and the superscripts denote the convolutional filter sizes. To save both the number of parameters and computational overhead, we only use three convolution operations.
Overall structure. We adopt a residual mechanism and Logistic function to facilitate gradient. First, we compute the channel attention \(M_{c}(F) \in {\mathbb {R}}^{C \times 1 \times 1}\) and the spatial attention \(M_{s}(F) \in {\mathbb {R}}^{1 \times H \times W}\) as two separate modules, Since these two attention maps have different shapes, we expand the attention maps to \({\mathbb {R}}^{C \times H \times W}\). Then, we choose element-wise summation to combine the channel attention map and the spatial attention map. Finally, the attention map M(F) is computed as:
where \(\sigma \) is a sigmoid function. For the given input feature map \(F \in {\mathbb {R}}^{C \times H \times W}\), based on the channel and spatial attention modules, a 3D attention map \(M(F) \in {\mathbb {R}}^{C \times H \times W}\) is generated. The final output feature \(F^{\prime }\) is computed as
where \(\otimes \) denotes matrix multiplication.
To suppress less salient features and highlight the target features and target locations, we add a regularization term in the loss function as follows:
where F and \(F^{\prime }\) denote the input and output, respectively; W represents FR module weights; \(l(\cdot )\) is the loss function; \(g(\cdot )\) is \(l_{1}\) norm penalty function; \(\xi \) is the penalty that balances \(g(\gamma )\) and \(g(\lambda )\). \(\gamma \) and \(\lambda \) are the scaling factor of the channel attention module and the spatial attention module, respectively. Then, we jointly train the weights and these scaling factors with \(l_{1}\) regularization imposed on the scaling factors.
Multi-scale feature fusion network
In this section, we learn multi-scale feature representations in Transformer model for object tracking. We propose a simple and effective cross-attention based multi-scale feature fusion network that produces robust image features. Specifically, to fuse multi-scale features more efficient, we first align dimensional projections of the encoded feature branch and template feature branch mapped in the same feature space. Then, the encoded feature branch is used as a query. The template feature branch exchanges information with the encoded feature branch through cross-attention for multiple feature fusion. The encoded feature branch learns to abstract information in Transformer encoder, and interacts with the template feature branch to combine features at different scales.
An illustration of the cross-attention operation in a multi-scale feature fusion network is shown in Fig. 4.
\(F_{encoded} \in {\mathbb {R}}^{n \times C \times H \times W}\) denotes the input to the encoded feature branch. \(T_{i} \in {\mathbb {R}}^{n \times C \times H \times W}\) denotes the input to the template feature branch, which are further concatenated to form the template feature ensemble \(T={\text {Concat}}\left( T_{1}, \ldots , T_{n}\right) \). Specifically, we adopt a projection function to map the features of both branches into the same feature space as follows:
where \(f^{l}(\cdot )\) is the projection function for dimension alignment, \(\otimes \) is the broadcasting element-wise multiplication. \(Q, K, V \in {\mathbb {R}}^{n \times C \times H \times W}\) are learnable parameters. M is a mask ensemble. In visual tracking, to reduce the interference of similar targets, we construct the Gaussian-shaped masks of the template features through \(m(y)=\exp \left( -\frac{\Vert y-c\Vert ^{2}}{2 \sigma ^{2}}\right) \), where c is the ground-truth target position. Then, we concatenate the reconstructed masks \(m_{i} \in {\mathbb {R}}^{H \times W}\) to obtain a mask ensemble \(M={\text {Concat}}\left( m_{1}, \cdots , m_{n}\right) \in {\mathbb {R}}^{n \times H \times W}\).
As shown in Fig. 4, Q and K are performed cross-attention. Then, the attention map (AM) generated in cross-attention is obtained as follows:
where C and h are the embedding dimension and number of heads. After performing the cross-attention \(C A=A M \otimes V\), we propagate the masks ensemble from the template feature branch to the encoded feature branch. In addition, as in Transformer, we also use multi-heads in cross-attention and represent it as Multi-heads Cross Attention (MCA). MCA enables fusion network to attend to multiple parts of the input feature simultaneously. This allows fusion network to capture different types of information and dependencies within the input, leading to better tracking performance. MCA increase the frequency of fusion across the template feature branch and encoded feature branch. Finally, the output of a multi-scale feature fusion (MSF) network with layer normalization and residual structure is computed as follows:
where \(g^{l}(\cdot )\) is the back-projection function for the dimension alignment. LN denotes layer normalization. The final output feature is reshaped to the original size as \(MSF \in {\mathbb {R}}^{n \times C \times H \times W}\).
Experiments
In this section, we conduct extensive experiments on six challenging benchmarks including LaSOT [7], GOT-10k [8], OTB-100 [9], TrackingNet [41], UAV123 [10] and VOT2018 [11]. We also compared the proposed tracker with several state-of-the-art trackers on three small-scale datasets for inference speed, as shown in Table 1. To further validate the effectiveness of the proposed Siamese backbone network and multi-scale fusion network, we conduct the ablation study on GOT-10k and UAV123.
Implementation details
The proposed SiamFMT is implemented in Pytorch and executed on Intel(R) Core(TM) i5-10400 CPU @ 2.90GHz with 16GB Memory and a NVIDIA GTX-1080Ti GPU. We utilize the training splits of LaSOT [7], TrackingNet [41] and GOT-10k [8] for offline training. We apply some transformations on the original images to generate image pairs. The common data augmentation (such as translation and brightness jitter) is applied to enlarge the training sets. We set the central jitter factor and the scale jitter factor to 3 and 0.25, respectively. The sizes of the input template and search patches are 128\(\times \)128 and 256\(\times \)ss256, respectively. Our framework is trained for 50 epochs with 3,571 iterations per epoch and the batch size is set to 14. We train the model with ADAM optimizer, and set the initial learning rate to \(1 \textrm{e}\)-3 and a decay factor 0.2 for every 15 epochs. The proposed SiamFMT achieves competing tracking performance against SOTA trackers.
Ablation study and analysis
To verify the effectiveness of the designed feature recognizer module and Multi-Scale feature Fusion network, we conduct the ablation study on GOT-10k and UVA123 benchmarks.
We also use LaSOT [7], GOT-10k [8] and TrackingNet [41] as training sets on a single Nvidia 1080Ti GPU to train TrDiMP as the baseline. To further validate the generalization of SiamFMT, we choose GOT-10k and UAV123 test sets to evaluate the proposed tracker. In GOT-10k, there are no overlap of object classes between the training and test sets.
Backbone architecture. We embed the Feature Recognizer module in ResNet50 [5] to constitute a Siamese backbone network for feature extracting. As shown in Tables 2 and 3, we conduct ablation experiments on GOT-10k and UAV123 test sets, respectively. Compared with the baseline, our method improves the average overlap (AO) by 0.6%. The precision (Prec.) is improved by 1% from 0.853 to 0.863, and the success (AUC score) is improved by 0.5% from 0.643 to 0.648. The experimental results show that the proposed method has a positive effect on the tracking results.
Feature fusion network. To show the superiority of the multi-scale feature fusion network (MSF), we add the MSF to the baseline without feature recognizer modules and keep the other components unchanged. The multi-scale feature fusion network performs multiple fusions of template features and encoded features to combine features at different scales. Compared to traditional cross-correlation method, ours tracker can focus more on target edge information and make the tracker obtain better robustness. In Table 3, by comparing the values in the second and fourth rows, we notice that the precision and the success is improved by 1.3% and 1% with the same backbone, respectively. It is worth noting that, as shown in Table 2, the average overlap(AO) is improved by 1.7% from 66.2% to 67.9%, while keeping other components constant. Meanwhile, our method improves by 0.8% compared to TrDiMP [31].
Overall structure. Finally, we add both the Feature Recognizer module and the multi-scale feature fusion network to the baseline. It is worth pointing out that TrDiMP [31] already achieves outstanding results while our approach consistently improves such a strong baseline. As shown in Table 2, compared with the baseline, the method with Feature Recognizer and the multi-scale feature fusion network brings 2.2% performance gains on average overlap(AO). Compared with TrDiMP, our approach also achieves 1.3% improvement in average overlap (AO). By comparing the second row and fourth row in Table 3, the precision (Prec.) is improved by 3.2% from 0.853 to 0.885, and the success (AUC score) increases by 2.4% from 0.643 to 0.667. The results further demonstrate that the performance of our tracker has improved by 1.9% over the performance of TrDiMP. This is benefit from the designed FR module and MSF.
As shown in Fig. 5, we visualize some tracking results. As can be seen in the second and third columns, our tracker is able to highlight the locations of the targets well while suppressing background and similar target information against on distracting factors.
The number of cross-attention heads. In our method, multi-head cross-attention is able to fuse features at different scales and capture the dependencies between different features. Thus, the number of cross-attention heads is important. As shown in Table 4, we list the performance on different numbers of cross-attention heads. The tracking performance gradually improves as the number of headers increases. However, when the number of cross-attention heads is more than 6, the performance drops. We argue that excess cross-attention heads may lead to model overfitting. In addition, we observe that increasing the number of heads improves tracking accuracy, but decreases tracking speed. Therefore, to better balance the tracking performance and speed, we choose the number of cross-attention heads to be 4.
Evaluation on UAV123
UAV123 [10] is a new aerial video datasets consisting of 123 low-altitude aerial video sequences. Different from other tracking benchmarks, the tracked targets in UAV123 are small because the viewpoint is in the air. The UAV123 is very challenging to trackers. We compare the proposed tracker with 9 state-of-the-art and real-time trackers, including TrDiMP [31], TransT [32], DiMP50 [45], SiamAttn [28], SiamGAT [4], SiamBAN [18], CLNet [49], STMTrack [50] and SiamCAR [20]. A comparison with state-of-the-art trackers is shown in Fig. 6 in terms of the precision and success of OPE. Our tracker reaches a success score of 66.7% and precision of 88.5%, which outperforms the recently proposed TrDiMP [31] by 1.9% on precision and TransT [32] by 0.7% on success.
Figure 7 reports the attribute-based evaluation of the proposed SiamFMT and nine representative state-of-the-art tracking algorithms. The proposed SiamFMT ranks on the first place on attributes of aspect ratio change, background clutter, camera motion, full occlusion, similar object and scale variation. The results demonstrate that our tracker is robust to complicated appearance variations.
Evaluation on OTB-100
OTB-100 is one of the most classic benchmarks for visual tracking. It consists of 98 video sequences with 11 interference attributes. These attributes include background clutter (BC), low resolution (LR), out-of-view (OV), illumination variation (IV), scale variation (SV), occlusion (OCC), deformation (DEF), motion blur (MB), fast motion (FM), in-plane rotation (IPR) and out-of-plane rotation (OPR).
The comparison with state-of-the-art trackers are shown in Fig. 8 in terms of success and precision plots of OPE. Our tracker reaches a success score of 69.3% and precision of 91.2% that surpasses many state-of-the-art trackers. Especially, our tracker significantly improve tracking success and precision on the aspects of background clutter (BC), occlusion (OCC), out-of-plane rotation (OPR) and out-of-view (OV). This is benefit from our multi-scale feature fusion network. As shown in Fig. 9, the proposed tracker ranks on the first place on these challenging attributes.
Evaluation on VOT2018
VOT2018 is a widely used benchmark for visual tracking and contains 60 video sequences. It evaluates the tracking performance on three metrics including accuracy (A), robustness (R) and expected average overlap (EAO).
As shown in Fig. 10, the proposed tracking algorithm is compared with nine state-of-the-art methods including ATOM [43], LADCF [51], SiamRPN [3], SiamMask [52], UPDT [53], RCO [54], DeepSTRCF [55], CPT [56] and SA-Siam-R [57]. Experimental results demonstrate that the proposed tracker achieves the top EAO score on VOT2018. Compared with the recent trackers SiamMask [52] and ATOM [43], our method improves EAO by 4.3% and 2.3%, respectively.
In Table 5, our tracker achieves 0.617 accuracy, 0.192 robustness and 0.424 EAO on VOT2018. We further compare the proposed SiamFMT in terms of the accuracy, robustness and EAO against SOTA trackers including ATOM [43], DRT [58], DeepSTRCF [55], CPT [56], SA-Saiam-R [57], LSART [59], ECO [60], CCOT [61] and SiamFC [2]. Compared with these trackers, the proposed tracking algorithm achieves superior tracking performance.
Evaluation on GOT-10k
GOT-10k [8] is a challenging large-scale dataset that consists of more than 10,000 videos. There is no overlap between the classes of the training and testing datasets. GOT-10k is usually used to evaluate the generalization of a tracker. We follow the GOT-10k protocol and train the proposed model on the given training datasets and test the proposed tracking algorithm SiamFMT on the given testing datasets. After uploading the tracking results to the official website, the corresponding tracking results in average overlap (AO) and success rate (\(SR_{0.50}\) and \(SR_{0.75}\)) are obtained.
As can be seen from Fig. 11,
the proposed SiamFMT outperforms many SOTA trackers in success. As shown in Table 6, we evaluate our tracker on GOT-10k and compare it with state-of-the-art trackers including TrDiMP [31], SBT [36], STMTrack [50], KYS [62], PrDiMP [63], SiamGAT [4], DiMP50 [45], D3S [64], SiamFC++ [19], Ocean-offline [65], SiamCAR [20], SiamRPN++ [44], ATOM [43], SPM [66], SiamMask-EU [52], SiamRPN [3] and SiamFC [2]. The proposed tracker performs excellent performance in terms of average overlap (AO) and success rates (\(SR_{0.50}\) and \(SR_{0.75}\)). In particular, the proposed SiamFMT achieve the second best performance in terms of \(SR_{0.75}\) behind SBT-small [36] and outperforms other excellent trackers. Our tracker performs 1.6% and 2.5% higher than SBT-small in AO and \(SR_{0.50}\). In addition, our method performs 4.8% and 4.7% higher than KYS in AO and \(SR_{0.50}\) metrics, respectively. Compared with SiamGAT, the proposed tracker is 5.7%, 5.5% and 10.2% higher in terms of AO, \(SR_{0.50}\) and \(SR_{0.75}\), respectively. These results demonstrate that our tracker has a good generalization.
Evaluation on LaSOT
LaSOT [7] is a large-scale, densely annotated, and challenging single object tracking dataset. The dataset a training set consisting of 1400 sequences and a testing set consisting of 280 sequences. With an average length of over 2500 frames per sequence, LaSOT is more challenging than previous short-term tracking datasets. It is used to evaluate a tracker’s ability in re-detecting a target and long-term tracking. We use the one-pass evaluation including success rate, precision and Normalized precision to compare different tracking algorithms on the LaSOT testing set, including TrSiam [31], STMTrack [50], SiamGAT [4], Ocean-online [65], SiamBAN [18], CLNet [49], SiamFC++ [19], SiamCAR [20], GlobalTrack [67], ATOM [43], SiamRPN++ [44], D3S [64], DiMP50 [45] and SiamFC [2]. From Fig. 12, we can see that the proposed tracking algorithm achieves superior tracking results against some state-of-the-art trackers. Compared with the recently proposed TrSiam [31], STMTrack [50] and SiamGAT [4], the proposed SiamFMT improves the AUC scores by 0.4%, 2.2% and 8.9%, respectively.
We also report the Success (Succ.), precision (Prec.) and normalized precision (N.Prec.) in Table 7. From Table 7, the proposed SiamFMT obtains the best tracking result. The proposed SiamFMT outperforms SBT-small by 1.7% in Succ. and 0.6% in Prec. Meanwhile, compared to the Transformer-based tracker, The proposed tracker outperforms TrSiam by 4.4% in Prec. and 0.7% in N.Prec. The results indicate that the proposed tracker is competitive for long-term tracking tasks.
Evaluation on TrackingNet
Tracking is a large-scale dataset with a test set of 511 videos sequences, and covering different object classes and complex tracking scenarios. We submit the raw tracking results to the online evaluation server to obtain the tracking metrics shown in Table 8. We compare our tracker with the state-of-the-art trackers such as STMTrack [50], DualTRF [35], UTT [68], TrDiMP [31], E.T.Track [47] and TrTr [69]. The proposed tracker achieves tracking results that are on par with the current state of the art. Among all the compared trackers, only STMTrack [50] and DualTRF [35] has a slightly N.Prec. than our tracker. However, in terms of Succ and Prec, our tracker obtains 80.4% and 79.8%, respectively, outperforming other excellent Transformer-based trackers such as UTT [68], E.T.Track [47] and TrTr [69].
Conclusion
In this paper, we propose an effective and lightweight tracking framework. The framework includes two main parts: a Siamese backbone architecture based on hierarchical attention, and a multi-scale feature fusion(MSF) network based on cross-attention. The hierarchical attention consisting of channel and spatial branches is built on the designed feature recognizers to emphasize on important feature elements. The multi-scale feature fusion network fuses template features and encoded features through cross-attention and allows the tracker to adapt to changes in target scales. The MSF makes a bridge between the template and search branches, and provides stronger encoding features for subsequent decoders. The ablation study and experiments show that the proposed SiamFMT is robust to cluttered background, scale variation, and similar targets. On several mainstream benchmarks such as OTB100, GOT-10k and UAV123, the proposed tracker performs significantly better than some state-of-the-art trackers.
Data availability
The datasets generated during and analyzed during the current study are available from the corresponding author on reasonable request. reasonable request.
References
Zhang T, Liu X, Zhang Q, Han J (2022) Siamcda: complementarity- and distractor-aware rgb-t tracking based on siamese network. IEEE Trans Circ Syst Video Technol 32(3):1403–1417
Bertinetto L, Valmadre J, Henriques JF, Vedaldi A, Torr PH (2016) Fully-convolutional siamese networks for object tracking, in: European conference on computer vision, Springer, pp. 850–865
Li B, Yan J, Wu W, Zhu Z, Hu X (2018) High performance visual tracking with siamese region proposal network. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8971–8980
Guo D, Shao Y, Cui Y, Wang Z, Zhang L, Shen C (2021) Graph attention tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9543–9552
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778
Chen C-F, Fan Q, Panda R (2021) Crossvit: Cross-attention multi-scale vision transformer for image classification, arXiv preprint arXiv:2103.14899
Fan H, Lin L, Yang F, Chu P, Deng G, Yu S, Bai H, Xu Y, Liao C, Ling H (2019) Lasot: A high-quality benchmark for large-scale single object tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5374–5383
Huang L, Zhao X, Huang K (2019) Got-10k: A large high-diversity benchmark for generic object tracking in the wild, IEEE Transactions on Pattern Analysis and Machine Intelligence
Wu Y, Lim J, Yang M-H (2013) Online object tracking: A benchmark. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2411–2418
Mueller M, Smith N, Ghanem B (2016) A benchmark and simulator for uav tracking, in: European conference on computer vision, Springer, pp. 445–461
Kristan M, Leonardis A, Matas J, Felsberg M, Pflugfelder R, Čehovin Zajc L, Vojir T, Bhat G, Lukezic A, Eldesokey A, et al (2018) The sixth visual object tracking vot2018 challenge results. In: Proceedings of the European Conference on Computer Vision (ECCV) Workshops
Li X, Huang L, Wei Z (2022) A twofold convolutional regression tracking network with temporal and spatial mechanism. IEEE Trans Circ Syst Video Technol 32(3):1537–1551
Wang Y, Zhang W, Lai C, Wang J (2023) Adaptive temporal feature modeling for visual tracking via cross-channel learning. Knowl-Based Syst 265:110380
Guo Q, Feng W, Zhou C, Huang R, Wan L, Wang S (2017) Learning dynamic siamese network for visual object tracking. In: Proceedings of the IEEE international conference on computer vision, pp. 1763–1771
He A, Luo C, Tian X, Zeng W (2018) A twofold siamese network for real-time object tracking. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4834–4843
Zhu Z, Wang Q, Li B, Wu W, Yan J, Hu W (2018) Distractor-aware siamese networks for visual object tracking. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 101–117
Fan H, Ling H (2019) Siamese cascaded region proposal networks for real-time visual tracking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 7952–7961
Chen Z, Zhong B, Li G, Zhang S, Ji R (2020) Siamese box adaptive network for visual tracking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 6668–6677
Xu Y, Wang Z, Li Z, Yuan Y, Yu G (2020) Siamfc++: towards robust and accurate visual tracking with target estimation guidelines. Proc AAAI Conf Artificial Intell 34:12549–12556
Guo D, Wang J, Cui Y, Wang Z, Chen S (2020) Siamcar: Siamese fully convolutional classification and regression for visual tracking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 6269–6277
Hu J, Shen L, Sun G (2018) Squeeze-and-excitation networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7132–7141
Park J, Woo S, Lee J-Y, Kweon IS (2020) A simple and light-weight attention module for convolutional neural networks. Int J Comput Vis 128(4):783–798
Yang Z, Zhu L, Wu Y, Yang Y (2020) Gated channel transformation for visual recognition, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 11794–11803
Fan J, Wu Y, Dai S (2010) Discriminative spatial attention for robust tracking. In: European Conference on computer vision, Springer, pp. 480–493
Choi J, Jin Chang H, Yun S, Fischer T, Demiris Y, Young Choi J (2017) Attentional correlation filter network for adaptive visual tracking. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4807–4816
Lukezic A, Vojir T, Čehovin Zajc L, Matas J, Kristan M (2017) Discriminative correlation filter with channel and spatial reliability. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6309–6318
Wang Q, Teng Z, Xing J, Gao J, Hu W, Maybank S (2018) Learning attentions: residual attentional siamese network for high performance online visual tracking. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4854–4863
Yu Y, Xiong Y, Huang W, Scott MR (2020) Deformable siamese attention networks for visual object tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6728–6737
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. In: Advances in neural information processing systems, pp. 5998–6008
Cui Y, Jiang C, Wang L, Wu G (2022) Mixformer: End-to-end tracking with iterative mixed attention, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13608–13618
Wang N, Zhou W, Wang J, Li H (2021) Transformer meets tracker: Exploiting temporal context for robust visual tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1571–1580
Chen X, Yan B, Zhu J, Wang D, Yang X, Lu H (2021) Transformer tracking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 8126–8135
Cao Z, Fu C, Ye J, Li B, Li Y (2021) Hift: Hierarchical feature transformer for aerial tracking, in: Proceedings of the IEEE/CVF international conference on computer vision, pp. 15457–15466
Lin L, Fan H, Xu Y, Ling H (2021) Swintrack: A simple and strong baseline for transformer tracking, arXiv preprint arXiv:2112.00995
Xie F, Wang C, Wang G, Yang W, Zeng W (2021) Learning tracking representations via dual-branch fully transformer networks. In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 2688–2697
Xie F, Wang C, Wang G, Cao Y, Yang W, Zeng W (2022) Correlation-aware deep tracking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 8751–8760
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need, Advances in neural information processing systems 30
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition, arXiv preprint arXiv:1409.1556
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. Adv Neural Inform Process Syst 25:1097–1105
Ioffe S, Szegedy C (2015) Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: International conference on machine learning, PMLR, pp. 448–456
Muller M, Bibi A, Giancola S, Alsubaihi S, Ghanem B (2018) Trackingnet: A large-scale dataset and benchmark for object tracking in the wild. In: Proceedings of the European conference on computer vision (ECCV), pp. 300–317
Kiani Galoogahi H, Fagg A, Huang C, Ramanan D, Lucey S (2017) Need for speed: A benchmark for higher frame rate object tracking. In: Proceedings of the IEEE international conference on computer vision, pp. 1125–1134
Danelljan M, Bhat G, Khan FS, Felsberg M (2019) Atom: Accurate tracking by overlap maximization, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 4660–4669
Li B, Wu W, Wang Q, Zhang F, Xing J, Yan J (2019) Siamrpn++: Evolution of siamese visual tracking with very deep networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 4282–4291
Bhat G, Danelljan M, Gool LV, Timofte R (2019) Learning discriminative model prediction for tracking. In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 6182–6191
Mayer C, Danelljan M, Paudel DP, Van Gool L (2021) Learning target candidate association to keep track of what not to track. In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 13444–13454
Blatter P, Kanakis M, Danelljan M, Van Gool L (2023) Efficient visual tracking with exemplar transformers. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 1571–1581
Mayer C, Danelljan M, Bhat G, Paul M, Paudel DP, Yu F, Van Gool L (2022) Transforming model prediction for tracking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 8731–8740
Dong X, Shen J, Shao L, Porikli F (2020) Clnet: A compact latent network for fast adjusting siamese trackers. In: Computer vision–ECCV 2020: 16th European conference, Glasgow, UK, August 23–28 Proceedings, Part XX 16, Springer, 2020, pp. 378–395
Fu Z, Liu Q, Fu Z, Wang Y (2021) Stmtrack: Template-free visual tracking with space-time memory networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 13774–13783
Xu T, Feng Z-H, Wu X-J, Kittler J (2019) Learning adaptive discriminative correlation filters via temporal consistency preserving spatial feature selection for robust visual object tracking. IEEE Trans Image Process 28(11):5596–5609
Wang Q, Zhang L, Bertinetto L, Hu W, Torr PH (2019) Fast online object tracking and segmentation: A unifying approach. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 1328–1338
Bhat G, Johnander J, Danelljan M, Khan FS, Felsberg M (2018) Unveiling the power of deep tracking. In: Proceedings of the European conference on computer vision (ECCV), pp. 483–498
He Z, Fan Y, Zhuang J, Dong Y, Bai H (2017) Correlation filters with weighted convolution responses. In: Proceedings of the IEEE international conference on computer vision workshops, pp. 1992–2000
Li F, Tian C, Zuo W, Zhang L, Yang M-H (2018) Learning spatial-temporal regularized correlation filters for visual tracking. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4904–4913
Che M, Wang R, Lu Y, Li Y, Zhi H, Xiong C (2018) Channel pruning for visual tracking. In: Proceedings of the European conference on computer vision (ECCV) Workshops,
He A, Luo C, Tian X, Zeng W (2018) Towards a better match in siamese network based visual object tracker. in: Proceedings of the European conference on computer vision (ECCV) workshops
Sun C, Wang D, Lu H, Yang M-H (2018) Correlation tracking via joint discrimination and reliability learning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 489–497
Sun C, Wang D, Lu H, Yang M-H (2018) Learning spatial-aware regressions for visual tracking, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8962–8970
Danelljan M, Bhat G, Shahbaz Khan F, Felsberg M (2017) Eco: Efficient convolution operators for tracking. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6638–6646
Danelljan M, Robinson A, Khan FS, Felsberg M (2016) Beyond correlation filters: Learning continuous convolution operators for visual tracking. In: European conference on computer vision, Springer, pp. 472–488
Bhat G, Danelljan M, Van Gool L, Timofte R (2020) Know your surroundings: Exploiting scene information for object tracking. In: European conference on computer vision, Springer, pp. 205–221
Danelljan M, Gool LV, Timofte R (2020) Probabilistic regression for visual tracking, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 7183–7192
Lukezic A, Matas J, Kristan M (2020) D3s-a discriminative single shot segmentation tracker, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 7133–7142
Zhang Z, Peng H, Fu J, Li B, Hu W (2020) Ocean: Object-aware anchor-free tracking. In: Computer vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXI 16, Springer, pp. 771–787
Wang G, Luo C, Xiong Z, Zeng W (2019) Spm-tracker: series-parallel matching for real-time visual object tracking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 3643–3652
Huang L, Zhao X, Huang K (2020) Globaltrack: a simple and strong baseline for long-term tracking. Proc AAAI Conf Artificial Intell 34:11037–11044
Ma F, Shou MZ, Zhu L, Fan H, Xu Y, Yang Y, Yan Z (2022) Unified transformer tracker for object tracking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 8781–8790
Zhao M, Okada K, Inaba M (2021) Trtr: Visual tracking with transformer, arXiv preprint arXiv:2105.03817
Cui Y, Jiang C, Wang L, Wu G (2021) Target transformed regression for accurate tracking, arXiv preprint arXiv:2104.00403
Shen Q, Qiao L, Guo J, Li P, Li X, Li B, Feng W, Gan W, Wu W, Ouyang W (2022) Unsupervised learning of accurate siamese tracking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 8101–8110
Zheng J, Ma C, Peng H, Yang X (2021) Learning to track objects from unlabeled videos, in: Proceedings of the IEEE/CVF international conference on computer vision, pp. 13546–13555
Acknowledgements
This work is supported by the National Natural Science Foundation of China (No. 61861032).
Author information
Authors and Affiliations
Contributions
We confirm that the manuscript has been read and approved by all named authors and that there are no other persons who satisfied the criteria for authorship. We further confirm that the order of authors listed in the manuscript has been approved by all of us. We understand that the corresponding author is the sole contact for the Editorial process. She is responsible for communicating with the other authors about progress, submissions of revisions and final approval of proofs.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Wang, J., Yin, P., Yang, W. et al. Exploiting multi-scale hierarchical feature representation for visual tracking. Complex Intell. Syst. 10, 3617–3632 (2024). https://doi.org/10.1007/s40747-024-01345-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s40747-024-01345-y