1 Introduction

Researchers are becoming increasingly interested in TIR target tracking due to its effectiveness in dark environments [1,2,3,4]. Although the current research has made some progress, the tracking result is not ideal due to the lack of a single feature’s ability to express the target is still a problem worthy of research [5,6,7].

Recently, influenced by the success of the CNN architecture in various visual tasks [8,9,10,11,12,13,14,15,16], some methods have attempted to use the CNN’s powerful representation capabilities to improve the TIR target tracking performance [2, 17,18,19,20]. The MCFTS [2] tracker uses the pre-trained CNN to extract multi-layer convolution features of the infrared target and combines the kernel correlation filters method to construct an integrated TIR tracking method, which has achieved good tracking results. Gao et al. [17] introduced the pre-trained deep appearance features and deep optical flow features into the structure output support vector machine for TIR target tracking. Li et al. [21] proposed a TIR tracking method using sparse representation of deep semantic features. The deep semantic feature is obtained through a pre-trained convolutional neural network combined with a target feature channel selection module based on a supervised training method. The HSSNet [20] tracker trains a hierarchical spatial perception feature model end-to-end under the framework of the matching task to represent the TIR image target and designs a matching TIR tracking method. To adapt to the change of target appearance, Gundogdu et al. [4] proposed an integrated TIR tracking method based on correlation filters using convolutional neural networks features. Zhang et al. [19] proposed to use the generative confrontation network to convert visible light images into the TIR images and use these synthesized the TIR images to train a twin network based on matching. Then use the twin network to extract the deep features of the TIR targets and integrate these features for TIR target tracking. Although some progress has been made with the above-mentioned methods, since a single type of feature cannot fully characterize the appearance information of the target, the characteristics of the TIR target obtained based on these networks can not achieve the optimal tracking result for the target tracking.

To solve all of these issues, we developed an adaptively multi-feature fusion model (AMFT) capable of tracking TIR targets with high efficiency and robustness. Generally speaking, it can be said that hand-crafted features have good spatial structure information, which makes it easier to distinguish targets from backgrounds, but their ability to characterize the target is obviously insufficient. While the deep convolutional neural networks features with discriminative semantic information can help us accurately detect the position of the target, they cannot adapt to changes in the spatial location of the target. Our AMFT tracking method can adaptively fuse the advantages of hand-crafted features and deep convolutional neural networks features so that it can track targets more accurately and robustly. Figure 1 shows that the tracking results of our AMFT tracking method are similar to the ground-truth labels of the tracking target, demonstrating the effectiveness of the proposed multi-feature fusion model.

Fig. 1
figure 1

Tracking examples. AMFT_H represents the tracking results with only hand-crafted features, while AMFT_D represents the tracking results with only deep convolutional neural networks features

Following is a summary of the main contributions:

  • An improved tracking method based on multi-feature fusion (AMFT) is proposed for the TIR target tracking task.

  • The presented AMFT tracker could train a multi-feature fusion model that may autonomously integrate the direct benefits of several features to better characterize the target appearance.

  • Extensive comparative evaluations show that the proposed AMFT tracker outperforms other trackers on PTB-TIR [22] and LSOTB-TIR [23] benchmarks.

2 Related works

we will actually mainly introduce some of the most relevant studies to our tracking method, including such tracking methods [24,25,26,27,28] and multiple features fused methods [29,30,31,32,33,34] in this section.

The correlation filters (CF) can determine the degree of similarity between the signals by performing correlation operations on two signals. For the target tracking task, it can be regarded as a similarity measurement between tracking target and candidates, and the candidate sample who with the greatest similarity to the target will be found in the search area as the tracking target. Because CF could perform a fast operation on numerous training samples, these CF-based trackers have achieved better tracking results [35,36,37,38]. Most trackers based on the CF framework use the cyclic structure of training samples to learn linear filters. The image patch produced by the cyclic shift is similar to the translation of the target and cannot simulate the real tracking scene. When the tracking scene is more complicated, the tracking results usually obtained by relying on the response map will be inaccurate, and the target will be lost. In order to obtain the desired output response map, Bibi et al. [39] used the score of the real sample to replace the score of the cyclic shift sample, which made up for the shortcomings of the manually set response map. Based on the good properties of the CF-based tracking framework, many attempts are made to introduce it into the TIR target tracking task [1, 4, 6, 29, 40]. He et al. [1] introduce a weighted correlation filter-based infrared target tracking method to obtain efficient tracking results. Gundogdu et al. [4] verifies that good TIR target tracking accuracy can be achieved by using deep convolutional feature in the CF-based tracker.

Recently, deep learning-based trackers have achieved good results on target tracking task. Convolutional neural networks have becomes more popular in target tracking tasks due to their formidable feature extraction capabilities [41,42,43,44,45,46,47]. In [42], Wang et al. propose that an unsupervised CNN model be trained on large-scale unlabeled sample images, which can successfully solve the issue of insufficient training samples with labels. The Siamese network-based tracking framework approaches treat the tracking task as a template-matching problem, and returning the most similar target candidate as the tracking result by calculating the similarity between the template target and the target candidates. The SiamFC [43] tracker introduces a fully-convolutional Siamese network for the tracking task. The CFNet [44] tracker attempts to treat the correlation filters as a network layer in the deep network architecture to obtain a faster tracking speed. Dong et al. [45] introduces a triplet loss to extract expressive deep features for visual tracking tasks by adding them into the Siamese network framework instead of pairwise loss for model training. In [48], a structured target-aware model has been proposed to improve the target tracking performance in the TIR scenarios.

Multi-feature fusion is a common method to improve tracker performance in target tracking task [31,32,33, 49,50,51,52]. Liu et al. [31] propose to simultaneously learn local structural features and global semantic features of the TIR images under the framework of matching network to enhance the discrimination ability of feature model to similar interferers. The HDT [32] tracker integrates multiple weak correlation filters based on deep features through a Hedge method, which can be used to automatically update the weight of each weak tracker so as to locate the target more accurately. The MFFT [33] tracker adopts the complementarity between multiple different features to enhance the robustness of the proposed tracking method. In [2], a MCFTS tracker has been proposed to uses a Kullback–Leibler divergence fusion method to integrate multiple convolution feature-based correlation filters for the TIR target tracking task. Zhang et al. [53] propose a tracking framework to integrate the RGB and TIR images in the RGBT tracking task in an end-to-end way. In a deep RGBT tracking framework, Li et al. [54] describe a multi-adapter convolutional network that performs modality-shared, modality-specific, and instance-aware feature learning simultaneously.

Although these trackers have produced some acceptable tracking performance, the existing tracking approaches are still unable to obtain optimal tracking results when faced with identical object interference, occlusions, and other difficult challenges due to the complexity of the TIR tracking scenarios. We present a multi-feature fusion-based tracking approach that relies on the complementarity between different types of features to increase the tracker’s tracking performance in these TIR tracking scenarios.

3 The proposed AMFT tracker

For more accurate TIR target tracking performance, we propose a multi-feature fusion model for characterizing the target appearance more comprehensively. We use the correlation filters-based tracking framework to generate the corresponding response map complementary fusion of different features to a better accurate target location. First, we briefly introduce the correlation filters-based tracking framework. After that, we propose a multi-feature fusion mode for accurate TIR target tracking results.

3.1 Correlation filters-based tracking framework

The correlation filters (CF)-based trackers have been extensively studied in recent years, and it has greatly improved the tracking speed under the premise of ensuring tracking accuracy. The CF-based tracking methods usually train a classifier to identify the target from the background [35, 36, 46]. We construct a weak target tracker using the correlation filters for every single category of features and then construct a multi-feature fusion tracker through the fusion of the response maps of multiple weak trackers to better handle the challenging problems in the TIR target tracking task. The correlation filters \(w_k\) corresponding to the k-th features can be obtained as follow:

$$\begin{aligned} w_k =\mathop {\arg \min }_{w_k} (||w_k * x_k - y||^2+\lambda ||w_k||^2), \end{aligned}$$
(1)

where \(w_k\) is the trained correlation filters for the k-th features, \(x_k\) denotes the k-th features of the training samples, y denotes the Gaussian-shape label of training samples, and \(\lambda\) is a regularization parameter. Define \({\mathscr {F}}\) as the Fourier transform, and Eq. (1) in the Fourier domain yields a closed solution of the following form:

$$\begin{aligned} w_k = {\mathscr {F}}^{-1}\left( \frac{\hat{x_k} \odot {\hat{y}}}{\hat{x_k}^* \odot \hat{x_k} +\lambda }\right) . \end{aligned}$$
(2)

where \({\mathscr {F}}(x_k)=\hat{x_k}\), \({\mathscr {F}}(y)={\hat{y}}\) and \(\hat{x_k}^*\) is the conjugate transpose of \(\hat{x_k}\).

The main purpose of the search phase is to obtain the response map of the target position in the search image frame. First, given the search area z of a TIR image target and extract the different types of features \(z_k\). Then, the features \(z_k\) are transformed into the Fourier domain: \({\mathscr {F}}(z_k)=\hat{z_k}\). Therefore, the response map of the target location dependent on the k-th features \(z_k\) can be obtained by the following cross-correlation operation:

$$\begin{aligned} R_k = {\mathscr {F}}^{-1}(\hat{z_k} \otimes \hat{w_k}), \end{aligned}$$
(3)

where \(\otimes\) is the cross-correlation operator and \(R_k\) is the response map of the k-th features \(z_k\).

3.2 Multi-feature fusion model

Given the target position response map generated by the tracker based on the different types of features, the goal of the integrated model is to fuse each response map \(R_k\) to obtain a stronger response map R and predict the target location. Each response map \(R_k\) can be regarded as a probability distribution with position (ij) as the target(\(\sum r_k^{ij} = 1\), where \(r_k^{ij}\) represents the probability of the position (ij) in the response map \(R_k\), \(i =\{1, 2,\ldots , N\}\), \(j =\{1, 2,\ldots , M\}\)). The fused response map R can reflect the consistent part of each response map \(R_k\). The position of the maximum value in the fused response map should be considered as the predicted position of the target, so as to locate the target more accurately. Figure 2 shows the overview of the proposed multi-feature fusion model for the TIR target tracking process. To achieve this, the distribution of response map R is expected to be as close as possible to the distribution of each response map \(R_k\). To measure the difference between the two probability distributions of each response map \(R_k\) and response map R after fusion, Jensen–Shannon (JS) divergence is adopted to measure the generalized distance between them and believes that the smaller of the distance, the smaller of the difference in their distribution. The JS divergence is a symmetric measure of the similarity of two distributions, which can give full play to the advantages of each response map [55, 56]. By minimizing the JS divergence, we can get the optimal fused response map:

$$\begin{aligned} \begin{aligned}&R =\mathop {\arg \min }_{R} \sum JS(R_k||R)\\&\hbox {s.t.} \ \ \sum r^{ij} =1, \end{aligned} \end{aligned}$$
(4)

where \(JS(R_k||R) = \frac{1}{2} KL(R_k||M) + \frac{1}{2} KL(R||M)\), \(M = \frac{1}{2} (R_k + R)\), \(KL(R_k||M) = \sum r_k^{ij} \hbox {log}\frac{r_k^{ij}}{m^{ij}}\), \(KL(R||M) = \sum r^{ij} \hbox {log}\frac{r^{ij}}{m^{ij}}\) KL denotes the Kullback–Leibler divergences, and \(r_k^{ij}\), \(r^{ij}\), \(m^{ij}\) represent the value of the (ij) position in response maps \(R_k\), R and M, respectively.

Fig. 2
figure 2

The overview of the proposed multi-feature fusion model for the TIR target tracking

To effectively utilize the complementarity between different type of features [2, 57], we filter the response maps corresponding to a different type of features as follow:

$$\begin{aligned} R_{j,k} = R_j \odot R_k, \end{aligned}$$
(5)

Equation (5) indicates that if two response maps in the same area have similar probability distributions, the filtered response map has a higher response value in that area; otherwise, it returns a lower response value. After that, Eq. (4) can be rewrite as follow:

$$\begin{aligned} \begin{aligned}&R =\mathop {\arg \min }_{R} \sum JS(R_{j,k}||R)\\&\hbox {s.t.} \ \ \sum r^{ij} =1. \end{aligned} \end{aligned}$$
(6)

Finding the position with the highest value in the fused response map R, and yields the tracking target location.

3.3 Model update

To adapt to the dynamic changes in the TIR target appearance during the whole tracking process, the correlation filters need to be updated continuously. We follow other correlation filters-based trackers [2, 35] who use a simple but effective linear update method to update the correlation filters:

$$\begin{aligned} w_k^t = w_k^{t-1} + \gamma w_k^t, \end{aligned}$$
(7)

where \(\gamma\) represents the filters learning rate, \(w_k\) represents the correlation filters corresponding to the k-th features, and \(w_k^t\) represents the trained filters in the t-th frame.

4 Experiments

We verified the tracking performance of the proposed AMFT tracker on the PTB-TIR [22] and LSOTB-TIR [23] benchmarks against several other trackers, such as MCFTS [2], HSSNet [20], SianFC [43], TADT [58], MLSSNet [31], HCF [49], HDT [32], SRDCF [38], UDT [42], CFNet [44], SiamTri [45], CREST [46], VITAL [59], GFSDCF [60], MDNet [61], Staple [62], and MCCT [63]. The evaluation criteria are precision score and success score under the One Pass Evaluation (OPE) [22].

4.1 Implementation details

Experiments implemented in MATLAB2019b, and the PC is equipped with with an i7-10700-2.90GHz-CPU, and an Nvidia-GTX-1660-GPU with the matconvnet1.0-beta25 toolbox. The tracking speed of the proposed AMFT tracker is around 7 fps. The features we used in this AMFT tracker include Color Names (CN) [64], HOG [35], and deep CNN features from ResNet50 [65]. The regularization parameter \(\lambda = 10\)e−4, and the learning rate \(\gamma = 10\)e−2. The interpolation strategy has been adopted to estimate the target scale [36, 60], and it is used to predict the target location and scale with a scale factor of 7 and a scale step of 1.01.

4.2 Ablation studies

In an effort to demonstrate the effectiveness of each type of the features in the proposed AMFT tracker, we provide the ablation studies using the PTB-TIR [22] benchmark. The experimental results are shown in Table 1. Note that AMFT_H represents the tracking results with only hand-crafted features, while AMFT_D represents the tracking results with only deep convolutional neural networks features. Due to the lack of color, rich texture and relatively fuzzy contour of the target in the thermal infrared image, deep features or hand-crafted features can not be used to represent the target well, resulting in low tracking accuracy. Table 1 shows that our AMFT tracker has significantly improved tracking performance when compared to a single-type features-based tracker. We also give the tracking speed of different types of trackers on the PTB-TIR benchmark. It can be seen from the tracking speed that multi-feature fusion will slightly increase the computation amount and reduce the tracking speed.

Table 1 Ablation studies on PTB-TIR [22] benchmark

4.3 Comparative experiments on PTB-TIR benchmark

The experimental comparison outputs of the proposed tracker and other state-of-the-art trackers are shown in Fig. 3. We may conclude from this figure that our AMFT tracker outperformed the competition results in terms of precision and success metrics. When compared to these single-type features-based trackers [38, 43, 46, 62], our AMFT tracker performs dramatically better in terms of tracking evaluation metrics. Besides, compared with those multi-layers fused trackers [2, 20, 31, 32, 49], our tracker also achieved competitive tracking performance. Though our AMFT tracker performs somewhat worse in the precision metric than that of the MDNet [61] tracker, our AMFT tracker has a dramatically higher success score than that of the MDNet tracker, demonstrating that our tracker is more competitive than the MDNet tracker. What’s more, as shown in Fig. 4 that our tracker is much faster than the MDNet tracker. The experimental results show that the multi-feature fusion model can adaptively use the complementarity between different types of features to characterize the target appearance, which is particularly useful in the TIR target tracking task.

Fig. 3
figure 3

Experimental comparison on PTB-TIR [22] benchmark

Fig. 4
figure 4

Comparison results of tracking speed and tracking accuracy on the LSOTB-TIR [23] benchmark

Figure 5 compares the performance of the proposed AMFT tracking method with that of some state-of-the-art tracking methods on the PTB-TIR [22] benchmark on some different attributes. The proposed multi-feature fusion model could further be verified to be effective in the TIR target tracking task. In comparison with these state-of-the-art trackers, our AMFT tracker has obtained good tracking results under these attributes, as shown in Fig. 5. The comparison of the thermal crossover attribute shows that our tracker could reduce the interference by other analogs. Although the success score of our AMFT tracker is lower than the GFSDCF [60] tracker on the scale variation attribute, the success score of our AMFT tracker is higher than the GFSDCF [60] tracker on the rest of the other attributes, which shows that our tracker has better tracking performance. In general, these experimental results displayed the effectiveness of our multi-feature fusion method for the TIR tracking task.

Fig. 5
figure 5

Experimental comparison on PTB-TIR [22] benchmark for some attributes

4.4 Comparative experiments on LSOTB-TIR benchmark

Figure 6 shows the tracking results comparison of our AMFT tracking method and some state-of-the-art tracking methods on the LSOTB-TIR [23] benchmark. According to Fig. 6, we know that our AMFT tracking method achieved the best success scores and the second-best precision scores. Compared with the group feature selection-based GFSDCF [60] tracker, the proposed AMFT tracker is slightly lower in the tracking precision score, but higher in tracking success precision score, which indicates that the proposed AMFT tracker achieves better performance on the LSOTB-TIR benchmark. Compared with these multi-layer deep features-based trackers (such as MFCTS [2], HDT [32], and HCF [49]), our tracker adopts the adaptive fusion strategy of hand-crafted features and deep features, which could get more accurate TIR target tracking results. Compared with the Siamese network-based trackers (such as SiamFC [43], and SiamTri [45]), our tracker obtained more than \(10\%\) improvement in each evaluation metric. Compared to other tracking methods, our AMFT tracking method also achieved better TIR target tracking results.

Fig. 6
figure 6

Experimental comparison on LSOTB-TIR [23] benchmark

Table 2 Success scores (%) comparison on the LSOTB-TIR [23] benchmark for 12 different attributes, which include the scale variation (SV), fast motion (FM), motion blur (MB), distractor (DIS), low resolution (LR), intensity variation (IV), out-of-view (OV), background clutter (BC), deformation(DEF), aspect ratio variation (ARV), occlusion (OCC), and thermal crossover (TC)

We compare the tracking performance of our proposed AMFT tracking method against other state-of-the-art tracking methods on some attributes and scenarios on the LSOTB-TIR [23] benchmark in order to show the tracking effectiveness of our AMFT tracking method. Table 2 shows the proposed AMFT tracking method obtained best success scores on most of the attributes (e.g., fast motion (FM), scale variation (SV), motion blur (MB), etc). For the deformation (DEF) and occlusion (OCC) attributes, the tracking success score of the proposed AMFT tracking method is lower than the MDNet [61] tracking method, probably due to the MDNet tracking method applies effective and efficient hard negative mining technology. The success score of the MDNet tracker is lower than that of our tracker in most attributes, which illustrates the effectiveness of the multiple-types of features fusion model in our AMFT tracker. In all of these tracking scenarios, our proposed AMFT tracking method received the top three success scores, as shown in Table 3. In conclusion, our proposed AMFT tracking method outperformed these state-of-the-art tracking methods in terms of the TIR target tracking scenarios.

Table 3 Success scores (%) comparison on the LSOTB-TIR [23] benchmark for 4 different scenarios, which include the handheld camera (HH), vehicle-mounted camera (VM), drone-mounted camera (DM), and surveillance camera(VS)

4.5 Qualitative comparison

Figure 7 shows the results of visual comparison between our AMFT tracking method and other state-of-the-art tracking methods on some TIR target tracking test video sequences. The MDNet [61] tracking method easily disturbed by the fast motion and scale variation attributes (e.g., dog-D-002, and person-V-007). The GFSDCF [60] tracking method gets some accurate tracking results on the dog-D-002 and leopard-H-001 test video sequences due to the group feature selection model that has been usefully adopted. However, the tracking results of the GFSDCF [60] tracking method on other test video sequences (such as street-S-001, and bus-S-004) are still unacceptable. Compared to other tracking methods, the proposed AMFT tracking method could accurately be tracking these targets in the complex tracking scenarios, which verified the proposed multi-featured fusion model is fully effective.

Fig. 7
figure 7

Qualitative comparison of our AMFT tracking method and VITAL [59], GFSDCF [60], MDNet [61], TADT [58], SRDC4F [38] tracking methods on some TIR target tracking test video sequences(from top to bottom are dog-D-002, street-S-001, bus-S-004, person-V-007, and leopard-H-001)

4.6 Failure cases

Figure 8 shows some failure cases of the proposed AMFT tracker. To display the tracking result more intuitively, we also give the ground-truth label of the target as a reference. For the stranger3 testing sequence, the main reason why the proposed AMFT tracker cannot track the target is the challenge of low resolution. For the campus2 and saturated testing sequences, due to the influence of similar distracts, our AMFT tracker shifted to other similar targets, leading to the failure of the tracking task. For these failure tracking cases, we will further explore them in future work.

Fig. 8
figure 8

Failure cases (from top to bottom are campus2, stranger3, and saturated). The proposed AMFT tracking results shows in red boxes and the target ground truth shows in green boxes

5 Conclusions

In this paper, we propose a multiple types of features fusion model for the TIR target tracking task. The multi-feature fusion model adaptively integrates the hand-crafted features and the deep features by the JS divergence and gives play to their complementarity, to better model the target appearance. Meanwhile, we adopt a model update strategy to adapt to the changes of target appearance during the tracking process. Furthermore, we verify the validity of the multi-feature fusion model through the ablation studies. We demonstrate in extensive experiments on the PTB-TIR and LSOTB-TIR benchmarks that the proposed AMFT tracker has competitive tracking performance when compared to other state-of-the-art trackers.