1 Introduction

Tracking is a fundamental task in computer vision with numerous applications in surveillance [41, 47], self-driving vehicles [8, 48], and UAV-based monitoring [43, 59]. It is the task of locating the same moving object in each frame of a video sequence, given only the initial appearance of target object. Most modern trackers treat this as a classification problem. By learning an appearance model of the target from the initial frame, the trackers distinguish the target from background by cross-correlation operation and predict its location in the following frames. Although achieving impressive performance, these Tracking-by-Detection approaches can fail when the appearance model misidentifies a similar-looking object (a “distractor”) as the target. Even a current state-of-the-art tracker,Super_DiMP,Footnote 1 which fully exploits (through end-to-end offline training and online meta-learning) both target and background appearance information, is still fooled by distractors as shown in Fig. 1b. In contract, humans are able to take into account the appearance of other objects, in order to distinguish these potential distractors from the target object and successfully infer the position of a tracked object [19].

Fig. 1
figure 1

A comparison of our approach with a state-of-the-art tracker on two hard scenarios. a shows the search region of the current frame and the green rectangle is the ground-truth location of the tracked target object. b and c are the score maps produced by Super_DiMP and by Super_DiMP when combined with the proposed method. Super_DiMP identified the wrong location as the most likely location for the target in both scenarios. Our approach correctly identifies the target location in both scenarios

A leading theory of how such perceptual inference is performed in the brain is provided by predictive coding (PC) [9, 49, 53, 55]. Specifically, PC suggests that the brain learns, from prior experience, an internal model of the world. This internal model encodes possible causes of sensory inputs, and new sensory inputs are then represented in terms of these known causes. Determining which combination of the many possible causes best fits the current sensory data is achieved through an iterative process of minimising the error between the sensory data and the expected sensory inputs predicted by the causes [54]. This inference process performs explaining away [30, 36, 37]: possible causes (i.e. objects) compete to explain the sensory evidence (i.e. the current image), and if one cause explains part of the evidence (i.e. a part of the image), then support from this evidence for alternative explanations (i.e. other objects) is reduced, or explained away.

This paper proposes that explaining away, implemented using a version of PC called DIM [54, 56], can be used to enable a tracker (like a human) to take into account the appearance of distractors when identifying the target in each frame of a video. Specifically, both the target and the distractors (identified in previous frames) are used as possible causes underlying the appearance for the next frame of the video. These causes compete to explain each part of the current frame, and when a distractor provides a better explanation for the appearance of some part of the image, this part of the image is explained away and will not be matched to the target. In this way, the target is less likely to be matched to incorrect, but similar-looking, regions of the image and is more likely to be matched to the correct location, increasing the robustness of tracking.

Our main contributions are summarised as follows:

  • We propose a novel tracking architecture that detects distractors in every frame. These are represented as additional appearance models with the same size as the target appearance model. The predicted location of the target in the next frame takes into consideration not only the tracked object, but also the distractors. As a consequence, matches between the target appearance model and the surrounding background are suppressed, and the identification of the target is more reliable (as illustrated in Fig. 1).

  • The proposed method does not require any retraining of the underlying tracker and could easily integrate with most current trackers that use the Tracking-by-Detection architecture. We demonstrate this by integrating the proposed method with four existing trackers: SiamFC [2], DaSiamRPN [82], Super_DiMP [12], and ARSuper_DiMP [75]. In all cases, the performance of the underlying tracker is improved by the addition of the proposed method. This indicates that the proposed method has good transferability and is potentially a general approach that could be used to improve the performance of most visual trackers.

  • We demonstrate the effectiveness of our general approach by integrating it with the recent state-of-the-art trackers Super_DiMP [12] and ARSuper_DiMP [75]. The resulting trackers achieve results that are competitive with the state of the art on seven benchmark datasets: OTB-100 [67], UAV123 [44], NFS [31], LaSOT [18], Trackingnet [45], GOT-10K [28], and VOT2020 [32].

2 Related work

Contemporary approaches solve the tracking problem by learning the appearance of the target in the first frame. These approaches can be roughly divided into generative trackers [2, 5, 17, 25, 26, 35, 40, 40, 60, 65, 73, 76,77,78,79, 82] and discriminative trackers [3, 4, 7, 11,12,13,14,15,16, 23, 27, 38,39,40, 42, 46, 63, 64, 68,69,72, 75, 79,80,81]. The former formulate object tracking as a cross-correlation problem in deep feature space and take advantage of end-to-end learning by training a Y-shaped network containing two branches, one for the object template and the other for the search region. This approach is exemplified by Siamese network-based trackers which have gained significant attention due to their promising performance and efficiency. However, they typically employ a fixed target template and do not model background information, which consequently results in incorrect tracking when there is a similar-looking object in the background or a significant change of the target appearance. Despite appearance updating strategies [25, 60] that have been recently introduced into Siamese network-based trackers, their performance is still below that of discriminative trackers.

The discriminative trackers are exemplified by discriminative correlation filter (DCF)-based methods, which learn to distinguish the target from the background. Traditionally, these methods [14, 27, 39] have a fast training process in the Fourier domain using the diagonalising transformation of circular convolutions to generate training samples. However, the online learning procedures are complicated and cannot be integrated with end-to-end learning architectures. To solve this problem, DiMP-based trackers [3, 4, 11, 15, 64, 75] employ a meta-learning formulation to predict the weights of the classification layer. This enables DiMP-based trackers to achieve state-of-the-art results on many benchmarks. Despite discriminative trackers learning an appearance model using background information, the appearance model is still unable to deal with cases that contain highly similar-looking distractors (as illustrated in Fig. 1).

Tracking failures caused by a similar-looking location being misidentified as the target indicates that only using the appearance model to identify the tracked object is insufficient to achieve robust results for the popular Tracking-by-Detection-based trackers. Some existing methods attempt to address this issue by taking more visual cues into consideration. For example, Gladh et al [23] use deep motion features extracted from optical flow images together with appearance features to generate the target model. Wang et al [63] predict the approximate location of the target by decoupling camera motion and object motion to create an adaptive search region.

Some existing methods attempt to address this issue by introducing attention mechanisms. For example, RAR [22] employed a hierarchical attention module to leverage both inter- and intra-frame attention at each convolutional layer which effectively highlighted informative representations and suppressed distractors. SiamGAT [24] employed a graph attention module to replace the cross-correlation operation, that is common in Siamese trackers, for part-to-part matching which effectively passed target information from the template to the search region. [64] proposed an appearance model generator using a transformer [61], and the transformer-encoder promotes the previous appearance models via attention-based feature reinforcement to acquire more compact target representations, while the transformer-decoder generates the appearance model for the current frame. TransT [6] developed a Transformer-like fusion module to combine the template and search region features solely using attention instead of correlation. STMTrack [20] created a space-time memory network inspired by non-local self-attention [66] to fully use of historical information about the target to better adapt to appearance variations during tracking.

More closely related to our work are methods that explicitly take into account information about possible distractors. For example, DaSiamRPN [82] proposed a distractor-aware feature learning scheme to boost the discriminative power of the networks during offline training, and also a novel distractor-aware module to suppress distractors during online tracking. Bhat et al [4] presented an end-to-end learning architecture, KYS, where the encoding of image regions is learned and propagated by appearance-based dense tracking between frames. The final prediction is then obtained by combining the explicit background representation with the appearance model output. Nocal-Siam [58] proposed a target-aware non-local attention module to jointly refine visual features of the target and search branches which suppressed distractors effectively.

Fig. 2
figure 2

An overview of the proposed tracking architecture. Distractor objects are located in every frame by identifying non-target peaks in the score map generated by the tracker. Distractor appearance models are obtained by cropping areas from the current image features \(\phi ({\mathbf {X}})\) that are the same size as the target appearance model but centred at the distractor positions. The distractor list contains the distractor appearance models detected in the previous n frames. These models together with the target model are united as a joint appearance model for our predictor to compute for the best matching location

Other distractor-suppression techniques have been proposed for specific tracking architectures, but would be difficult to incorporate in modern Tracking-by-Detection approaches. For example, [10] developed an online feature ranking mechanism to select the top-ranked appearance features for the trackers based on colour histogram. TLD [29] proposed the Tracking–Learning–Detection architecture which implemented a P-N learning mechanism to exploit spatio-temporal relationships in the video. Siam R-CNN [62] proposed a Siamese re-detection architecture with a novel Tracklet Dynamic Programming Algorithm to simultaneously track all potential objects and select the best object in the current timestep based on the complete history of all target and distractor object tracklets. [74] proposed a novel hard negative mining method to suppress distractors for long-time tracking which enhanced the target identification ability of a verification network.

In contrast, we present a common distractor-suppression solution applicable to modern Tracking-by-Detection trackers. We design a novel architecture that constructs a joint appearance model for both the tracked and distractor objects. Each object in the appearance model then competes to explain each part of the next video frame. This leads to the score map for the target being suppressed at locations where the appearance of a distractor is a better match to the image and consequently results in more robust predictions about the true location of the target.

3 Proposed method

Figure 2 shows the architecture of the proposed method. A visual tracker is used to generate an initial prediction which is then used by the proposed detector module (see Sect. 3.1 for details) to locate distractor objects. Once the positions of distractors are determined, the corresponding distractor appearance models are obtained by cropping regions from the current image features. Lastly, the proposed predictor module (see Sect. 3.2 for details) takes the distractor appearance models (detected in previous frames) and the target appearance model into consideration at same time. These models compete to explain every pixel of the image, which results in the suppression of distractors in the final score map, which describes the similarity between the target and each location in the image.

3.1 Distractor detection

During tracking, both generative and discriminative trackers (see Sect. 2) predict a scalar confidence score map \({\mathbf {S}}({\mathbf {X}}) \in {\mathbb {R}}\) given an input image \({\mathbf {X}}\), such that:

$$\begin{aligned} {\mathbf {S}}({\mathbf {X}})={\mathbf {A}} \star \phi ({\mathbf {X}}) \end{aligned}$$

Here, \(\phi ({\mathbf {X}})\) are the features extracted from the search region of the image, commonly by a CNN. \({\mathbf {A}}\) is the target appearance model, for Siamese trackers it represents the features of the template \({\mathbf {Z}}\), i.e. \(\phi ({\mathbf {Z}})\). For DCF trackers, it is the convolution kernel which is trained online. \(\star \) represents the cross-correlation operation.

The score map measures the similarity between an appearance model and the deep features extracted from the current video frame. The tracker estimates the target object’s location by finding the location of the maximum in the score map. If the appearance of the target is distinctive, there will only be one peak in the score map. However, if there are similar-looking distractors in the search region, the score map will have multiple peaks. Hence, distractors can be identified by finding the locations of peaks excluding the one that represents the target. To be specific, a peak is defined as a local maximum (within a 3-by-3 neighbourhood) that has a value over a global threshold which is set to 0.7 times the max value of the score map. The peak corresponding to the target is determined by finding the location of the maximum value in the final score map produced by the proposed predictor.

Finally, distractor appearance models are obtained by cropping areas from \(\phi ({\mathbf {X}})\) that are the same size as the target appearance model and are centred at the distractor positions. A list of distractors is updated every frame and contains the distractor appearance models detected in the last n (default value is 5) frames. If there is no distractor in a frame, no additional distractor appearance models are stored in the list.

3.2 Appearance model competition

The predictor takes the target appearance model and the distractor appearance models extracted from the preceding n frames. These appearance models compete to match to the features extracted from the search region by the tracker. The competition is achieved by the DIM algorithm which implements explaining away and which is the current state-of-the-art method for image patch matching in both colour [56] and deep feature [21] space. A detailed description of the DIM algorithm can be found in [56], but an introduction is provided below for the convenience of the reader.

The DIM can be thought of as a function, with two input arguments and one output. In the current application, this function operates as follows:

$$\begin{aligned} {\mathbf {S}}_j({\mathbf {X}})=\mathrm {DIM}(Pre({\mathbf {A}}_{j}),Pre(\phi ({\mathbf {X}}))_{\epsilon _{2},\iota } \end{aligned}$$

where \({\mathbf {A}}_{j}\) is a joint appearance model consisting of a stack of the target appearance model and the distractor appearance models detected in last n frames, and \(\phi ({\mathbf {X}})\) are the features extracted from the search region of the image, \({\mathbf {X}}\). The output of this function, \({\mathbf {S}}_j({\mathbf {X}})\), is a stack of score maps, and each channel, j, is the individual score map for the corresponding appearance model in \({\mathbf {A}}_{j}\). Pre stands for pre-processing and will be described below.

To simplify the notation, we will represent the two inputs to DIM as \({\mathbf {w}}\) and \({\mathbf {I}}\) (i.e. \({\mathbf {w}}_j=Pre({\mathbf {A}}_{j})\) and \({\mathbf {I}}=Pre(\phi ({\mathbf {X}})\)). Internally, the DIM function performs \(\iota \) iterations for the following three equations:

$$\begin{aligned}&{\mathbf {R}}_{i}=\sum _{j=1}^{p}\left( {\mathbf {v}}_{j i} \star {\mathbf {S}}_{j}\right) \end{aligned}$$
$$\begin{aligned}&{\mathbf {E}}_{i}={\mathbf {I}}_{i} \oslash \left[ {\mathbf {R}}_{i}\right] _{\epsilon _{2}} \end{aligned}$$
$$\begin{aligned}&{\mathbf {S}}_{j} \leftarrow \left[ {\mathbf {S}}_{j}\right] _{\epsilon _{1}} \odot \sum _{i=1}^{k}\left( {\mathbf {w}}_{j i} *{\mathbf {E}}_{i}\right) \end{aligned}$$

where i is an index over the number of channels in the input \({\mathbf {I}}\); j is an index over the number of different appearance models; \({\mathbf {R}}_{i}\) is a 2-dimensional array representing a reconstruction of \({\mathbf {I}}_i\); \({\mathbf {E}}_{i}\) is a 2-dimensional array representing the discrepancy (or residual error) between \({\mathbf {I}}_{i}\) and \({\mathbf {R}}_{i}\); \({\mathbf {S}}_{j}\) is the individual score map for the corresponding appearance model in \({\mathbf {A}}_{j}\); \({\mathbf {w}}_{j i}\) is a 2-dimensional array representing channel i of the corresponding appearance model after pre-processing (i.e. \(Pre({\mathbf {A}}_{j})_i\)) the values in each \({\mathbf {w}}_{j}\) were normalised to sum to one; \({\mathbf {v}}_{j i}\) is a 2-dimensional array also representing appearance model values (the values of \({\mathbf {v}}_{j}\) were made equal to the corresponding values of \({\mathbf {w}}_{j}\) except they were normalised to have a maximum value of one); \([\mathbf {\cdot }]_{\epsilon }=max(\mathbf {\cdot },\epsilon )\); \(\oslash \) and \(\odot \) indicate element-wise division and multiplication, respectively; and \(\star \) and \(*\) represent cross-correlation and convolution operations, respectively.

DIM attempts to find a sparse set of elementary components, \({\mathbf {v}}\), that when combined together reconstruct \({\mathbf {I}}\) with minimum error [52]. For the current application, the elementary components are the target appearance model and the distractor appearance models in the distractor list. These appearance models can be thought of as a “dictionary” or “codebook” that can be used to reconstruct many different images. The activation dynamics, described by Eqs. 34 and 5, perform gradient descent on the residual error in order to find values of \({\mathbf {S}}\) that accurately reconstruct \({\mathbf {I}}\) [1, 51, 57]. Specifically, the equations operate to find values for \({\mathbf {S}}\) that minimise the Kullback–Leibler (KL) divergence between \({\mathbf {I}}\) and its reconstruction \({\mathbf {R}}\) [50, 57]. The activation dynamics thus result in the DIM algorithm selecting a subset of dictionary elements that best explain \({\mathbf {I}}\). The strength of an element in \({\mathbf {S}}\) reflects the strength with which the corresponding dictionary entry (i.e. appearance model) is required to be present in order to accurately reconstruct \({\mathbf {I}}\) at that location [56]. Hence, when a distractor appearance model provides a high similarity to the appearance of some part of the image, this part of the image is explained away and will not be matched to the target. In this way, the target appearance model is more likely to be matched to the correct location, increasing the robustness of tracking.

Because DIM minimises the KL divergence between \({\mathbf {I}}\) and its reconstruction \({\mathbf {R}}\) created by the additive combination of elementary image components \({\mathbf {v}}\), both inputs to the DIM function must be nonnegative [56]. However, \({\mathbf {A}}_{j}\) and \(\phi ({\mathbf {X}})\) are activation values extracted from a CNN which can contain negative values. Thus, the pre-processing takes the positive and rectified negative values of \({\mathbf {A}}_{j}\) and \(\phi ({\mathbf {X}})\) and separates them into two parts which are used as separate channels by the DIM algorithm.

\(\epsilon _{1}\) is a parameters which was given to the values \(\frac{\epsilon _{2}}{\max (\sum \limits _{j}^{p} v_{j i})}\). \(\epsilon _{2}\) is a scalar parameter used by the DIM algorithm. It determines the magnitude required elements of \({\mathbf {I}}\) to have a strong effect on the competition. Hence, if a value of \({\mathbf {I}}\) is smaller than \(\epsilon _{2}\), it is effectively ignored. When DIM is applied to colour images, those images have pixel intensities that typically range from 0 to 1, so the maximum value of \({\mathbf {I}}\) is approximately 1 for every image, and it is possible to use a fixed value of \(\epsilon _{2}\). However, the maximum value of \(Pre(\phi ({\mathbf {X}}))\), which is used as \({\mathbf {I}}\), can vary as it is produced by applying a CNN to different videos. To deal with this variation, the appropriate value for \(\epsilon _{2}\) for any one video is chosen from a set of ten possible values: values ranging from \(1\times 10^{-3}\) to \(9\times 10^{-3}\) in steps of \(8\times 10^{-4}\). When DIM is applied for the first time to a video, it is run ten times with each candidate \(\epsilon _{2}\) value. The magnitude of the highest peaks in the resulting ten score maps for the target object is compared, and the value of \(\epsilon _{2}\) corresponding to the highest peak is used for all subsequent frames of this video. The number of iterations, \(\iota \), performed by the DIM algorithm was set to 15.

3.3 Implementation details

DIM requires the appearance model to have dimensions that are odd numbers; otherwise, the reconstruction of \({\mathbf {I}}\) does not align with the actual \({\mathbf {I}}\). Therefore, if the size of the target appearance model employed by a visual tracker is even, the target appearance model is padded by one row on the right and one column on the bottom with zeros and the new size of target appearance model is used to generate distractor appearance models, as described in Sect. 3.1.

If no distractor appearance model has been detected in the preceding five frames, i.e. if \({\mathbf {A}}_{j}\) only contains the target appearance model, then DIM will output a similar result to that produced by Eq. 1. In such circumstances, when the distractor list is empty, DIM is not employed and the score map generated by the original tracker is used as the final score map. This helps to improve the computational efficiency of the proposed method. The frequency of DIM used in the trackers reported in this paper can be found in Sect. 4.5.

Tracking sometimes fails, for example, when the target is occluded or is out of frame. In this case, the tracker may confuse a distractor for the target. Subsequently, when the tracked object reappears, the proposed detection module will incorrectly regard the reappeared target object as a distractor due to the incorrect matching in the former frame. If this happens, the appearance of the target object will be included in the list of distractor appearance models and DIM will suppress the score map at the location of the true target. To avoid this phenomenon, we rely on the assumption that the position of target object between two adjacent frames doesn’t change significantly, while there will be a jump in the predicted position of the target when a distractor is confused for the target. Specifically, the Euclidean distance between the locations of the highest peaks of the final score map in this frame and the one in the former frame is calculated. If this distance exceeds a threshold d (a value of 3 was used and the distance was calculated before upsampling the score map), the distractor list is cleared and the proposed predictor does not run for r frames. (A value of 5 was used.) Hence, for those r frames the final score map will be the one generated by the underlying tracker.

The complete proposed architecture is summarised by Algorithm 1.

figure c

Footnote 2Footnote 3

Fig. 3
figure 3

Success and precision measured using the OTB-100 and UAV123 datasets

4 Experiments

4.1 State-of-the-art comparison

We evaluate our proposed tracking architecture using Super_DiMP [12] and ARSuper_DiMP [75] on seven tracking benchmarks: OTB-100 [67], UAV123 [44], NFS [31], LaSOT [18], Trackingnet [45], GOT-10k [28], and VOT2020 [32]. Due to the stochastic nature of DCF trackers, the results reported for DiMP-based trackers [3, 4, 15, 64, 75] are an average over multiple runs. For OTB-100, NFS, UAV123, and LaSOT, the results were averaged over five runs. As the results of Trackingnet are obtained using an online evaluation server with limited submissions for an account, only a single run was used. GOT-10k results are also evaluated with an online server and the official documentation suggests using three runs for all trackers, hence three runs were used. The official VOT evaluation toolkit runs a tracker twenty times as default to produce statistically significant results, and hence, twenty runs were used for VOT2020. For a fair comparison, we follow the same approach to test our trackers, termed Super_DIM_DiMP and ARSuper_DIM_DiMP, and the original trackers. As ARSuper_DiMP uses an Alpha-Refine module to improve the accuracy of the bounding boxes predicted by Super_DiMP, the hyper-parameters used by our method were tuned using Super_DiMP and reused for ARSuper_DiMP. Super_DIM_DiMP runs at around 23 FPS on a single Nvidia Tesla V100 GPU. In comparison, Super_DiMP runs at approximately 27 FPS. ARSuper_DIM_DiMP and ARSuper_DiMP run at around 20 and 16 FPS, respectively. Our code is available at https://github.com/iminfine/DIMtracking.

OTB-100 [67]: This dataset has been used extensively to evaluate visual trackers. Our methods are compared with numerous state-of-the-art trackers in Fig. 3a and 3 b, including STMTrack [20], PRT [40], DROL-RPN [79], ARSuper_DiMP [75], TrDiMPFootnote 4 [64], Super_DiMP [12], SiamR-CNN [62], SiamRPN++ [35], PrDiMP50 [15], SiamBAN [7], KYS [4], FCOT [11], TransT [6], and DiMP50 [3]. Despite performance becoming saturated over recent years, the proposed tracker Super_DIM _DiMP still outperforms the baseline, Super_DiMP, by 1\(\%\) in terms of AUC (success score) and 1.3\(\%\) in terms of precision. Similarly, ARSuper_DIM_DiMP outperforms ARSuper_DiMP in both scores and achieves the best AUC score with 72.3\(\%\).

UAV123 [44]: This is a large dataset captured from low-altitude UAVs. It contains over 110K frames and 123 videos. It is quite changing due to small tracked objects and fast motion. PrDiMP50 [15], TransT [6], Super_DiMP [12], ARSuper_DiMP [75], TrDiMP [64], FCOT [11], DiMP50 [3], PrDiMP18 [15], SiamRPN++ [35], DROL-RPN [79], SiamR-CNN [62], STMTrack [20], DiMP18 [3] are compared. It can be seen from the results shown in Fig. 3c and 3 d that Super_DIM_DiMP outperforms the previous best approaches with an AUC of 68.1\(\%\) and precision of 90.6\(\%\).

Table 1 Comparison with state-of-the-art trackers on the NFS dataset
Table 2 Comparison with state-of-the-art trackers on the LaSOT dataset

NFS [31]: This dataset contains 100 videos captured using a high frame rate (240 FPS) camera. We evaluate our tracker on the 30 FPS version of this dataset in which videos have an average length of 479 frames. As shown in Table 1, ARSuper_DIM_DiMP outperforms the previous best approaches with an AUC of 67.5\(\%\) and precision of 82.3\(\%\).

LaSOT [18]: The large-scale LaSOT dataset contains 280 videos in its test set. The video sequences, which have an average length of 2500 frames, are longer than those in other datasets, testing not only the accuracy of the tracker but also its robustness. As shown in Table 2, ARSuper_DIM_DiMP achieves the best AUC score with 65.5\(\%\) and Super_DIM_DiMP outperforms Super_DiMP with relative gains of 0.4\(\%\) in success score and 0.6\(\%\) in normalised precision.

Trackingnet [45]: This dataset provides over 30K videos sampled from YouTube. We report results on its test set, consisting of 511 videos with an average of 441 frames per sequence. As shown in Table 3, our approaches achieve similar results to the baseline tackers Super_DiMP and ARSuper_DiMP.

GOT-10K [28]: This is a recent large-scale dataset consisting of 10k video sequences. With this dataset, trackers are evaluated on 180 videos with 84 object classes and 32 motions that cover a wide range of common moving objects in the wild. The results in terms of average overlap (AO) and success rates (SR\(_{0.50}\) and SR\(_{0.75}\)Footnote 5) are shown in Table 4. Among Tracking-By-Detection methods, ARSuper_DIM_DiMP achieves the performance. Meanwhile, Super_DIM_DiMP significantly outperforms Super_DiMP with a relative improvement of 2.5\(\%\) in AO.

VOT2020 [32]: The VOT challenge [32,33,34], held yearly, provides a precisely defined and repeatable way of comparing short-term trackers. VOT2020 contains 60 videos with binary segmentation masks as the ground truth and uses a new evaluation protocol which separates the sequences into short pieces to keep the computational complexity of the evaluation at a moderate level. The results in terms of expected average overlap (EAO), accuracy (A), and robustness (R) are shown in Table 4. ARSuper_DIM_DiMP outperforms the baseline ARSuper_DiMP by 1.2\(\%\) in terms of robustness.

Table 3 Comparison with state-of-the-art trackers on the Trackingnet dataset
Table 4 Comparison with state-of-the-art trackers on the GOT-10K dataset
Table 5 Comparison with state-of-the-art trackers on the VOT2020 dataset
Fig. 4
figure 4

Qualitative comparison of Super_DiMP and Super_DIM_DiMP on hard scenarios. First row shows frames from video Crowds from OTB-100, and second and third rows are frames from video group2_1 from UAV123. Note that the annotations provided with the UAV123 data define the bounding-box coordinates as NaNs when the target is out of frame or fully occluded, and thus, the ground-truth bounding box is not shown in frames 98, 104, 113, and 116 of the video group2_1. Note that cropped regions of each video are shown to improve the visibility of the bounding boxes

Of all the Tracking-By-Detection trackers tested, ARSuper_DIM_DiMP was the best performing on five datasets (OTB-100, NFS, LaSOT, Trackingnet, and GOT-10k). No other Tracking-By-Detection method was best performing on more than one dataset, and for one of those (UAV123), the best performing tracker was our other method, Super_DIM_DiMP. The results also show DIM to be more effective than other recent distractor-suppression methods such as KYS [4] and Nocal-Siam [58].

4.2 Qualitative evaluations

The effects of explaining away are illustrated on the first row of Fig. 4. In this example, Super_DiMP incorrectly starts to track a similar-looking distractor in frame 30. In contrast, this distractor was detected in previous frames and the proposed tracker is able to use this information to infer the true location of the target. However, the proposed tracker can fail when the tracking object is fully covered for a long time. In the example on the second and third rows, the target is fully occluded by a pavilion from frames 98 to 141. (Four frames are selected in Fig. 4.) Super_DiMP checks the maximum of \({\mathbf {S}}({\mathbf {X}})\) every frame, if the value is below a threshold that is interpreted as a tracking failure, the tracker does not update the location of the target and outputs the same predicted bounding box as the last frame. Hence, the predicted bounding box of the two trackers on frames 98, 104, and 113 are highly overlapped. From frame 116, the proposed tracker regards a distractor as the target even when the target reappears. This is because the proposed detection module incorrectly regards the reappeared target object as a distractor, and the appearance of the target object is included in the list of distractor appearance models. Hence, DIM suppresses the score map at the location of the true target. We have developed a reset mechanism to avoid this phenomenon (see Sect. 3.3 for details); however, this mechanism is not activated in this particular scenario as the proposed tracker regards the distractor as the target for a long time. However, this type of failure case (which requires the tracking target to be fully occluded for a long time, and the surrounding background to contain a highly similar distractor) is rare and, hence, has little detrimental effect on performance overall.

4.3 Parameter sensitivity

The proposed method employs a number of hyper-parameters:

  • The global threshold, g, applied to the score map to locate peaks caused by distractors, see Sect. 3.1;

  • The number of previous frames, n, used to detect distractors, see Sect. 3.1;

  • The number of iterations, \(\iota \), performed by the DIM algorithm, see Sect. 3.2;

  • The distance threshold, d, used to identify situations where the target has been lost, see Sect. 3.3;

  • The reset period, r, during which DIM is not used following the distance threshold being exceeded, see Sect. 3.3;

The influence of these hyper-parameters on the performance of Super_DIM_DiMP was evaluated by varying the value of one parameter while keeping the other parameters fixed at their default values. This experiment was conducted using the OTB-100 dataset, and the results are shown in Tables 5 and 6.

It can be seen that when the value of g was increased by a factor of 1.15 from its default value, the algorithm still produced state-of-the art performance. However, increasing this parameter further had a detrimental effect on performance, which is not surprising as very few distractors will be identified if g is too large. Decreasing g also reduced performance, and an extreme reduction in g could lead to worse performance than the underlying tracker alone. This is likely to be due to the DIM algorithm needing more iterations to perform explaining away when there are many distractor appearance models. In addition, small amplitude peaks in the score map close to the target will result in parts of the target being included in the distractor appearance models.

The algorithm was tolerant to large changes in n, \(\iota \), and r. However, only detecting distractors in one preceding frame (\(n \div 5\)) meant that there were few distractor appearance models, and hence, only a minor performance gain compared to the underlying tracker alone. Performance deteriorated when a large number of iterations was performed, and this can be explained by the similarity values becoming sparser as the number of iterations increases [56]. Using a very small r resulted in an AUC only marginally above that of the base tracker which indicates that one frame has insufficient time for the tracker to relocate the target.

The results were particularly sensitive to the value of d. When d was decreased by a factor of 5 from its default value, the proposed method produced similar results to the underlying tracker alone. This is because the small distance threshold excluded DIM from being used most of the time, due to small displacements of the target from one frame to the next. In contrast, increasing the value of d meant the situations where the target was lost were not identified, and as explained in Sect. 3.3, this can result in target object being included among the distractor appearance models, and the score map being suppressed at the true location of the target in subsequent frames.

4.4 Transferability

The experiments above are based on the discriminative trackers, Super_DiMP [12] and ARSuper_DiMP [75]. To test the transferability of our method, we also combined it with two generative trackers: SiamFC [2] and DaSiamRPN [82]. The global threshold g was set to 0.5 for these two trackers, and other hyper-parameters were kept at the standard values. Optimising the parameters carefully for each tracker may result in better performance. The resulting methods were evaluated on the OTB-100 and UAV123 dataset, and the results are shown in Table 7. We evaluated them on a single Nvidia Tesla V100 GPU. The speed of SiamFC and DaSiamRPN is around 210 FPS and 103 FPS, respectively. When integrating DIM, their speeds are 175 FPS and 82 FPS, respectively. The results show our method can also improve the tracking performance of both these trackers, which indicates our method has the potential to be a general approach that could improve the performance for most visual trackers.

Table 6 Evaluation of the sensitivity of the proposed architecture to its parameter values on OTB-100
Table 7 Evaluation of proposed architecture with generative trackers. Note that the authors of DaSiamRPN [82] report an online update module for the target template but they did not release this as part of the official implementation. Thus, the results reported here are lower than those in [82]

4.5 The frequency of DIM used in these trackers

We tested this using OTB-100. This dataset contains 59040 frames. DIM was used on 1817, 1723, 15757, and 13244 frames when integrated with Super_DiMP, ARSuper_DiMP, SiamFC, and DaSiamRPN, respectively. DIM is used less frequently in Super_DiMP and ARSuper_DiMP as these trackers are much more robust than the other two trackers and also because of the use of a higher global threshold g (0.7 of these trackers and 0.5 of other two trackers). Because Super_DiMP and ARSuper_DiMP fully exploit both target and background appearance information during tracking, they are only fooled on rare occasions when a distractor is highly similar to the tracked object. DIM works on these rare occurrences to provide a useful boost in performance.

5 Conclusions

We propose a novel tracking architecture that can detect distractors in each frame of a video. The distractor appearance models compete with the target appearance model to explain each part of a subsequent frame of the video. Parts of the image that look similar to the target, and might have been misidentified as the target, are explained away by the distractor appearance models. This leads to suppression of the distractors in the score map, and hence, to more robust tracking of the target. It is general-purpose and has the potential to improve the performance of many exiting tracking algorithms, and when combined with state-of-the-art discriminative trackers are shown to improve tracking results even further.