Explaining away results in more robust visual tracking

Gao, Bo; Spratling, Michael W.

doi:10.1007/s00371-022-02466-6

Explaining away results in more robust visual tracking

Original article
Open access
Published: 05 April 2022

Volume 39, pages 2081–2095, (2023)
Cite this article

Download PDF

You have full access to this open access article

The Visual Computer Aims and scope Submit manuscript

Explaining away results in more robust visual tracking

Download PDF

1696 Accesses
2 Citations
1 Altmetric
Explore all metrics

Abstract

Many current trackers utilise an appearance model to localise the target object in each frame. However, such approaches often fail when there are similar-looking distractor objects in the surrounding background, meaning that target appearance alone is insufficient for robust tracking. In contrast, humans consider the distractor objects as additional visual cues, in order to infer the position of the target. Inspired by this observation, this paper proposes a novel tracking architecture in which not only is the appearance of the tracked object, but also the appearance of the distractors detected in previous frames, taken into consideration using a form of probabilistic inference known as explaining away. This mechanism increases the robustness of tracking by making it more likely that the target appearance model is matched to the true target, rather than similar-looking regions of the current frame. The proposed method can be combined with many existing trackers. Combining it with SiamFC, DaSiamRPN, Super_DiMP, and ARSuper_DiMP all resulted in an increase in the tracking accuracy compared to that achieved by the underlying tracker alone. When combined with Super_DiMP and ARSuper_DiMP, the resulting trackers produce performance that is competitive with the state of the art on seven popular benchmarks.

Know Your Surroundings: Exploiting Scene Information for Object Tracking

Auto-correct-integrated trackers with and without memory of first frames

Article 02 June 2020

Visual tracking in complex scenes through pixel-wise tri-modeling

Article 23 January 2015

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Tracking is a fundamental task in computer vision with numerous applications in surveillance [41, 47], self-driving vehicles [8, 48], and UAV-based monitoring [43, 59]. It is the task of locating the same moving object in each frame of a video sequence, given only the initial appearance of target object. Most modern trackers treat this as a classification problem. By learning an appearance model of the target from the initial frame, the trackers distinguish the target from background by cross-correlation operation and predict its location in the following frames. Although achieving impressive performance, these Tracking-by-Detection approaches can fail when the appearance model misidentifies a similar-looking object (a “distractor”) as the target. Even a current state-of-the-art tracker,Super_DiMP,^{Footnote 1} which fully exploits (through end-to-end offline training and online meta-learning) both target and background appearance information, is still fooled by distractors as shown in Fig. 1b. In contract, humans are able to take into account the appearance of other objects, in order to distinguish these potential distractors from the target object and successfully infer the position of a tracked object [19].

A leading theory of how such perceptual inference is performed in the brain is provided by predictive coding (PC) [9, 49, 53, 55]. Specifically, PC suggests that the brain learns, from prior experience, an internal model of the world. This internal model encodes possible causes of sensory inputs, and new sensory inputs are then represented in terms of these known causes. Determining which combination of the many possible causes best fits the current sensory data is achieved through an iterative process of minimising the error between the sensory data and the expected sensory inputs predicted by the causes [54]. This inference process performs explaining away [30, 36, 37]: possible causes (i.e. objects) compete to explain the sensory evidence (i.e. the current image), and if one cause explains part of the evidence (i.e. a part of the image), then support from this evidence for alternative explanations (i.e. other objects) is reduced, or explained away.

This paper proposes that explaining away, implemented using a version of PC called DIM [54, 56], can be used to enable a tracker (like a human) to take into account the appearance of distractors when identifying the target in each frame of a video. Specifically, both the target and the distractors (identified in previous frames) are used as possible causes underlying the appearance for the next frame of the video. These causes compete to explain each part of the current frame, and when a distractor provides a better explanation for the appearance of some part of the image, this part of the image is explained away and will not be matched to the target. In this way, the target is less likely to be matched to incorrect, but similar-looking, regions of the image and is more likely to be matched to the correct location, increasing the robustness of tracking.

Our main contributions are summarised as follows:

We propose a novel tracking architecture that detects distractors in every frame. These are represented as additional appearance models with the same size as the target appearance model. The predicted location of the target in the next frame takes into consideration not only the tracked object, but also the distractors. As a consequence, matches between the target appearance model and the surrounding background are suppressed, and the identification of the target is more reliable (as illustrated in Fig. 1).
The proposed method does not require any retraining of the underlying tracker and could easily integrate with most current trackers that use the Tracking-by-Detection architecture. We demonstrate this by integrating the proposed method with four existing trackers: SiamFC [2], DaSiamRPN [82], Super_DiMP [12], and ARSuper_DiMP [75]. In all cases, the performance of the underlying tracker is improved by the addition of the proposed method. This indicates that the proposed method has good transferability and is potentially a general approach that could be used to improve the performance of most visual trackers.
We demonstrate the effectiveness of our general approach by integrating it with the recent state-of-the-art trackers Super_DiMP [12] and ARSuper_DiMP [75]. The resulting trackers achieve results that are competitive with the state of the art on seven benchmark datasets: OTB-100 [67], UAV123 [44], NFS [31], LaSOT [18], Trackingnet [45], GOT-10K [28], and VOT2020 [32].

2 Related work

Contemporary approaches solve the tracking problem by learning the appearance of the target in the first frame. These approaches can be roughly divided into generative trackers [2, 5, 17, 25, 26, 35, 40, 40, 60, 65, 73, 76,77,78,79, 82] and discriminative trackers [3, 4, 7, 11,12,13,14,15,16, 23, 27, 38,39,40, 42, 46, 63, 64, 68,69,72, 75, 79,80,81]. The former formulate object tracking as a cross-correlation problem in deep feature space and take advantage of end-to-end learning by training a Y-shaped network containing two branches, one for the object template and the other for the search region. This approach is exemplified by Siamese network-based trackers which have gained significant attention due to their promising performance and efficiency. However, they typically employ a fixed target template and do not model background information, which consequently results in incorrect tracking when there is a similar-looking object in the background or a significant change of the target appearance. Despite appearance updating strategies [25, 60] that have been recently introduced into Siamese network-based trackers, their performance is still below that of discriminative trackers.

The discriminative trackers are exemplified by discriminative correlation filter (DCF)-based methods, which learn to distinguish the target from the background. Traditionally, these methods [14, 27, 39] have a fast training process in the Fourier domain using the diagonalising transformation of circular convolutions to generate training samples. However, the online learning procedures are complicated and cannot be integrated with end-to-end learning architectures. To solve this problem, DiMP-based trackers [3, 4, 11, 15, 64, 75] employ a meta-learning formulation to predict the weights of the classification layer. This enables DiMP-based trackers to achieve state-of-the-art results on many benchmarks. Despite discriminative trackers learning an appearance model using background information, the appearance model is still unable to deal with cases that contain highly similar-looking distractors (as illustrated in Fig. 1).

Tracking failures caused by a similar-looking location being misidentified as the target indicates that only using the appearance model to identify the tracked object is insufficient to achieve robust results for the popular Tracking-by-Detection-based trackers. Some existing methods attempt to address this issue by taking more visual cues into consideration. For example, Gladh et al [23] use deep motion features extracted from optical flow images together with appearance features to generate the target model. Wang et al [63] predict the approximate location of the target by decoupling camera motion and object motion to create an adaptive search region.

Some existing methods attempt to address this issue by introducing attention mechanisms. For example, RAR [22] employed a hierarchical attention module to leverage both inter- and intra-frame attention at each convolutional layer which effectively highlighted informative representations and suppressed distractors. SiamGAT [24] employed a graph attention module to replace the cross-correlation operation, that is common in Siamese trackers, for part-to-part matching which effectively passed target information from the template to the search region. [64] proposed an appearance model generator using a transformer [61], and the transformer-encoder promotes the previous appearance models via attention-based feature reinforcement to acquire more compact target representations, while the transformer-decoder generates the appearance model for the current frame. TransT [6] developed a Transformer-like fusion module to combine the template and search region features solely using attention instead of correlation. STMTrack [20] created a space-time memory network inspired by non-local self-attention [66] to fully use of historical information about the target to better adapt to appearance variations during tracking.

More closely related to our work are methods that explicitly take into account information about possible distractors. For example, DaSiamRPN [82] proposed a distractor-aware feature learning scheme to boost the discriminative power of the networks during offline training, and also a novel distractor-aware module to suppress distractors during online tracking. Bhat et al [4] presented an end-to-end learning architecture, KYS, where the encoding of image regions is learned and propagated by appearance-based dense tracking between frames. The final prediction is then obtained by combining the explicit background representation with the appearance model output. Nocal-Siam [58] proposed a target-aware non-local attention module to jointly refine visual features of the target and search branches which suppressed distractors effectively.

Other distractor-suppression techniques have been proposed for specific tracking architectures, but would be difficult to incorporate in modern Tracking-by-Detection approaches. For example, [10] developed an online feature ranking mechanism to select the top-ranked appearance features for the trackers based on colour histogram. TLD [29] proposed the Tracking–Learning–Detection architecture which implemented a P-N learning mechanism to exploit spatio-temporal relationships in the video. Siam R-CNN [62] proposed a Siamese re-detection architecture with a novel Tracklet Dynamic Programming Algorithm to simultaneously track all potential objects and select the best object in the current timestep based on the complete history of all target and distractor object tracklets. [74] proposed a novel hard negative mining method to suppress distractors for long-time tracking which enhanced the target identification ability of a verification network.

In contrast, we present a common distractor-suppression solution applicable to modern Tracking-by-Detection trackers. We design a novel architecture that constructs a joint appearance model for both the tracked and distractor objects. Each object in the appearance model then competes to explain each part of the next video frame. This leads to the score map for the target being suppressed at locations where the appearance of a distractor is a better match to the image and consequently results in more robust predictions about the true location of the target.

3 Proposed method

Figure 2 shows the architecture of the proposed method. A visual tracker is used to generate an initial prediction which is then used by the proposed detector module (see Sect. 3.1 for details) to locate distractor objects. Once the positions of distractors are determined, the corresponding distractor appearance models are obtained by cropping regions from the current image features. Lastly, the proposed predictor module (see Sect. 3.2 for details) takes the distractor appearance models (detected in previous frames) and the target appearance model into consideration at same time. These models compete to explain every pixel of the image, which results in the suppression of distractors in the final score map, which describes the similarity between the target and each location in the image.

3.1 Distractor detection

During tracking, both generative and discriminative trackers (see Sect. 2) predict a scalar confidence score map ${\mathbf {S}}({\mathbf {X}}) \in {\mathbb {R}}$ given an input image ${\mathbf {X}}$, such that:

$$\begin{aligned} {\mathbf {S}}({\mathbf {X}})={\mathbf {A}} \star \phi ({\mathbf {X}}) \end{aligned}$$

(1)

Here, $\phi ({\mathbf {X}})$ are the features extracted from the search region of the image, commonly by a CNN. ${\mathbf {A}}$ is the target appearance model, for Siamese trackers it represents the features of the template ${\mathbf {Z}}$, i.e. $\phi ({\mathbf {Z}})$. For DCF trackers, it is the convolution kernel which is trained online. $\star $ represents the cross-correlation operation.

The score map measures the similarity between an appearance model and the deep features extracted from the current video frame. The tracker estimates the target object’s location by finding the location of the maximum in the score map. If the appearance of the target is distinctive, there will only be one peak in the score map. However, if there are similar-looking distractors in the search region, the score map will have multiple peaks. Hence, distractors can be identified by finding the locations of peaks excluding the one that represents the target. To be specific, a peak is defined as a local maximum (within a 3-by-3 neighbourhood) that has a value over a global threshold which is set to 0.7 times the max value of the score map. The peak corresponding to the target is determined by finding the location of the maximum value in the final score map produced by the proposed predictor.

Finally, distractor appearance models are obtained by cropping areas from $\phi ({\mathbf {X}})$ that are the same size as the target appearance model and are centred at the distractor positions. A list of distractors is updated every frame and contains the distractor appearance models detected in the last n (default value is 5) frames. If there is no distractor in a frame, no additional distractor appearance models are stored in the list.

3.2 Appearance model competition

The predictor takes the target appearance model and the distractor appearance models extracted from the preceding n frames. These appearance models compete to match to the features extracted from the search region by the tracker. The competition is achieved by the DIM algorithm which implements explaining away and which is the current state-of-the-art method for image patch matching in both colour [56] and deep feature [21] space. A detailed description of the DIM algorithm can be found in [56], but an introduction is provided below for the convenience of the reader.

The DIM can be thought of as a function, with two input arguments and one output. In the current application, this function operates as follows:

$$\begin{aligned} {\mathbf {S}}_j({\mathbf {X}})=\mathrm {DIM}(Pre({\mathbf {A}}_{j}),Pre(\phi ({\mathbf {X}}))_{\epsilon _{2},\iota } \end{aligned}$$

(2)

where ${\mathbf {A}}_{j}$ is a joint appearance model consisting of a stack of the target appearance model and the distractor appearance models detected in last n frames, and $\phi ({\mathbf {X}})$ are the features extracted from the search region of the image, ${\mathbf {X}}$. The output of this function, ${\mathbf {S}}_j({\mathbf {X}})$, is a stack of score maps, and each channel, j, is the individual score map for the corresponding appearance model in ${\mathbf {A}}_{j}$. Pre stands for pre-processing and will be described below.

To simplify the notation, we will represent the two inputs to DIM as ${\mathbf {w}}$ and ${\mathbf {I}}$ (i.e. ${\mathbf {w}}_j=Pre({\mathbf {A}}_{j})$ and ${\mathbf {I}}=Pre(\phi ({\mathbf {X}})$). Internally, the DIM function performs $\iota $ iterations for the following three equations:

$$\begin{aligned}&{\mathbf {R}}_{i}=\sum _{j=1}^{p}\left( {\mathbf {v}}_{j i} \star {\mathbf {S}}_{j}\right) \end{aligned}$$

(3)

$$\begin{aligned}&{\mathbf {E}}_{i}={\mathbf {I}}_{i} \oslash \left[ {\mathbf {R}}_{i}\right] _{\epsilon _{2}} \end{aligned}$$

(4)

$$\begin{aligned}&{\mathbf {S}}_{j} \leftarrow \left[ {\mathbf {S}}_{j}\right] _{\epsilon _{1}} \odot \sum _{i=1}^{k}\left( {\mathbf {w}}_{j i} *{\mathbf {E}}_{i}\right) \end{aligned}$$

(5)

where i is an index over the number of channels in the input ${\mathbf {I}}$; j is an index over the number of different appearance models; ${\mathbf {R}}_{i}$ is a 2-dimensional array representing a reconstruction of ${\mathbf {I}}_i$; ${\mathbf {E}}_{i}$ is a 2-dimensional array representing the discrepancy (or residual error) between ${\mathbf {I}}_{i}$ and ${\mathbf {R}}_{i}$; ${\mathbf {S}}_{j}$ is the individual score map for the corresponding appearance model in ${\mathbf {A}}_{j}$; ${\mathbf {w}}_{j i}$ is a 2-dimensional array representing channel i of the corresponding appearance model after pre-processing (i.e. $Pre({\mathbf {A}}_{j})_i$) the values in each ${\mathbf {w}}_{j}$ were normalised to sum to one; ${\mathbf {v}}_{j i}$ is a 2-dimensional array also representing appearance model values (the values of ${\mathbf {v}}_{j}$ were made equal to the corresponding values of ${\mathbf {w}}_{j}$ except they were normalised to have a maximum value of one); $[\mathbf {\cdot }]_{\epsilon }=max(\mathbf {\cdot },\epsilon )$; $\oslash $ and $\odot $ indicate element-wise division and multiplication, respectively; and $\star $ and $*$ represent cross-correlation and convolution operations, respectively.

DIM attempts to find a sparse set of elementary components, ${\mathbf {v}}$, that when combined together reconstruct ${\mathbf {I}}$ with minimum error [52]. For the current application, the elementary components are the target appearance model and the distractor appearance models in the distractor list. These appearance models can be thought of as a “dictionary” or “codebook” that can be used to reconstruct many different images. The activation dynamics, described by Eqs. 3, 4 and 5, perform gradient descent on the residual error in order to find values of ${\mathbf {S}}$ that accurately reconstruct ${\mathbf {I}}$ [1, 51, 57]. Specifically, the equations operate to find values for ${\mathbf {S}}$ that minimise the Kullback–Leibler (KL) divergence between ${\mathbf {I}}$ and its reconstruction ${\mathbf {R}}$ [50, 57]. The activation dynamics thus result in the DIM algorithm selecting a subset of dictionary elements that best explain ${\mathbf {I}}$. The strength of an element in ${\mathbf {S}}$ reflects the strength with which the corresponding dictionary entry (i.e. appearance model) is required to be present in order to accurately reconstruct ${\mathbf {I}}$ at that location [56]. Hence, when a distractor appearance model provides a high similarity to the appearance of some part of the image, this part of the image is explained away and will not be matched to the target. In this way, the target appearance model is more likely to be matched to the correct location, increasing the robustness of tracking.

Because DIM minimises the KL divergence between ${\mathbf {I}}$ and its reconstruction ${\mathbf {R}}$ created by the additive combination of elementary image components ${\mathbf {v}}$, both inputs to the DIM function must be nonnegative [56]. However, ${\mathbf {A}}_{j}$ and $\phi ({\mathbf {X}})$ are activation values extracted from a CNN which can contain negative values. Thus, the pre-processing takes the positive and rectified negative values of ${\mathbf {A}}_{j}$ and $\phi ({\mathbf {X}})$ and separates them into two parts which are used as separate channels by the DIM algorithm.

$\epsilon _{1}$ is a parameters which was given to the values $\frac{\epsilon _{2}}{\max (\sum \limits _{j}^{p} v_{j i})}$. $\epsilon _{2}$ is a scalar parameter used by the DIM algorithm. It determines the magnitude required elements of ${\mathbf {I}}$ to have a strong effect on the competition. Hence, if a value of ${\mathbf {I}}$ is smaller than $\epsilon _{2}$, it is effectively ignored. When DIM is applied to colour images, those images have pixel intensities that typically range from 0 to 1, so the maximum value of ${\mathbf {I}}$ is approximately 1 for every image, and it is possible to use a fixed value of $\epsilon _{2}$. However, the maximum value of $Pre(\phi ({\mathbf {X}}))$, which is used as ${\mathbf {I}}$, can vary as it is produced by applying a CNN to different videos. To deal with this variation, the appropriate value for $\epsilon _{2}$ for any one video is chosen from a set of ten possible values: values ranging from $1\times 10^{-3}$ to $9\times 10^{-3}$ in steps of $8\times 10^{-4}$. When DIM is applied for the first time to a video, it is run ten times with each candidate $\epsilon _{2}$ value. The magnitude of the highest peaks in the resulting ten score maps for the target object is compared, and the value of $\epsilon _{2}$ corresponding to the highest peak is used for all subsequent frames of this video. The number of iterations, $\iota $, performed by the DIM algorithm was set to 15.

3.3 Implementation details

DIM requires the appearance model to have dimensions that are odd numbers; otherwise, the reconstruction of ${\mathbf {I}}$ does not align with the actual ${\mathbf {I}}$. Therefore, if the size of the target appearance model employed by a visual tracker is even, the target appearance model is padded by one row on the right and one column on the bottom with zeros and the new size of target appearance model is used to generate distractor appearance models, as described in Sect. 3.1.

If no distractor appearance model has been detected in the preceding five frames, i.e. if ${\mathbf {A}}_{j}$ only contains the target appearance model, then DIM will output a similar result to that produced by Eq. 1. In such circumstances, when the distractor list is empty, DIM is not employed and the score map generated by the original tracker is used as the final score map. This helps to improve the computational efficiency of the proposed method. The frequency of DIM used in the trackers reported in this paper can be found in Sect. 4.5.

Tracking sometimes fails, for example, when the target is occluded or is out of frame. In this case, the tracker may confuse a distractor for the target. Subsequently, when the tracked object reappears, the proposed detection module will incorrectly regard the reappeared target object as a distractor due to the incorrect matching in the former frame. If this happens, the appearance of the target object will be included in the list of distractor appearance models and DIM will suppress the score map at the location of the true target. To avoid this phenomenon, we rely on the assumption that the position of target object between two adjacent frames doesn’t change significantly, while there will be a jump in the predicted position of the target when a distractor is confused for the target. Specifically, the Euclidean distance between the locations of the highest peaks of the final score map in this frame and the one in the former frame is calculated. If this distance exceeds a threshold d (a value of 3 was used and the distance was calculated before upsampling the score map), the distractor list is cleared and the proposed predictor does not run for r frames. (A value of 5 was used.) Hence, for those r frames the final score map will be the one generated by the underlying tracker.

The complete proposed architecture is summarised by Algorithm 1.

^{Footnote 2}^{Footnote 3}

4 Experiments

4.1 State-of-the-art comparison

We evaluate our proposed tracking architecture using Super_DiMP [12] and ARSuper_DiMP [75] on seven tracking benchmarks: OTB-100 [67], UAV123 [44], NFS [31], LaSOT [18], Trackingnet [45], GOT-10k [28], and VOT2020 [32]. Due to the stochastic nature of DCF trackers, the results reported for DiMP-based trackers [3, 4, 15, 64, 75] are an average over multiple runs. For OTB-100, NFS, UAV123, and LaSOT, the results were averaged over five runs. As the results of Trackingnet are obtained using an online evaluation server with limited submissions for an account, only a single run was used. GOT-10k results are also evaluated with an online server and the official documentation suggests using three runs for all trackers, hence three runs were used. The official VOT evaluation toolkit runs a tracker twenty times as default to produce statistically significant results, and hence, twenty runs were used for VOT2020. For a fair comparison, we follow the same approach to test our trackers, termed Super_DIM_DiMP and ARSuper_DIM_DiMP, and the original trackers. As ARSuper_DiMP uses an Alpha-Refine module to improve the accuracy of the bounding boxes predicted by Super_DiMP, the hyper-parameters used by our method were tuned using Super_DiMP and reused for ARSuper_DiMP. Super_DIM_DiMP runs at around 23 FPS on a single Nvidia Tesla V100 GPU. In comparison, Super_DiMP runs at approximately 27 FPS. ARSuper_DIM_DiMP and ARSuper_DiMP run at around 20 and 16 FPS, respectively. Our code is available at https://github.com/iminfine/DIMtracking.

OTB-100 [67]: This dataset has been used extensively to evaluate visual trackers. Our methods are compared with numerous state-of-the-art trackers in Fig. 3a and 3 b, including STMTrack [20], PRT [40], DROL-RPN [79], ARSuper_DiMP [75], TrDiMP^{Footnote 4} [64], Super_DiMP [12], SiamR-CNN [62], SiamRPN++ [35], PrDiMP50 [15], SiamBAN [7], KYS [4], FCOT [11], TransT [6], and DiMP50 [3]. Despite performance becoming saturated over recent years, the proposed tracker Super_DIM _DiMP still outperforms the baseline, Super_DiMP, by 1$\%$ in terms of AUC (success score) and 1.3$\%$ in terms of precision. Similarly, ARSuper_DIM_DiMP outperforms ARSuper_DiMP in both scores and achieves the best AUC score with 72.3$\%$.

UAV123 [44]: This is a large dataset captured from low-altitude UAVs. It contains over 110K frames and 123 videos. It is quite changing due to small tracked objects and fast motion. PrDiMP50 [15], TransT [6], Super_DiMP [12], ARSuper_DiMP [75], TrDiMP [64], FCOT [11], DiMP50 [3], PrDiMP18 [15], SiamRPN++ [35], DROL-RPN [79], SiamR-CNN [62], STMTrack [20], DiMP18 [3] are compared. It can be seen from the results shown in Fig. 3c and 3 d that Super_DIM_DiMP outperforms the previous best approaches with an AUC of 68.1$\%$ and precision of 90.6$\%$.

Table 1 Comparison with state-of-the-art trackers on the NFS dataset

Full size table

Table 2 Comparison with state-of-the-art trackers on the LaSOT dataset

Full size table

NFS [31]: This dataset contains 100 videos captured using a high frame rate (240 FPS) camera. We evaluate our tracker on the 30 FPS version of this dataset in which videos have an average length of 479 frames. As shown in Table 1, ARSuper_DIM_DiMP outperforms the previous best approaches with an AUC of 67.5$\%$ and precision of 82.3$\%$.

LaSOT [18]: The large-scale LaSOT dataset contains 280 videos in its test set. The video sequences, which have an average length of 2500 frames, are longer than those in other datasets, testing not only the accuracy of the tracker but also its robustness. As shown in Table 2, ARSuper_DIM_DiMP achieves the best AUC score with 65.5$\%$ and Super_DIM_DiMP outperforms Super_DiMP with relative gains of 0.4$\%$ in success score and 0.6$\%$ in normalised precision.

Trackingnet [45]: This dataset provides over 30K videos sampled from YouTube. We report results on its test set, consisting of 511 videos with an average of 441 frames per sequence. As shown in Table 3, our approaches achieve similar results to the baseline tackers Super_DiMP and ARSuper_DiMP.

GOT-10K [28]: This is a recent large-scale dataset consisting of 10k video sequences. With this dataset, trackers are evaluated on 180 videos with 84 object classes and 32 motions that cover a wide range of common moving objects in the wild. The results in terms of average overlap (AO) and success rates (SR$_{0.50}$ and SR$_{0.75}$^{Footnote 5}) are shown in Table 4. Among Tracking-By-Detection methods, ARSuper_DIM_DiMP achieves the performance. Meanwhile, Super_DIM_DiMP significantly outperforms Super_DiMP with a relative improvement of 2.5$\%$ in AO.

VOT2020 [32]: The VOT challenge [32,33,34], held yearly, provides a precisely defined and repeatable way of comparing short-term trackers. VOT2020 contains 60 videos with binary segmentation masks as the ground truth and uses a new evaluation protocol which separates the sequences into short pieces to keep the computational complexity of the evaluation at a moderate level. The results in terms of expected average overlap (EAO), accuracy (A), and robustness (R) are shown in Table 4. ARSuper_DIM_DiMP outperforms the baseline ARSuper_DiMP by 1.2$\%$ in terms of robustness.

Table 3 Comparison with state-of-the-art trackers on the Trackingnet dataset

Full size table

Table 4 Comparison with state-of-the-art trackers on the GOT-10K dataset

Full size table

Table 5 Comparison with state-of-the-art trackers on the VOT2020 dataset

Full size table

Of all the Tracking-By-Detection trackers tested, ARSuper_DIM_DiMP was the best performing on five datasets (OTB-100, NFS, LaSOT, Trackingnet, and GOT-10k). No other Tracking-By-Detection method was best performing on more than one dataset, and for one of those (UAV123), the best performing tracker was our other method, Super_DIM_DiMP. The results also show DIM to be more effective than other recent distractor-suppression methods such as KYS [4] and Nocal-Siam [58].

4.2 Qualitative evaluations

The effects of explaining away are illustrated on the first row of Fig. 4. In this example, Super_DiMP incorrectly starts to track a similar-looking distractor in frame 30. In contrast, this distractor was detected in previous frames and the proposed tracker is able to use this information to infer the true location of the target. However, the proposed tracker can fail when the tracking object is fully covered for a long time. In the example on the second and third rows, the target is fully occluded by a pavilion from frames 98 to 141. (Four frames are selected in Fig. 4.) Super_DiMP checks the maximum of ${\mathbf {S}}({\mathbf {X}})$ every frame, if the value is below a threshold that is interpreted as a tracking failure, the tracker does not update the location of the target and outputs the same predicted bounding box as the last frame. Hence, the predicted bounding box of the two trackers on frames 98, 104, and 113 are highly overlapped. From frame 116, the proposed tracker regards a distractor as the target even when the target reappears. This is because the proposed detection module incorrectly regards the reappeared target object as a distractor, and the appearance of the target object is included in the list of distractor appearance models. Hence, DIM suppresses the score map at the location of the true target. We have developed a reset mechanism to avoid this phenomenon (see Sect. 3.3 for details); however, this mechanism is not activated in this particular scenario as the proposed tracker regards the distractor as the target for a long time. However, this type of failure case (which requires the tracking target to be fully occluded for a long time, and the surrounding background to contain a highly similar distractor) is rare and, hence, has little detrimental effect on performance overall.

4.3 Parameter sensitivity

The proposed method employs a number of hyper-parameters:

The global threshold, g, applied to the score map to locate peaks caused by distractors, see Sect. 3.1;
The number of previous frames, n, used to detect distractors, see Sect. 3.1;
The number of iterations, $\iota $, performed by the DIM algorithm, see Sect. 3.2;
The distance threshold, d, used to identify situations where the target has been lost, see Sect. 3.3;
The reset period, r, during which DIM is not used following the distance threshold being exceeded, see Sect. 3.3;

The influence of these hyper-parameters on the performance of Super_DIM_DiMP was evaluated by varying the value of one parameter while keeping the other parameters fixed at their default values. This experiment was conducted using the OTB-100 dataset, and the results are shown in Tables 5 and 6.

It can be seen that when the value of g was increased by a factor of 1.15 from its default value, the algorithm still produced state-of-the art performance. However, increasing this parameter further had a detrimental effect on performance, which is not surprising as very few distractors will be identified if g is too large. Decreasing g also reduced performance, and an extreme reduction in g could lead to worse performance than the underlying tracker alone. This is likely to be due to the DIM algorithm needing more iterations to perform explaining away when there are many distractor appearance models. In addition, small amplitude peaks in the score map close to the target will result in parts of the target being included in the distractor appearance models.

The algorithm was tolerant to large changes in n, $\iota $, and r. However, only detecting distractors in one preceding frame ($n \div 5$) meant that there were few distractor appearance models, and hence, only a minor performance gain compared to the underlying tracker alone. Performance deteriorated when a large number of iterations was performed, and this can be explained by the similarity values becoming sparser as the number of iterations increases [56]. Using a very small r resulted in an AUC only marginally above that of the base tracker which indicates that one frame has insufficient time for the tracker to relocate the target.

The results were particularly sensitive to the value of d. When d was decreased by a factor of 5 from its default value, the proposed method produced similar results to the underlying tracker alone. This is because the small distance threshold excluded DIM from being used most of the time, due to small displacements of the target from one frame to the next. In contrast, increasing the value of d meant the situations where the target was lost were not identified, and as explained in Sect. 3.3, this can result in target object being included among the distractor appearance models, and the score map being suppressed at the true location of the target in subsequent frames.

4.4 Transferability

The experiments above are based on the discriminative trackers, Super_DiMP [12] and ARSuper_DiMP [75]. To test the transferability of our method, we also combined it with two generative trackers: SiamFC [2] and DaSiamRPN [82]. The global threshold g was set to 0.5 for these two trackers, and other hyper-parameters were kept at the standard values. Optimising the parameters carefully for each tracker may result in better performance. The resulting methods were evaluated on the OTB-100 and UAV123 dataset, and the results are shown in Table 7. We evaluated them on a single Nvidia Tesla V100 GPU. The speed of SiamFC and DaSiamRPN is around 210 FPS and 103 FPS, respectively. When integrating DIM, their speeds are 175 FPS and 82 FPS, respectively. The results show our method can also improve the tracking performance of both these trackers, which indicates our method has the potential to be a general approach that could improve the performance for most visual trackers.

Table 6 Evaluation of the sensitivity of the proposed architecture to its parameter values on OTB-100

Full size table

Table 7 Evaluation of proposed architecture with generative trackers. Note that the authors of DaSiamRPN [82] report an online update module for the target template but they did not release this as part of the official implementation. Thus, the results reported here are lower than those in [82]

Full size table

4.5 The frequency of DIM used in these trackers

We tested this using OTB-100. This dataset contains 59040 frames. DIM was used on 1817, 1723, 15757, and 13244 frames when integrated with Super_DiMP, ARSuper_DiMP, SiamFC, and DaSiamRPN, respectively. DIM is used less frequently in Super_DiMP and ARSuper_DiMP as these trackers are much more robust than the other two trackers and also because of the use of a higher global threshold g (0.7 of these trackers and 0.5 of other two trackers). Because Super_DiMP and ARSuper_DiMP fully exploit both target and background appearance information during tracking, they are only fooled on rare occasions when a distractor is highly similar to the tracked object. DIM works on these rare occurrences to provide a useful boost in performance.

5 Conclusions

We propose a novel tracking architecture that can detect distractors in each frame of a video. The distractor appearance models compete with the target appearance model to explain each part of a subsequent frame of the video. Parts of the image that look similar to the target, and might have been misidentified as the target, are explained away by the distractor appearance models. This leads to suppression of the distractors in the score map, and hence, to more robust tracking of the target. It is general-purpose and has the potential to improve the performance of many exiting tracking algorithms, and when combined with state-of-the-art discriminative trackers are shown to improve tracking results even further.

Notes

Super_DiMP combines the bounding-box regressor of PrDiMP [15] with the standard DiMP classifier [3]. Code for this tracker is available at https://github.com/visionml/pytracking/.
${\mathbf {X}}$ is the current search region obtained by cropping a square area in $im_{t}$ to search for the target around the location at which it appeared in $im_{t-1}$.
${\mathbf {X}}$ is the current search region obtained by cropping a square area in $im_{t}$ to search for the target around the location at which it appeared in $im_{t-1}$.
Using the raw results provided by the authors, we were unable to reproduce the scores reported for TrDiMP in [64] for the OTB-100, UAV123, and NFS datasets. Our different results are shown in Fig. 3 and Table 1.
The percentage of successfully tracked frames where overlap rates are above the given threshold.

References

Achler, T.: Symbolic neural networks for cognitive capacities. Biologic. Inspir. Cogn. Archit. 9, 71–81 (2014)
Google Scholar
Bertinetto, L., Valmadre, J., Henriques, J.F., Vedaldi, A., Torr, P.H.: Fully-convolutional siamese networks for object tracking. In: European Conference on Computer Vision (ECCV), pp. 850–865. Springer (2016)
Bhat, G., Danelljan, M., Gool, L.V., Timofte, R.: Learning discriminative model prediction for tracking. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 6182–6191 (2019)
Bhat, G., Danelljan, M., Van Gool, L., Timofte, R.: Know your surroundings: Exploiting scene information for object tracking. arXiv:2003.11014 (2020)
Bo, L., Junjie, Y., Wei, W., Zheng, Z., Xiaolin, H.: High performance visual tracking with siamese region proposal network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8971–8980 (2018)
Chen, X., Yan, B., Zhu, J., Wang, D., Yang, X., Lu, H.: Transformer tracking. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8126–8135 (2021)
Chen, Z., Zhong, B., Li, G., Zhang, S., Ji, R.: Siamese box adaptive network for visual tracking. arXiv:2003.06761 (2020)
Cho, H., Seo, Y.W., Kumar, B.V., Rajkumar, R.R.: A multi-sensor fusion system for moving object detection and tracking in urban driving environments. In: 2014 IEEE International Conference on Robotics and Automation (ICRA), pp. 1836–1843. IEEE (2014)
Clark, A.: Whatever next? predictive brains, situated agents, and the future of cognitive science. Behav. Brain sci. 36(3), 181–204 (2013)
Article Google Scholar
Collins, R.T., Liu, Y., Leordeanu, M.: Online selection of discriminative tracking features. IEEE Trans. Pattern Anal. Mach. Intell. 27(10), 1631–1643 (2005)
Article Google Scholar
Cui, Y., Jiang, C., Wang, L., Wu, G.: Fully convolutional online tracking. arXiv:2004.07109 (2020)
Danelljan, M., Bhat, G.: Pytracking: Visual tracking library based on pytorch (2019). https://github.com/visionml/pytracking/, accessed: 6/01/2020
Danelljan, M., Bhat, G., Khan, F.S., Felsberg, M.: ATOM: Accurate tracking by overlap maximization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4660–4669 (2019)
Danelljan, M., Bhat, G., Shahbaz Khan, F., Felsberg, M.: ECO: Efficient convolution operators for tracking. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6638–6646 (2017)
Danelljan, M., Gool, L.V., Timofte, R.: Probabilistic regression for visual tracking. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7183–7192 (2020)
Devi, R.B., Chanu, Y.J., Singh, K.M.: Discriminative object tracking with subspace representation. Vis. Comput. 37, 1207–1219 (2021)
Article Google Scholar
Fan, C., Zhang, R., Ming, Y.: Mp-ln: motion state prediction and localization network for visual object tracking. The Visual Computer pp. 1–16 (2021)
Fan, H., Lin, L., Yang, F., Chu, P., Deng, G., Yu, S., Bai, H., Xu, Y., Liao, C., Ling, H.: Lasot: A high-quality benchmark for large-scale single object tracking. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5374–5383 (2019)
Feria, C.S.: The effects of distractors in multiple object tracking are modulated by the similarity of distractor and target features. Perception 41(3), 287–304 (2012)
Article Google Scholar
Fu, Z., Liu, Q., Fu, Z., Wang, Y.: Stmtrack: Template-free visual tracking with space-time memory networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 13774–13783 (2021)
Gao, B., Spratling, M.W.: Robust template matching via hierarchical convolutional features from a shape biased CNN. arXiv:2007.15817 (2020)
Gao, P., Zhang, Q., Wang, F., Xiao, L., Fujita, H., Zhang, Y.: Learning reinforced attentional representation for end-to-end visual tracking. Inf. Sci. 517, 52–67 (2020)
Article Google Scholar
Gladh, S., Danelljan, M., Khan, F.S., Felsberg, M.: Deep motion features for visual tracking. In: 2016 23rd International Conference on Pattern Recognition (ICPR), pp. 1243–1248. IEEE (2016)
Guo, D., Shao, Y., Cui, Y., Wang, Z., Zhang, L., Shen, C.: Graph attention tracking. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2021)
Guo, Q., Feng, W., Zhou, C., Huang, R., Wan, L., Wang, S.: Learning dynamic siamese network for visual object tracking. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 1763–1771 (2017)
He, A., Luo, C., Tian, X., Zeng, W.: Towards a better match in siamese network based visual object tracker. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 0–0 (2018)
Henriques, J.F., Caseiro, R., Martins, P., Batista, J.: High-speed tracking with kernelized correlation filters. IEEE Trans. Pattern Anal. Mach. Intell. 37(3), 583–596 (2014)
Article Google Scholar
Huang, L., Zhao, X., Huang, K.: GOT-10k: A large high-diversity benchmark for generic object tracking in the wild. IEEE Transactions on Pattern Analysis and Machine Intelligence (2019)
Kalal, Z., Mikolajczyk, K., Matas, J.: Tracking-learning-detection. IEEE Trans. Pattern Anal. Mach. Intell. 34(7), 1409–1422 (2011)
Article Google Scholar
Kersten, D., Mamassian, P., Yuille, A.: Object perception as bayesian inference. Annu. Rev. Psychol. 55, 271–304 (2004)
Article Google Scholar
Kiani Galoogahi, H., Fagg, A., Huang, C., Ramanan, D., Lucey, S.: Need for speed: A benchmark for higher frame rate object tracking. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 1125–1134 (2017)
Kristan, M., Leonardis, A., Matas, J., Felsberg, M., Pflugfelder, R., Kämäräinen, J.K., Danelljan, M., Zajc, L.Č., Lukežič, A., Drbohlav, O., et al.: The eighth visual object tracking VOT2020 challenge results. In: European Conference on Computer Vision, pp. 547–601. Springer (2020)
Kristan, M., Leonardis, A., Matas, J., Felsberg, M., Pflugfelder, R., Čehovin Zajc, L., Vojir, T., Bhat, G., Lukezic, A., Eldesokey, A., et al.: The sixth visual object tracking VOT2018 challenge results. In: Proceedings of the European Conference on Computer Vision Workshops, pp. 0–0 (2018)
Kristan, M., Matas, J., Leonardis, A., Felsberg, M., Pflugfelder, R., Kamarainen, J.K., Čehovin Zajc, L., Drbohlav, O., Lukezic, A., Berg, A., et al.: The seventh visual object tracking VOT2019 challenge results. In: Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, pp. 0–0 (2019)
Li, B., Wu, W., Wang, Q., Zhang, F., Xing, J., Yan, J.: SiamRPN++: Evolution of siamese visual tracking with very deep networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4282–4291 (2019)
Lochmann, T., Deneve, S.: Neural processing as causal inference. Curr. Opin. Neurobiol. 21(5), 774–781 (2011)
Article Google Scholar
Lochmann, T., Ernst, U.A., Deneve, S.: Perceptual inference predicts contextual modulations of sensory responses. J. Neurosci. 32(12), 4179–4195 (2012)
Article Google Scholar
Lukezic, A., Matas, J., Kristan, M.: D3S-a discriminative single shot segmentation tracker. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7133–7142 (2020)
Ma, C., Huang, J.B., Yang, X., Yang, M.H.: Robust visual tracking via hierarchical convolutional features. IEEE Transactions on Pattern Analysis and Machine Intelligence (2018)
Ma, Z., Wang, L., Zhang, H., Lu, W., Yin, J.: RPT: Learning point set representation for siamese visual tracking. arXiv:2008.03467 (2020)
Mangawati, A., Leesan, M., Aradhya, H.R., et al.: Object tracking algorithms for video surveillance applications. In: 2018 International Conference on Communication and Signal Processing (ICCSP), pp. 0667–0671. IEEE (2018)
Mbelwa, J.T., Zhao, Q., Wang, F.: Visual tracking tracker via object proposals and co-trained kernelized correlation filters. Vis. Comput. 36(6), 1173–1187 (2020)
Article Google Scholar
Mondragón, I.F., Campoy, P., Martinez, C., Olivares-Méndez, M.A.: 3d pose estimation based on planar object tracking for UAVs control. In: 2010 IEEE International Conference on Robotics and Automation (ICRA), pp. 35–41. Ieee (2010)
Mueller, M., Smith, N., Ghanem, B.: A benchmark and simulator for UAV tracking. In: European Conference on Computer Vision (ECCV), pp. 445–461. Springer (2016)
Muller, M., Bibi, A., Giancola, S., Alsubaihi, S., Ghanem, B.: Trackingnet: A large-scale dataset and benchmark for object tracking in the wild. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 300–317 (2018)
Nam, H., Han, B.: Learning multi-domain convolutional neural networks for visual tracking. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4293–4302 (2016)
Pan, Z., Liu, S., Sangaiah, A.K., Muhammad, K.: Visual attention feature (VAF): a novel strategy for visual tracking based on cloud platform in intelligent surveillance systems. J. Parallel Distrib. Comput. 120, 182–194 (2018)
Article Google Scholar
Prabhakar, G., Kailath, B., Natarajan, S., Kumar, R.: Obstacle detection and classification using deep learning for tracking in high-speed autonomous driving. In: 2017 IEEE Region 10 Symposium (TENSYMP), pp. 1–6. IEEE (2017)
Rao, R.P., Ballard, D.H.: Predictive coding in the visual cortex: a functional interpretation of some extra-classical receptive-field effects. Nat. Neurosci. 2(1), 79–87 (1999)
Article Google Scholar
Solbakken, L.L., Junge, S.: Online parts-based feature discovery using competitive activation neural networks. In: The 2011 International Joint Conference on Neural Networks, pp. 1466–1473. IEEE (2011)
Spratling, M.W.: Image segmentation using a sparse coding model of cortical area v1. IEEE Trans. Image Process. 22(4), 1631–1643 (2012)
Article MathSciNet MATH Google Scholar
Spratling, M.W.: Classification using sparse representations: a biologically plausible approach. Biol. Cybern. 108(1), 61–73 (2014)
Article MathSciNet Google Scholar
Spratling, M.W.: Predictive coding as a model of cognition. Cogn. Process. 17(3), 279–305 (2016)
Article Google Scholar
Spratling, M.W.: A hierarchical predictive coding model of object recognition in natural images. Cogn. Comput. 9(2), 151–167 (2017)
Article Google Scholar
Spratling, M.W.: A review of predictive coding algorithms. Brain Cogn. 112, 92–97 (2017)
Article Google Scholar
Spratling, M.W.: Explaining away results in accurate and tolerant template matching. Pattern Recognition p. 107337 (2020)
Spratling, M.W., De Meyer, K., Kompass, R.: Unsupervised learning of overlapping image components using divisive input modulation. Computational intelligence and neuroscience 2009 (2009)
Tan, H., Zhang, X., Zhang, Z., Lan, L., Zhang, W., Luo, Z.: Nocal-siam: Refining visual features and response with advanced non-local blocks for real-time siamese tracking. IEEE Transactions on Image Processing (2021)
Tarhan, M., Altuğ, E.: A catadioptric and pan-tilt-zoom camera pair object tracking system for UAVs. J. Intell. Robot. Syst. 61(1–4), 119–134 (2011)
Article Google Scholar
Valmadre, J., Bertinetto, L., Henriques, J., Vedaldi, A., Torr, P.H.S.: End-to-end representation learning for correlation filter based tracking. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. arXiv:1706.03762 (2017)
Voigtlaender, P., Luiten, J., Torr, P.H., Leibe, B.: Siam R-CNN: Visual tracking by re-detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6578–6588 (2020)
Wang, J., He, Y.: Motion prediction in visual object tracking. arXiv:2007.01120 (2020)
Wang, N., Zhou, W., Wang, J., Li, H.: Transformer meets tracker: Exploiting temporal context for robust visual tracking. arXiv:2103.11681 (2021)
Wang, Q., Zhang, L., Bertinetto, L., Hu, W., Torr, P.H.: Fast online object tracking and segmentation: A unifying approach. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1328–1338 (2019)
Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp. 7794–7803 (2018)
Wu, Y., Lim, J., Yang, M.H.: Online object tracking: A benchmark. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp. 2411–2418 (2013)
Xu, T., Feng, Z., Wu, X.J., Kittler, J.: Adaptive channel selection for robust visual object tracking with discriminative correlation filters. Int. J. Comput. Vision 129(5), 1359–1375 (2021)
Article MATH Google Scholar
Xu, T., Feng, Z.H., Wu, X.J., Kittler, J.: Joint group feature selection and discriminative filter learning for robust visual object tracking. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7950–7960 (2019)
Xu, T., Feng, Z.H., Wu, X.J., Kittler, J.: Learning adaptive discriminative correlation filters via temporal consistency preserving spatial feature selection for robust visual object tracking. IEEE Trans. Image Process. 28(11), 5596–5609 (2019)
Article MathSciNet MATH Google Scholar
Xu, T., Feng, Z.H., Wu, X.J., Kittler, J.: Learning low-rank and sparse discriminative correlation filters for coarse-to-fine visual object tracking. IEEE Trans. Circuits Syst. Video Technol. 30(10), 3727–3739 (2019)
Article Google Scholar
Xu, T., Feng, Z.H., Wu, X.J., Kittler, J.: AFAT: adaptive failure-aware tracker for robust visual object tracking. arXiv:2005.13708 (2020)
Xu, Y., Wang, Z., Li, Z., Yuan, Y., Yu, G.: SiamFC++: Towards robust and accurate visual tracking with target estimation guidelines. In: The Association for the Advancement of Artificial Intelligence (AAAI), pp. 12549–12556 (2020)
Xuan, S., Li, S., Zhao, Z., Kou, L., Zhou, Z., Xia, G.S.: Siamese networks with distractor-reduction method for long-term visual object tracking. Pattern Recognition p. 107698 (2020)
Yan, B., Zhang, X., Wang, D., Lu, H., Yang, X.: Alpha-refine: Boosting tracking performance by precise bounding box estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5289–5298 (2021)
Zhang, Z., Li, B., Hu, W., Peng, H.: Towards accurate pixel-wise object tracking by attention retrieval. arXiv:2008.02745 (2020)
Zhang, Z., Peng, H.: Deeper and wider siamese networks for real-time visual tracking. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
Zhipeng, Z., Houwen, P., Jianlong, F., Bing, L., Weiming, H.: Ocean: Object-aware anchor-free tracking. In: European Conference on Computer Vision (2020)
Zhou, J., Wang, P., Sun, H.: Discriminative and robust online learning for siamese visual tracking. In: Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), vol. 34(07), pp. 13017–13024 (2020)
Zhu, X.F., Wu, X.J., Xu, T., Feng, Z., Kittler, J.: Robust visual object tracking via adaptive attribute-aware discriminative correlation filters. IEEE Transactions on Multimedia (2021)
Zhu, X.F., Wu, X.J., Xu, T., Feng, Z.H., Kittler, J.: Complementary discriminative correlation filters based on collaborative representation for visual object tracking. IEEE Trans. Circuits Syst. Video Technol. 31(2), 557–568 (2020)
Article Google Scholar
Zhu, Z., Wang, Q., Li, B., Wu, W., Yan, J., Hu, W.: Distractor-aware siamese networks for visual object tracking. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 101–117 (2018)

Download references

Acknowledgements

The authors acknowledge use of the research computing facility at King’s College London, Rosalind (https://rosalind.kcl.ac.uk), and the Joint Academic Data science Endeavour (JADE) facility.

Author information

Authors and Affiliations

Department of Informatics, King’s College London, London, UK
Bo Gao & Michael W. Spratling

Authors

Bo Gao
View author publications
You can also search for this author in PubMed Google Scholar
Michael W. Spratling
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Bo Gao.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This research was funded by China Scholarship Council.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Gao, B., Spratling, M.W. Explaining away results in more robust visual tracking. Vis Comput 39, 2081–2095 (2023). https://doi.org/10.1007/s00371-022-02466-6

Download citation

Accepted: 11 March 2022
Published: 05 April 2022
Issue Date: May 2023
DOI: https://doi.org/10.1007/s00371-022-02466-6

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Explaining away results in more robust visual tracking

Abstract

Similar content being viewed by others

Know Your Surroundings: Exploiting Scene Information for Object Tracking

Auto-correct-integrated trackers with and without memory of first frames

Visual tracking in complex scenes through pixel-wise tri-modeling

1 Introduction

2 Related work