1 Introduction

Visual tracking remains a highly popular research area of computer vision, with the number of motion and tracking papers published at high profile conferences exceeding 40 papers annually. The significant activity in the field over last two decades is reflected in the abundance of review papers [19]. In response to the high number of publications, several initiatives emerged to establish a common ground for tracking performance evaluation. The earliest and most influential is the PETS [10], which is the longest lasting initiative that proposed frameworks for performance evaluation in relation to surveillance systems applications. Other frameworks have been presented since with focus on surveillance systems and event detection, (e.g., CAVIARFootnote 1, i-LIDSFootnote 2, ETISEOFootnote 3), change detection [11], sports analytics (e.g., CVBASEFootnote 4), faces (e.g. FERET [12, 13]), long-term trackingFootnote 5 and the multiple target tracking [14, 15]Footnote 6.

In 2013 the Visual object tracking, VOT, initiative was established to address performance evaluation for short-term visual object trackers. The initiative aims at establishing datasets, performance evaluation measures and toolkits as well as creating a platform for discussing evaluation-related issues. Since its emergence in 2013, three workshops and challenges have been carried out in conjunction with the ICCV2013 (VOT2013 [16]), ECCV2014 (VOT2014 [17]) and ICCV2015 (VOT2015 [18]). This paper discusses the VOT2016 challenge, organized in conjunction with the ECCV2016 Visual object tracking workshop, and the results obtained. Like VOT2013, VOT2014 and VOT2015, the VOT2016 challenge considers single-camera, single-target, model-free, causal trackers, applied to short-term tracking. The model-free property means that the only training example is provided by the bounding box in the first frame. The short-term tracking means that trackers are assumed not to be capable of performing successful re-detection after the target is lost and they are therefore reset after such event. The causality means that the tracker does not use any future frames, or frames prior to re-initialization, to infer the object position in the current frame. In the following, we overview the most closely related work and point out the contributions of VOT2016.

1.1 Related Work

Several works that focus on performance evaluation in short-term visual object tracking [16, 17, 1924] have been published in the last three years. The currently most widely used methodologies for performance evaluation originate from three benchmark papers, in particular the Online tracking benchmark (OTB) [21], the ‘Amsterdam Library of Ordinary Videos’ (ALOV) [22] and the ‘Visual object tracking challenge’ (VOT) [1618].

Performance Measures. The OTB- and ALOV-related methodologies, like [21, 22, 24, 25], evaluate a tracker by initializing it on the first frame and letting it run until the end of the sequence, while the VOT-related methodologies [1620] reset the tracker once it drifts off the target. Performance is evaluated in all of these approaches by overlaps between the bounding boxes predicted from the tracker with the ground truth bounding boxes. The OTB and ALOV initially considered performance evaluation based on object center estimation as well, but as shown in [26], the center-based measures are highly brittle and overlap-based measures should be preferred. The ALOV measures the tracking performance as the F-measure at 0.5 overlap threshold and a similar measure was proposed by OTB. Recently, it was demonstrated in [19] that such threshold is over-restrictive, since an overlap below 0.5 does not clearly indicate a tracking failure in practice. The OTB introduced a success plot which represents the percentage of frames for which the overlap measure exceeds a threshold, with respect to different thresholds, and developed an ad-hoc performance measure computed as the area under the curve in this plot. This measure remains one of the most widely used measures in tracking papers. It was later analytically proven by [20, 26] that the ad-hoc measure is equivalent to the average overlap (AO), which can be computed directly without intermediate success plots, giving the measure a clear interpretation. An analytical model was recently proposed [19] to study the average overlap measures with and without resets in terms of tracking accuracy estimator. The analysis showed that the no-reset AO measures are biased estimators with large variance while the VOT reset-based average overlap drastically reduces the bias and variance and is not hampered by the varying sequence lengths in the dataset.

Čehovin et al. [20, 26] provided a highly detailed theoretical and experimental analysis of a number of the popular performance measures. Based on that analysis, the VOT2013 [16] selected the average overlap with resets and number of tracking failures as their main performance criteria, measuring geometric accuracy and robustness respectively. The VOT2013 introduced a ranking-based methodology that accounted for statistical significance of the results, which was extended with the tests of practical differences in the VOT2014 [17]. The notion of practical differences is unique to the VOT challenges and relates to the uncertainty of the ground truth annotation. The VOT ranking methodology treats each sequence as a competition among the trackers. Trackers are ranked on each sequence and ranks are averaged over all sequences. This is called the sequence-normalized ranking. An alternative is sequence-pooled ranking [19], which ranks the average performance on all sequences. Accuracy-robustness ranking plots were proposed [16] to visualize the results. A drawback of the AR-rank plots is that they do not show the absolute performance. In VOT2015 [18], the AR-raw plots from [19, 20] were adopted to show the absolute average performance. The VOT2013 [16] and VOT2014 [17] selected the winner of the challenge by averaging the accuracy and robustness ranks, meaning that the accuracy and robustness were treated as equivalent “competitions”. A high average rank means that a tracker was well-performing in accuracy as well as robustness relative to the other trackers. While ranking converts the accuracy and robustness to equal scales, the averaged rank cannot be interpreted in terms of a concrete tracking application result. To address this, the VOT2015 [18] introduced a new measure called the expected average overlap (EAO) that combines the raw values of per-frame accuracies and failures in a principled manner and has a clear practical interpretation. The EAO measures the expected no-reset overlap of a tracker run on a short-term sequence. In principle, this measure reflects the same property as the AO [21] measure, but, since it is computed from the VOT reset-based experiment, it does not suffer from the large variance and has a clear definition of what the short-term sequence means. VOT2014 [17] pointed out that speed is an important factor in many applications and introduced a speed measure called the equivalent filter operations (EFO) that partially accounts for the speed of computer used for tracker analysis.

The VOT2015 [18] noted that state-of-the-art performance is often misinterpreted as requiring a tracker to score as number one on a benchmark, often leading authors to creatively select sequences and experiments and omit related trackers in scientific papers to reach the apparent top performance. To expose this misconception, the VOT2015 computed the average performance of the participating trackers that were published at top recent conferences. This value is called the VOT2015 state-of-the-art bound and any tracker exceeding this performance on the VOT2015 benchmark should be considered state-of-the-art according to the VOT standards.

Datasets.The current trend in computer vision datasets construction appears to be focused on increasing the number of sequences in the datasets [2225, 27], but often much less attention is being paid to the quality of its content and annotation. For example, some datasets disproportionally mix grayscale and color sequences and in most datasets the attributes like occlusion and illumination change are annotated only globally even though they may occur only at a small number of frames in a video. The dataset size is commonly assumed to imply quality. In contrast, the VOT2013 [16] argued that large datasets do not necessarily imply diversity or richness in attributes. Over the last three years, the VOT has developed a methodology that automatically constructs a moderately sized dataset from a large pool of sequences. The uniqueness of this methodology is that it explicitly optimizes diversity in visual attributes while focusing on sequences which are difficult to track. In addition, the sequences in the VOT datasets are per-frame annotated by visual attributes, which is in stark contrast to the related datasets that apply global annotation. It was recently shown [19] that performance measures computed from global attribute annotations are significantly biased toward the dominant attributes in the sequences, while the bias is significantly reduced with per-frame annotation, even in presence of misannotations.

Most closely related works to the work described in this paper are the recent VOT2013 [16], VOT2014 [17] and VOT2015 [18] challenges. Several novelties in benchmarking short-term trackers were introduced through these challenges. They provide a cross-platform evaluation kit with tracker-toolkit communication protocol, allowing easy integration with third-party trackers, per-frame annotated datasets and state-of-the-art performance evaluation methodology for in-depth tracker analysis from several performance aspects. The results were published in joint papers [1618] of which the VOT2015 [18] paper alone exceeded 120 coauthors. The evaluation kit, the dataset, the tracking outputs and the code to reproduce all the results are made freely-available from the VOT initiative homepageFootnote 7. The advances proposed by VOT have also influenced the development of related methodologies and benchmark papers like [2325].

1.2 The VOT2016 Challenge

VOT2016 follows VOT2015 challenge and considers the same class of trackers. The dataset and evaluation toolkit are provided by the VOT2016 organizers. The evaluation kit records the output bounding boxes from the tracker, and if it detects tracking failure, re-initializes the tracker. The authors participating in the challenge were required to integrate their tracker into the VOT2016 evaluation kit, which automatically performed a standardized experiment. The results were analyzed by the VOT2016 evaluation methodology. In addition to the VOT reset-based experiment, the toolkit conducted the main OTB [21] experiment in which a tracker is initialized in the first frame and left to track until the end of the sequence without resetting. The performance on this experiment is evaluated by the average overlap measure [21].

Participants were expected to submit a single set of results per tracker. Participants who have investigated several trackers submitted a single result per tracker. Changes in the parameters did not constitute a different tracker. The tracker was required to run with fixed parameters on all experiments. The tracking method itself was allowed to internally change specific parameters, but these had to be set automatically by the tracker, e.g., from the image size and the initial size of the bounding box, and were not to be set by detecting a specific test sequence and then selecting the parameters that were hand-tuned to this sequence. The organizers of VOT2016 were allowed to participate in the challenge, but did not compete for the winner of VOT2016 challenge title. Further details are available from the challenge homepageFootnote 8.

The advances of VOT2016 over VOT2013, VOT2014 and VOT2015 are the following: (i) The ground truth bounding boxes in the VOT2015 dataset have been re-annotated. Each frame in the VOT2015 dataset has been manually per-pixel segmented and bounding boxes have been automatically generated from the segmentation masks. (ii) A new methodology was developed for automatic placement of a bounding box by optimizing a well defined cost function on manually per-pixel segmented images. (iii) The evaluation system from VOT2015 [18] is extended and the bounding box overlap estimation is constrained to image region. The toolkit now supports the OTB [21] no-reset experiment and their main performance measures. (iv) The VOT2015 introduced a second sub-challenge VOT-TIR2015 held under the VOT umbrella which deals with tracking in infrared and thermal imagery [28]. Similarly, the VOT2016 is accompanied with VOT-TIR2016, and the challenge and its results are discussed in a separate paper submitted to the VOT2016 workshop [29].

The remainder of this paper is structured as follows. In Sect. 2, the new dataset is introduced. The methodology is outlined in Sect. 3, the main results are discussed in Sect. 4 and conclusions are drawn in Sect. 5.

2 The VOT2016 Dataset

VOT2013 [16] and VOT2014 [17] introduced a semi-automatic sequence selection methodology to construct a dataset rich in visual attributes but small enough to keep the time for performing the experiments reasonably low. In VOT2015 [18], the methodology was extended into a fully automated sequence selection with the selection process focusing on challenging sequences. The methodology was applied in VOT2015 [18] to produce a highly challenging VOT2015 dataset.

Results of VOT2015 showed that the dataset was not saturated and the same sequences were used for VOT2016. The VOT2016 dataset thus contains all 60 sequences from VOT2015, where each sequence is per-frame annotated by the following visual attributes: (i) occlusion, (ii) illumination change, (iii) motion change, (iv) size change, (v) camera motion. In case a particular frame did not correspond to any of the five attributes, we denoted it as (vi) unassigned.

In VOT2015, the rotated bounding boxes have been manually placed in each frame of the sequence by experts and cross checked by several groups for quality control. To enforce a consistency, the annotation rules have been specified. Nevertheless, we have noticed that human annotators have difficulty following the annotation rules, which makes it impossible to guarantee annotation consistency. For this reason, we have developed a novel approach for dataset annotation. The new approach takes a pixel-wise segmentation of the tracked object and places a bounding box by optimizing a well-defined cost function. In the following, Sect. 2.1 discusses per-frame segmentation mask construction and the new bounding box generation approach is presented in Sect. 2.2.

2.1 Producing Per-frame Segmentation Masks

The per-frame segmentations were provided for VOT by a research group that applied an interactive annotation tool designed by VOTFootnote 9 for manual segmentation mask construction. The tool applies Grabcut [30] object segmentation on each frame. The color model is initialized from the VOT2015 ground truth bounding box (first frame) or propagated from the final segmentation in the previous frame. The user can interactively add foreground or background examples to improve the segmentation. Examples of the object segmentations are illustrated in Fig. 1.

2.2 Automatic Bounding Box Computation

The final ground truth bounding box for VOT2016 was automatically computed on each frame from the corresponding segmentation mask. We have designed the following cost function and constraints to reflect the requirement that the bounding box should capture object pixels with minimal amount of background pixels:

$$\begin{aligned} \begin{aligned}&&\underset{\mathbf{b}}{\mathop {\arg \max }} \{ C(\mathbf{b}) = \alpha \sum _{\mathbf{x} \notin A(\mathbf{b})} \left[ \mathrm {M}(\mathbf{x})> 0 \right] + \sum _{\mathbf{x} \in A(\mathbf{b})} \left[ \mathrm {M}(\mathbf{x}) == 0 \right] \}, \\&\text {subject to}&\frac{1}{\mathrm {M}_f}\sum _{\mathbf{x} \notin A(\mathbf{b})} \left[ \mathrm {M}(\mathbf{x}) > 0 \right]< \varTheta _{f}, \frac{1}{|A(\mathbf{b})|}\sum _{\mathbf{x} \in A(\mathbf{b})} \left[ \mathrm {M}(\mathbf{x}) == 0 \right] < \varTheta _{b}, \end{aligned} \end{aligned}$$

where \(\mathbf{b}\) is the vector of bounding box parameters (center, width, height, rotation), \(A(\mathbf{b})\) is the corresponding bounding box, \(\mathrm {M}\) is the segmentation mask which is non-zero for object pixels, \([\cdot ]\) is an operator which returns 1 iff the statement in the operator is true and 0 otherwise, \(\mathrm {M}_f\) is number of object pixels and \(|\cdot |\) denotes the cardinality. An intuitive interpretation of the cost function is that we want to find a bounding box which minimizes a weighted sum of the number of object pixels outside of the bounding box and the number of background pixels inside the bounding box, with percentage of excluded object pixels and included background pixels constrained by \(\varTheta _{f}\) and \(\varTheta _{b}\), respectively. The cost (1) was optimized by Interior Point [31] optimization, with three starting points: (i) the VOT2015 ground truth bounding box, (ii) a minimal axis-align bounding box containing all object pixels and (iii) a minimal rotated bounding box containing all object pixels. In case a solution satisfying the constraints was not found, a relaxed unconstrained BFGS Quasi-Newton method [32] was applied. Such cases occurred at highly articulated objects. The bounding box tightness is controlled by parameter \(\alpha \). Several values, i.e., \(\alpha = \{1,4,7,10\}\), were tested on randomly chosen sequences and the final value \(\alpha = 4\) was selected since its bounding boxes were visually assessed to be the best-fitting. The constraints \(\varTheta _{f} = 0.1\) and \(\varTheta _{b} = 0.4\) were set to the values defined in previous VOT challenges. Examples of the automatically estimated ground truth bounding boxes are shown in Fig. 1.

All bounding boxes were visually verified to avoid poor fits due to potential segmentation errors. We identified \(12\,\%\) of such cases and reverted to the VOT2015 ground truth for those. During the challenge, the community identified four frames where the new ground truth is incorrect and those errors were not caught by the verification. In these cases, the bounding box within the image bounds was properly estimated, but extended out of image bounds disproportionally. These errors will be corrected in the next version of the dataset and we checked, during result processing, that it did not significantly influence the challenge results. Table 1 summarizes the comparison of the VOT2016 automatic ground truth with the VOT2015 in terms of portions of object and background pixels inside the bounding boxes. The statistics were computed over the whole dataset excluding the \(12\,\%\) of frames where the segmentation was marked as incorrect. The VOT2016 ground truth improves in all aspects over the VOT2015. It is interesting to note that the average overlap between VOT2015 and VOT2016 ground truth is 0.74.

Table 1. The first two columns shows the percentage and number of frames annotated by the VOT2016 and VOT2015 methodology, respectively. The fg-out and bg-in denote the average percentage of object pixels outside and percentage of background pixels inside the GT, respectively. The average overlap with the VOT2015 annotations is denoted by Avg. overlap, while the #opt. failures denotes the number of frames in which the algorithm switched from constrained to unconstrained optimization.

2.3 Uncertainty of Optimal Bounding Box Fits

The cost function described in Sect. 2.2 avoids subjectivity of manual bounding box fitting, but does not specify how well constrained the solution is. The level of constraint strength can be expressed in terms of the average overlap of bounding boxes in the vicinity of the cost function (1) optimum, where we define the vicinity as a variation of bounding boxes within a maximum increase of the cost function around the optimum. The relative maximum increase of the cost function, i.e., the increase divided by the optimal value, is related to the annotation uncertainty in the per-pixels segmentation masks and can be estimated by the following rule-of thumb.

Let \(S_f\) and \(S_b\) denote the number of object and background pixels inside and outside of the bounding box, respectively. According to the central limit theorem, we can assume that \(S_f\) and \(S_b\) are normally distributed, i.e., \(\mathcal {N}(\mu _f, \sigma _f^2)\) and \(\mathcal {N}(\mu _b, \sigma _b^2)\), since they are sums of many random variables (per-pixel labels). In this respect, the value of the cost function C in (1) can be treated as a random variable as well and it is easy to show the following relation \(\text {var}(C) = \sigma _c^2 = \alpha ^2 \sigma _f^2 + \sigma _b^2\). The variance of the cost function is implicitly affected by the per-pixel annotation uncertainty through the variances \(\sigma _f^2\) and \(\sigma _b^2\). Assume that at most \(x \mu _f\) and \(x \mu _b\) pixels are incorrectly labeled on average. Since nearly all variation in a Gaussian is captured by three standard deviations, the variances are \(\sigma _f^2 = ({x \mu _f/3})^2\) and \(\sigma _b^2 = ({x \mu _b/3})^2\). Applying the three-sigma rule to the variance of the cost C, and using the definition of the foreground and background variances, gives an estimator of the maximal cost function change \(\varDelta _c = 3\sigma _c = x \sqrt{\alpha ^2 \mu _f^2 + \mu _b^2}\). Our goal is to estimate the maximal relative cost function change in the vicinity of its optimum \(C_\mathrm {opt}\), i.e., \(r_\mathrm {max} = \frac{\varDelta _c}{C_\mathrm {opt}}\). Using the definition of the maximal change \(\varDelta _c\), the rule of thumb for the maximal relative change is

$$\begin{aligned} r_\mathrm {max} = \frac{ x \sqrt{\alpha ^2 \mu _f^2 + \mu _b^2} }{ \mu _f + \mu _b}. \end{aligned}$$

3 Performance Evaluation Methodology

Since VOT2015 [18], three primary measures are used to analyze tracking performance: accuracy (A), robustness (R) and expected average overlap (AEO). In the following these are briefly overviewed and we refer to [1820] for further details. The VOT challenges apply a reset-based methodology. Whenever a tracker predicts a bounding box with zero overlap with the ground truth, a failure is detected and the tracker is re-initialized five frames after the failure. Čehovin et al. [20] identified two highly interpretable weakly correlated performance measures to analyze tracking behavior in reset-based experiments: (i) accuracy and (ii) robustness. The accuracy is the average overlap between the predicted and ground truth bounding boxes during successful tracking periods. On the other hand, the robustness measures how many times the tracker loses the target (fails) during tracking. The potential bias due to resets is reduced by ignoring ten frames after re-initialization in the accuracy measure, which is quite a conservative margin [19]. Stochastic trackers are run 15 times on each sequence to obtain reduce the variance of their results. The per-frame accuracy is obtained as an average over these runs. Averaging per-frame accuracies gives per-sequence accuracy, while per-sequence robustness is computed by averaging failure rates over different runs. The third primary measure, called the expected average overlap (EAO), is an estimator of the average overlap a tracker is expected to attain on a large collection of short-term sequences with the same visual properties as the given dataset. This measure addresses the problem of increased variance and bias of AO [21] measure due to variable sequence lengths on practical datasets. Please see [18] for further details on the average expected overlap measure.

We adopt the VOT2015 ranking methodology that accounts for statistical significance and practical differences to rank trackers separately with respect to the accuracy and robustness [18, 19]. Apart from accuracy, robustness and expected overlaps, the tracking speed is also an important property that indicates practical usefulness of trackers in particular applications. To reduce the influence of hardware, the VOT2014 [17] introduced a new unit for reporting the tracking speed called equivalent filter operations (EFO) that reports the tracker speed in terms of a predefined filtering operation that the tookit automatically carries out prior to running the experiments. The same tracking speed measure is used in VOT2016.

In addition to the standard reset-based VOT experiment, the VOT2016 toolkit carried out the OTB [21] no-reset experiment. The tracking performance on this experiment was evaluated by the primary OTB measure, average overlap (AO).

4 Analysis and Results

4.1 Practical Difference Estimation

As noted in Sect. 2.3, the variation in the per-pixel segmentation masks introduces the uncertainty of the optimally fitted ground truth bounding boxes. We expressed this uncertainty as the average overlap of the optimal bounding box with the bounding boxes sampled in vicinity of the optimum, which is implicitly defined as the maximal allowed cost increase. Assuming that on average, at most \(10\,\%\) of pixels might be incorrectly assigned in the object mask, the rule of thumb (2) estimates an increase of cost function by at most \(7\,\%\). The average overlap specified in this way was used in the VOT2016 as an estimate of the per-sequence practical differences.

The following approach was thus applied to estimate the practical difference thresholds. Thirty uniformly dispersed frames were selected per sequence. For each frame a set of 3125 ground truth bounding box perturbations were generated by varying the ground truth regions by \( {{\varvec{\Delta }}_\mathbf{b }} = \left[ \varDelta _x, \varDelta _y, \varDelta _w, \varDelta _h, \varDelta _\varTheta \right] \), where all \(\varDelta \) are sampled uniformly (5 samples) from ranges \(\pm 5\,\%\) of ground truth width (height) for \(\varDelta _x\)(\(\varDelta _y\)), \(\pm 10\,\%\) of ground truth width (height) for \(\varDelta _w\)(\(\varDelta _h\)) and \(\pm 4^\circ \) for \(\varDelta _\varTheta \). These ranges were chosen such that the cost function is well explored near the optimal solution and the amount of bounding box perturbations can be computed reasonably fast. The examples of bounding boxes generated in this way are shown in Fig. 1. An average overlap was computed between the ground truth bounding box and the bounding boxes that did not exceed the optimal cost value by more than \(7\,\%\). The average of the average overlaps computed in thirty frames was taken as the estimate of the practical difference threshold for a given sequence. The boxplots in Fig. 1 visualize the distributions of average overlaps with respect to the sequences.

Fig. 1.
figure 1

Box plots of per-sequence overlap dispersion at \(7\,\%\) cost change (left), and examples of such bounding boxes (right). The optimal bounding box is depicted in red, while the \(7\,\%\) cost change bounding boxes are shown in green. (Color figure online)

4.2 Trackers Submitted

Together 48 valid entries have been submitted to the VOT2016 challenge. Each submission included the binaries/source code that was used by the VOT2016 committee for results verification. The VOT2016 committee and associates additionally contributed 22 baseline trackers. For these, the default parameters were selected, or, when not available, were set to reasonable values. Thus in total 70 trackers were tested in the VOT2016 challenge. In the following we briefly overview the entries and provide the references to original papers in the Appendix A where available.

Eight trackers were based on convolutional neural networks architecture for target localization, MLDF (A.19), SiamFC-R (A.23), SiamFC-A (A.25), TCNN (A.44), DNT (A.41), SO-DLT (A.8), MDNet-N (A.46) and SSAT (A.12), where MDNet-N (A.46) and SSAT (A.12) were extensions of the VOT2015 winner MDNet [33]. Thirteen trackers were variations of correlation filters, SRDCF (A.58), SWCF (A.3), FCF (A.7), GCF (A.36), ART-DSST (A.45), DSST2014 (A.50), SMACF (A.14), STC (A.66), DFST (A.39), KCF2014 (A.53), SAMF2014 (A.54), OEST (A.31) and sKCF (A.40). Seven trackers combined correlation filter outputs with color, Staple (A.28), Staple+ (A.22), MvCFT (A.15), NSAMF (A.21), SSKCF (A.27), ACT (A.56) and ColorKCF (A.29), and six trackers applied CNN features in the correlation filters, deepMKCF (A.16), HCF (A.60), DDC (A.17), DeepSRDCF (A.57), C-COT (A.26), RFD-CF2 (A.47). Two trackers were based on structured SVM, Struck2011 (A.55) and EBT (A.2) which applied region proposals as well. Three trackers were based on purely on color, DAT (A.5), SRBT (A.34) and ASMS (A.49) and one tracker was based on fusion of basic features LoFT-Lite (A.38). One tracker was based on subspace learning, IVT (A.64), one tracker was based on boosting, MIL (A.68), one tracker was based on complex cells approach, CCCT (A.20), one on distributed fields, DFT (A.59), one tracker was based on Gaussian process regressors, TGPR (A.67), and one tracker was the basic normalized cross correlation tracker NCC (A.61). Nineteen submissions can be categorized as part-based trackers, DPCF (A.1), LT-FLO (A.43), SHCT (A.24), GGTv2 (A.18), MatFlow (A.10), Matrioska (A.11), CDTT (A.13), BST (A.30), TRIC-track (A.32), DPT (A.35), SMPR (A.48), CMT (A.70), HT (A.65), LGT (A.62), ANT (A.63), FoT (A.51), FCT (A.37), FT (A.69), and BDF (A.9). Several submissions were based on combination of base trackers, PKLTF (A.4), MAD (A.6), CTF (A.33), SCT (A.42) and HMMTxD (A.52).

4.3 Results

The results are summarized in sequence-pooled and attribute-normalized AR-raw plots in Fig. 2. The sequence-pooled AR-rank plot is obtained by concatenating the results from all sequences and creating a single rank list, while the attribute-normalized AR-rank plot is created by ranking the trackers over each attribute and averaging the rank lists. The AR-raw plots were constructed in similar fashion. The expected average overlap curves and expected average overlap scores are shown in Fig. 3. The raw values for the sequence-pooled results and the average overlap scores are also given in Table 2.

The top ten trackers come from various classes. The TCNN (A.44), SSAT (A.12), MLDF (A.19) and DNT (A.41) are derived from CNNs, the C-COT (A.26), DDC (A.17), Staple (A.28) and Staple+ (A.22) are variations of correlation filters with more or less complex features, the EBT (A.2) is structured SVM edge-feature tracker, while the SRBT (A.34) is a color-based saliency detection tracker. The following five trackers appear either very robust or very accurate: C-COT (A.26), TCNN (A.44), SSAT (A.12), MLDF (A.19) and EBT (A.2). The C-COT (A.26) is a new correlation filter which uses a large variety of state-of-the-art features, i.e., HOG [34], color-names [35] and the vgg-m-2048 CNN features pretrained on ImagenetFootnote 10. The TCNN (A.44) samples target locations and scores them by several CNNs, which are organized into a tree structure for efficiency and are evolved/pruned during tracking. SSAT (A.12) is based on MDNet [33], applies segmentation and scale regression, followed by occlusion detection to prevent training from corrupt samples. The MLDF (A.19) applies a pre-trained VGG network [36] which is followed by another, adaptive, network with Euclidean loss to regress to target position. According to the EAO measure, the top performing tracker was C-COT (A.26) [37], closely followed by the TCNN (A.44). Detailed analysis of the AR-raw plots shows that the TCNN (A.44) produced slightly greater average overlap (0.55) than C-COT (A.26) (0.54), but failed slightly more often (by six failures). The best overlap was achieved by SSAT (A.12) (0.58), which might be attributed to the combination of segmentation and scale regression this tracker applies. The smallest number of failures achieved the MLDF (A.19), which outperformed C-COT (A.26) by a single failure, but obtained a much smaller overlap (0.49). Under the VOT strict ranking protocol, the SSAT (A.12) is ranked number one in accuracy, meaning the overlap was clearly higher than for any other tracker. The second-best ranked tracker in accuracy is Staple+ (A.22) and several trackers share third rank SHCT (A.24), deepMKCF (A.16), FCF (A.7), meaning that the null hypothesis of difference between these trackers in accuracy could not be rejected. In terms of robustness, trackers MDNet-N (A.46), C-COT (A.26), MLDF (A.19) and EBT (A.2) share the first place, which means that the null hypothesis of difference in their robustness could not be rejected. The second and third ranks in robustness are occupied by TCNN (A.44) and SSAT (A.12), respectively.

Fig. 2.
figure 2

The AR-rank plots and AR-raw plots generated by sequence pooling (left) and attribute normalization (right).

Fig. 3.
figure 3

Expected average overlap curve (left) and expected average overlap graph (right) with trackers ranked from right to left. The right-most tracker is the top-performing according to the VOT2016 expected average overlap values. See Fig. 2 for legend. The dashed horizontal line denotes the average performance of fourteen state-of-the-art trackers published in 2015 and 2016 at major computer vision venues. These trackers are denoted by gray circle in the bottom part of the graph.

It is worth pointing out some EAO results appear to contradict AR-raw measures at a first glance. For example, the Staple obtains a higher EAO measure than Staple+, even though the Staple achieves a slightly better average accuracy and in fact improves on Staple by two failures, indicating a greater robustness. The reason is that the failures early on in the sequences globally contribute more to penalty than the failures that occur at the end of the sequence (see [18] for definition of EAO). For example, if a tracker fails once and is re-initialized in the sequence, it generates two sub-sequences for computing the overlap measure at sequence length N. The first sub-sequence ends with the failure and will contribute to any sequence length N since zero overlaps are added after the failure. But the second sub-sequence ends with the sequence end and zeros cannot be added after that point. Thus the second sub-sequence only contributes to the overlap computations for sequence lengths N smaller than its length. This means that re-inits very close to the sequence end (tens of frames) do not affect the EAO.

Note that the trackers that are usually used as baselines, i.e., MIL (A.68), and IVT (A.64) are positioned at the lower part of the AR-plots and the EAO ranks, which indicates that majority of submitted trackers are considered state-of-the-art. In fact, fourteen tested trackers have been recently (in 2015 and 2016) published at major computer vision conferences and journals. These trackers are indicated in Fig. 3, along with the average state-of-the-art performance computed from the average performance of these trackers, which constitutes a very strict VOT2016 state-of-the-art bound. Approximately 22\(\%\) of submitted trackers exceed this bound.

Table 2. The table shows expected average overlap (EAO), accuracy and robustness raw values (A,R) and ranks (\(A_\mathrm {rank},A_\mathrm {rank}\)), the no-reset average overlap AO [21], the speed (in EFO units) and implementation details (M is Matlab, C is C or C++, P is Python). Trackers marked with * have been verified by the VOT2015 committee. A dash “-” indicates the EFO measurements were invalid.

The number of failures with respect to the visual attributes are shown in Fig. 4. On camera motion attribute, the tracker that fails least often is the EBT A.2, on illumination change the top position is shared by RFD\(\_\)CF2 A.47 and SRBT A.34, on motion change the top position is shared by EBT A.2 and MLDF A.19, on occlusion the top position is shared by MDNet\(\_\)A.46 and C-COT A.26, on the size change attribute, the tracker MLDF A.19 produces the least failures, while on the unassigned attribute, the TCNN A.44 fails the least often. The overall accuracy and robustness averaged over the attributes is shown in Fig. 2. The attribute-normalized AR plots are similar to the pooled plots, but the top trackers (TCNN A.44, SSAT A.12, MDNet\(\_\)A.46 and C-COT A.26) are pulled close together, which is evident from the ranking plots.

Fig. 4.
figure 4

The expected average overlap with respect to the visual attributes (left). Expected average overlap scores w.r.t. the tracking speed in EFO units (right). The dashed vertical line denotes the estimated real-time performance threshold of 20 EFO units. See Fig. 2 for legend.

We have evaluated the difficulty level of each attribute by computing the median of robustness and accuracy over each attribute. According to the results in Table 3, the most challenging attributes in terms of failures are occlusion, motion change and illumination change, followed by scale change and camera motion.

Table 3. Tracking difficulty with respect to the following visual attributes: camera motion (cam. mot.), illumination change (ill. ch.), motion change (mot. ch.), occlusion (occl.) and size change (scal. ch.).

In addition to the baseline reset-based VOT experiment, the VOT2016 toolkit also performed the OTB [21] no-reset (OPE) experiment. Figure 5 shows the OPE plots, while the AO overall measure is given in Table 2. According to the AO measure, the three top performing trackers are SSAT (A.12), TCNN (A.44) and C-COT (A.26), which is similar to the EAO ranking, with the main difference that SSAT and C-COT exchange places. The reason for this switch can be deduced from the AR plots (Fig. 2) which show that the C-COT is more robust than the other two trackers, while the SSAT is more accurate. Since the AO measure does not apply resets, it does not enhance the differences among the trackers on difficult sequences, where one tracker might fail more often than the other, whereas the EAO is affected by these. Thus among the trackers with similar accuracy and robustness, the EAO prefers trackers with higher robustness, while the AO prefers more accurate trackers. To establish a visual relation among the EAO and AO rankings, each tracker is shown in a 2D plot in terms of the EAO and AO measures in Fig. 5. Broadly speaking, the measures are correlated and EAO is usually lower than EO, but the local ordering with these measures is different, which is due to the different treatment of failures.

Fig. 5.
figure 5

The OPE no-reset plots (left) and the EAO-AO scatter plot (right).

Apart from tracking accuracy, robustness and EAO measure, the tracking speed is also crucial in many realistic tracking applications. We therefore visualize the EAO score with respect to the tracking speed measured in EFO units in Fig. 4. To put EFO units into perspective, a C++ implementation of a NCC tracker provided in the toolkit runs with average 140 frames per second on a laptop with an Intel Core i5-2557M processor, which equals to approximately 200 EFO units. All trackers that scored top EAO performed below realtime, while the top EFO was achieved by NCC (A.61), BDF (A.9) and FoT (A.51). Among the trackers within the VOT2016 realtime bound, the top two trackers in terms of EAO score were Staple+ (A.22) and SSKCF (A.27). The latter is modification of the Staple (A.28), while the latter is modification of the Sumshift [38] tracker. Both approaches combine a correlation filter output with color histogram backprojection. According to the AR-raw plot in Fig. 2, the SSKCF (A.27) tracks with a decent average overlap during successful tracking periods (\(\sim 0.55\)) and produces decently long tracks. For example, the probability of SSKCF still tracking the target after \(S=100\) frames is approximately 0.69. The Staple+ (A.22) tracks with a similar overlap (\(\sim 0.56\)) and tracks the target after 100 frames with probability 0.70. In the detailed analysis of the results we have found some discrepancies between the reported EFO units and the trackers speed in seconds for the Matlab trackers. The toolkit was not ignoring the Matlab start time, which can significantly vary across different trackers, which is why the EFO units of some Matlab trackers might be significantly underestimated.

5 Conclusion

This paper reviewed the VOT2016 challenge and its results. The challenge contains an annotated dataset of sixty sequences in which targets are denoted by rotated bounding boxes to aid a precise analysis of the tracking results. All the sequences are the same as in the VOT2015 challenge and the per-frame visual attributes are the same as well. A new methodology was developed to automatically place the bounding boxes in each frame by optimizing a well-defined cost function. In addition, a rule-of-thumb approach was developed to estimate the uniqueness of the automatically placed bounding boxes under the expected bound on the per-pixel annotation error. A set of 70 trackers have been evaluated. A large percentage of trackers submitted have been published at recent conferences and top journals, including ICCV, CVPR, TIP and TPAMI, and some trackers have not yet been published (available at arXiv). For example, fourteen trackers alone have been published at major computer vision venues in 2015 and 2016 so far.

The results of VOT2016 indicate that the top performing tracker of the challenge according to the EAO score is the C-COT (A.26) tracker [37]. This is a correlation-filter-based tracker that applies a number of state-of-the-art features. The tracker performed very well in accuracy as well as robustness and trade-off between the two is reflected in the EAO. The C-COT (A.26) tracker is closely followed by TCNN (A.44) and SSAT (A.12) which are close in terms of accuracy, robustness and the EAO. These trackers come from a different class, they are pure CNN trackers based on the winning tracker of VOT2015, the MDNet [33]. It is impossible to conclusively decide whether the improvements of C-COT (A.26) over other top-performing trackers come from the features or the approach. Nevertheless, results of top trackers conclusively show that features play a significant role in the final performance. All trackers that scored the top EAO perform below real-time. Among the realtime trackers, the top performing trackers were Staple+ (A.22) and SSKCF (A.27) that implement a simple combination of the correlation filter output and histogram backprojection.

The main goal of VOT is establishing a community-based common platform for discussion of tracking performance evaluation and contributing to the tracking community with verified annotated datasets, performance measures and evaluation toolkits. The VOT2016 was a fourth attempt toward this, following the very successful VOT2013, VOT2014 and VOT2015. The VOT2016 also introduced a second sub-challenge VOT-TIR2016 that concerns tracking in thermal and infrared imagery. The results of that sub-challenge are described in a separate paper [29] that was presented at the VOT2016 workshop. Our future work will be focused on revising the evaluation kit, dataset, performance measures, and possibly launching other sub-challenges focused to narrow application domains, depending on the feedbacks and interest expressed from the community.