A method for performance diagnosis and evaluation of video trackers

Several measures for evaluating multi-target video trackers exist that generally aim at providing ‘end performance.’ End performance is important particularly for ranking and comparing trackers. However, for a deeper insight into trackers’ performance it would also be desirable to analyze key contributory factors (false positives, false negatives, ID changes) that (implicitly or explicitly) lead to the attainment of a certain end performance. Specifically, this paper proposes a new approach to enable a diagnosis of the performance of multi-target trackers as well as providing a means to determine the end performance to still enable their comparison in a video sequence. Diagnosis involves analyzing probability density functions of false positives, false negatives and ID changes of trackers in a sequence. End performance is obtained in terms of the extracted performance scores related to false positives, false negatives and ID changes. In the experiments, we used four state-of-the-art trackers on challenging real-world public datasets to show the effectiveness of the proposed approach.


Introduction
Evaluation measures [4][5][6]9,15,19,22] are important techniques of providing a means to draw performance comparisons among different multi-target tracking algorithms [3,17,18,20,21].These measures are generally aimed to determine end performance of trackers.End performance provides an overall quantification of goodness or badness of trackers' results in the form of a score at frame level [12,19], or sequence level [4,15], without separately analyzing in an explicit manner the key factors (i.e., false positives, false negatives, ID changes [4]) that contribute to the achievement of a certain performance score.Analysis of these contributory factors may indeed be needed in interpreting performance behavior of tracking algorithms against a variety of datasets.It would therefore be desirable from a researcher's perspective to obtain a deeper insight into these factors in addition to the end performance.
Existing measures are broadly made up of composite error counts [4,15], tracking success counts [5,12], tracking failure counts [11,14] and temporal averaging of scores [13,15]; providing tracking quality measurements without giving an explicit insight as why the performance of a tracker is less than perfect.Consider, for example, two cases where a wellknown existing measure, Multiple Object Tracking Accuracy (MOTA) [4], provides a comparable performance for a pair of multi-target trackers on a dataset, and ranks one tracker to be better than the other on another dataset (Fig. 1a).Indeed, the measure provides an end performance comparison in the form of a score for each tracker but does not reveal as why those end performances are obtained by trackers.There appears to be a need to also perform a diagnosis that is aimed at revealing and dissecting different aspects of tracking performance that could help in understanding why a certain performance is achieved.Such an approach may be used

Problem definition
Let X be a set of tracks estimated by a multi-target tracker in a video sequence, V : X = {X j } J j=1 , where J is the total number of estimated tracks.X j is the estimated track for target j: , where k j start and k j end are the first and final frame numbers of X j , respectively.X k, j is the estimated state of target j at frame k : k = 1, . . ., K with K as the total number of frames in V .X k, j = (x k, j , y k, j , A k, j , l j ), where (x k, j , y k, j ) and A k, j denote at frame k the position and occupied area of target j on image plane, respectively, and l j defines its ID.A k, j may use rectangular (bounding box) [18], elliptical [10], or contour [1] representations.The number of estimated targets at frame k is denoted as n k , which are defined as {X k,1 , . . ., X k, j , . . ., X k,n k }.Likewise, the notations for the ground-truth quantities corresponding to X , X j , J, X k, j , k j start , k j end , x k, j , y k, j , A k, j , l j and n k are X , Xi , I, Xk,i , ki start , ki end , xk,i , ȳk,i , Āk,i , li and nk , respectively.
A typical diagnostic procedure for a system starts as a result of identification of symptom(s) that may allude to the deterioration in system's performance.For a multi-target tracking system, deterioration may refer to deviation of X from X [16], which is computed as a discrepancy between X and X .The deterioration of performance in a system results from the occurence of fault(s) in it.In the case of a tracking system, the basic set of faults may include ID change, false positive, and false negative; referring to the error in maintaining a unique target ID, incorrect estimation, and missed estimation at frame k, respectively (see Fig. 1b).Here, we consider these three frame-level faults (ID change, false positive, false negative) as they often implicitly or explicitly form a basis for, or contribute to, estimat-ing existing track-level assessment proposals [4][5][6]15,19,22] (see details in Sect.2).A diagnostic procedure involves performing fault diagnosis.Therefore, for a tracking system, diagnosis may include analyzing across the frames of a sequence the occurrence of false positives, false negatives, and ID changes, which is expected to dissect and reveal more into the achievement of a certain end performance.

Contributions
In this paper, we present a new approach that, instead of presenting only the end performance of trackers, is also aimed at diagnosis in terms of providing a more revealing picture of the performance of multi-target trackers.It involves analyzing probability density functions (PDFs), in addition to extracting performance scores for each fault type (false positives, false negatives, ID changes) in a video sequence.Performance scores quantify the per frame concentration and robustness of a tracker to each fault type.We show the usefulness of the proposed method using state-of-the-art trackers on four challenging publicly available datasets.

Related work
Measures exist that implicitly account for faults (false positives, false negatives, ID changes) in their formulation to provide end tracking performance, with respect to the ground-truth information [15,19].Optimal Sub-Pattern Assignment (OSPA) metric [19] provides a frame-level target positional evaluation by combining accuracy and cardinality errors.The cardinality error (difference between the number of estimated and ground-truth targets) is the number of unassociated targets at frame k; hence, it encapsulates the information about false positives and false negatives.Inspired from OSPA, Multiple Extended-target Tracking Error (METE) [15] also quantifies frame-level performance by combining accuracy and cardinality errors; taking also into account the information about the occupied target region.Multiple Extended-target Lost-Track ratio (MELT) [15] quantifies the performance at sequence level based on the use of lost-track ratio.Given an associated pair of estimated and ground-truth tracks, lost-track ratio is computed as a ratio of the number of frames with an overlap | is the cardinality of a set) between a pair of X k, j and Xk,i less than a predefined threshold value, and the number of frames in a ground-truth track i.When O(•) is less than the threshold, it may point toward the presence of a false positive, or a false negative in a frame.Some measures explicitly use information about the fault and combine them to quantify the end performance [2,4,5].
False Alarm Rate (FAR) [2,5], Specificity, Positive Prediction, and False Positive Rate [2] use the number of false positives with other quantities in the evaluation procedure at frame level.Negative Prediction and False Negative Rate [2] use information about the number of false negatives with other quantities in evaluation at frame level.Multiple Object Tracking Accuracy (MOTA) [4] estimates performance by combining information about the number of false positives, false negatives, and ID changes at each frame and normalizing across the sequence.
Measures that quantify performance by separately using the information from a specific fault include Normalized ID changes (NIDC) [15], False Positive track matches and False Negative track matches [6], False Alarm Track (FAT) and Track Detection Failure (TDF) [22], Track Fragmentation (TF) [5], and ID Changes (IDC) [22].NIDC normalizes the number of ID changes by length of the track in which they occur.False Positive track matches and FAT use information about false positives across frames.False Negative track matches and TDF use information about false negatives across frames.TF and IDC count the number of ID changes across frames of an individual track and all tracks, respectively.
As reviewed above, existing measures (OSPA, METE, MELT, MOTA, FAR, Specificity, Positive and Negative Predictions, False Positive and Negative Rates) focus on evaluating end performance of trackers without separately providing an explicit insight into each fault type that could be needed to understand the attainment of a certain end performance.Some measures (NIDC, FAT, TDF, TF, IDC, False Positive and False Negative track matches) do provide a separate evaluation for each fault type; however, counting (or combining) false positives, false negatives, or ID changes would still provide an end performance evaluation that may not enable understanding tracker's performance behavior in terms of its ability to deal with each fault type.In this paper, we address this limitation by proposing an approach that involves dissecting a tracker's performance by separately analyzing the behavior of each fault type, while still enabling the end performance evaluation.

Tracking performance diagnosis and evaluation
Without loss of generality, A k, j is considered in the form of a bounding box in which case X k, j can be re-written as: X k, j = (x k, j , y k, j , w k, j , h k, j , l j ), where w k, j and h k, j denote width and height of the bounding box for target j at frame k.The notations for ground-truth quantities corresponding to w k, j and h k, j are wk,i and hk,i , respectively.Given a set of estimated states {X k,1 , . . ., X k, j , . . ., X k,n k }, and a set of ground-truth states { Xk,1 , . . ., Xk,i , . . ., Xk,n k } at frame k, the association between the elements of the two sets is established using Hungarian algorithm by minimizing the overlap cost 1 − O(•) , where O(•) defines the amount of overlap between a pair of X k, j and Xk,i , as described in | .FP k , the false positives, are the number of associated pairs of estimated and ground-truth targets with O(•) < τ (where τ is a threshold value) plus the number of unassociated estimated targets at frame k.FN k , the false negatives, are the number of groundtruth targets that are missed by a tracker at frame k.IDC k , the ID changes, are the number of changed associations corresponding to the ground-truth tracks at frame k.See also Fig. 1b.Next, we describe the proposed method including performance diagnosis (Sect.3.1) and evaluation (Sect.3.2), followed by highlighting the advantages of using the proposed method.

Performance diagnosis
Analyzing the occurence of FP k , FN k , and IDC k at each frame can be cumbersome for longer sequences, and also make it difficult to analyze and compare trackers' performance across different sequences with different lengths.Additionally, as discussed in the Sect.2, looking solely at the total numbers of false positives, false negatives, or ID changes across a sequence is still an end performance evaluation, and a deeper insight would be desirable for performance diagnosis.Instead, the analysis of the distributions of false positives, false negatives, and ID changes in a sequence is expected to provide a more revealing picture of tracker's performance behavior for a fault type, irrespective of sequence length.Moreover, the analysis of distributions (in normalized form) could enable inferring trends about performance of trackers across different datasets.
We therefore compute probability density functions (PDFs) for false positives, false negatives, and ID changes in a sequence.A PDF is computed as a normalized histogram for a particular fault type; hence, the area under each PDF equals 1, i.e., the sum of bin values on y-axis is equal to 1.We denote PDFs for false positives, false negatives, and ID changes for a sequence as Figure 2 shows PDFs of false positives, false negatives, and ID changes for existing trackers with the datasets used in this study.For example: Pr[FP k = 0] is read as a probability in terms of the percentage of frames in which the tracker produces zero false positive; similarly, Pr[FN k = 2] refers to a probability in terms of the percentage of frames in which the tracker produces two false negatives; likewise, Pr[IDC k > 2] means a probability in terms of the percentage of frames in which the tracker produces more than two ID changes.To show the usefulness of analyzing PDFs, consider the case of the Towncentre dataset [3], where PDFs for false positives of a pair of trackers (a tracker from Benfold and Reid (BenfoldTracker) [3] and a tracker from Pirsiavash et al. (Pir-siavashTracker) [17]) are generated for full-body tracking in Fig. 2a (PDFs are shown with solid lines).The total number of false positives in the sequence ( K k=1 FP k ) only reveals that PirsiavashTracker (10,118 false positives) is better than BenfoldTracker (12,162 false positives); however, their corresponding PDFs (Fig. 2a) provide a deeper insight as follows.PDFs reveal that Pr[FP k = 0] for BenfoldTracker is higher (better) than Pr[FP k = 0] for PirsiavashTracker.This shows an enhanced robustness of BenfoldTracker than Pirsi-avashTracker because of the presence of more frames where the former did not produce any false positives.The PDFs further reveal that, on the contrary, BenfoldTracker shows a greater tendency than PirsiavashTracker of producing a higher concentration of false positives (i.e., for FP k > 3) in a frame (see Fig. 2a), i.e., Pr[FP k > 3] for BenfoldTracker is higher (worse) than that for PirsiavashTracker.Therefore, analysis of a PDF offers a more detailed and dissected picture of a tracker's performance by revealing its robustness and per frame concentration for an individual fault type, which is not explicitly available by a simple fault count.To further aid the analysis and to facilitate end performance evaluation comparison of trackers, we next define two performance scores that account for the two aspects above for each fault type.

Performance evaluation
The first score tells the ability of a tracker to track without producing a fault across a sequence, and is called robustness to a fault type (R): such that K fp is the number of frames containing false positive(s), K fn is the number of frames containing false negative(s), and K idc is the number of frames containing ID change(s).R fp ∈ [0, 1], R fn ∈ [0, 1], R idc ∈ [0, 1]: the higher the value (R fp /R fn /R idc ), the better the ability.NB: R fp /R fn /R idc differs in formulation from MOTA [4]: MOTA reduces to providing the performance separately in terms of false positives or false negatives or ID changes, respectively.Unlike MOTA that, in 'reduced' form, uses information about the number of false positives, false negatives, or ID changes, R fp , R fn and R idc instead use information about the number of frames having false positive(s), false negative(s), and ID change(s), respectively, to provide robustness.
The second score tells the tendency of a tracker to produce a fault type per frame, and is called per frame concentration of a fault type (PFC): PFC fp ≥ 0, PFC fn ≥ 0, and PFC idc ≥ 0: the lower the value (PFC fp /PFC fn /PFC idc ), the lower the tendency of producing a fault type per frame.
NB: PFC idc differs from NIDC [15], such that the latter penalizes the number of ID changes by length of the track in which they occur, whereas the former quantifies per frame concentration by averaging the number of ID changes across the whole sequence.Likewise, PFC fp and PFC fn differ from MELT [15] that encapsulates lost-track ratio information (as explained in Sect.2).The lost-track value could indeed reflect the number of frames having false positives and/or false negatives in a track.Unlike MELT, PFC fp (PFC fn ) quantifies per frame concentration by averaging the number of false positives(false negatives) across the whole sequence.

Advantages
This section shows the advantages of using the proposed method over the widely used measure, MOTA.To this end, for clarity, we plot in Fig. 3 the numerator term of MOTA (that we here refer to as MOTA k : for BenfoldTracker [3] 3).Therefore, MOTA alone might not be revealing enough, as it does not provide an explicit insight into the individual fault types (false positives, false negatives, ID changes) that could be beneficial for a deeper understanding of performance.Differently, the proposed method enables a separate analysis of the behavior of individual fault types for a tracker in terms of respective PDFs (Sect.3.1), as well as its end performance in terms of extracted robustness (R) and per frame concentration (PFC) scores for each fault type (Sect.3.2).

Experimental validation
This section demonstrates the usefulness of the proposed method using state-of-the-art trackers on real-world publicly available datasets.Section 4.1 describes the setup including trackers and datasets, followed by the performance analysis of trackers using the proposed method and (for comparison) existing measures in Sect.4.2, and a discussion in Sect.4.3.

Trackers and datasets
Table 1 provides a summary of trackers and datasets used in the experiments.We used available ground truth generated for every frame of the sequences.We used trackers from Pirsiavash et al. (PirsiavashTracker) [17], Yang and Nevatia (YangTracker) [21], Benfold and Reid (Benfold-Tracker) [3], and Poiesi et al. (PoiesiTracker) [18].The parameters of trackers are the same as in the original papers.We use head and full-body tracks in experiments.Moreover, we chose four challenging datasets: Towncentre [3], iLids Easy [7], ETH Bahnhof [8], and ETH Sunnyday [8].Towncentre and iLids Easy are recorded from a static camera, whereas ETH Bahnhof and Sunnyday involves a moving camera.On Towncenter, trackers are tested for head tracking (BenfoldTracker-H, PoiesiTracker-H) and full-body tracking (BenfoldTracker, PirsiavashTracker); on iLids Easy, trackers are used for full-body tracking (BenfoldTracker, PoiesiTracker, PirsiavashTracker); and on ETH Bahnhof and Sunnyday, trackers are tested for full-body tracking (Yang-Tracker, PoiesiTracker, PirsiavashTracker).We use τ = 0.25 for head tracking, and τ = 0.5 for full-body tracking [3]. Figure 2 shows PDFs of trackers for each fault type on all datasets.Table 2 presents performance scores (PFC fp , PFC fn , PFC idc , R fp , R fn , R idc ) and existing measures (MOTA, mean METE, MELT) for trackers on all datasets, as well as their number of false positives, false negatives, and ID changes to aid analysis.
123  2).Differently, the proposed method provides a greater understanding of the trackers' performance, by enabling a separate analysis of their per frame concentration and robustness to faults using corresponding PDFs, as well as PFC and R scores, as follows.Based on false positives, PDFs of Ben-foldTracker and PirsiavashTracker (shown with solid lines in Fig. 2a) reveal that there are more frames in which PirsiavashTracker produced false positive(s) than Benfold-Tracker.This shows that the latter is more robust to false positives than the former, which is also confirmed by a better R fp for BenfoldTracker (Table 2); see also qualitative results where PirsiavashTracker produced false positives, but Ben-foldTracker did not (Fig. 4b).At the same time, it can also be noticed that BenfoldTracker shows a greater tendency of producing a higher per frame concentration of false positives than PirsiavashTracker (i.e., for FP k > 3, Fig. 2a), that is also shown by a better PFC fp for PirsiavashTracker (Table 2).Based on false negatives and ID changes, Benfold-Tracker outperforms PirsiavashTracker, as shown in general in their PDFs (Fig. 2b, c), PFC fn , PFC idc , R fn , and R idc (Table 2).
Likewise, for head tracking on this dataset, the values of MOTA, mean METE and MELT simply show that PoiesiTracker-H is better than BenfoldTracker-H (Table 2).The proposed method enables a more detailed analysis of trackers' performance as follows.PDFs of trackers (BenfoldTracker-H, PoiesiTracker-H) are shown with dotted lines in Fig. 2a-c.For false positives (Fig. 2a), the results show that Pr[FP k = 0] for PoiesiTracker-H is higher than that for BenfoldTracker-H, showing an enhanced robustness of the former to false positives that is also noticeable by its superior R fp (Table 2).As for per frame concentration, from PDFs, there is no clear winner between BenfoldTracker-H and PoiesiTracker-H for FP k > 0 (Fig. 2a): BenfoldTracker-H outperforms PoiesiTracker-H for 0 < FP k ≤ 3, PoiesiTracker-H is better than BenfoldTracker-H for 3 < FP k < 7, and both trackers generally perform comparably thereafter across their PDFs.Overall, in terms of PFC fp , PoiesiTracker-H, however, shows a superior performance than BenfoldTracker-H (Table 2).For false negatives, PoiesiTracker-H shows more robustness than BenfoldTracker-H, that is noticeable by higher Pr[FN k = 0] (Fig. 2b) and R fn (Table 2) of former.As for per frame concentration, BenfoldTracker-H shows a better PFC fn than PoiesiTracker-H (Table 2); this is also reflected by mostly a better performance of the former across their PDFs, i.e., for FN k > 3 (Fig. 2b).See qualitative results in a sample frame showing several false negatives for PoiesiTracker-H, and a fewer for BenfoldTracker-H (Fig. 4c).For ID changes, overall the results based on PDFs (Fig. 2c) and PFC idc , R idc (Table 2) reveal that PoiesiTracker-H is better based on per frame concentration of ID changes, whereas BenfoldTracker-H is more robust to producing ID changes.

iLids Easy
Based on MOTA, mean METE and MELT, BenfoldTracker is the best followed by PirsiavashTracker and PoiesiTracker (Table 2).The proposed method produces a different ranking based on false positives (PFC fp , R fp ) and ID changes (PFC idc , R idc ) by ranking PoiesiTracker as the best, followed by PirsiavashTracker and BenfoldTracker (Table 2).Indeed, Fig. 2d, f also shows that PoiesiTracker generally outperforms PirsiavashTracker and BenfoldTracker across their PDFs of false positives and ID changes.On the other hand, based on false negatives, the performance trends of trackers using the proposed method are similar to those produced by MOTA, mean METE and MELT, i.e., Benfold-Tracker outperforms PirsiavashTracker and PoiesiTracker across their PDFs (Fig. 2e), as well as based on their PFC fn , R fn (Table 2).The qualitative results show that BenfoldTracker produces more false positives (Fig. 4d,  f) and ID changes (Fig. 4d, e) than others, and Poiesi-Tracker produces more false negatives (Fig. 4d) than others.The total number of false positives (FP), false negatives (FN), ID changes (IDC), MOTA, Mean METE and MELT of trackers are also listed.For Towncentre, 'BenfoldTracker-H' and 'PoiesiTracker-H' refer to the use of these trackers for head tracking.On each dataset, the best tracking scores are shown in bold

ETH Bahnhof and Sunnyday
On ETH Bahnhof, existing measures (MOTA, mean METE, MELT) consider YangTracker as the best, followed by PoiesiTracker and PirsiavashTracker.The proposed method provides additional information and useful insights into the performance based on different fault types as follows.Based on false positives, PirsiavashTracker is found to be the most robust and shows the least per frame concentration, followed by YangTracker and PoiesiTracker, as confirmed by their PFC fp and R fp scores (Table 2), and across their PDFs (Fig. 2g).For example, Fig. 4g, i shows the qualitative tracking results with more false positives for PoiesiTracker than others.Based on false negatives, PoiesiTracker is the best, followed by YangTracker and PirsiavashTracker; this is reflected by their PFC fn and R fn (Table 2), as well as in general across their PDFs (Fig. 2h).See, for example, qualitative results in a sample frame with more false negatives for PrisiavashTracker than others (Fig. 4i).Based on ID changes, YangTracker shows an increased robustness and a better per frame concentration, as compared to Pirsiavash-Tracker and PoiesiTracker, as confirmed by their PFC idc and R idc scores (Table 2), and generally across their PDFs (Fig. 2i).On ETH Sunnyday, the trends and rankings of trackers (YangTracker, PirsiavashTracker, PoiesiTracker) based on PFC and R scores (Table 2), and PDFs (Fig. 2j-l) for all fault types are interestingly similar to those reported above for ETH Bahnhof.See also qualitative results on ETH Sunnyday in Fig. 4j-l.

Discussion
The proposed method could be used to provide formative feedback that could help researchers in addressing shortcomings in tracking algorithms.In fact, the analysis based on false positives could enable analyzing the impact on tracking performance originating from the detection stage.For example, BenfoldTracker has generally shown inferior PFC fp and R fp on Towncentre and iLids Easy than others, and PoiesiTracker has shown inferior PFC fp and R fp on ETH Bahnhof and Sunnyday than others.Indeed, on ETH Bahnhof and Sunnyday, the possible reason of the worst PFC fp and R fp scores of PoiesiTracker is that its person detector has a limited ability to deal with varying illumination conditions in these datasets [15].Therefore, the results show a particular need of improvement at the detection stage of BenfoldTracker and PoiesiTracker.Similarly, inferior scores related to false negatives (i.e., PFC fn , R fn ) can point toward improving the detection stage, and/or inability to temporally link small tracks ('tracklets') in an effective manner.For example, PirsiavashTracker has shown inferior PFC fn and R fn on most datasets (Towncentre, ETH Bahnhof, ETH Sunnyday) than others.This is likely due to the absence of an effective dedicated strategy to link tracklets in Pirsi-avashTracker [17], which other trackers (e.g., PoiesiTracker, YangTracker) possess.In fact, it is due to this limited ability that PirsiavashTracker also reported the highest cardinality error (that can be caused by false negatives) on these datasets in an earlier study [15].Likewise, the analysis based on ID changes provide a formative feedback vis-a-vis the tracking stage.For instance, YangTracker consistently shows better PFC idc and R idc than PoiesiTracker and Pirsiavash-Tracker on ETH Bahnhof and Sunnyday, which also confirms the conclusions of [15] that YangTracker outperforms other trackers in terms of ID changes on the same datasets.Indeed, this is because YangTracker uses an effective ID management strategy, employing motion and appearance affinities to avoid confusion between IDs of targets that are close to each other [21].Hence, a researcher could pay more attention on improving the ID management in PoiesiTracker and PirsiavashTracker.

Conclusions
We presented a new method that, instead of just providing usual end performance evaluation, also aims at performance diagnosis of a multi-target tracker in a video sequence.Existing tracking evaluation proposals generally focus only on end performance assessment that is important for drawing performance comparison.Instead, the proposed approach enables a more detailed performance analysis using probability density functions (PDFs) of key frame-level faults that a tracker can make (i.e., false positives, false negatives, ID changes).To complement this analysis, the extracted performance scores further offer a separate evaluation in terms of per frame concentration and robustness of trackers for each fault type.We used real-world publicly available datasets using state-ofthe-art trackers to validate the proposed method by showing its effectiveness over existing proposals, and its use in identifying algorithmic shortcomings of trackers.While the proposed method accounts for multi-target tracking, it could still be partly suitable for single-target trackers; however, for single-target tracking, ID change is generally not an issue and, hence, could be ignored.Additionally, the proposed method could be applied for any target type, provided the target model contains the position and occupied area (on a 2D image plane) as parameters.Moreover, the proposed method is based on analyzing frame-level faults.An explicit inclusion of track-level faults could also be of interest and is left to future work.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecomm ons.org/licenses/by/4.0/),which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Fig. 1 a
Fig. 1 a In given examples, MOTA provides a comparable performance for a tracker by Poiesi et al. (PoiesiTracker) [18] and a tracker by Pirsiavash et al. (PirsiavashTracker) [17] on ETH Sunnyday dataset [8], and ranks latter better than former on iLids Easy dataset [7].b Definition of a false positive (FP), a false negative (FN) and an ID change (IDC) in a frame.Ground truth: dotted bounding box; tracker's result: solid bounding box.Bounding box color represents a unique ID

Fig. 4
Fig. 4 Qualitative results of trackers on the Towncentre [3] (a-c), iLids Easy [7] (d-f), ETH Bahnhof [8] (g-i), and ETH Sunnyday [8] (j-l) datasets.Key-Red BenfoldTracker, blue PirsiavashTracker, green PoiesiTracker, black YangTracker on a segment of the Towncentre dataset.MOTA k combines contributions of FP k , The numerator term of MOTA (here referred to as MOTA k ) is plotted across a segment of the Towncentre dataset for BenfoldTracker.Additionally, FP k , FN k and IDC k are also plotted alongside FN k , and IDC k at frame k.Indeed the same value of MOTA k could be caused by different combinations of FP k , FN k , and IDC k ; for example, MOTA k = 8 at k = 1348 and k = 1393, although values of FP k , FN k , and IDC k are different in these frames (see Fig.

Table 1
Summary of datasets

Table 2
Performance evaluation of trackers on different datasets using PFC fp , PFC fn , PFC idc , R fp , R fn , and R