1 Introduction

Fig. 1
figure 1

Value of HOTA metric for various tracking methods on MOT17 and MOT20 test sets under private detection protocol. Blue circles represent online trackers

Multiple object tracking (MOT) is one of the most important problem in computer vision and has applications in areas of autonomous robotics [20, 50], autonomous driving  [12, 24, 43, 52] and smart cities  [8, 44, 52, 72]. The problem consists of determining the position and identity of each object of interest (e.g. pedestrian) for every frame of the video. This is usually done in a tracking-by-detection paradigm, by applying detection and tracking steps for every input frame. Given a set of detections, the goal of the tracking step is to assign detections to tracked objects. Due to occlusions, some objects cannot be detected even though they are still in the scene. When an object reappears, it should not be recognized as a new object but rather matched to an existing one. Kalman filter  [27] is usually used as a tracking algorithm to overcome missing detections and occlusions and provide estimates of the object state. The assignment problem can be formulated as a bipartite matching task between the detections and tracklets and solved using the Hungarian algorithm  [28]. Intersection over union (IoU) is an effective measure of similarity between the detections and existing tracklets and can be used to create a cost matrix needed for the Hungarian algorithm. To better deal with occlusions and crowded scenes, appearance similarity is usually used in addition to IoU or other motion cues. Computing appearance similarity requires the extraction of visual features. However, using high-quality feature extractors (e.g. FastReID  [25]) increases execution time and limits real-time application [1].

To reduce the number of false positives and ghost tracks, low-confidence detections are usually filtered out, and only a subset of detections is used for association. However, not all low-confidence detections are false positives. ByteTrack  [69] uses a two-stage assignment where high-confidence detections are used in the first stage while remaining detections and unassociated tracklets are used in the second. Following a ByteTrack, several works have used low-confidence detections in the second assignment stage  [1, 21, 30, 37, 58, 62]. On the other hand, using low-confidence detections in the second stage to match with remaining tracklets can result in identity switches (IDSWs). Suppose two objects bounding boxes are highly overlapped and only one has a high detection confidence score. In that case, we may assign that detection to the wrong object (forcing the incorrect association of the other detection in the second stage), but if we used both detections in the first stage, we could match them correctly  [55].

Several papers  [11, 13, 60] have adopted multiple-stage association where more recently updated tracklets (or the more confident ones) are associated first. Note that any multiple-stage association can introduce identity switches.

Ideally, MOT algorithm should be online, operating in real-time, the similarity measure should be discriminative enough to enable correct match of all tracklet-detection pairs in one-stage assignment, and all true positive and none of the false positive detections should be used.

To approach the described ideal solution, in this paper, we present the BoostTrack, a simple tracking-by-detection system built on top of SORT  [6] that uses several lightweight plug-and-play additions that can significantly improve performance.

To avoid two-stage assignment and still utilize low-confidence detections, we propose to increase (boost) the confidence of two groups of low-confidence detection bounding boxes:

  1. 1.

    the bounding boxes where we predict an object should be,

  2. 2.

    counterintuitively, the bounding boxes where currently tracked objects should not be.

When an object is partially occluded, its detected bounding box confidence can be low, but the IoU between the predicted position and the bounding box can be high. We propose to increase the detection confidence of such detections. On the other hand, low-confidence detections positioned where we do not predict an object should be could be a noise, but can often be a new object that is only partially visible (e.g. entering the scene or standing on the edge). We use Mahalanobis distance measure  [38] to discover these outliers and find that increasing the confidence of these detections also improves the performance.

To utilize the benefits of multiple-stage assignment and avoid its drawbacks, we introduce detection-tracklet confidence, which can be used to scale any similarity measure and implicitly favour high-confidence tracklet, high-confidence detection pairs in a one-stage assignment.

IoU alone can lead to many identity switches in crowded scenes, and recent algorithms use appearance features in addition to IoU and other motion features. However, using an additional visual embedding module increases time complexity, reducing FPS and the possibility of real-time application. We propose three lightweight plug and play additions that can improve association performance:

  1. 1.

    We use detection-tracklet confidence scores to scale IoU and increase the similarity of high confidence detection-tracklet pairs. High variance prediction giving high IoU (or any other similarity measure) with relatively low confidence detection should not have the same weight as the low variance prediction, high confidence detection overlap.

  2. 2.

    Mahalanobis distance  [38] can be used as a similarity measure to account for estimated tracklet variance. Admissible values depend upon the dimensionality of the tracklet and the chosen confidence interval, and any change requires a different scaling parameter. We introduce a more robust way of using Mahalanobis distance as a similarity measure.

  3. 3.

    To reduce the possibility of identity switches in crowded scenes, we introduce shape similarity motivated by the fact that a mismatch can happen due to the high IoU overlap of moving objects. Still, the shape of the objects (i.e. width and height) should remain relatively constant in a short time frame.

In the rest of the paper, we refer to our methods for improving the estimation of the detection confidence and improving the calculation of the similarity matrix as detection confidence boosting and similarity boosting, respectively.

We demonstrate the effectiveness of proposed additions on MOT17  [41] and MOT20  [14] datasets. It has become a standard practice to apply camera motion compensation (CMC)  [1, 4, 16, 17, 37, 54] and interpolation of fragmented tracks  [1, 17, 67, 69] to MOT. By integrating CMC and gradient boosting interpolation from  [67], we achieve comparable results with state of the art methods, without using time costly visual features and running at the speed of 65.45 FPS on MOT17 and 32.79 FPS on MOT20, on a desktop with one NVIDIA GeForce RTX 3090 GPU and AMD Ryzen 9 5950X 16-Core CPU. Furthermore, by adding visual embedding to our system, which we refer to as BoostTrack+, at the expanse of longer run-time (15.35 FPS on MOT17 and 3.05 FPS on MOT20), we outperform all standard benchmark solutions. BoostTrack+ ranks first among online methods in HOTA score on the MOT17 test set and first among all methods in HOTA score on the MOT20 test set (see figure  1 for visual comparison).

Fig. 2
figure 2

Overview of our BoostTrack method. We use existing tracklets and detected bounding boxes to increase detection confidence scores before filtering out low confidence detections. We use the remaining bounding boxes to calculate the base (main) similarity measure and to improve it by adding our lightweight similarity boost

In summary, we make the following contributions:

  • We introduce two confidence detection boosting techniques to utilize low confidence detections in one-stage assignment,

  • We define detection-tracklet confidence and use it to give more weight to high detection confidence - high tracklet confidence association pairs, avoiding the need for multiple-stage association used in some previous works,

  • We propose a novel way of incorporating Mahalanobis distance and shape similarity to the similarity matrix,

  • We perform a detailed ablation study on MOT17 and MOT20 validation sets to show the effectiveness of the proposed methods. Our appearance-free BoostTrack method outperforms standard benchmark solutions and achieves comparable performance with the most recent methods on MOT17 and MOT20 test sets. Our BoostTrack+ method ranks first among online methods in HOTA score on both MOT17 and MOT20 test sets under private detection protocol.

We give overview of our method in figure  2. The rest of the paper is structured as follows: in section  2, we review the related work focusing on various multiple-stage association procedures, tracklet confidence, and different similarity measures used in previous works. Section  3 introduces detection-tracklet confidence and three proposed similarity matrix boosting techniques. In section  4, we discuss our two detection confidence boosting strategies - namely, increasing the detection confidence of likely objects based on IoU and increasing the detection confidence of unlikely objects based on Mahalanobis distance. We discuss experiments, show the results of the ablation study and compare our results with benchmark methods in section  5. We conclude our work in section  6.

2 Related work

2.1 Sort

Solving the MOT online using the Kalman filter for tracking, IoU as a similarity matrix, and the Hungarian algorithm for the assignment was first introduced in SORT  [6]. SORT uses a linear constant velocity model, and in every step, the Kalman filter is used to predict the state of the tracklet:

$$\begin{aligned} \textbf{x} = [u, v, s, r, \dot{u}, \dot{v}, \dot{s}]^T, \end{aligned}$$
(1)

where uvsr represent the coordinates of the bounding box center, area and aspect ratio, respectively, and \( \dot{u}, \dot{v}, \dot{s}\) corresponding velocities (authors assume aspect ratio to be constant).

Assignment cost is calculated as \(-1\cdot \text {IoU}(D, T)\), and only assignments with IoU greater than a specified threshold, \(\tau _{IoU}\), are considered admissible.

2.2 Working with unreliable detections

Various strategies have been used to deal with unreliable detections, i.e. to identify and discard false positive detections  [46]. In  [61], an SVM  [7] model was trained to classify detections into tracked or inactive class. In  [42], all detections are used for the association, but the association is done in a two-stage manner, associating first detections and tracklets with greater similarity measure and associating the rest in the second stage. Filtering out unreliable detections based on detection confidence, i.e. thresholding, is the most common practice  [17, 60, 65]. However, some low-confidence detections can correspond to partially occluded objects, and using these detections can increase performance.

In  [51], high-confidence detections are used for trajectory initialization and tracking, while low-confidence detections are used for tracking only. ByteTrack  [69] uses all detection boxes in a two-stage assignment where high-confidence boxes are used in the first stage and the remaining boxes and tracklets in the second. Following the ByteTrack, the same two-stage assignment was adopted in several works (e.g.  [1, 30, 37, 58, 62]). In  [39] authors proposed an offline tracking algorithm that uses all detection boxes. LG-Track  [40] uses localization and classification confidence scores from the detectors and divides detections into four groups based on thresholds for the two scores. The association is performed in four stages using different cost matrices, which are differently scaled by detection confidence scores in different association stages. ImprAsso  [55] splits detections into high and low confidence sets and calculates associating distance for both sets, which are combined into a single matrix and used in a single association step.

2.3 Tracklet confidence

Being a general term, tracklet confidence has been differently defined and used for different purposes in several prior works  [2, 3, 11, 13, 42, 63]. In  [2], tracklet confidence is defined as an intuitive measure of similarity between constructed and real object trajectory and used to split trackers into low and high confidence groups, which are associated differently based on the group they belong to. In  [11], the scene is split into \(k \times k\) grid and both detections and predicted bounding boxes are used as candidates for the association. Tracklet confidence is defined and used to calculate the probability of the object being in a given area of the image and to filter out unreliable predictions. Tracklet score is calculated based on detection scores in  [3] and used as tracklet termination criteria. In  [42], tracklet confidence is designed and used to detect occlusions. Since low confidence detection, in the case of a true positive, usually means the object is partially visible or occluded, in  [63], tracklet confidence is defined as a measure of object visibility and predicted as a part of object state. The difference between detection box confidence and predicted tracklet confidence is used as an additional similarity measure  [63].

2.4 Similarity measures

Various improvements and additions to the IoU have been proposed to improve matching performance. In  [30] Generalized IoU (GIoU)  [48] is used. Normalized IoU is proposed in  [42] to include differences between bounding box size and center. To account for object motion, a momentum term was added to IoU in  [9]. Width and height information are used in several previous works, e.g.  [32, 35, 65], and in  [63], Height Modulated IoU is introduced to incorporate height similarity into IoU matrix explicitly.

Since DeepSORT  [60], using appearance features to associate detections with the tracklets has become a popular approach in tracking-by-detection MOT [1, 17, 57, 59]. Specifically, the cosine distance between visual embedding vectors is used to construct the association cost matrix. Mahalanobis distance is used as a gating mechanism to discard inadmissible associations. Since uncertainty increases when the tracklet is not updated (due to occlusion or missing detection), the assignment is done in cascade, in increasing order of the number of steps since the last update  [60]. Following a DeepSORT, a weighted sum of Mahalanobis distance and cosine distance is used in several other works (e.g.  [1, 17, 68]). In  [33], Mahalanobis distance is smoothed by adding \(\alpha \cdot I\) to the covariance matrix when calculating the distance.

2.5 Our approach

To the best of our knowledge, no prior work used detection-tracklet confidence to implicitly prioritize high-confidence detection or high-confidence tracklets in a single-stage association. In  [35, 60], recently updated tracklets (i.e. the more confident ones) are explicitly favoured in cascade matching, and works  [1, 30, 37, 42, 58, 62, 69] use two-stage matching prioritizing high-confidence detections.

We adopted and modified shape similarity from  [32]. In  [32] (and  [65]), shape similarity (similarity between height and width) is used in conjunction with Euclidian distance and visual similarity to construct an association cost matrix and it cannot be used as a standalone metric or addition to the association cost. In our work, shape similarity is used together with detection-tracklet confidence to create a standalone addition to the similarity matrix designed to reduce possible ambiguity arising from the IoU measure.

Prior works used Mahalanobis distance directly to define the similarity matrix  [17, 68] or to discard inadmissible associations  [60]. In our work, we convert Mahalanobis distance into probabilities, creating a more intuitive, robust, and universal metric.

In  [63], Height Modulated IoU, velocity and confidence cost are added to the appearance cost to create a final cost matrix used for association. In our terminology, this can be seen as boosting the appearance similarity by adding additional lightweight measures. In our work, we propose and use different similarity measures.

In  [55], one-stage association is performed using all detection boxes and normalization parameter \(\beta \) is used to control the association cost of low-confidence detections. Our detection confidence boosting strategy does not use all detection boxes in the association step or require any change in creating an assignment cost matrix but rather increases the confidence of some boxes before filtering out low-confidence detections. No prior work, to our knowledge, explicitly targeted and used detections that seem to be outliers. Our detection confidence boosting method attempts to use more true positive detection bounding boxes without the need for sequence-specific hyperparameter tuning used in some previous works (e.g.  [1, 55, 69]).

3 Similarity matrix boost techniques

In this section, we introduce our similarity matrix boost techniques. Proposed improvements are orthogonal to existing approaches and can be added to any similarity matrix \(S_{base}\) (e.g. \(S_{base}\) can be calculated as IoU between detected bounding boxes and existing tracklets) to improve assignment performance.

Note, to use the similarity matrix as an assignment cost matrix needed for the Hungarian algorithm [28], we need to “reverse" the values, i.e. the greater the similarity between a given detection-tracklet pair, the lower the corresponding assignment cost. As in SORT [6], we obtain the assignment cost matrix by multiplying the similarity matrix with -1.

3.1 Detection-tracklet confidence similarity boost

To benefit from hierarchical assignments that favour high-confident detections  [1, 30, 37, 58, 62, 69] or recently updated tracklets  [35, 60] and avoid the drawbacks of such approaches, we design detection-tracklet confidence as a scaling factor to favour high-confidence detection, high-confidence tracklet pairs in one-stage assignment.

Let \(D = \{D_1, D_2, \ldots , D_n\}\) and \(T = \{T_1, T_2, \ldots , T_m\}\) be the set of detections and set of tracklets, respectively, and \(c_{d_1}, c_{d_2}, \ldots , c_{d_n}\) and \(c_{t_1}, c_{t_2}, \ldots c_{t_m}\) their corresponding confidence scores. \(T_1, T_2, \ldots , T_m\) are obtained as outputs of Kalman prediction step.

Recently updated tracklets (i.e. active tracklets that were recently assigned detection and executed Kalman update step) should have more reliable state prediction and higher confidence. Due to initial noisy predictions, new tracklets should have less confidence. Let \(\text {age}(T_j)\) and \(\text {last\_update}(T_j)\) be the number of steps since creation and the number of steps since the last update of tracklet \(T_j\), respectively. We define tracklet confidence \(c_{t_j}\) as:

$$\begin{aligned} c_{t_j} = {\left\{ \begin{array}{ll} \beta ^ {s_{init}-\text {age}(T_j)},&{} \text {if } \text {age}(T_i) < s_{init}\\ \beta ^ {\text {last\_update}(T_j)-1}, &{} \text {otherwise,} \end{array}\right. } \end{aligned}$$
(2)

for \(j \in \{1, 2, \ldots , m\}\). \(\beta \in (0, 1)\), is the tracklet confidence decay hyperparameter, and \(s_{init}\) is the number of steps we consider the tracklet as “new", i.e. having initial unreliable predictions.

Note that detection confidence scores \(c_{d_1}, c_{d_2}, \ldots , c_{d_n}\) are available as the output of the detector.

We define detection-tracklet confidence of detection \(D_i\) and tracklet \(T_j\), \(c_{d_i,t_j}\), as a product \(c_{d_i} \cdot c_{t_j}\). To encourage admissible associations only (e.g. \(\text {IoU}(D_i, T_j) \ge \tau _{IoU}\)) we set \(c_{d_i,t_j}\) to 0 for inadmissible associations. In summary, we define \(c_{d_i,t_j}\) as

$$\begin{aligned} c_{d_i,t_j} = {\left\{ \begin{array}{ll} c_{d_i} \cdot c_{t_j},&{} \text { if } (D_i, T_j) \text { is admissible}\\ 0,&{} \text {otherwise.} \end{array}\right. } \end{aligned}$$
(3)

We define confidence matrix C as \(C = [c_{d_i,t_j}]_{n \times m}\).

Using the detection-tracklet confidence scores, we can boost similarity matrix S by adding confidence scaled \(\text {IoU}(D, T)\):

$$\begin{aligned} S_{boost} = S_{base} + \lambda _{IoU} \cdot C\odot \text {IoU}(D, T), \end{aligned}$$
(4)

where by \(\odot \) we denote element-wise matrix product, and \(\lambda _{IoU}\) is a hyperparameter. Note that any similarity score can be used in place of the IoU.

3.2 Mahalanobis distance similarity boost

Mahalanobis distance  [38] is used as a similarity measure in some previous works (e.g.  [1, 17, 68]).

Note that the Kalman filter, here used as a state estimator, provides rigorous and optimal performance guarantees that do not rely on any assumptions on process or observation noise other than the mean and the covariance are known, and the Kalman filter is an optimal minimum mean square error estimator  [27]. However, if the process and observation noise are assumed to be Gaussian, then the estimated mean and state covariance parameterize Gaussian distributions.

In that case, Mahalanobis distance values are chi-squared distributed, and relevant values depend on the degrees of freedom and chosen confidence interval boundary. Usually, a 95% confidence interval and 4 degrees of freedom are used, giving a range of admissible values of \((0, \, 9.4877)\) and requiring a relatively low weight factor \(\lambda = 0.02\)  [1, 17, 68]. In 3D MOT, detections are given as a tuple of 7 parameters, and in both 2D and 3D MOT we may choose to use only the box center for calculating Mahalanobis distance. However, changing the confidence interval or the degrees of freedom gives a different range of values and would require a different \(\lambda \) value.

On the other hand, only a relative difference between Mahalanobis distance values is relevant for the assignment task. From the perspective of a given tracklet, we can think of Mahalanobis distances between the detections as unnormalized probabilities. Motivated by this, we apply the softmax function to normalize Mahalanobis distance. First, we clip distances to a max limit value (e.g. 9.4877). Then, we subtract each value from the limit value. Finally, we apply softmax, and to avoid giving weight to inadmissible associations, we set the similarity measure (i.e. “probability") to 0 for detections beyond the limit value. Note that we apply softmax for each column of the similarity matrix. A pseudocode of our procedure is illustrated in algorithm  1 in Appendix A.

After obtaining the \(S^{MhD}\) similarity matrix in a described way, we can boost the initial similarity measure \(S_{base}\) by:

$$\begin{aligned} S_{boost} = S_{base} + \lambda _{MhD} \cdot S^{MhD}(D, T), \end{aligned}$$
(5)

where by \(\lambda _{MhD}\) we denote the weight of Mahalanobis distance similarity boost.

Normalizing Mahalanobis distances provides greater robustness to dimensionality changes (no need for changing the weight, but only the clip threshold) and enables direct comparison with other similarity measures. In case when few detection bounding boxes have a similar Mahalanobius distance to a given tracklet, softmax can effectively reduce the impact of using Mahalanobius distance similarity and make \(S_{base}\) similarity more decisive (however, in case of \(S_{base}\) ambiguity, \(S^{MhD}\) may provide new information and enable correct assignment). Furthermore, adjusting softmax temperature gives us more control in handling ambiguous assignments. Note that we keep the temperature parameter equal to 1 for simplicity.

3.3 Shape similarity boost

To avoid possible ambiguity from the other similarity metrics (e.g. IoU), we reintroduce the shape similarity metric. Consider a scenario where two objects are highly overlapped (e.g. pedestrians passing by one another). Corresponding tracklets can have greater IoU with wrong detection boxes leading to identity switch. However, in a short time frame, the objects’ shape (width and height) should remain relatively constant, and using shape information could reduce possible ambiguity.

We should not rely too much on the tracklet shape information from a tracklet that was not recently updated. For example, a person may move hands and can appear wider, but only temporarily. Even the height can change due to the object moving closer or farther away from the camera. On the other hand, we should also consider detection confidence because comparing shapes with unreliable detections could reduce the reliability of the shape similarity. To account for this, we scale shape similarity by detection-tracklet confidence scores introduced in subsection 3.1. Let \(ds_{i,j}\) be the shape difference between the detection \(D_i\) and the tracklet \(T_j\) defined as

$$\begin{aligned} ds_{i,j}=\frac{|D_i^w - T_j^w|}{\text {max}(D_i^w, T_j^w)} + \frac{|D_i^h - T_j^h|}{\text {max}(D_i^h, T_j^h)}, \end{aligned}$$
(6)

where by w and h in superscript we denote width and height, respectively. We define our shape similarity measure between detection \(D_i\) and tracklet \(T_j\) as

$$\begin{aligned} S^{shape}_{d_i, t_j} = c_{d_i,t_j} \cdot \text {exp}\big (- ds_{i,j}\big ). \end{aligned}$$
(7)

We can boost similarity measure \(S_{base}\) by:

$$\begin{aligned} S_{boost} = S_{base} + \lambda _{shape} \cdot S^{shape}(D, T), \end{aligned}$$
(8)

where \(\lambda _{shape}\) is a hyperparameter used as the weight of the shape similarity.

Combining the three proposed similarity boost techniques we get:

$$\begin{aligned} S_{boost} = S_{base}&+ \lambda _{IoU} \cdot C\odot \text {IoU}(D, T) \nonumber \\&+ \lambda _{MhD} \cdot S^{MhD}(D, T) \nonumber \\&+\lambda _{shape} \cdot S^{shape}(D, T). \end{aligned}$$
(9)

4 Detection confidence boosting techniques

Not all low-confidence detections are false positives. In this section, we describe our proposed methods of utilizing two groups of low-confidence detections by boosting their confidence score.

4.1 Detecting likely objects

When an object is partially occluded, sometimes it can still be detected. Detection confidence of such partially visible objects can be low, and their detected bounding boxes can be discarded for having a confidence score below some threshold \(\tau _D\). However, having a tracking module, i.e. Kalman filter, enables us to predict where the object should be. If the IoU score between the tracklet and the detection box is high, we propose to increase the confidence of that detection box, enabling it to be used for later association. For each detection box \(D_i\) we can calculate IoU between the detection box and all tracklets and increase the confidence of detection \(D_i\), and obtain boosted confidence \(\hat{c}_{d_i}\) using the maximum value of calculated IoUs:

$$\begin{aligned} \hat{c}_{d_i} = \text {max}\big (c_{d_i}, \beta _c\cdot \max _{j}(\text {IoU}(D_i, T_j))\big ). \end{aligned}$$
(10)

Hyperparameters \(\beta _c\) and \(\tau _D\) implicitly define the IoU threshold for the detections to be used for the association, even if the original detection confidence \(c_{d_i}\) is low. This way, we also increase the confidence of some detections where \(c_{d_i} > \tau _D\), which we found to slightly increase the performance combined with our detection-tracklet confidence from subsection 3.1.

Figure 3 shows an example of a highly occluded person with a low detection confidence bounding box (in red), \(c_d = 0.1748\), that has a high IoU (0.949) with the predicted bounding box. Our method increases the confidence of this detection and uses it for association.

Fig. 3
figure 3

Detecting likely objects based on IoU value. Blue bounding boxes represent tracklets. The red bounding box is the detection with the original confidence score of 0.1748, which is increased based on high IoU with the predicted bounding box

Fig. 4
figure 4

Detecting “unlikely” objects based on Mahalanobis distance. Blue bounding boxes represent tracklets. The yellow bounding box is the detection with the original confidence score of 0.2252, which is increased based on the high Mahalanobis distance between all existing tracklets

4.2 Detecting “unlikely" objects

Previous confidence boosting strategy aimed to increase the detection confidence for detections where a tracked object is likely to be. However, some objects could never be detected in the first place. They can be partially occluded during the entire video or positioned on the edge of the scene and only partially visible. We propose a method to detect some of these objects. In particular, we note that false positive low-confidence detections typically occur near the tracked objects (with the exception discussed in the previous subsection). If we detect an object far from where any tracked object is supposed to be, we can assume that this detection is not result of a motion and noise produced by currently tracked object. Such outliers can actually be previously undetected objects.

Mahalanobis distance can be used to detect outliers  [23]. As previously noted, Mahalanobis distance values are chi-square distributed and values greater than a certain threshold \(\tau _{MhD}\) are considered outliers. We set the threshold \(\tau _{MhD}\) to 13.2767, corresponding to a \(99\%\) confidence interval bound for 4 degrees of freedom chi-square distribution.

For a given detection \(D_i\), we compute the distance between \(D_i\) and every tracklet \(T_j\) and consider \(D_i\) outlier if the distance between \(D_i\) and the closest tracklet is greater than \(\tau _{MhD}\), i.e. if

$$\begin{aligned} \min _{j}(MhD(D_i, T_j)) > \tau _{MhD}. \end{aligned}$$
(11)

If a given detection \(D_i\) is an outlier with respect to state distributions of all currently tracked objects, we consider it to correspond to the previously undetected object. However, some of these outliers can still be false positives. To all detections where inequality  (11) holds, we apply non-maximum suppression with a threshold of \(\tau _{NMS}=0.3\) to remove overlapping detections and reduce the number of used false positive detections. We provide details on the influence of \(\tau _{NMS}\) in Appendix B. We set the detection confidence of the remaining detections to \(\tau _D\), allowing them to be used in the association step.

Figure  4 shows an example of a low detection confidence bounding box (in yellow) with original detection confidence \(c_d = 0.2252\). The closest Mahalanobis distance between existing tracklets is 2178.23, and we assume it is unlikely to be a false positive generated by currently tracked objects. We increase the detection confidence of such detections. Note that our method can increase the confidence of false positive detections. Still, in the ablation study in subsection  5.3, we show that the number of used IDs does not increase significantly and that our method improves MOT performance.

5 Experiments

5.1 Datasets and metrics

Datasets. We use standard MOT benchmark datasets, MOT17  [41] and MOT20  [14], to conduct experiments. As in several popular benchmarks  [1, 17, 37, 69], we replace original dataset detections with detections from YOLOX-X  [22] and conduct experiments under private detection protocol.

MOT17 contains static and moving camera videos of pedestrians. Videos are filmed with different FPS settings (ranging from 14 to 30 FPS) and split into training and test sets. Following previous works  [1, 17, 37, 69, 71], we construct a validation set using the second half of each training sequence and use it for the ablation study (note that using a validation set is important because the detector and feature extractor are trained on the first half of the training data).

MOT20 contains 8 sequences (4 for training and the remaining 4 as the test set) of crowded scenes filmed with a static camera. Similarly, we use the second half of each sequence as the validation set.

Metrics. We evaluate tracking performance using widely accepted metrics CLEAR metrics (we focus primarily on MOTA, IDs, IDSWs)  [5], IDF1  [49] and HOTA  [36]. MOTA (Multi-Object Tracking Accuracy) is defined using the number of false positives (FP), false negatives (FN), identity switches (IDSW) and the total number of ground truth detections (gtDet) as:

$$\begin{aligned} \text {MOTA} = 1 - \frac{|\text {FP}|+\text {FN}|+|\text {IDSW}|}{|\text {gtDet}|}. \end{aligned}$$
(12)

Note that FP and FN are ID agnostic. As such, MOTA does not penalize wrong associations heavily (only the IDSW accounts for associations mismatch, but is insignificant compared to \(|\text {FP}|+|\text {FN}|\)) and is mainly used as a metric for detection performance. IDF1 metric computes the matching on the id level and can be used to measure the association performance. HOTA combines detection, association and localization accuracy and attempts to assess the whole tracking performance. It is calculated as a geometric mean between the detection accuracy (DetA) and the association accuracy (AssA), integrated over different localisation thresholds (approximated as a finite sum for \(\alpha \in \{0.05, 0.1, \ldots , 0.95 \}\))  [36].

Since our detection confidence boosting techniques can introduce possible false positive detections and new identities, we explicitly monitor IDs and IDSWs in addition to MOTA, IDF1 and HOTA when discussing the impact of our confidence boosting methods.

5.2 Implementation details

Kalman filter. As in  [17, 60, 68, 69], we define the state as eight-dimensional vector \(\textbf{x} = [u, v, h, r, \dot{u}, \dot{v}, \dot{h}, \dot{r}]^T\), where by (uv) we denote coordinates of the bounding box center, height and aspect ratio of the bounding box, respectively, while \(\dot{u}, \dot{v}, \dot{h}, \dot{r}\) represent their corresponding velocities. We retain constant process and measurement noise from  [6].Footnote 1

MOT specific settings. Same as  [6], we report tracklet state only in the case of 3 consecutive matches (i.e. the Kalman updates) and use \(\tau _{IoU} = 0.3\) as criteria to discard inadmissible associations. As in  [9, 37], we set the detection confidence threshold \(\tau _D\) to 0.6 for MOT17 and 0.4 for MOT20. In previous works (e.g.  [1, 60, 69]), unassociated tracklets are kept for \(A_{\text {max}}=30\) frames. Since our detection confidence boost techniques rely on the tracklet predicted position, we keep tracklets alive for a longer period. Since different sequences can have different frame rates, a fixed value of 30 steps corresponds to different clock times (e.g. 2.14 seconds for the MOT17-05 sequence and 1 second for the MOT17-09 sequence). We use sequence specific value \(A_{\text {max}}=\text {max}(30, 2*\text {FPS})\), i.e. we keep unassociated trakelts alive for at least 2 seconds. As in previous works (e.g.  [19, 31, 37, 70]), we resize images from MOT17 and MOT20 to \(1440 \times 800\) and \(1600 \times 896\), respectively.

BoostTrack specific settings. As the base similarity measure, we use IoU, i.e. \(S_{base}=\text {IoU}\) in equation  (9).

We run a grid search to find values of \(\lambda _{IoU}, \lambda _{MhD}\) and \(\lambda _{shape}\). For each \(\lambda \) we tested values from the set \(\{0, 0.25, 0.5, 0.75, 1\}.\) As the best trade-off between different metrics on MOT17 and MOT20, we choose values \(\lambda _{IoU}=0.5,\, \lambda _{MhD}=0.25,\, \lambda _{shape}=0.25\). We observed that any setting where \(lambda_{MhD}\) is not dominant improves the performance compared to the baseline.

We tested various \((\beta , s_{init})\) settings for \((\beta , s_{init}) \in [0.7, 0.95] \times \{0, 1, \ldots 10\}\) on the MOT17 validation set. We found that any setting improves the performance of the baseline and set \(\beta =0.9,\, s_{init}=7\) for the best trade-off between different metrics. We provide more details on the influence of \(\beta \) and \(s_{init}\) in Appendix C.

We set the limit value for Mahalanobis distance for similarity boost and outliers detection to 13.2767, corresponding to a \(99\%\) confidence interval boundary for 4 degrees of freedom chi-square distribution. For boosting the detection confidence of likely objects, we set \(\beta _c=0.65\) for MOT17 and \(\beta _c=0.5\) for the MOT20 dataset, corresponding to the required IoU of 0.923 and 0.8 for low-confidence detection to surpass the threshold \(\tau _D\).

We evaluate results using TrackEval  [26].

Additional modules. For the YOLOX-X  [22], we use the weights from  [69]. We use Enhanced correlation coefficient maximization from  [18] for CMC and rely on implementation from  [17]. We keep the same settings as in  [17], but resize the images to 350 pixels in width (and proportional height). We use linear interpolation implementation from  [69] and do not set a maximum interval for interpolation. Tracks with less than 25 detections are not interpolated (default value from  [69]), and we further improve interpolation results by applying gradient boosting interpolation from [67]. As in  [1, 37, 69], we use FastReID  [25] as a visual embedding model, and apply Dynamic Appearance embedding update from  [37]. When using appearance similarity, we add \(\lambda _{app}\cdot S_{app}(D_i, T_j)\) to the overall similarity measure proposed in equation  (9), where \(S_{app}(D_i, T_j)\) represents cosine similarity between visual embedding vectors. To account for a total weight of 2 for non-appearance similarity, we set \(\lambda _{app}=3\) to give more weight to the appearance similarity. In addition to admissibility condition \(\text {IoU}(D_i, T_j)\ge \tau _{IoU}\), we allow assignments between detection box \(D_i\) and tracklet \(T_j\) if

$$\begin{aligned} \text {IoU}(D_i, T_j)\ge \frac{\tau _{IoU}}{2} \wedge S_{app}(D_i, T_j)\ge \frac{3}{4}. \end{aligned}$$
(13)

Hardware. We run all the experiments on the desktop with AMD Ryzen 9 5950X 16-Core CPU and NVIDIA GeForce RTX 3090 GPU.

Software. Our implementation is developed on top of publicly available codes  [6, 17, 37, 67, 69].

5.3 Ablation study

Similarity boost To test the impact of each component of the proposed similarity boost technique, we conduct a detailed ablation study on MOT17 and MOT20 validation sets. Table  1 shows results for every combination of proposed components: detection-tracklet confidence (DTC) boost, Mahalanobis distance (MhD) boost and shape similarity boost. The first row corresponds to the case where \(S = \text {IoU}(D, T)\) and represents the baseline. We set \(\lambda _{IoU}=0.5,\, \lambda _{MhD}=0.25,\, \lambda _{shape}=0.25\) (see BoostTrack specific settings from subsection  5.2). DTC and Shape similarity boost improve performance both as a standalone addition and combined. Adding the MhD similarity boost alone slightly decreases the overall performance, but combined with DTC and Shape, it results in significant performance gain on MOT17 and MOT20. We set \(\lambda \) values based on the grid search and found that any setting where \(\lambda _{MhD}\) is not dominant improves the performance. Since similarity boost should improve the association performance, we consider IDF1 the most important metric to show the advantage of the proposed methods. We achieved +1.346 IDF1 on MOT17 and +0.953 IDF1 on MOT20 compared to the baseline. Note that we also achieve improvement in other metrics.

Table 1 Ablation study on the MOT17 and MOT20 validation sets for different similarity boost settings (best in bold)

Detection confidence boost. We test the effectiveness of our detection confidence boost (DCB) techniques and show results in table 2. We tested our detecting likely objects (DLO) and detecting “unlikely" objects (DUO) strategies combined with our similarity boost (SB) components (when SB is used, we assume all three components, i.e. the last row of table 1). In addition to previous metrics, we display the total number of used IDs because boosting detection confidence can introduce new identities. Note that MOT17 and MOT20 validation sets contain 339 and 1418 ground truth identities, respectively.

Our study shows that both DCB techniques improve the MOT performance, both as standalone additions and combined with our SB techniques. Since DCB introduces new detections and SB aims to improve association, we discuss improvements in HOTA  [36] metric to summarize both detection and association performance. Without SB, we get +0.584 HOTA on MOT17 and +1.925 HOTA on MOT20. If we include SB, we get an overall performance increase of +1.546 HOTA on MOT17 and +2.327 HOTA on MOT20. This shows that proposed boosting techniques complement each other and can be used jointly. We also get a significant increase in IDF1 and MOTA metrics. IDSW cannot be trivially compared because the number of used IDs can be significantly different.

Table 2 Ablation study on the MOT17 and MOT20 validation sets for different detection confidence boost settings (best in bold)

Additional modules. As the standard practice, BoostTrack uses camera motion compensation (CMC) and interpolation to connect fragmented tracks. We use gradient boosting interpolation (GBI) from  [67], but we also show results obtained using linear interpolation (LI). Finally, we add appearance similarity (AS) and show the effect of added components on MOT17 and MOT20 datasets in table  3. We did not include run-time speed in tables  1 and  2 because all the experiments run at approximately the same FPS. Adding CMC or AS affects execution speed, and we show FPS in addition to previously used metrics. To make comparison easier, we divide the table  3 into three parts. The first part shows the results of adding CMC, GBI and AS to the baseline. To distinguish between the fast appearance-free method and the slower method that uses AS, we label the latter as BoostTrack+ and show the results of BoostTrack and BoostTrack+ in the second and third parts of the table, respectively.

Table 3 Ablation study on the MOT17 and MOT20 validation sets for different additional modules

Adding gradient boosting interpolation greatly improves results on MOT17, and we achieve 70.647 HOTA (+2.969), 79.8 MOTA (+4.786) and 82.323 IDF1 (+2.469). By using CMC combined with GBI, we achieve 71.63 HOTA, 80.692 MOTA and 83.959 IDF1 on the MOT17 validation set, retaining real-time execution speed of 65.45 FPS. Adding appearance similarity further improves the performance, and we get +0.781 HOTA and +1.426 IDF1 with a slight decrease in MOTA (\(-\)0.022).

On MOT20, GBI results in +2.553 HOTA, +4.155 MOTA and +1.476 IDF1 improvement. Since videos in MOT20 are filmed with a static camera, using CMC has no significant impact on performance. Adding appearance similarity further improves the performance: +0.834 HOTA, +0.289 MOTA, +1.504 IDF1, at the expense of increased computation time (3.05 FPS).

In the case of the MOT17, our proposed additions effectively replace the need for AS, while adding AS increases the performance further. AS has a more significant impact on association performance in crowded scenes from MOT20, and adding our techniques slightly reduces HOTA and IDF1 values (\(-\)0.008 HOTA and \(-\)0.349 IDF1) but increases MOTA value (+0.504 MOTA). For a fair comparison, we keep the same ratio of non-AS and AS when adding AS to the baseline (no SB+DCB) and set \(\tau _{AS} = 1.5\).

5.4 Comparison with benchmark methods

We show the evaluation results on the MOT17 and the MOT20 test sets under private detection protocol in tables  4 and  5, respectively.

Table 4 Comparison with other MOT methods on the MOT17 test set (best in bold). We mark offline methods with ’*’
Table 5 Comparison with other MOT methods on the MOT20 test set (best in bold). We mark offline methods with ’*’

On the MOT17 and the MOT20 test sets our fast non-appearance BoostTrack method shows comparable performance. On the MOT17, BoostTrack ranks fourth among online methods in HOTA metric, while on the MOT20 it shows comparable results (note that it still outperforms standard benchmarks solutions such as StorngSORT  [17] or ByteTrack  [69]).

Our BoostTrack+ method effectively outperforms standard benchmark solutions on both datasets. Among online trackers, BoostTrack+ ranks first in HOTA and second in IDF1 metric on the MOT17 test set. BoostTrack+ ranks first in HOTA metric among all methods and first in IDF1 metric among online methods. Our method achieves comparable results in MOTA metric on both datasets.

6 Conclusions

In this paper, we presented three techniques for improving the similarity measure between detections and tracklets and two for increasing the confidence score of low-score detection bounding boxes. Our method uses simple one-stage association and, combined with camera motion compensation and gradient boosting interpolation, achieves comparable performance with state-of-the-art methods on MOT17 and MOT20 datasets while operating in real-time. Adding appearance similarity further increases the performance of our method, and our BoostTrack+ ranks the best online method in HOTA score for MOT17 and MOT20 datasets.