1 Introduction

For automated vehicles, an accurate detection of single frame alone is not enough, instead of a fluent and coherent comprehension of continuous-time series of perception data is required. Therefore, most autonomous driving perception systems have implemented the online multiple object tracking (MOT) algorithm, which is responsible for tracking multiple targets of interest, recording their trajectories and maintaining their labels [1]. The output of a MOT algorithm generally contains two parts: trajectory and label of each tracked object. The task of MOT becomes extremely challenging in noisy and crowded environment, where data association between measurements and states is hard to achieve. Without correct association, the quality of state update can barely be guaranteed [1]. Figure 1 shows a general widely used measure-state matching tracking framework.

Fig. 1
figure 1

General multi-object tracking framework based on measurement-state matching data association paradigm. Matched pairs of measurement and state are filled into motion prediction module in a filtering process and finally managed by trajectory manager, while unmatched states are judged birth or death by lifetime manager

Data association is the process of matching individual observation and state. Much research has been conducted in this area. The most commonly implemented approach to deal with data association is global nearest neighbor (GNN) algorithm [2], whose main idea is to convert the matching of observations and states into obtaining the minimum distance of two sets. Popular applications of GNN include greedy algorithm, implemented in CenterPoint [3] CenterTrack[4] and Hungarian algorithm, implemented in AB3DMOT [5], StanfordTRI Mahalanobis 3D [6]. These GNN-based tracking methods are quite efficient and have achieved great performance. However, GNN shows degraded performance in environments of high clutter density. To overcome this problem, Joint Probability Data Association (JPDA) [7] is proposed. However, due to its exponential complexity, JPDA is rarely implemented in the tracking system of automated vehicles. In summary, both GNN and JPDA methods try to solve the optimal matching at a given observation and provide a deterministic matching result.

Unlike GNN and JPDA, the random finite set (RFS) framework, first proposed in Ref. [8], is a probabilistic approach to realize data association. RFS-based tracker is able to estimate the states and cardinality of multiple targets from a set of observations with clutter. Due to the application of Bayesian probability, the RFS-based tracker can associate the measurements and states while at the same time avoiding high-complexity deterministic data association [9]. Some closed-form realizations of RFS include the probability hypothesis density (PHD) [10], the cardinalized probability hypothesis density (CPHD) [11], labeled multi-Bernoulli (LMB) [12] and Poisson multi-Bernoulli mixture (PMBM) [13]. PMBM has been proved to have superior performance and simpler parameterization than GLMB. PMBM filter has been implemented for visual-based tracking in KITTI dataset [14] , and lidar tracking in Argoverse and Waymo datasets [15].

Probabilistic approach of tracking, like RFS-based tracker, with the advantage of higher efficiency in data association. Nevertheless, this approach may cause frequent track fragmentation and track discontinuity, because early RFS research does not consider the constraint on the continuity of the trajectory. There is a tendency to extend the state of RFS tracker when considering the trajectories of a target [16]. Xia et al. extend the concept of PMBM to multi-scan PMBM based on sets of trajectories. However, computing the probability density of trajectories has a high computational complexity and takes considerable computation time [17]. How to perform trajectory PMBM in an environment with strong real-time requirements remains under-explored.

The constraint of a trajectory requires to consider the continuity among states over a period of time. In a contrast, classic filtering method only considers the states and observations of two frames, the frame before and after. To consider more consecutive frames during online tracking, there is a growing tendency to analyze the update of short trajectory, called tracklet. PC-TCNN [18] detects tracklets directly from raw point clouds and performs data association between tracklets. In the concept of tracklet, the forward filtering considers multiple frames of historical information, while the backward smoothing estimates the possibility of the tracklet existence. This possibility of existence is defined as confidence. Different from traditional count-based methods, the method of estimating confidence is called the confidence-based method. CBMOT [19] originates scores from detection scores and performs score decay through tracklets. PC-3T[20] performs score update via prediction confidence model. ByteTrack [21] designs a simple structure “BYTE”, which can associate detections with either high-confidence tracklets or low-confidence ones. In summary, the score imposed on a tracklet indicates a kind of intuition that targets observed in the long-term field of vision will not suddenly disappear. Existing methods design different hand-crafted processes to manage the change of confidence of each tracklet. Explanation of the meaning of confidence, quantification and utilization of confidence remain under-explored.

This paper proposes PTMOT, a confidence-based probabilistic tracklet multi-object tracker, aiming to realize probabilistic association between tracklets and measurements to reduce calculation burden and integrate confidence-based estimation in RFS framework. Moreover, this paper adopts a confidence-based approach to adjust the confidence of tracklets.

Compared with the deterministic association method, due to the estimation of multiple association assumptions, the proposed method can correct occasional errors in certain frames. Prior RFS-based methods are prone to high ID switches due to the switching of multiple hypotheses, which damages the tracking continuity. The proposed method adopts the approach of combining the confidence estimation method and the PMBM model, which greatly improves the tracking continuity of the random finite set method. The proposed tracker achieves high performance in nuScenes tracking challenge, using different modalities lidar, cameras and radars. Precisely, the contributions of this paper are as follows:

  1. (1)

    A probabilistic MOT tracker which maintains both trajectory and label is developed by integrating PMBM filter and tracklet confidence estimation method. The proposed probabilistic MOT tracker is applicable to different detection inputs and achieves satisfactory performance.

  2. (2)

    A confidence-based method which is able to continuously estimate and update the confidence of the tracklets. The single target hypotheses (STHs) of a target at different moments update the confidence of the leaf nodes following a bottom-up tree structure. Global hypotheses are remapped to enhance the most likely STHs’ confidence.

The rest of this paper is organized as follows: Sect. 2 introduces the architecture of the proposed tracker. Section 3 presents the process of the tracking algorithm and details of modules. Section 4 discusses the experiment, quantitative results and ablation study on datasets. Section 5 concludes this paper.

2 Probabilistic Tracker Architecture

The multi-object tracker jointly estimates the position and label of each target of attention, while filtering out false detection and supplementing missed detection. By correlating the targets in the time series, the past trajectory of one target is obtained, and its velocity is computed. Organization of past trajectories is the most important input for trajectory prediction; therefore, multi-object tracking is the foundation of trajectory prediction.

The pipeline of the proposed probabilistic 3D multi-object tracker has the structure shown in Fig. 2. The three core modules are the Poisson process for tracking position, the multi-Bernoulli mixture process for tracking label and the confidence estimation module. The pipeline consists of five stages: input formatting, Poisson process, multi-Bernoulli mixture process, confidence estimation and output formatting. Compared to deterministic trackers, the proposed tracker only replaces the original deterministic observation-track matching process, without changing the underlying operation of the filtering-based tracking method, including matching metrics and filtering methods. It can work on any filtering algorithm, such as the Kalman filter, extended Kalman filter, or unscented Kalman filter and any matching metrics, such as the 3D-IOU, Mahalanobis distance or distance in embedded feature space.

Fig. 2
figure 2

Pipeline of the probabilistic multiple object tracker. After formatting detection results as point targets, the proposed tracker executes Poisson process for initialization of possible new tracklets. Then, the MBM process searches several most likely GHs and updates STHs. The confidence module refines the confidence score of STHs. The targets and unique labels of the most likely hypothesis are formatted as tracking outputs

2.1 Probabilistic Tracking of Position

In the beginning, a standard tracking-by-detection approach is adopted. At each time step, the 3D object detector gives out detection results of n objects \(Z=\left( D_1,D_2,\ldots ,D_n\right)\). The detection results are encoded as point targets and fed into the tracker. Note that the detection results contain clutter, and the position tracking part is responsible for distinguishing and extracting true targets and their initial confidence from noisy measurements with clutter.

Poisson component is responsible for tracking position of newly initiated targets. It considers the density of undetected targets, manages the number of hypotheses of potential targets and gives birth to new-born detected targets. Poisson intensity is defined as:

$$\begin{aligned} f^\mathrm {p}\left( X\right) =e^{-\int \mu \left( x\right) \mathrm {d}x}\left[ \mu \left( \cdot \right) \right] ^X \end{aligned}$$
(1)

In Eq. 1, \(f^\mathrm {p}\left( X\right)\) denotes the posterior of Poisson intensity, \(\mu (x)\) denotes a Poisson density and X denotes the whole set of undetected objects.

For the sake of closed-form solution, Poisson intensity is often modeled as multi-modal Gaussian distribution. Figure 3 illustrates an example of the Poisson process graphically. At last timestamp, there are four intensity components. At the current timestamp, they are updated with several point measurements, and new tracklets are initialized at the peaks of updated intensity. The predicted Poisson intensity and birth intensity update in a quasi-convolutional manner, i.e., every intensity component interacts with every birth intensity component. The tracklets are spawned at the peaks of Gaussian distribution. In this way, no explicit one-to-one data association is required, and new targets are spawned based on the mixture of multiple Gaussian distributions.

Fig. 3
figure 3

A simple example of the Poisson process of tracking location and spawning new tracks. First, generate intensity according to detection scores from measurements, then, correlate with predicted intensity to get the updated intensity, and finally initialize new tracklets whose confidence is beyond an adaptive threshold

2.2 Probabilistic Tracking of Label

The multi-Bernoulli mixture(MBM) process considers the data association global hypotheses and single target hypotheses. The MBM density is defined as sum of j global hypotheses:

$$\begin{aligned} f^{\mathrm {mbm}}(X)\propto \sum _{j}\sum _{X_1\uplus \ldots \uplus X_n=X}\prod _{i=1}^{n}w_{j,i}f_{j,i}\left( X_i\right) \end{aligned}$$
(2)

In Eq. 2, \(f^{\mathrm {mbm}}(X)\) is the posterior of MBM intensity, X is the whole set of detected objects, \(w_{j,i}\) denotes the weight of a Bernoulli component \(X_i\) in global hypothesis (GH) j, and \(f_{j,i}\left( X_i\right)\) is the probability density of single Bernoulli component \(X_i\) in global hypothesis j:

$$\begin{aligned} f_{j,i}(X)=\left\{ \begin{matrix}1-r_{j,i},&{}X=\oslash \\ r_{j,i}p_{j,i}(x),&{}X={x}\\ 0,&{}\mathrm {\ otherwise\ }\\ \end{matrix}\right. \end{aligned}$$
(3)

In Eq. 3, \(r_{j,i}\) denotes the existence probability of object x in STH X. Each STH has missed detection or one object detected.

The multi-Bernoulli process predicts the existing tracklets, updates the possibility of multiple data associations, proposes single target hypotheses and multiple global hypotheses. An illustration of MBM process is shown in Fig. 4. The tracking results of the current frame are selected from the most likely GH.

Fig. 4
figure 4

Multi-Bernoulli mixture process of tracking labels

2.3 Tracklet Confidence Estimation

Unlike traditional count-based methods, confidence-based methods manage the birth and death of tracklets based on confidence score. Confidence-based methods enable a continuous estimation of tracklets. A target is able to have a continuous existence probability, even if the detector occasionally loses the observation of the target. The tracklet is smoothed in terms of confidence.

Confidence serves the data association process. The confidence is defined as the Bernoulli parameter \(r_i\). The initial value originates from detection score. The confidence changes in the process of generating new child STHs in a tree structure. The confidence is filled into elements in the cost matrix.

Confidence estimation needs two important function, score update and score decay. The score decay module performs a tree-structured backward smoothing of confidence to maintain a continuous estimation. The score update module enhances the confidence of existing STHs with most likely GHs.

3 Proposed Confidence-Based Tracker

3.1 Overall Process

The proposed tracker adopts the Bayesian filtering theory to formalize the multi-object tracking process. MOT can be modeled as a multi-variable estimation problem. \(x_t^i\) represents the state of the i-th object at the t-th timestamp, \(\varvec{X_t} =(x_t^1,x_t^2,\ldots ,x_t^{M_t})\) is the state of all targets in the t-th frame.\(z_t^i\) represents the observation of the i-th target at the t-th timestamp. \(\varvec{Z_t} =(z_t^1,z_t^2,\ldots ,z_t^{M_t} )\) represents the observation sequence of all targets under the t-th frame.

The pseudo-code of the implementation of the proposed tracker at a timestamp k is shown in Algorithm 1. The whole process is divided into six stages. They are prediction, update, reduction, score decay, score update and targets estimation.

figure f

The main difference between the proposed tracker and typical PMBM filter lies in the prediction step. Unlike typical PMBM, the proposed tracker does not marginalize out the previous states. In general, the prediction step follows the Chapman–Kolmogorov equation:

$$\begin{aligned} f_{k+1 \mid k}(x_{k+1}) = \int g_{k+1}(x_{k+1}\mid x)f_{k \mid k}(x)\mathrm {d}x \end{aligned}$$
(4)

In Eq. 4, \(f_{k+1 \mid k}(x_{k+1})\) is the prediction of the next timestamp, \(g_{k+1}(x_{k+1} \mid x)\) is the single-target transition density at time \(k+1\) and the previous state is marginalized out to obtain the state \(x_{k+1}\) at timestamp \(k+1\).

The state variable tracklet in PMBM tracker is a concatenation of object states at consecutive timestamps in sliding windows. For real-time requirements of online tracking, the tracklet is approximately a series of multiple first-order Markovian aggregations and only the state of multiple frames is considered when estimating the confidence of the tracklet.

The smoothing of confidence is implemented through online smoothing-while-filtering approach. The definition and usage of confidence are illustrated in Sect. 3.3.1. Two important functions in score decay process are introduced in Sect. 3.3.2, while score update is introduced in Sect. 3.3.3.

3.2 PMBM Modules

3.2.1 PMBM Prediction

The prediction of PMBM consists of two stages, the prediction of Poisson components and the Bernoulli components.

Each Poisson component follows the state transition function \(g(p_{k+1}\mid p_k)\) and its weight \(w_i\) is multiplied by probability of survival \(P_{\mathrm {s}}\):

$$\begin{aligned}&p_{k+1}=g(p_{k+1}\mid p_k) p_k \end{aligned}$$
(5)
$$\begin{aligned}&w_i=w_i\cdot \ P_{\mathrm {s}} \end{aligned}$$
(6)

The prediction of each Bernoulli component \(f_{j,i}\left( X_i\right)\) follows the state transition function \(g(x_{k+1}\mid x_k)\), while the confidence is multiplied by probability of survival \(P_{\mathrm {s}}\):

$$\begin{aligned}&x_{k+1}=g(x_{k+1}\mid x_k)x_k \end{aligned}$$
(7)
$$\begin{aligned}&r_{j,i}=r_{j,i}\cdot \ P_{\mathrm {s}} \end{aligned}$$
(8)

The STHs after prediction are written to the current timestamp of the state sequence.

The state transition function \(g(x_{k+1}\mid x_k)\) follows a first-order constant velocity and constant turning rate (CVCT) motion model and Kalman filter. The important state variables concerned in 3D MOT include position \(p_x\),\(p_y\), scale, orientation \(\theta\), velocity \(v_x\), \(v_y\) and yaw rate \(\dot{\theta }\). The state is expressed as

$$\begin{aligned} {\varvec x} = \left[ p_{\varvec x}, p_{y},\theta ,v_{\varvec x},v_{y},\dot{\theta } \right] ^\mathrm {T} \end{aligned}$$
(9)

State-of-the-art detectors predict the velocity of each target, so the observations include velocity prediction \(v_x\) and \(v_y\).

$$\begin{aligned} {\varvec z}=\left[ p_x,p_y,\theta ,v_x,v_y\right] ^\mathrm {T} \end{aligned}$$
(10)

The state transition equation is:

$$\begin{aligned} {\varvec x}_{k+1}=\left[ \begin{matrix}1&{}0&{}0&{}\mathrm {\Delta t}&{}0&{}0\\ 0&{}1&{}0&{}0&{}\mathrm {\Delta t}&{}0\\ 0&{}0&{}1&{}0&{}0&{}\mathrm {\Delta t}\\ 0&{}0&{}0&{}1&{}0&{}0\\ 0&{}0&{}0&{}0&{}1&{}0\\ 0&{}0&{}0&{}0&{}0&{}1\\ \end{matrix}\right] \cdot \left[ \begin{matrix}p_x\\ p_y\\ \theta \\ v_x\\ v_y\\ \dot{\theta }\\ \end{matrix}\right] \end{aligned}$$
(11)

Given random measurement noise \({v}_k\), the observation model follows

$$\begin{aligned} z_k=\left[ \begin{matrix}1&{}0&{}0&{}0&{}0&{}0\\ 0&{}1&{}0&{}0&{}0&{}0\\ 0&{}0&{}1&{}0&{}0&{}0\\ 0&{}0&{}0&{}1&{}0&{}0\\ 0&{}0&{}0&{}0&{}1&{}0\\ \end{matrix}\right] \cdot {\varvec x}_k+{{\varvec v}}_k \end{aligned}$$
(12)

The position and orientation of 3D bounding boxes are tracked, while the dimensions and other properties are inherited from measurements.

3.2.2 PMBM Update

figure g

The pseudo-code of PMBM update is shown in Algorithm 2. Similar to prediction stage, the PMBM update consists of two stage, the Poisson and Bernoulli update. Each Poisson density weakens the weights of undetected objects w that stay undetected by the detection probability \(P_\mathrm {d}\).

$$\begin{aligned} w_i=(1-P_\mathrm {d})w_i \end{aligned}$$
(13)

Poisson update is also in charge of the initialization of new tracklets. First, perform gating for each measurement to Poisson densities and select the neighboring Poisson components. Next, perform merging for selected components and get the mixed state \(X_{\mathrm {mix}}\). Last, spawn a new track whose first STH is the Bernoulli component \(f_{\mathrm {ini}}\left( X\right)\). The initial state of \(f_{\mathrm {ini}}\left( X\right)\) is \(X_{\mathrm {mix}}\), and the initial confidence is set to be the detection score.

Core hypothesis-based data association method is proceeded in Bernoulli update. First, for each STH, perform gating to measurements and the STH, then update each STH with neighboring possible measurements and become several child STHs. Second, if there is no GH, spawn the first one using current STHs. If there are already GH, spawn new global hypotheses recursively based on current ones. For each current GH who has j STHs and current number of measurements i, create a cost matrix \(C_{mn}\) with the size \([i,i+j]\). Fill in each element with the opposite number of the cost of child STH. Last, perform Murty solver to get weight for each GH.

3.2.3 PMBM Reduction

Reduction is a necessary step for PMBM to prune low-confidence candidate targets and keep the target candidate pool from overflowing. The pruning process is divided into three stages. First, prune the STHs whose confidence is below a certain threshold, instead, the pruned states are recycled back to the Poisson mixture density. Second, prune global hypotheses whose weight is under a certain threshold, as well as hypotheses which include the removed STHs. Third, prune STHs that are not included in any global hypotheses. Since a recursive approach is adopted to produce new global hypotheses, that is, to create new GH based on one global hypothesis in the last timestamp, the STH will no longer appear if it is not included in this timestamp.

3.3 Confidence-Based Operations

3.3.1 Confidence in Probabilistic Data Association

Confidence is defined as the Bernoulli parameter \(r_i\) in each STH which is the form of a Bernoulli component. The confidence serves as an important reference for data association. In PMBM structure, the probabilistic data association lies in the update process in Sect. 3.2.2. The initial confidence \(r_i^0\) of a newly spawned tracklet comes from the detection score \(s_{\mathrm {D}}\):

$$\begin{aligned} r_i^0 = s_{\mathrm {D}} \end{aligned}$$
(14)

The confidence \(r_{i,j}^t\) of new STH \(s_{i,j}^t\) matching with one observation \(z_j\) generates from the confidence \(r_i^{t-1}\) of its parent STH \(s_i^{t-1}\) and considers the likehood between observation \(z_j\) and state \(x_i\):

$$\begin{aligned} r_{i,j}^t = s_i^{t-1}\cdot P_\mathrm {d}+likelihood(x_i,z_j) \end{aligned}$$
(15)

The likelihood function computes the normalized likelihood according to a certain matching metrics. The confidence \(r_{i}^t\) of new STH \(s_{i}^t\) indicates the target misses only generates from the confidence \(r_{i}^{t-1}\) of its parent STH \(s_{i}^t\) with detection probability \(P_\mathrm {d}\):

$$\begin{aligned} r_{i}^t=\frac{r_{i}^{t-1}\cdot (1-P_\mathrm {d})}{1-r_{i}^{t-1}+r_{i}^{t-1}\cdot (1-P_\mathrm {d})} \end{aligned}$$
(16)

The cost matrix \(\varvec{C_{M\times N}}\) is responsible for finding several optimal association alternatives. M is the number of measurements, and N is the number of tracklets. The matrix is divided into two sub-matrices. \(\varvec{C_{{1:M}, {1:N-M}}}\) stands for the cost for targets detected in the previous time step, while \(\varvec{C_{{1:M}, {M-N+1:N}}}\) stands for the cost for newly detected target, for elements in sub-matrix \(\varvec{C_{{1:M}, {1:N-M}}}\):

$$\begin{aligned} \varvec{C_{i,j}}=-r_{i,j} \end{aligned}$$
(17)

for elements in sub-matrix \(\varvec{C_{{1:M}, {M-N+1:N}}}\):

$$\begin{aligned} \varvec{C_{i,N-M+i}}=-r_{i}^0 \end{aligned}$$
(18)

The Murty solver is responsible for solving several best paths in the cost matrix. The opposite of confidence in the formula is because the Murty solver generally needs to search for the shortest path. In this process, the comparison of confidence in different tracklets is relative. Confidence has a great impact on probabilistic data association procedure in the tracker.

3.3.2 Score Decay

The pseudo-code of score decay process is shown in Algorithm 3.

figure h

The score decay performs in the form of tracklet smoothing. The tracklets’ confidence is smoothed near online through re-scoring STHs in a temporal window. Basically, the state of a tracklet is defined as a tuple:

$$\begin{aligned} \varvec{X}=(\alpha ,\beta ,x_{\alpha :\beta }) \end{aligned}$$
(19)

where \(\beta\) denotes the discrete-time of the tracklet’s most recent state and \(\alpha\) denotes the timestamp when the tracklet starts to be estimated. Here only the tracklets that are currently activated are taken into consideration, then the sequence of states is \([x_\alpha ,x_{\alpha +1},\ldots ,x_{\beta -1},x_\beta ]\). Each state has a corresponding confidence score, and the approach is to get an average score during the tracklet and assign this score to the most current state as well as perform a certain score decay to the last several states.

The main difference between prior confidence-based tracker and the proposed tracker is that one tracklet has more than one STHs at one time. That means the tracklet needs to be managed in a tree-structure manner rather than a list-structure manner. An simple example of score decay is illustrated in Fig. 5. Given the decay factor \(\eta =0.75\), the score estimation is that \(r=(1/4)\cdot (0.60+0.75\cdot 0.15+0.75^2\cdot 0.25+0.75^3\cdot 0.95)=0.31\).

$$\begin{aligned} \begin{array}{c} r_{j,i}=\frac{1}{N}(r_{j,i}+\eta Pa\left( r_{j,i}\right) +\eta ^2{ Pa}^{\left( 2\right) }\left( r_{j,i}\right) \\ + \ldots +\eta ^{N-1}{ Pa}^{\left( N-1\right) }\left( r_{j,i}\right) ) \end{array} \end{aligned}$$
(20)

In Eq. 20, N is the length of the tracklet, \(Pa(\cdot )\) denotes the parent of current STH, \(r_{j,i}\) is the confidence rate of the leaf STH node. In this manner, the meaning of the decay is the choice of previous timestamps in this tracklet. The scores of the state sequence in the tracklet are smoothed and later utilized in data association.

Fig. 5
figure 5

An example of tracing parents nodes and to perform tracklet smoothing of confidence

3.3.3 Score Update

figure i

In typical PMBM filter, the process of computing GH with various STHs is a one-way route. The consequent most likely global hypothesis is chosen for target estimation. This section aims to render a two-way influence between STH and GH, i.e., remap the weight of GHs to the weight of each STH. The pseudo-code of score update is shown in Algorithm 4.

Each Bernoulli component is controlled by two parameters \({r_i,p_i}\), where \(r_i\) stands for the probability of existence of the target. The score update process is:

$$\begin{aligned} r_{j,i}=\sum _{i\in G_j} w_j\cdot r_{j,i} \end{aligned}$$
(21)

In Eq. 21, \(G_j\) stands for the j-th GH and \(w_j\) is its weight. In the pseudo-code, add a remapping attribute \(\delta _i\) to each STH. For each Bernoulli component, query its existence in the top 5 likely GHs whose weights have been normalized. If the STH is included in one GH \(G_j\) with weight \(w_j\), add the weight to the remapping attribute.

For example, if the STH appears in all top 5 global hypotheses, the remapping attribute reaches the maximum 1. Next, multiply the Bernoulli confidence \(r_i\) with \(\delta _i\) and become the new enhanced confidence. The new confidence rate will affect the element of cost matrix in the next data association process. During this process, most of the STH is weakened, but since the data association considers the relative cost of different STHs, the remapping process can greatly distinguish the STHs that are frequently considered in different global hypotheses.

4 MOT Tests

4.1 Dataset and Evaluation Metrics

Dataset for MOT tests is nuScenes tracking dataset [22], which contains the diverse sensor data from top lidar, radars and cameras. The dataset consists of 700 scenes in split train, 150 scenes in split val and 150 scenes in split test. Each scene lasts for 20 s. These frames include annotated samples (every 0.5s) and unannotated sweeps. For the sake of evaluation, detectors and the proposed multi-object tracker run between samples, which means the sampling time is 0.5 s. Input data include lidar and radar point clouds, images accompanied with GPS/IMU localization.

Evaluation metric of nuScenes tracking challenge follows the widely used CLEAR MOT metrics [23] (including AMOTA, AMOTP, IDS and FRAG), Mostly Tracked (MT) / Mostly Lost (ML) metrics and new proposed metrics in Ref. [22]. Note that nuScenes evaluates AMOTP based on Euclidean distance between each center point of ground truth bounding box and predicted bounding box; therefore, a smaller distance in AMOTP represents a higher accuracy.

4.2 Implementation Details

4.2.1 Environmental Setup

The test is implemented using Python3 on a laptop Dell Inspiron 7591 with Intel Core i7-9750H CPU, RAM 8GB and GeForce GTX 1650 on a standard operating system Ubuntu 18.04LTS. The average latency of the proposed tracker in unaccelerated python code is less than 100ms, which meets the requirement of real-time online tracking.

4.2.2 3D Object Detection

A tracking-by-detection approach is adopted. The key information of one 3D bounding box includes:

$$\begin{aligned} {\text {bbox3D}}=[x,y,z,h,w,l,{\text {yaw}},{\text {score}},{\text {type}},v_x,v_y] \end{aligned}$$
(22)

In Eq. 22, [xyz] are the positions of the center of each bounding box in the global coordinate. [hwl] denote its height, width and length. The yaw angle denotes the orientation, while the pitch angle and roll angle are not considered. The detection method shall provide a detection score and semantic object type identification for each detected object. \([v_x,v_y]\) denotes the velocity of plane movement of the center point in the global frame. CenterPoint [3] is used as a 3D detector using lidar, and CenterFusion [24] is used as another detector using cameras and radars.

4.2.3 Parameter Setting

An optimal parameter setting is shown in Table 1. The matching metric of affinity applies the Mahalanobis distance. The parameter settings are mostly based on empirical study. The hyper-parameters in the table have little influence on the results given a \(\pm 20\%\) fluctuation.

Table 1 An optimal parameters setting for the proposed method on nuScenes dataset

4.3 Tracking Performance Evaluation

The proposed tracker is evaluated on nuScenes val and test subset. The evaluation is divided into modalities using lidar and modalities without lidar. Qualitative results of the proposed tracker using CenterPoint detection are shown in Fig. 6. Six typical scenes from the test subset of nuScenes are selected, each having three consecutive samples, to show the effectiveness of the tracker to handle difficult cases like occasional detection loss. Different colors indicate identical labels and instances.

Fig. 6
figure 6

Visualization of 3D tracking results of the proposed tracker. Different colors of the bounding boxes indicate unique labels of tracked objects

4.3.1 Baseline Evaluation

Since the method uses only motion features of 3D bounding boxes but not appearance features from feature extraction backbone network, a standard realization of AB3DMOT[5] and Mahalanobis3D [6] is used as baselines. Table 2 shows comparison of the proposed tracker with baselines methods. For the sake of fairness, all trackers share the same object detection results from CenterPoint [3] implemented in mmdetection3D [25]. Baseline methods are implemented in open-source codes provided by Mahalanobis 3D [6].

Table 2 Quantitative comparison of 3D MOT evaluation results with baseline methods on validation and test splits of nuScenes dataset

Table 2 shows the improvement of the proposed tracker against other motion-only methods. A significant improvement of 4.5% in AMOTA score against the lidar-based Mahalanobis 3D tracker is achieved. Compared to standard PMBM filter, the proposed tracker enhances the AMOTP with an average precision increase of 0.1m.

The standard PMBM filter suffers a major problem of high ID switch and fragmentation because PMBM does not provide explicit continuity between time steps. The proposed method considers the STHs in a sliding window and remaps global hypotheses to STHs, so it strengthens the impression of a trajectory. It decreases 28% of IDS and 47% of FRAG. It achieves the least FRAG among the mentioned motion-only filtering methods. The results prove that the proposed tracker successfully tackles the problem of track fragmentation.

Similarly, on the nuScenes test subset, the proposed tracker decreases 27.4% of IDS and 46.3% of FRAG compared to standard PMBM filter, which proves the proposed tracker achieves an enhanced track continuity. The improvement also indicates more mostly tracked trajectories and more true positives.

4.3.2 Comparison Regarding Modalities Radar-Camera

Measuring whether a multi-object tracker can get a satisfactory output given an inaccurate detector shows the generality of the tracker. As an typical detector using cameras and radars, CenterFusion performs about half of superior accuracy compared to lidar detectors in terms of nuScenes Detection Score (NDS). The proposed tracker is compared with other trackers using modalities camera and radar-based in Table 3. Only modalities including cameras and/or radars are considered. The tracker proposed in this paper uses centerfusion detection input generated from open-source code. Results of other trackers are from leaderboard of Eval.AI. In terms of modalities, ‘C’ denotes camera, ‘R’ denotes radar, and ‘R-C’ denotes Radar-Camera.

Table 3 Quantitative comparison of 3D MOT evaluation results on nuScenes test dataset

In Table 3, the proposed method is compared with CenterTrack [4], the best camera-based tracker QD-3DT [26] and CFTrack [27]. A considerable improvement of 3.9% in AMOTA score against the camera-based tracker QD-3DT and 5.6% against CFTrack which is also based on CenterFusion detection results is achieved. Moreover, the method has the least track fragmentation FRAG of 2116 compared to other methods. These results prove that the proposed tracker has a satisfactory generality for domain adaption. In summary, the proposed tracker achieves superior tracking performance given low-cost sensors like cameras and radars.

4.3.3 Ablation Study

The ablation study of the proposed method is implemented on nuScenes val set. An ablation analysis of the three modules described in Sect. 2 to better understand their contribution to the proposed tracker is provided. The three modules are the Poisson position tracking module, MBM label tracking module and confidence estimation module. The ablation study is shown in Table 4. “P” denotes Poisson position tracking module, “M” denotes MBM label tracking module and “C” denotes confidence estimation module, “N” denotes the length of a tracklet.

Table 4 An ablation study of the three important modules in the proposed method

The MBM module is the core module of the proposed method. The Poisson module yields consistent improvement in precision, while the confidence module greatly improves the track continuity over the baseline. Moreover, longer tracklets lead to fewer fragmentation, while shorter tracklets are prone to get higher accuracy.

4.4 Discussion

One of the most important advances of the proposed tracker lies in the track continuity, which is reflected in the reduction of track fragmentation to half compared with the typical PMBM filter. Less track fragmentation and improved continuity lead to better AMOTP precision.

Here is a brief discussion about the parameter setting of the proposed method with data-driven methods. Two advantages of the proposed approach over data-driven approaches are highlighted, generality for different detectors and physically defined hyper-parameters which can be tuned according to some guidelines. Data-driven methods have advantages over inference-based methods in the number of learnable parameters, but still need to manually modulate certain hyper-parameters, which brings challenges to the domain adaption of the model if not given enough training data. In contrast, the proposed method does not need training data because the hyper-parameters of the method have very well-defined physical meanings and it achieves satisfactory performance given different detector input with a similar parameter setting.

5 Conclusion

The RFS-based tracking algorithms are attracting growing attention because they are able to automatically manage the birth and death of trajectories as well as manage the data association problem in an online multi-hypothesis tracking manner. This paper proposes PTMOT, a probabilistic tracklet multi-object tracker, which introduces confidence-based methods to PMBM filter. The confidence estimation of STHs and GHs is handled by “score update” and “score decay” function. The experiments demonstrate that the track fragmentation of the proposed tracker decreases significantly. The proposed tracker is able to ensure the continuity of the trajectory over sliding windows. It has also achieved superior performance in tracking with modality radar camera. In the future, the authors will extend the research of RFS-based tracker to multi-modalities MOT algorithms.