Introduction

Multi-Object Tracking (MOT) is a challenging and popular research topic in computer vision and forms basis of many other vision techniques, such as video analysis [30], behavior recognition [11], autonomous driving [15], etc. The main task of MOT algorithm is to assign candidate detections to an existing trajectory or to generate a new trajectory according to the established rules [9]. The MOT is extremely difficult due to objects partially obscured, objects similar, camera shake, and the speed of the objects varies, in dense crowds [21].

There are currently two main frameworks of MOT: tracking-by-detection and joint-detection-and-tracking [20]. The tracking-by-detection framework divides MOT into two phases, object detection and data association, which is performed sequentially by two separate models [11]. First, the detector model detects object detections in each frame [12], and then, the association model extracts re-ID features from the detections, and assigns re-ID information to each detection. Therefore, detector model and association model duplicate the task of extracting object information, resulting in low tracking efficiency in dense scenes. With the development of deep learning [38, 40, 44] and multi-task learning [3], many works [46, 47] have proposed joint-detection-and-tracking to improve the performance and efficiency of tracking algorithms by reducing the number of object information acquisitions.

Based on object detection algorithms, the current predominant strategy for implementing the joint-detection-and-tracking is joint detection and re-ID [36, 46]. It obtains more accurate local re-ID features by sharing a backbone network with two stages of detection and re-ID and then uses a simple similarity metric to fuse re-ID features to perform data association. The above strategy integrates re-ID features extraction of association model into detection model, which simplifies the time-consuming association model. However, the partial obscuration of objects and the deviation of detections will lead to the failure of re-ID features in high-density pedestrian scenes. Therefore, the joint detection and re-ID, which uses low robustness similarity metrics to fuse re-ID features with low confidence to complete data association, is prone to identified transfer, as illustrated in Fig. 1.

Fig. 1
figure 1

Identified transfer. Illustration of tracklets generated from Fairmot on two clips. Because of the occlusion, the center of the detection changes, resulting in identified transfer. a The ID changes between frames 702 and 710. b The ID changes between frames 1048 and 1049

For the low robustness of data association due to the failure of a single feature, many works [10, 31] have proposed to improve the robustness of the association model by fusing different types of features (e.g., appearance features, motion features, and topological structure features) using a probabilistic graph model with minimal cost. Since the probabilistic graphical model can represent different categories of information in the manifold space, the effective fusion of different information can be achieved during the optimization of the energy function, compensating for the loss of a single piece of information. The existing association model, based on the minimum cost flow probability graph, treats the detections of several successive frames as nodes and stepwise matches candidate detections of the current frame with historical detections, resulting in a Markov-chain trajectory along the time dimension [6, 7, 14]. The classical minimum cost flow graph model focuses on the relationship between nodes and nodes, and is prone to identified transfer and cumulative error problems in dense scenarios. To solve the identified transfer and ensure the tracking efficiency of joint detection and re-ID, this paper discards the idea of coupling the candidate detections with the historical detections of the classical probabilistic graph model, uses the matching computation scheme of the candidate detection and the current trajectory, and transforms ID assignment into an energy minimization problem of CRF.

Probabilistic graphical models usually require iterative solution methods to find the optimal solution, making it difficult to generalize to online MOT. In this paper, the original CRF model is approximated by a combined model of multiple CRF models with closed solutions. These CRF models with closed solutions can be avoided by using an iterative algorithm to solve them, and the optimal solution can be obtained directly using matrix multiplication. In addition, the energy and resource consumption of computers is a major concern for mobile or embedded devices. To deploy the algorithm on mobile devices, this research designs an edge-cloud model that combines a light-weight network on an edge device (the nvidia jetson tx2) with a heavyweight network on a server. Quasi-real-time tracking is possible on the edge device while maintaining performance.

Main contributions of this paper:

  • On the strategy of joint detection and re-ID, a CRF-association-model for candidate detections and trajectories is proposed. Unlike classical graph model, it avoids the cumulative deviation of the longer Markov chain in the association process, and transforms the task of ID assignment into an energy minimization problem with CRF.

  • The original CRF model is approximated by a combined model with multiple closed solutions, avoiding iterative algorithms to find the optimal solution, thereby improving efficiency, and reducing computer energy and resource consumption.

  • An edge-cloud model is designed to combine edge devices and servers for quasi-real-time tracking.

Related work

The joint-detection-and-tracking is accomplished in two main strategies: joint detection and motion prediction [3, 47], and joint detection and re-ID [36, 46].

Joint detection and motion prediction uses a detector to return the coordinate information of detections. Chained-Tracker [24] proposes an end-to-end model which takes two adjacent frames as input to the network and uses the model to regress boxes. Tracktor [3] utilizes the Faster-RCNN detection head to directly regress boxes or terminate trajectories. However, these box regression methods require the assumption of a high overlap rate between adjacent frames and therefore perform poorly on low frame rate video. In contrast to these methods, CenterTrack [47] uses information from two adjacent frames to predict the bias of the detections’ centroid and then uses the distance of the centroid to complete the association. Meanwhile, it generates a trajectory heat map which can help match detections anywhere with a low overlap of object detections. These methods only associate objects in adjacent frames without reinitializing the lost traces, which complicates the handling of occlusion cases.

Joint detection and re-ID first uses a single network to complete the detection and re-ID features in one step and then uses a simple model to complete data association. Track-RCNN [33] adds a header for re-ID features extraction on Mask RCNN and gets detection and re-ID features for the object. JDE [36] constructs re-ID features extraction head on top of YOLOv3 and completes the association based on appearance similarity, enabling near-video rate inference. Fairmot [46] believes that the anchor-based detector does not extract re-ID features well, and suggests extracting the features with anchor-free networks such as CenterNet. However, the focus of these methods is on how to efficiently obtain re-ID features with high confidence and it is easy for them to cause identified transfer in dense scenes. To solve this problem, this paper uses a probabilistic graph-based association model to complete data association in the strategy of joint detection and re-ID, fuse multiple types of features in the process of energy function optimization, and improve the tracking performance.

Common association models are divided into two main categories: end-to-end models based on deep learning and probabilistic graph models. The end-to-end model obtains the similarity matrix between detections in a data-driven manner. For example, DAN [30] constructs Deep Affinity Network to learn the similarity matrix. This paper [6] uses a graph neural network (GNN) to learn the similarity matrix. Since the inputs and outputs of end-to-end model are fixed, it does not retain all possible candidate detections compared to the probabilistic graph model. The classical probabilistic graphical model is mainly based on the minimum cost flow. For example, the paper [45] treats detections as nodes and connects nodes of adjacent frames in sequence to form edges of the graph, with positive weights on the edges if the nodes belong to the same track and negative weights otherwise. However, this method only considers the relationship between adjacent frames and cannot solve the occlusion. To solve the occlusion, LMP [32] proposes two different edges, regular and lifting edges according to the length of the time interval, and constraining the solution of a regular edge using the constraints of the lifting edge. The paper [17] uses more information (e.g., appearance features, motion features, and the topological structure features) as constraints to reduce the tracking error. These graph models, although more complex and reliable, focus on the relationship between nodes and nodes, ignoring the relationship between nodes and trajectories, and are therefore prone to the phenomenon of cumulative error. Some papers propose the CRF model [16, 41]. The paper [16] treats the relationship of two trajectories as nodes, while setting label for each node, label represents whether two trajectories belong to the same trajectory, and values are only 0, 1. It focuses more on the relationship between nodes and nodes and does not consider the relationship between nodes and the overall model. This cannot reflect the relationship between nodes and all trajectories well, and when targets with similar appearance and similar location are prone to identify transfer. To solve it, this paper adopts the calculation scheme of candidate detections and current trajectories and gets the relationship between candidate detections and current trajectories.

To improve the tracking speed, many papers [2, 45] iteratively solve the global optimal solution by adding complex boundary conditions, which cannot be applied to a wide range of tracking scenarios. As the number of objects becomes larger, iterations’s times increase, which affects tracking efficiency. In this paper, to ensure the tracking performance and efficiency, instead of adding complex boundary conditions, we use the combination of multiple CRF models to obtain the local optimal solution. Due to the complexity of MOT, there are few papers proposing the deployment of MOT to mobile. Inspired by these papers [13, 27, 35] which use different devices to keep the balance of the spped and performance, we distill the CNN network and design a edge-cloud use different networks on different devices in the detection process to achieve quasi-real time tracking.

Method

In this section, this paper introduces the construction process of the proposed algorithm, and the specific algorithm flow is illustrated in Fig. 2. To avoid accumulated errors in the minimum cost flow model, this paper abstracts trajectories into historical trajectory nodes and constructs a CRF model of nodes and trajectories relationship. The solution space of CRF model is complex, and this paper constructs a combination of multiple CRF models with closed solutions, which reduces the original complexity and improves the tracking efficiency. First, some definitions are given as follows:

Fig. 2
figure 2

The general flowchart of the proposed algorithm. Fairmot is the baseline of the proposed algorithm, which provides re-ID features \(E_t\) and the detection \(P_t\). First, the coordinate information \(P_t\) of the candidate detection and the detection \(P_{tr}\) and id information \(ID_{tr}\) of the historical trajectory are used to encode each node. Then, the model determines the relationship matrix F between the candidate detections and the historical trajectory using the indicated encoding(candidate nodes’ \(Z_t\), trajectory nodes’\(Z_{tr}\)) and re-id information(candidate nodes’\(E_{t}\), trajectory nodes’\(E_{tr}\))

Notations:

  • \(x_i\) represents the node function of node i.

  • \(y_{ij}\) represents the edge function of edge ij.

  • \(V_{tr}\) represents the historical trajectory nodes.

  • \(v_{tr_i}\) represents the \(i\text{- }th\) historical trajectory node.

  • \(V_O\) represents the candidate detection nodes.

  • \(\mathbf {F_i}\) represents the relationship between node i and all trajectories in the CRF model.

  • \(\mathbf {Z_i}\) represents the \(i\text{- }th\) node’s indicator vector in the CRF model.

  • \(e_i\) represents the re-ID features of the node i.

  • \(e_{tr_i}\) represents the re-ID features of the \(i\text{- }th\) trajectory.

  • \(w_{ij}\) represents the appearance similarity between node i and node j.

  • KM represents the Kalman Filter.

  • \(p_{t-1_i}\) represents the detection of \(i\text{- }th\) object at frame \(t\text{- }1\).

  • \(U_{ik}\) represents the overlap rate of the \(i\text{- }th\) candidate detection node’s detection and the \(k\text{- }th\) historical trajectory node’s prediction detection.

  • \(H_{t-1}\) represents the detection nodes in frame \(t\text{- }1\).

  • \(f_{k_i}\) represents the relationship between the \(i\text{- }th\) node and the \(k\text{- }th\) trajectory in the the \(k\text{- }th\) trajectory’s subgraph model.

  • \(z_{k_i}\) represents the relationship between the \(i\text{- }th\) node’s indicator scalar in the the \(k\text{- }th\) trajectory’s subgraph model.

  • \(f_{k}\) represents the relationship between all nodes and the \(k\text{- }th\) trajectory.

  • \(F^*\) represents the approximate solution of the CRF model.

The classical minimum cost flow probability graph

The number of objects changes dynamically over time, while the input and output of a convolutional neural network are fixed. Therefore, compared to end-to-end convolutional neural networks, the model based on classical probability graph is more suitable for the dense scene. The minimum cost flow model is a classical association model in the probability graph. It treats the detection as a node, and the node function \(x_i\) is 1 if the node is not false, and otherwise, it is 0. The nodes of two adjacent frames are connected to form the edges of the graph model according to the time flow dimension, and the edge function \(y_{ij}\) takes the value of 1 if nodes i and j belong to the same track; otherwise, it is 0, as illustrated in Fig. 3a, b. Then, the following energy function belongs to an instance of the minimum cost flow:

$$\begin{aligned} \begin{aligned}&\min _{x \in \{0,1\}^V,y \in \{0,1\}^E} \sum _{i\in V} x_{i} + \sum _{(ij)\in E} \frac{1}{2}w_{ij}y_{ij},\\&w_{ij}=1-cosine(e_i,e_j), \end{aligned} \end{aligned}$$
(1)

where \(e_i\) represents the node’s appearance feature. The relationship between nodes and nodes can be obtained by solving the above model. According to the edge function, nodes are interrelated to form Markov-chain trajectories along the time dimension.

The above model only cares about the relationship between nodes and nodes in the association, and ignores the relationship between nodes and trajectories, which causes cumulative error easily. In a dense scenario, nodes with similar appearance but very different positions are prone to identified transfer, as illustrated in Fig. 3a. And the error will propagate backward with the Markov chain, producing the phenomenon of cumulative error. As illustrated in Fig. 3b, the green trajectory contains the wrong node, and this error causes the green node not to be tracked in the green trajectory, resulting in a cumulative error. To solve these problems, the CRF model incorporates trajectories into the model and constructs the relationship between trajectories and candidate detections. The CRF model abstracts trajectories into historical trajectory nodes, such as the colored square illustrated in Fig. 3c, d. The trajectory nodes expand the temporal field of CRF model to avoid identified transfer, as illustrated in Fig. 3c. At the same time, the error tolerance of historical trajectory nodes is higher than normal nodes, as illustrated in Fig. 3d. Although there are errors in the trajectory, the trajectory contains more information, so that the errors do not propagate backward. The CRF model abstracts trajectories into historical trajectory nodes \(V_{tr}=\{v_{tr1},v_{tr2},...,v_{trm}\}\). The historical trajectory nodes \(V_{tr}\) and the candidate detection nodes \(V_O\) will form a complete graph \(G=(V=(V_{tr},V_O),E\in V \times V)\). The function variables are no longer simple 0–1 encoding, but are related to the trajectory. The node function is defined as \(V=\{i\in V \mid x_{i}=\left\| \mathbf {F_i}-\mathbf {Z_i}\right\| ,\mathbf {F_i} \in R^{1 \times m},\mathbf {Z_i} \in R^{1 \times m}\}\). \(\mathbf {F_i}\) denotes the relationship between node i and all trajectories, and \(\mathbf {Z_i}\) is the indicator vector that guides the direction of solving the energy function. The specific solution of the indicator vector will be explained in the section “The energy function for the CRF model”. Therefore, compared to the minimum cost flow, the CRF model uses all candidate detection nodes to participate in the energy function, which increases the spatial field of the model. The edge function is defined as \(E=\{ij\in E \mid y_{ij}= \left\| \mathbf {F_i}-\mathbf {F_j}\right\| , \mathbf {F_i} \in R^{1 \times m},\mathbf {F_j} \in R^{1 \times m}\}\), and represents the relationship between different nodes. In the edge function, the CRF model utilizes not only the relationship between candidate detection nodes and trajectory nodes, but also the relationship between trajectory nodes and trajectory nodes, and candidate detection nodes and candidate detection nodes. Therefore, compared to the minimum cost flow, the CRF model is not concerned with whether two nodes belong to the same track. Rather, more emphasis is placed on the relationship of the node to other nodes as a whole. It can effective use the topology structure between nodes and avoid identified transfer when nodes are similar in appearance but differ greatly in location.

Fig. 3
figure 3

The CRF model and the minimum cost flow model. Figure c, d represent the CRF model. Figure a, b represent the minimum cost flow graph model. Nodes to be in the same track have the same color. Nodes connected with straight lines form a trajectory. The minimum cost flow model only utilizes the node-to-node relationship, and when the objects are similar in appearance, it is easy to produce the error, as illustrated in figure (a), where a black node appears on the green trajectory. Therefore, the black node is only concerned with the black node, without considering the green node, producing cumulative errors. The CRF model concerns about the relationship between nodes and trajectories. The trajectories are abstracted into nodes, such as the squares illustrated in figure (c), (d), which expand the field of the model and reduce identified transfer. At the same time, the trajectory nodes are more tolerant, as illustrated in figure (d); although there is a black node in the green trajectory, the green node can still be tracked

The energy function for the CRF model

In this paper, the association problem of MOT is transformed into a CRF model solving problem. The energy function of the CRF model is defined according to the energy function (1), and it is defined as follows:

$$\begin{aligned} \begin{aligned}&\min \sum _{i\in V} \left\| \mathbf {F_i}-\mathbf {Z_i}\right\| + \sum _{ij\in E} \frac{1}{2}w_{ij}\left\| \mathbf {F_i}-\mathbf {F_j}\right\| ,\\&w_{ij}=cosine(e_i,e_j). \end{aligned} \end{aligned}$$
(2)

As illustrated in Fig. 2, this paper uses fairmot’s re-ID features to define the re-ID features of candidate detection nodes. Inspired by the paper [46], this paper fuses the re-ID features of the \(i\text{- }th\) historical trajectory \(e_{tr_i}\) according to the following linear formulas:

$$\begin{aligned} \begin{aligned}&e_{t-1_i} =\frac{e_{t-1_i}}{\left\| e_{t-1_i}\right\| },\\&e =(1-\alpha )e_{t-1_i} + \alpha e,\\&e_{tr_i}= e,\\&\alpha =0.9,\\&e_{tr_i} =\frac{e_{tr_i}}{\left\| e_{tr_i}\right\| }; \end{aligned} \end{aligned}$$
(3)

\(e_{t-1_i}\) represents the re-ID features of the \(i\text{- }th\) object at frame \(t\text{- }1\). This approach quickly fuses the historical re-ID features of the same object, increasing the speed of tracking.

\(\mathbf {Z_i}\) represents the indicator vector of the node i, which is used to guide the process of solving the energy function. Different nodes have different indicator vectors, as the vector encoding process in Fig. 2. To highlight the uniqueness of historical trajectory nodes, this paper uses one-hot coding to complete the indicator vector coding, and uses the historical trajectory identification(ID) number \(ID_{i}\) to define it. The specific formula is as follows:

$$\begin{aligned} Z_i[k] = \left\{ \begin{array}{lr} 1,\,\,\, &{} ID_{i}=k\\ 0,\,\,\, &{} else.\\ \end{array} \right. \end{aligned}$$
(4)

To avoid identified transfer arising from similar appearance but very different location, this paper uses the location topology information to define the indication vector of candidate detection nodes. First, this paper obtains the prediction detections based on the historical trajectory nodes. Inspired by these papers [18, 46], this paper first uses the Kalman filter(KM) to obtain the prediction detections \(p_{tr_k}\), as follows:

$$\begin{aligned} \begin{aligned}&p =KM(p_{t-1_k},p),\\&p_{tr_k}= p,\\ \end{aligned} \end{aligned}$$
(5)

\(p_{t-1_k}\) represents the detection of the \(k\text{- }th\) object at frame \(t\text{- }1\). The \(U_{ik}\) represents the overlap rate of the \(i\text{- }th\) candidate detection node’s detection and the \(k\text{- }th\) historical trajectory node’s prediction detection, and is used to define the indicator vector of the \(i\text{- }th\) candidate detection, as follows:

$$\begin{aligned} Z_i[k]=U_{ik}. \end{aligned}$$
(6)

By solving formula (2), this paper can obtain the solution of the CRF model, and thus get the relationship between nodes and all trajectories.

The combination graph models based on conditional random field with closed-form solution

Fig. 4
figure 4

The combination of multiple CRF models. Based on historical trajectory nodes, the CRF model is split into a combination of multiple CRF models. All subgraphs are constructed. A subgraph represents the relationship between nodes and a trajectory. A square represents a trajectory node, a circle with a number represents a detection node, and a circle without a number represents a candidate detection node

The complexity of solving the model by iterative formula (2) is \(O(n^2)\). In dense scenarios, this solving approach is not conducive to online tracking. Therefore, this paper looks for approximate solutions. Inspired by the solution process of the paper [42], it matrixizes the formula to easily find the closed solution and reduce the complexity to O(1), because its values are scalas. Therefore, this paper decomposes the similarity \(\mathbf {F_i}\) of nodes and trajectories. Based on the number of historical trajectories, the CRF model is splited into a combination of multiple CRF models. As illustrated in Fig. 4, Each subgraph represents the relationship between all nodes and a trajectory. Each subgraph model contains only one historical trajectory node, and the other trajectory nodes are transformed into ordinary detection nodes containing re-ID features, position of one frame, and ID of trajectories. Because all subgraphs are reconstructed, all this paper just needs to find the closed solution of one subgraph, and the other subgraphs are solved iteratively to obtain the approximate solution of CRF model. Therefore, for the \(k\text{- }th\) subgraph model \(G_k = (V_k=\{v_{tr_k},H_{t-1}-v_k,V_O\},E_k \in V_k \times V_k\})\), the node function is \(V_k=\{ i\in V \mid x_{i}= (f_{k_i}-z_{k_i})^2\}\), the edge function is \(E_k=\{ ij\in E \mid y_{ij}= (f_{k_i}-f_{k_j})^2 \}\), where \(H_{t-1}\) represents the detection node with ID at frame \(t\text{- }1\), \(f_{k_i}\)represents the similarity between node i and trajectory k, and \(z_{k_i}\) represents the indication scalar of node i. Consistent with the CRF model, the detection node and the historical trajectory node of the \(k\text{- }th\) subgraph model define the indicator scalar according to nodes’ \(ID_i\)

$$\begin{aligned} z_{k_i} = \left\{ \begin{array}{lr} 1,\,\,\, &{} ID_i=k,\\ 0,\,\,\, &{} else.\\ \end{array} \right. \end{aligned}$$
(7)

The candidate detection node i uses the overlap rate of its detection with the prediction detection \(p_{tr_k}\) of the \(k\text{- }th\) trajectory node to define the indicator scalar, as follows:

$$\begin{aligned} z_{k_i} =U_{ik}. \end{aligned}$$
(8)

The energy function for the \(k\text{- }th\) subgraph model is defined as follows:

$$\begin{aligned} \begin{aligned}&\min _{\mathbf {f_k}}\sum _{i\in V_k}(f_{k_i}-z_{k_i})^2+\sum _{(i,j)\in E_k}\frac{1}{2}w_{ij}(f_{k_i}-f_{k_j})^2,\\&\quad w_{ij}=cosine(e_i,e_j),\\&\quad k=1,2...m.\\ \end{aligned} \end{aligned}$$
(9)

Inspired by the paper [42], matrixing formula (9) yields the following results:

$$\begin{aligned} \begin{aligned} \min _{\mathbf {f_k}} E(\mathbf {f_k})&=\min _{\mathbf {f_k}}(\mathbf {f_k}-\mathbf {z_k)^T}(\mathbf {f_k}-\mathbf {z_k})\\&\quad +\mathbf {f^T_{k}Df_k}-\mathbf {f^T_{k}Wf_k}, \end{aligned} \end{aligned}$$
(10)

where \(\textbf{D}\) and \(\textbf{W}\) are defined as follows:

$$\begin{aligned} d_{ii}&= \sum _{j=1}^{s}w_{ij}, \end{aligned}$$
(11)
$$\begin{aligned}&\begin{aligned} \textbf{D}&=diag\{d_{11},...,d_{ss}\} = \begin{pmatrix} d_{11} &{}\quad ... &{}\quad 0 \\ ... &{}\quad ... &{}\quad ... \\ 0 &{}\quad ... &{}\quad d_{ss} \end{pmatrix} ,\textbf{D}\in R^{s \times s}, \end{aligned} \end{aligned}$$
(12)
$$\begin{aligned} \textbf{W}&= \begin{pmatrix} w_{11} &{}\quad ... &{}\quad w_{1s} \\ ... &{}\quad ... &{}\quad ... \\ w_{s1} &{}\quad ... &{}\quad w_{ss} \end{pmatrix} ,\textbf{W}\in R^{s \times s}, \end{aligned}$$
(13)

where s represents the number of nodes. The derivative formula (9) yields the closed solution of the \(k\text{- }th\) subgraph model as follows:

$$\begin{aligned} \mathbf {f_k} = \mathbf {(I+D+W)^{-1}z_k},k=1,2,...,m, \end{aligned}$$
(14)

where \(\textbf{I}\) represents the unit matrix. Therefore, by solving formula (14), the solution to the \(k\text{- }th\) subgraph can be obtained with a complexity of O(1). And by iteration, the complexity of solving multiple subgraphs is O(n). Thus, the combination of CRF models does reduce the complexity of original CRF model.

Analysis of formula (14) reveals that, by matrixing all indicated scalars, the solution of combination of CRF models can be obtained by matrix multiplication in once time, as follows:

$$\begin{aligned} \textbf{F}^{*}&=(I+D+W)^{-1}L, \end{aligned}$$
(15)
$$\begin{aligned} \textbf{Z}&= \begin{pmatrix} z_{11} &{}\quad ... &{}\quad z_{1s} \\ ... &{}\quad ... &{}\quad ... \\ z_{m1} &{}\quad ... &{}\quad z_{ms} \end{pmatrix}, \nonumber \\ \textbf{L}&= Z^{T}. \end{aligned}$$
(16)

Therefore, regardless of the number of trajectories, the solution of combination of CRF models can be obtained at once using formula (15), which effectively improves tracking efficiency.

Fig. 5
figure 5

The loss chart of MOT20. The train_hm_loss represents detections’ loss. The train_id_loss represents features’ loss. The train_loss is the total loss. At 20 times, the train_id_loss’s curve gradually smoothes. However, the train_hm_loss still has a decreasing trend at the 20 times, but it starts to smooth out gradually without a decreasing trend at 26 times. The train_loss is smoothed and almost converges to 0 at 26 times, so the training is stopped at 26 times

The imitations of the method

As can be seen from formula (14), the solution of this model relies on indicator scalars. The combination of a set of indicator scalars yields the relationship between a trajectory and nodes. It is clear from the definition of formula (8) that the indicator scalar of each candidate detection node relies on simple positional ious information. Therefore, it is difficult to obtain distinguished indicator scalars when the coordinate information of nodes and prediction detections of multiple trajectories are close to each other, which may lead to difficulty for the model to obtain a distinguished solution. Therefore, this paper prepares to use deep learning networks to learn deeper positional information and thus learn distinguished indicator scalars for each node.

Table 1 Ablation experiment with Fairmot as the baseline to compare the effect of long trajectory information with short trajectory information
Table 2 Comparison of ours and Fairmot for ablation experiments on a private detector

Experiments

We conduct experiments on MOTchallenge and DanceTrack [28] to demonstrate algorithm’s effectiveness. First, we provide a brief introduction to the relevant datasets, metrics, and devices, and then analyze our algorithm’s advantages.

Fig. 6
figure 6

Under the moving camera. The small camera shake causes Fairmot’s tracking algorithm to fail, and object 699 with the green box becomes object 711 with the purple box in the next frame. However, our algorithm can track the object very well

Fig. 7
figure 7

Under the static camera. The object is heavily obscured, causing Fairmot’s tracking algorithm to fail in tracking, and object 12 with the green box becomes object 15 with the purple box in the next frame. However, our algorithm can track the object very well

Datasets and evaluation criteria

Datasets: We conduct experiments on MOTchallenge and DanceTrack [28]. MOT15, MOT16, MOT17 and MOT20 belong to MOTchallenge. MOT15 is a mixed dataset of pedestrians and vehicles. MOT16, MOT17, and MOT20 are only pedestrian datasets. MOT16 and MOT17 contain the same videos, 7 training set videos and 7 test set videos. However, MOT16 provides only one detector and MOT17 provides three detectors, DPM, Faster-RCNN, and SDP. The scenes of each video vary greatly in background, lighting, camera state, and crowd density. These variations make theses datasets very challenging. MOT20, on the other hand, is a high-density crowd dataset, with a higher crowd density than others. The duration of each video is also longer, which makes it more difficult. For DanceTrack, the target’s movement pattern is very complex and diverse, showing obvious non-linearity and often accompanied by a variety of body movements.

Table 3 The performance is compared between Yolov5s and DLA34 on a private detector
Fig. 8
figure 8

The edge-cloud model: To efficiently use the computing resources, we use different network on different devices in the detection process and three frame make up a circle, which the first frame and third frame are detected on the jetson device and the second is on RTX 2080 Ti GPUs

Fig. 9
figure 9

The cyclic process of edge-cloud model. Our model is a three-frame cycle, where the first frame and the third frame complete the detection process on the jetson, and the second frame is placed on the server. To ensure the efficiency and performance, we do not take a two-frame circle with one frame in the jetson and the another in the server. The orange represents the time when the device acquires frames, and the yellow represents the time when frames are transferred from the jetson to the serve. The green represents the time of detection and the red one represents the time of tracking

Evaluation criteria: Because only the training set labels are provided, all tracking results need to be submitted to the platform for calculation. How to judge the tracking algorithm performance is a difficult work and many cases need to be considered, so many metrics are provided. The most important is MOTA, which represents the accuracy of the whole tracking and is defined by FP, FN, and IDS. According to the paper [4], FP and FN are mainly influenced by the detector. IDS represents the number of identification number changes during the tracking process, and reflects the continuous state of tracking. To effectively express the performance of detection and tracking, the paper [4] combines three metrics to define MOTA which is defined as \(MOTA=1-\frac{\sum _t(FP_t+FN_t+IDS_t)}{\sum _t{GT_t}}\). And the criteria for tracking efficiency are FPS which means tracking frames per second. For the tracking efficiency, we calculate the speed on a 1080 Ti GPU for test datasets. Devices: We use the 1080Ti GPUs to calculate our model’s performance. To get the best performance in the real scenarios, the nvidia jetson tx2(the jetson) and RTX 2080 Ti GPUs are chosen to deploy our edge-cloud model. The jetson is an AI single module supercomputer based on NVIDIA Pascal architecture. It is powerful, compact, energy efficient and suitable for smart end devices, such as robots, drones, smart cameras, and portable medical devices. Therefore, it is the best device to deploy our model on the edge device.

Implementation details

We use fairmot’s backbone(DLA34) as backbone for our experiments. The model is pretrained on COCO dataset in advance and then continued to be pretrained on CrowdHuman and hybrid datasets (Caltech Pedestrian, CityPersons, CUHK-SYSU, PRW, ETHZ, MOT17, and MOT16). The model is pretrained on CrowdHuman for 60 epochs with the self-supervised learning approach and then trained on the MIX dataset for 30 epochs. This gives access to models of MOT16 and MOT17, which are available directly on fairmot’s GitHub. And models for MOT15 and MOT20 need to be finetuned on the corresponding datasets. We use standard data enhancement techniques, including rotation, scaling, and color dithering. The epoch is 32 and batchsize is 12. And finally, the loss diagram of MOT20 is illustrated in Fig. 5.

Table 4 On a private detector, we compare results with other outstanding papers

Analysis of experimental results of ablation experiments

To demonstrate that fusing long history trajectory information as historical trajectory nodes can effectively improve tracking performance and keep the tracking efficiency, we first conduct ablation experiments with one frame feature(off). We use Fairmot with CenterNet as baseline to get detections and re-ID features. The experimental results are shown in Table 1. MOTA is improved on all datasets. Especially on MOT20, MOTA increases substantially and is improved nearly 6%, and IDS decreases substantially. This result shows that the long trajectory contains a lot of information compared to the short trajectory and effectively solves the occlusion problem. For FPS on each dataset, the tracking speed of the long trajectory is only one frame per second lower than the short trajectory. From this result, it can be seen that our proposed closed-form solution effectively improves the tracking speed.

Table 5 The speed in the different process with different devices
Table 6 The performance about different models

To prove our algorithm can indeed solve identified transfer on the detection and re-ID. We uses Fairmot as the baseline for ablation experimental validation on MOT15, MOT16, MOT17, and MOT20 respectively. For Fairmot, we reproduce the results using Fairmot’s code and weights in github. To ensure the fairness of the ablation experiment, the same features are used for us as for Fairmot. In data association, Fairmot uses re-ID features and motion features to obtain the relationship matrix of trajectories and candidate detections, separately. First, it uses the motion relationship matrix to complete part association, and then uses the re-ID relationship matrix to complete the next association. However, our algorithm can get the final relationship matrix to complete data association once. From Table 2, it is found that the tracking performance MOTA is improved compared with Fairmot on all datasets. In MOT16 and MOT17, IDS decreases significantly, while FP and FN do not change much compared with Fairmot. Especially, in MOT20, MOTA is greatly improved nearly 6%. As illustrated in Figs. 6 and 7, compared with Fairmot in moving and static cameras, our algorithm completes tracking process effectively in dense scenes. As illustrated in Figs. 10 and 11, we visualize qualitative tracking results of each sequence in MOT17 and MOT20 test set. These results illustrate that our algorithm effectively solves identified transfer in the joint detection and re-ID.

Table 7 The power about different device
Table 8 We compare results with other outstanding papers on DanceTrack

For the tracking efficiency, we calculate the speed on a 1080 Ti GPU for test datasets. The comparison of the ablation experiments in Table 2 shows that our algorithm is better than Fairmot on every dataset. Especially in MOT20, the speed improves from 8.77 to 11.91. This is because we propose a closed solution approach to efficiently complete the association process in once.

Fig. 10
figure 10

Qualitative results of our method for MOT17 test set sequences. Bounding boxes, identities, and trajectories are marked in the images, and different color bounding boxes represent different objects

Fig. 11
figure 11

Qualitative results of our method for MOT20 test set sequences. Bounding boxes, identities, and trajectories are marked in the images, and different color bounding boxes represent different objects

To show the algorithm performance is state-of-the-art on MOTchallenge, we compare our algorithm with other excellent tracking algorithms under private detectors in Table 4. By analyzing Table 4, it shows that for MOTA metric, the performance of our algorithm improves mostly on MOT20 dataset. However, on MOT16 and MOT17 datasets, the improvement is not high. This is because our algorithm mainly solves identify transfer in high-density pedestrian dataset. IDF1 indicates the number of times that the track ID number is different from the initial track ID number. MT indicates the ratio of ground-truth trajectories that are covered by a track hypothesis for at least 80% of their respective life span. ML indicates the ratio of ground-truth trajectories that are covered by a track hypothesis for at most 20% of their respective life span. In this paper, the detector’s threshold is increased to obtain better detections, which decreases our algorithm’s performance on IDF1,MT, and ML. It is found that our algorithm is higher than other excellent algorithms and has the best tracking performance on all datasets on MOTA. Although LNS is an excellent online tracking graph model, our performance is still higher than it on MOT15, MOT16, and MOT17. To demonstrate our algorithm’s robustness, we conduct an experiment on DanceTrack [28] and compare its results with other papers in Table 8. It is found that our results although not the best are better than our baseline(Fairmot) on MOTA, IDF1, HOTA, and AssA. On DanceTrack [28], targets are not only highly similar in appearance but also rich in motion variations. Fairmot has difficulty in distinguishing these targets using simple appearance similarity and motion similarity. Therefore, our algorithm is able to distinguish these targets effectively compared to Fairmot. On DetA, our result is not good compared to Fairmot. Because DetA focuses more on detector performance. To obtain more accurate detections, we raise the detector threshold, which results in our result being inferior to Fairmot. However, compared to others, we use less complex and differentiated information, so it is less effective than theirs.

To deploy our algorithm in the jetson, we use the light-weight Yolov5s as backbone and decrease the size of pictures from \(1088*608\) to \(576*320.\) Therefore, the performance is lower than DLA34 with the size of \(1088*608\), as illustrated in Table 3. Therefore, we design an edge-cloud model with three frames as a circle to improve tracking performance and efficiency using a multi-threaded approach. We use Transmission Control Protocol to ensure the network connection between the jetson and the server. As illustrated in Fig. 8, we divide the whole process into two parts: detection and tracking. We use different backbone networks to get detections and features on different devices and the CRF model finishes the tracking on the jetson. We deploy Yolov5s on the jetson and DLA34 on a server with RTX 2080 Ti GPUs. According to the time, the first and third frame use Yolov5s to finish detection process on the jetson and the second uses DLA34 on RTX 2080 Ti GPUs.

From Table 5, the time of the detection process is 49 ms on the jetson and is 41 ms on the server. And also, the jetson needs 55 ms to obtain a picture and 15 ms to send information to the server. The time of the tracking process is 12 ms on the jetson. As illustrated in Fig. 9, for the frame1, the detection process starts at 55 ms and ends at 104 ms (55 + 49), and the tracking process starts at 104 ms and ends at 116 ms (55 + 49 + 12). The frame2 can be obtained at time 110 ms (55 + 55). And it costs 15 ms to send to the server. For the frame2, the detection process starts at 125 ms (110 + 15) and ends at 166 ms (110 + 15 + 41). The server needs 56 ms (15 + 41) to finish the detection process which is longer than the jetson. Since it takes almost no time for the detections and features to pass back to the jetson, the tracking process starts at 166 ms and ends at 178 ms (110 + 15 + 41 + 12). The frame3 can be obtained at 165 ms (55 + 55 + 55) and the detection process on the jetson is idle from 165 ms to 214 ms (165 + 49). For the frame3, the detection process starts at 165 ms and ends at 214 ms (165 + 49) which is longer than the time 178 ms the frame2 finish the tracking process. The tracking process of frame3 starts at 214 ms and ends at 226 ms (165 + 49 + 12). The frame4 can be obtained at 220 ms (55 + 55 + 55 + 55) and the detection process on the jetson is idle from 220 ms to 269 ms (220 + 49). Because the time on the jetson is shorter than on the server, the frame4 choose the jetson to finish the detection process. Therefore, the model’s circle is three-frame and it cost 229 ms to finish three frame tracking. Therefore, the speed of one frame is 76.3 ms and the edge-cloud model’s speed is 13.1 FPS which can achieve quasi-real time tracking in real-life scenarios. From Fig. 9, it can be seen that the detection process starts every 180 ms (55 + 55 + 55 + 55 + 15) on the server except for the first time when the server starts the detection process after 125 ms (55 + 55 + 15). From Table 6, the performance MOTA of the edge-cloud model is 60.9 and improves nearly 4% than Yolov5s, which shows that DLA34 can help improve the performance.

From Table 7, we can obtain the rated power of different devices. From Fig. 9, we can see that the edge-cloud model performs one cycle and the jetson device runs all time with a running time of 229 ms, while the server only runs during the detection of frame2 with a time of 41 ms. The energy consumed by devices can be calculated by the formula (17), as follows:

$$\begin{aligned} w = p\times t, \end{aligned}$$
(17)

where p represents power and t represents time in seconds. Therefore, for one cycle, the edge-cloud model consumes \(9.5075J (7.5*0.229+190*0.041)\), and consumes 3.169J to complete one frame on average. From Table 5, we can get that the time for the server to complete detection and tracking is 41 ms and 10 ms, respectively. And, the server consumes 9.69J\((190*(0.041+0.01))\) for one frame. Therefore, the energy consumption of the edge-cloud model to process one frame is less than that of the serve to process one frame.

Conclusion

To deploy multiple object tracking in real life, we design edge-cloud model to achieve quasi-real time tracking and the model is under the joint detection and re-id strategy. We summarize some papers about joint detection and re-id and find that identify transfer easily happens in dense scenes. To solve this problem, we propose a CRF model. To achieve quasi-real-time tracking, we split the CRF model into a combination of multiple CRF models with closed solution. Finally, we conduct comparative ablation experiments on MOT15, MOT16, MOT17, and MOT20 with Fairmot [46], and prove that our algorithm is indeed effective.