Efficient combination graph model based on conditional random field for online multi-object tracking

Zhang, Junwen; Zhang, Xiaolong; Zhu, Ziqi; Deng, Chunhua

doi:10.1007/s40747-022-00922-3

Efficient combination graph model based on conditional random field for online multi-object tracking

Original Article
Open access
Published: 02 December 2022

Volume 9, pages 3261–3276, (2023)
Cite this article

Download PDF

You have full access to this open access article

Complex & Intelligent Systems Aims and scope Submit manuscript

Efficient combination graph model based on conditional random field for online multi-object tracking

Download PDF

Junwen Zhang¹,
Xiaolong Zhang^1,2,3,
Ziqi Zhu¹ &
…
Chunhua Deng ORCID: orcid.org/0000-0002-5532-9060¹

1273 Accesses
Explore all metrics

Abstract

The joint detection and re-identification (re-ID) strategy shares network features of detection and re-ID, sacrifices the complex probability graph model pairing strategy, and consolidates a two-stage video tracking process into a one-stage, making the multi-object tracking process simple, fast, and accurate. In dense scenes, identified transfer is a major challenge for joint detection and re-ID. To this end, a probability graph model suitable for joint detection and re-ID is presented. The proposed model abandons the idea of matching candidate detections with historical detections in a classical probability graph, uses a scheme to calculate the degree of matching between candidate detections and historical trajectories, and transforms task of ID matching in re-ID process into an energy minimization problem of a conditional random field (CRF). However, the solution space of general CRF is complex and requires an iterative search. To achieve efficient online tracking, the original CRF problem is approximately transformed into a combination of multiple CRF problems with closed-form solutions. Moreover, the proposed algorithm has been applied in practical applications using an edge-cloud model that maintains the balance between performance and efficiency. Extensive experiments on the well-known MOTchallenge benchmark demonstrate the superior performance of the proposed algorithm.

Online Multi-object Tracking Using Single Object Tracker and Markov Clustering

Multiple Object Tracking by Efficient Graph Partitioning

Multi-object Tracking with Conditional Random Field

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

Multi-Object Tracking (MOT) is a challenging and popular research topic in computer vision and forms basis of many other vision techniques, such as video analysis [30], behavior recognition [11], autonomous driving [15], etc. The main task of MOT algorithm is to assign candidate detections to an existing trajectory or to generate a new trajectory according to the established rules [9]. The MOT is extremely difficult due to objects partially obscured, objects similar, camera shake, and the speed of the objects varies, in dense crowds [21].

There are currently two main frameworks of MOT: tracking-by-detection and joint-detection-and-tracking [20]. The tracking-by-detection framework divides MOT into two phases, object detection and data association, which is performed sequentially by two separate models [11]. First, the detector model detects object detections in each frame [12], and then, the association model extracts re-ID features from the detections, and assigns re-ID information to each detection. Therefore, detector model and association model duplicate the task of extracting object information, resulting in low tracking efficiency in dense scenes. With the development of deep learning [38, 40, 44] and multi-task learning [3], many works [46, 47] have proposed joint-detection-and-tracking to improve the performance and efficiency of tracking algorithms by reducing the number of object information acquisitions.

Based on object detection algorithms, the current predominant strategy for implementing the joint-detection-and-tracking is joint detection and re-ID [36, 46]. It obtains more accurate local re-ID features by sharing a backbone network with two stages of detection and re-ID and then uses a simple similarity metric to fuse re-ID features to perform data association. The above strategy integrates re-ID features extraction of association model into detection model, which simplifies the time-consuming association model. However, the partial obscuration of objects and the deviation of detections will lead to the failure of re-ID features in high-density pedestrian scenes. Therefore, the joint detection and re-ID, which uses low robustness similarity metrics to fuse re-ID features with low confidence to complete data association, is prone to identified transfer, as illustrated in Fig. 1.

For the low robustness of data association due to the failure of a single feature, many works [10, 31] have proposed to improve the robustness of the association model by fusing different types of features (e.g., appearance features, motion features, and topological structure features) using a probabilistic graph model with minimal cost. Since the probabilistic graphical model can represent different categories of information in the manifold space, the effective fusion of different information can be achieved during the optimization of the energy function, compensating for the loss of a single piece of information. The existing association model, based on the minimum cost flow probability graph, treats the detections of several successive frames as nodes and stepwise matches candidate detections of the current frame with historical detections, resulting in a Markov-chain trajectory along the time dimension [6, 7, 14]. The classical minimum cost flow graph model focuses on the relationship between nodes and nodes, and is prone to identified transfer and cumulative error problems in dense scenarios. To solve the identified transfer and ensure the tracking efficiency of joint detection and re-ID, this paper discards the idea of coupling the candidate detections with the historical detections of the classical probabilistic graph model, uses the matching computation scheme of the candidate detection and the current trajectory, and transforms ID assignment into an energy minimization problem of CRF.

Probabilistic graphical models usually require iterative solution methods to find the optimal solution, making it difficult to generalize to online MOT. In this paper, the original CRF model is approximated by a combined model of multiple CRF models with closed solutions. These CRF models with closed solutions can be avoided by using an iterative algorithm to solve them, and the optimal solution can be obtained directly using matrix multiplication. In addition, the energy and resource consumption of computers is a major concern for mobile or embedded devices. To deploy the algorithm on mobile devices, this research designs an edge-cloud model that combines a light-weight network on an edge device (the nvidia jetson tx2) with a heavyweight network on a server. Quasi-real-time tracking is possible on the edge device while maintaining performance.

Main contributions of this paper:

On the strategy of joint detection and re-ID, a CRF-association-model for candidate detections and trajectories is proposed. Unlike classical graph model, it avoids the cumulative deviation of the longer Markov chain in the association process, and transforms the task of ID assignment into an energy minimization problem with CRF.
The original CRF model is approximated by a combined model with multiple closed solutions, avoiding iterative algorithms to find the optimal solution, thereby improving efficiency, and reducing computer energy and resource consumption.
An edge-cloud model is designed to combine edge devices and servers for quasi-real-time tracking.

Related work

The joint-detection-and-tracking is accomplished in two main strategies: joint detection and motion prediction [3, 47], and joint detection and re-ID [36, 46].

Joint detection and motion prediction uses a detector to return the coordinate information of detections. Chained-Tracker [24] proposes an end-to-end model which takes two adjacent frames as input to the network and uses the model to regress boxes. Tracktor [3] utilizes the Faster-RCNN detection head to directly regress boxes or terminate trajectories. However, these box regression methods require the assumption of a high overlap rate between adjacent frames and therefore perform poorly on low frame rate video. In contrast to these methods, CenterTrack [47] uses information from two adjacent frames to predict the bias of the detections’ centroid and then uses the distance of the centroid to complete the association. Meanwhile, it generates a trajectory heat map which can help match detections anywhere with a low overlap of object detections. These methods only associate objects in adjacent frames without reinitializing the lost traces, which complicates the handling of occlusion cases.

Joint detection and re-ID first uses a single network to complete the detection and re-ID features in one step and then uses a simple model to complete data association. Track-RCNN [33] adds a header for re-ID features extraction on Mask RCNN and gets detection and re-ID features for the object. JDE [36] constructs re-ID features extraction head on top of YOLOv3 and completes the association based on appearance similarity, enabling near-video rate inference. Fairmot [46] believes that the anchor-based detector does not extract re-ID features well, and suggests extracting the features with anchor-free networks such as CenterNet. However, the focus of these methods is on how to efficiently obtain re-ID features with high confidence and it is easy for them to cause identified transfer in dense scenes. To solve this problem, this paper uses a probabilistic graph-based association model to complete data association in the strategy of joint detection and re-ID, fuse multiple types of features in the process of energy function optimization, and improve the tracking performance.

Common association models are divided into two main categories: end-to-end models based on deep learning and probabilistic graph models. The end-to-end model obtains the similarity matrix between detections in a data-driven manner. For example, DAN [30] constructs Deep Affinity Network to learn the similarity matrix. This paper [6] uses a graph neural network (GNN) to learn the similarity matrix. Since the inputs and outputs of end-to-end model are fixed, it does not retain all possible candidate detections compared to the probabilistic graph model. The classical probabilistic graphical model is mainly based on the minimum cost flow. For example, the paper [45] treats detections as nodes and connects nodes of adjacent frames in sequence to form edges of the graph, with positive weights on the edges if the nodes belong to the same track and negative weights otherwise. However, this method only considers the relationship between adjacent frames and cannot solve the occlusion. To solve the occlusion, LMP [32] proposes two different edges, regular and lifting edges according to the length of the time interval, and constraining the solution of a regular edge using the constraints of the lifting edge. The paper [17] uses more information (e.g., appearance features, motion features, and the topological structure features) as constraints to reduce the tracking error. These graph models, although more complex and reliable, focus on the relationship between nodes and nodes, ignoring the relationship between nodes and trajectories, and are therefore prone to the phenomenon of cumulative error. Some papers propose the CRF model [16, 41]. The paper [16] treats the relationship of two trajectories as nodes, while setting label for each node, label represents whether two trajectories belong to the same trajectory, and values are only 0, 1. It focuses more on the relationship between nodes and nodes and does not consider the relationship between nodes and the overall model. This cannot reflect the relationship between nodes and all trajectories well, and when targets with similar appearance and similar location are prone to identify transfer. To solve it, this paper adopts the calculation scheme of candidate detections and current trajectories and gets the relationship between candidate detections and current trajectories.

To improve the tracking speed, many papers [2, 45] iteratively solve the global optimal solution by adding complex boundary conditions, which cannot be applied to a wide range of tracking scenarios. As the number of objects becomes larger, iterations’s times increase, which affects tracking efficiency. In this paper, to ensure the tracking performance and efficiency, instead of adding complex boundary conditions, we use the combination of multiple CRF models to obtain the local optimal solution. Due to the complexity of MOT, there are few papers proposing the deployment of MOT to mobile. Inspired by these papers [13, 27, 35] which use different devices to keep the balance of the spped and performance, we distill the CNN network and design a edge-cloud use different networks on different devices in the detection process to achieve quasi-real time tracking.

Method

In this section, this paper introduces the construction process of the proposed algorithm, and the specific algorithm flow is illustrated in Fig. 2. To avoid accumulated errors in the minimum cost flow model, this paper abstracts trajectories into historical trajectory nodes and constructs a CRF model of nodes and trajectories relationship. The solution space of CRF model is complex, and this paper constructs a combination of multiple CRF models with closed solutions, which reduces the original complexity and improves the tracking efficiency. First, some definitions are given as follows:

Notations:

$x_i$ represents the node function of node i.
$y_{ij}$ represents the edge function of edge ij.
$V_{tr}$ represents the historical trajectory nodes.
$v_{tr_i}$ represents the $i\text{- }th$ historical trajectory node.
$V_O$ represents the candidate detection nodes.
$\mathbf {F_i}$ represents the relationship between node i and all trajectories in the CRF model.
$\mathbf {Z_i}$ represents the $i\text{- }th$ node’s indicator vector in the CRF model.
$e_i$ represents the re-ID features of the node i.
$e_{tr_i}$ represents the re-ID features of the $i\text{- }th$ trajectory.
$w_{ij}$ represents the appearance similarity between node i and node j.
KM represents the Kalman Filter.
$p_{t-1_i}$ represents the detection of $i\text{- }th$ object at frame $t\text{- }1$.
$U_{ik}$ represents the overlap rate of the $i\text{- }th$ candidate detection node’s detection and the $k\text{- }th$ historical trajectory node’s prediction detection.
$H_{t-1}$ represents the detection nodes in frame $t\text{- }1$.
$f_{k_i}$ represents the relationship between the $i\text{- }th$ node and the $k\text{- }th$ trajectory in the the $k\text{- }th$ trajectory’s subgraph model.
$z_{k_i}$ represents the relationship between the $i\text{- }th$ node’s indicator scalar in the the $k\text{- }th$ trajectory’s subgraph model.
$f_{k}$ represents the relationship between all nodes and the $k\text{- }th$ trajectory.
$F^*$ represents the approximate solution of the CRF model.

The classical minimum cost flow probability graph

The number of objects changes dynamically over time, while the input and output of a convolutional neural network are fixed. Therefore, compared to end-to-end convolutional neural networks, the model based on classical probability graph is more suitable for the dense scene. The minimum cost flow model is a classical association model in the probability graph. It treats the detection as a node, and the node function $x_i$ is 1 if the node is not false, and otherwise, it is 0. The nodes of two adjacent frames are connected to form the edges of the graph model according to the time flow dimension, and the edge function $y_{ij}$ takes the value of 1 if nodes i and j belong to the same track; otherwise, it is 0, as illustrated in Fig. 3a, b. Then, the following energy function belongs to an instance of the minimum cost flow:

$$\begin{aligned} \begin{aligned}&\min _{x \in \{0,1\}^V,y \in \{0,1\}^E} \sum _{i\in V} x_{i} + \sum _{(ij)\in E} \frac{1}{2}w_{ij}y_{ij},\\&w_{ij}=1-cosine(e_i,e_j), \end{aligned} \end{aligned}$$

(1)

where $e_i$ represents the node’s appearance feature. The relationship between nodes and nodes can be obtained by solving the above model. According to the edge function, nodes are interrelated to form Markov-chain trajectories along the time dimension.

The above model only cares about the relationship between nodes and nodes in the association, and ignores the relationship between nodes and trajectories, which causes cumulative error easily. In a dense scenario, nodes with similar appearance but very different positions are prone to identified transfer, as illustrated in Fig. 3a. And the error will propagate backward with the Markov chain, producing the phenomenon of cumulative error. As illustrated in Fig. 3b, the green trajectory contains the wrong node, and this error causes the green node not to be tracked in the green trajectory, resulting in a cumulative error. To solve these problems, the CRF model incorporates trajectories into the model and constructs the relationship between trajectories and candidate detections. The CRF model abstracts trajectories into historical trajectory nodes, such as the colored square illustrated in Fig. 3c, d. The trajectory nodes expand the temporal field of CRF model to avoid identified transfer, as illustrated in Fig. 3c. At the same time, the error tolerance of historical trajectory nodes is higher than normal nodes, as illustrated in Fig. 3d. Although there are errors in the trajectory, the trajectory contains more information, so that the errors do not propagate backward. The CRF model abstracts trajectories into historical trajectory nodes $V_{tr}=\{v_{tr1},v_{tr2},...,v_{trm}\}$. The historical trajectory nodes $V_{tr}$ and the candidate detection nodes $V_O$ will form a complete graph $G=(V=(V_{tr},V_O),E\in V \times V)$. The function variables are no longer simple 0–1 encoding, but are related to the trajectory. The node function is defined as $V=\{i\in V \mid x_{i}=\left\| \mathbf {F_i}-\mathbf {Z_i}\right\| ,\mathbf {F_i} \in R^{1 \times m},\mathbf {Z_i} \in R^{1 \times m}\}$. $\mathbf {F_i}$ denotes the relationship between node i and all trajectories, and $\mathbf {Z_i}$ is the indicator vector that guides the direction of solving the energy function. The specific solution of the indicator vector will be explained in the section “The energy function for the CRF model”. Therefore, compared to the minimum cost flow, the CRF model uses all candidate detection nodes to participate in the energy function, which increases the spatial field of the model. The edge function is defined as $E=\{ij\in E \mid y_{ij}= \left\| \mathbf {F_i}-\mathbf {F_j}\right\| , \mathbf {F_i} \in R^{1 \times m},\mathbf {F_j} \in R^{1 \times m}\}$, and represents the relationship between different nodes. In the edge function, the CRF model utilizes not only the relationship between candidate detection nodes and trajectory nodes, but also the relationship between trajectory nodes and trajectory nodes, and candidate detection nodes and candidate detection nodes. Therefore, compared to the minimum cost flow, the CRF model is not concerned with whether two nodes belong to the same track. Rather, more emphasis is placed on the relationship of the node to other nodes as a whole. It can effective use the topology structure between nodes and avoid identified transfer when nodes are similar in appearance but differ greatly in location.

The energy function for the CRF model

In this paper, the association problem of MOT is transformed into a CRF model solving problem. The energy function of the CRF model is defined according to the energy function (1), and it is defined as follows:

$$\begin{aligned} \begin{aligned}&\min \sum _{i\in V} \left\| \mathbf {F_i}-\mathbf {Z_i}\right\| + \sum _{ij\in E} \frac{1}{2}w_{ij}\left\| \mathbf {F_i}-\mathbf {F_j}\right\| ,\\&w_{ij}=cosine(e_i,e_j). \end{aligned} \end{aligned}$$

(2)

As illustrated in Fig. 2, this paper uses fairmot’s re-ID features to define the re-ID features of candidate detection nodes. Inspired by the paper [46], this paper fuses the re-ID features of the $i\text{- }th$ historical trajectory $e_{tr_i}$ according to the following linear formulas:

$$\begin{aligned} \begin{aligned}&e_{t-1_i} =\frac{e_{t-1_i}}{\left\| e_{t-1_i}\right\| },\\&e =(1-\alpha )e_{t-1_i} + \alpha e,\\&e_{tr_i}= e,\\&\alpha =0.9,\\&e_{tr_i} =\frac{e_{tr_i}}{\left\| e_{tr_i}\right\| }; \end{aligned} \end{aligned}$$

(3)

$e_{t-1_i}$ represents the re-ID features of the $i\text{- }th$ object at frame $t\text{- }1$. This approach quickly fuses the historical re-ID features of the same object, increasing the speed of tracking.

$\mathbf {Z_i}$ represents the indicator vector of the node i, which is used to guide the process of solving the energy function. Different nodes have different indicator vectors, as the vector encoding process in Fig. 2. To highlight the uniqueness of historical trajectory nodes, this paper uses one-hot coding to complete the indicator vector coding, and uses the historical trajectory identification(ID) number $ID_{i}$ to define it. The specific formula is as follows:

$$\begin{aligned} Z_i[k] = \left\{ \begin{array}{lr} 1,\,\,\, &{} ID_{i}=k\\ 0,\,\,\, &{} else.\\ \end{array} \right. \end{aligned}$$

(4)

To avoid identified transfer arising from similar appearance but very different location, this paper uses the location topology information to define the indication vector of candidate detection nodes. First, this paper obtains the prediction detections based on the historical trajectory nodes. Inspired by these papers [18, 46], this paper first uses the Kalman filter(KM) to obtain the prediction detections $p_{tr_k}$, as follows:

$$\begin{aligned} \begin{aligned}&p =KM(p_{t-1_k},p),\\&p_{tr_k}= p,\\ \end{aligned} \end{aligned}$$

(5)

$p_{t-1_k}$ represents the detection of the $k\text{- }th$ object at frame $t\text{- }1$. The $U_{ik}$ represents the overlap rate of the $i\text{- }th$ candidate detection node’s detection and the $k\text{- }th$ historical trajectory node’s prediction detection, and is used to define the indicator vector of the $i\text{- }th$ candidate detection, as follows:

$$\begin{aligned} Z_i[k]=U_{ik}. \end{aligned}$$

(6)

By solving formula (2), this paper can obtain the solution of the CRF model, and thus get the relationship between nodes and all trajectories.

The combination graph models based on conditional random field with closed-form solution

The complexity of solving the model by iterative formula (2) is $O(n^2)$. In dense scenarios, this solving approach is not conducive to online tracking. Therefore, this paper looks for approximate solutions. Inspired by the solution process of the paper [42], it matrixizes the formula to easily find the closed solution and reduce the complexity to O(1), because its values are scalas. Therefore, this paper decomposes the similarity $\mathbf {F_i}$ of nodes and trajectories. Based on the number of historical trajectories, the CRF model is splited into a combination of multiple CRF models. As illustrated in Fig. 4, Each subgraph represents the relationship between all nodes and a trajectory. Each subgraph model contains only one historical trajectory node, and the other trajectory nodes are transformed into ordinary detection nodes containing re-ID features, position of one frame, and ID of trajectories. Because all subgraphs are reconstructed, all this paper just needs to find the closed solution of one subgraph, and the other subgraphs are solved iteratively to obtain the approximate solution of CRF model. Therefore, for the $k\text{- }th$ subgraph model $G_k = (V_k=\{v_{tr_k},H_{t-1}-v_k,V_O\},E_k \in V_k \times V_k\})$, the node function is $V_k=\{ i\in V \mid x_{i}= (f_{k_i}-z_{k_i})^2\}$, the edge function is $E_k=\{ ij\in E \mid y_{ij}= (f_{k_i}-f_{k_j})^2 \}$, where $H_{t-1}$ represents the detection node with ID at frame $t\text{- }1$, $f_{k_i}$represents the similarity between node i and trajectory k, and $z_{k_i}$ represents the indication scalar of node i. Consistent with the CRF model, the detection node and the historical trajectory node of the $k\text{- }th$ subgraph model define the indicator scalar according to nodes’ $ID_i$

$$\begin{aligned} z_{k_i} = \left\{ \begin{array}{lr} 1,\,\,\, &{} ID_i=k,\\ 0,\,\,\, &{} else.\\ \end{array} \right. \end{aligned}$$

(7)

The candidate detection node i uses the overlap rate of its detection with the prediction detection $p_{tr_k}$ of the $k\text{- }th$ trajectory node to define the indicator scalar, as follows:

$$\begin{aligned} z_{k_i} =U_{ik}. \end{aligned}$$

(8)

The energy function for the $k\text{- }th$ subgraph model is defined as follows:

$$\begin{aligned} \begin{aligned}&\min _{\mathbf {f_k}}\sum _{i\in V_k}(f_{k_i}-z_{k_i})^2+\sum _{(i,j)\in E_k}\frac{1}{2}w_{ij}(f_{k_i}-f_{k_j})^2,\\&\quad w_{ij}=cosine(e_i,e_j),\\&\quad k=1,2...m.\\ \end{aligned} \end{aligned}$$

(9)

Inspired by the paper [42], matrixing formula (9) yields the following results:

$$\begin{aligned} \begin{aligned} \min _{\mathbf {f_k}} E(\mathbf {f_k})&=\min _{\mathbf {f_k}}(\mathbf {f_k}-\mathbf {z_k)^T}(\mathbf {f_k}-\mathbf {z_k})\\&\quad +\mathbf {f^T_{k}Df_k}-\mathbf {f^T_{k}Wf_k}, \end{aligned} \end{aligned}$$

(10)

where $\textbf{D}$ and $\textbf{W}$ are defined as follows:

$$\begin{aligned} d_{ii}&= \sum _{j=1}^{s}w_{ij}, \end{aligned}$$

(11)

$$\begin{aligned}&\begin{aligned} \textbf{D}&=diag\{d_{11},...,d_{ss}\} = \begin{pmatrix} d_{11} &{}\quad ... &{}\quad 0 \\ ... &{}\quad ... &{}\quad ... \\ 0 &{}\quad ... &{}\quad d_{ss} \end{pmatrix} ,\textbf{D}\in R^{s \times s}, \end{aligned} \end{aligned}$$

(12)

$$\begin{aligned} \textbf{W}&= \begin{pmatrix} w_{11} &{}\quad ... &{}\quad w_{1s} \\ ... &{}\quad ... &{}\quad ... \\ w_{s1} &{}\quad ... &{}\quad w_{ss} \end{pmatrix} ,\textbf{W}\in R^{s \times s}, \end{aligned}$$

(13)

where s represents the number of nodes. The derivative formula (9) yields the closed solution of the $k\text{- }th$ subgraph model as follows:

$$\begin{aligned} \mathbf {f_k} = \mathbf {(I+D+W)^{-1}z_k},k=1,2,...,m, \end{aligned}$$

(14)

where $\textbf{I}$ represents the unit matrix. Therefore, by solving formula (14), the solution to the $k\text{- }th$ subgraph can be obtained with a complexity of O(1). And by iteration, the complexity of solving multiple subgraphs is O(n). Thus, the combination of CRF models does reduce the complexity of original CRF model.

Analysis of formula (14) reveals that, by matrixing all indicated scalars, the solution of combination of CRF models can be obtained by matrix multiplication in once time, as follows:

$$\begin{aligned} \textbf{F}^{*}&=(I+D+W)^{-1}L, \end{aligned}$$

(15)

$$\begin{aligned} \textbf{Z}&= \begin{pmatrix} z_{11} &{}\quad ... &{}\quad z_{1s} \\ ... &{}\quad ... &{}\quad ... \\ z_{m1} &{}\quad ... &{}\quad z_{ms} \end{pmatrix}, \nonumber \\ \textbf{L}&= Z^{T}. \end{aligned}$$

(16)

Therefore, regardless of the number of trajectories, the solution of combination of CRF models can be obtained at once using formula (15), which effectively improves tracking efficiency.

The imitations of the method

As can be seen from formula (14), the solution of this model relies on indicator scalars. The combination of a set of indicator scalars yields the relationship between a trajectory and nodes. It is clear from the definition of formula (8) that the indicator scalar of each candidate detection node relies on simple positional ious information. Therefore, it is difficult to obtain distinguished indicator scalars when the coordinate information of nodes and prediction detections of multiple trajectories are close to each other, which may lead to difficulty for the model to obtain a distinguished solution. Therefore, this paper prepares to use deep learning networks to learn deeper positional information and thus learn distinguished indicator scalars for each node.

Table 1 Ablation experiment with Fairmot as the baseline to compare the effect of long trajectory information with short trajectory information

Full size table

Table 2 Comparison of ours and Fairmot for ablation experiments on a private detector

Full size table

Experiments

We conduct experiments on MOTchallenge and DanceTrack [28] to demonstrate algorithm’s effectiveness. First, we provide a brief introduction to the relevant datasets, metrics, and devices, and then analyze our algorithm’s advantages.

Datasets and evaluation criteria

Datasets: We conduct experiments on MOTchallenge and DanceTrack [28]. MOT15, MOT16, MOT17 and MOT20 belong to MOTchallenge. MOT15 is a mixed dataset of pedestrians and vehicles. MOT16, MOT17, and MOT20 are only pedestrian datasets. MOT16 and MOT17 contain the same videos, 7 training set videos and 7 test set videos. However, MOT16 provides only one detector and MOT17 provides three detectors, DPM, Faster-RCNN, and SDP. The scenes of each video vary greatly in background, lighting, camera state, and crowd density. These variations make theses datasets very challenging. MOT20, on the other hand, is a high-density crowd dataset, with a higher crowd density than others. The duration of each video is also longer, which makes it more difficult. For DanceTrack, the target’s movement pattern is very complex and diverse, showing obvious non-linearity and often accompanied by a variety of body movements.

Table 3 The performance is compared between Yolov5s and DLA34 on a private detector

Full size table

Evaluation criteria: Because only the training set labels are provided, all tracking results need to be submitted to the platform for calculation. How to judge the tracking algorithm performance is a difficult work and many cases need to be considered, so many metrics are provided. The most important is MOTA, which represents the accuracy of the whole tracking and is defined by FP, FN, and IDS. According to the paper [4], FP and FN are mainly influenced by the detector. IDS represents the number of identification number changes during the tracking process, and reflects the continuous state of tracking. To effectively express the performance of detection and tracking, the paper [4] combines three metrics to define MOTA which is defined as $MOTA=1-\frac{\sum _t(FP_t+FN_t+IDS_t)}{\sum _t{GT_t}}$. And the criteria for tracking efficiency are FPS which means tracking frames per second. For the tracking efficiency, we calculate the speed on a 1080 Ti GPU for test datasets. Devices: We use the 1080Ti GPUs to calculate our model’s performance. To get the best performance in the real scenarios, the nvidia jetson tx2(the jetson) and RTX 2080 Ti GPUs are chosen to deploy our edge-cloud model. The jetson is an AI single module supercomputer based on NVIDIA Pascal architecture. It is powerful, compact, energy efficient and suitable for smart end devices, such as robots, drones, smart cameras, and portable medical devices. Therefore, it is the best device to deploy our model on the edge device.

Implementation details

We use fairmot’s backbone(DLA34) as backbone for our experiments. The model is pretrained on COCO dataset in advance and then continued to be pretrained on CrowdHuman and hybrid datasets (Caltech Pedestrian, CityPersons, CUHK-SYSU, PRW, ETHZ, MOT17, and MOT16). The model is pretrained on CrowdHuman for 60 epochs with the self-supervised learning approach and then trained on the MIX dataset for 30 epochs. This gives access to models of MOT16 and MOT17, which are available directly on fairmot’s GitHub. And models for MOT15 and MOT20 need to be finetuned on the corresponding datasets. We use standard data enhancement techniques, including rotation, scaling, and color dithering. The epoch is 32 and batchsize is 12. And finally, the loss diagram of MOT20 is illustrated in Fig. 5.

Table 4 On a private detector, we compare results with other outstanding papers

Full size table

Analysis of experimental results of ablation experiments

To demonstrate that fusing long history trajectory information as historical trajectory nodes can effectively improve tracking performance and keep the tracking efficiency, we first conduct ablation experiments with one frame feature(off). We use Fairmot with CenterNet as baseline to get detections and re-ID features. The experimental results are shown in Table 1. MOTA is improved on all datasets. Especially on MOT20, MOTA increases substantially and is improved nearly 6%, and IDS decreases substantially. This result shows that the long trajectory contains a lot of information compared to the short trajectory and effectively solves the occlusion problem. For FPS on each dataset, the tracking speed of the long trajectory is only one frame per second lower than the short trajectory. From this result, it can be seen that our proposed closed-form solution effectively improves the tracking speed.

Table 5 The speed in the different process with different devices

Full size table

Table 6 The performance about different models

Full size table

To prove our algorithm can indeed solve identified transfer on the detection and re-ID. We uses Fairmot as the baseline for ablation experimental validation on MOT15, MOT16, MOT17, and MOT20 respectively. For Fairmot, we reproduce the results using Fairmot’s code and weights in github. To ensure the fairness of the ablation experiment, the same features are used for us as for Fairmot. In data association, Fairmot uses re-ID features and motion features to obtain the relationship matrix of trajectories and candidate detections, separately. First, it uses the motion relationship matrix to complete part association, and then uses the re-ID relationship matrix to complete the next association. However, our algorithm can get the final relationship matrix to complete data association once. From Table 2, it is found that the tracking performance MOTA is improved compared with Fairmot on all datasets. In MOT16 and MOT17, IDS decreases significantly, while FP and FN do not change much compared with Fairmot. Especially, in MOT20, MOTA is greatly improved nearly 6%. As illustrated in Figs. 6 and 7, compared with Fairmot in moving and static cameras, our algorithm completes tracking process effectively in dense scenes. As illustrated in Figs. 10 and 11, we visualize qualitative tracking results of each sequence in MOT17 and MOT20 test set. These results illustrate that our algorithm effectively solves identified transfer in the joint detection and re-ID.

Table 7 The power about different device

Full size table

Table 8 We compare results with other outstanding papers on DanceTrack

Full size table

For the tracking efficiency, we calculate the speed on a 1080 Ti GPU for test datasets. The comparison of the ablation experiments in Table 2 shows that our algorithm is better than Fairmot on every dataset. Especially in MOT20, the speed improves from 8.77 to 11.91. This is because we propose a closed solution approach to efficiently complete the association process in once.

To show the algorithm performance is state-of-the-art on MOTchallenge, we compare our algorithm with other excellent tracking algorithms under private detectors in Table 4. By analyzing Table 4, it shows that for MOTA metric, the performance of our algorithm improves mostly on MOT20 dataset. However, on MOT16 and MOT17 datasets, the improvement is not high. This is because our algorithm mainly solves identify transfer in high-density pedestrian dataset. IDF1 indicates the number of times that the track ID number is different from the initial track ID number. MT indicates the ratio of ground-truth trajectories that are covered by a track hypothesis for at least 80% of their respective life span. ML indicates the ratio of ground-truth trajectories that are covered by a track hypothesis for at most 20% of their respective life span. In this paper, the detector’s threshold is increased to obtain better detections, which decreases our algorithm’s performance on IDF1,MT, and ML. It is found that our algorithm is higher than other excellent algorithms and has the best tracking performance on all datasets on MOTA. Although LNS is an excellent online tracking graph model, our performance is still higher than it on MOT15, MOT16, and MOT17. To demonstrate our algorithm’s robustness, we conduct an experiment on DanceTrack [28] and compare its results with other papers in Table 8. It is found that our results although not the best are better than our baseline(Fairmot) on MOTA, IDF1, HOTA, and AssA. On DanceTrack [28], targets are not only highly similar in appearance but also rich in motion variations. Fairmot has difficulty in distinguishing these targets using simple appearance similarity and motion similarity. Therefore, our algorithm is able to distinguish these targets effectively compared to Fairmot. On DetA, our result is not good compared to Fairmot. Because DetA focuses more on detector performance. To obtain more accurate detections, we raise the detector threshold, which results in our result being inferior to Fairmot. However, compared to others, we use less complex and differentiated information, so it is less effective than theirs.

To deploy our algorithm in the jetson, we use the light-weight Yolov5s as backbone and decrease the size of pictures from $1088*608$ to $576*320.$ Therefore, the performance is lower than DLA34 with the size of $1088*608$, as illustrated in Table 3. Therefore, we design an edge-cloud model with three frames as a circle to improve tracking performance and efficiency using a multi-threaded approach. We use Transmission Control Protocol to ensure the network connection between the jetson and the server. As illustrated in Fig. 8, we divide the whole process into two parts: detection and tracking. We use different backbone networks to get detections and features on different devices and the CRF model finishes the tracking on the jetson. We deploy Yolov5s on the jetson and DLA34 on a server with RTX 2080 Ti GPUs. According to the time, the first and third frame use Yolov5s to finish detection process on the jetson and the second uses DLA34 on RTX 2080 Ti GPUs.

From Table 5, the time of the detection process is 49 ms on the jetson and is 41 ms on the server. And also, the jetson needs 55 ms to obtain a picture and 15 ms to send information to the server. The time of the tracking process is 12 ms on the jetson. As illustrated in Fig. 9, for the frame1, the detection process starts at 55 ms and ends at 104 ms (55 + 49), and the tracking process starts at 104 ms and ends at 116 ms (55 + 49 + 12). The frame2 can be obtained at time 110 ms (55 + 55). And it costs 15 ms to send to the server. For the frame2, the detection process starts at 125 ms (110 + 15) and ends at 166 ms (110 + 15 + 41). The server needs 56 ms (15 + 41) to finish the detection process which is longer than the jetson. Since it takes almost no time for the detections and features to pass back to the jetson, the tracking process starts at 166 ms and ends at 178 ms (110 + 15 + 41 + 12). The frame3 can be obtained at 165 ms (55 + 55 + 55) and the detection process on the jetson is idle from 165 ms to 214 ms (165 + 49). For the frame3, the detection process starts at 165 ms and ends at 214 ms (165 + 49) which is longer than the time 178 ms the frame2 finish the tracking process. The tracking process of frame3 starts at 214 ms and ends at 226 ms (165 + 49 + 12). The frame4 can be obtained at 220 ms (55 + 55 + 55 + 55) and the detection process on the jetson is idle from 220 ms to 269 ms (220 + 49). Because the time on the jetson is shorter than on the server, the frame4 choose the jetson to finish the detection process. Therefore, the model’s circle is three-frame and it cost 229 ms to finish three frame tracking. Therefore, the speed of one frame is 76.3 ms and the edge-cloud model’s speed is 13.1 FPS which can achieve quasi-real time tracking in real-life scenarios. From Fig. 9, it can be seen that the detection process starts every 180 ms (55 + 55 + 55 + 55 + 15) on the server except for the first time when the server starts the detection process after 125 ms (55 + 55 + 15). From Table 6, the performance MOTA of the edge-cloud model is 60.9 and improves nearly 4% than Yolov5s, which shows that DLA34 can help improve the performance.

From Table 7, we can obtain the rated power of different devices. From Fig. 9, we can see that the edge-cloud model performs one cycle and the jetson device runs all time with a running time of 229 ms, while the server only runs during the detection of frame2 with a time of 41 ms. The energy consumed by devices can be calculated by the formula (17), as follows:

$$\begin{aligned} w = p\times t, \end{aligned}$$

(17)

where p represents power and t represents time in seconds. Therefore, for one cycle, the edge-cloud model consumes $9.5075J (7.5*0.229+190*0.041)$, and consumes 3.169J to complete one frame on average. From Table 5, we can get that the time for the server to complete detection and tracking is 41 ms and 10 ms, respectively. And, the server consumes 9.69J$(190*(0.041+0.01))$ for one frame. Therefore, the energy consumption of the edge-cloud model to process one frame is less than that of the serve to process one frame.

Conclusion

To deploy multiple object tracking in real life, we design edge-cloud model to achieve quasi-real time tracking and the model is under the joint detection and re-id strategy. We summarize some papers about joint detection and re-id and find that identify transfer easily happens in dense scenes. To solve this problem, we propose a CRF model. To achieve quasi-real-time tracking, we split the CRF model into a combination of multiple CRF models with closed solution. Finally, we conduct comparative ablation experiments on MOT15, MOT16, MOT17, and MOT20 with Fairmot [46], and prove that our algorithm is indeed effective.

References

Bae S-H, Yoon K-J (2018) Confidence-based data association and discriminative deep appearance learning for robust online multi-object tracking. IEEE Trans Pattern Anal Mach Intell 40(3):595–610
Article Google Scholar
Berclaz J, Fleuret F, Turetken E, Fua P (2011) Multiple object tracking using k-shortest paths optimization. IEEE Trans Pattern Anal Mach Intell 33(9):1806–1819
Article Google Scholar
Bergmann Philipp, Meinhardt Tim, Leal-Taixé Laura (2019) Tracking without bells and whistles. In 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019, pages 941–951. IEEE,
Bernardin Keni, Stiefelhagen Rainer (2008). Evaluating multiple object tracking performance: The CLEAR MOT metrics. EURASIP J. Image Video Process., 2008,
Bewley Alex, Ge ZongYuan, Ott Lionel, Ramos Fabio Tozeto, Upcroft Ben (2016) Simple online and realtime tracking. In 2016 IEEE International Conference on Image Processing, ICIP 2016, Phoenix, AZ, USA, September 25-28, 2016, pages 3464–3468. IEEE,
Brasó Guillem, Leal-Taixé Laura (2020) Learning a neural solver for multiple object tracking. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, pages 6246–6256. Computer Vision Foundation / IEEE,
Brendel W, Amer M, Todorovic S (2011) Multiobject tracking as maximum weight independent set. In CVPR 2011:1273–1280
Google Scholar
Chen Long, Ai Haizhou, Shang Chong, Zhuang Zijie, Bai Bo (2017) Online multi-object tracking with convolutional neural networks. In 2017 IEEE International Conference on Image Processing (ICIP), pages 645–649,
Chen Long, Ai Haizhou, Zhuang Zijie, Shang Chong (2018) Real-time multiple people tracking with deeply learned candidate selection and person re-identification. In 2018 IEEE International Conference on Multimedia and Expo, ICME 2018, San Diego, CA, USA, July 23-27, 2018, pages 1–6. IEEE Computer Society,
Dehghan A, Assari S, Shah M (jun 2015) Gmmcp tracker: Globally optimal generalized maximum multi clique problem for multiple object tracking. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4091–4099, Los Alamitos, CA, USA, IEEE Computer Society
Fang Kuan, Xiang Yu, Li Xiaocheng, Savarese Silvio (2018) Recurrent autoregressive networks for online multi-object tracking. In 2018 IEEE Winter Conference on Applications of Computer Vision, WACV 2018, Lake Tahoe, NV, USA, March 12-15, 2018, pages 466–475. IEEE Computer Society,
Han Qinzhe, Yin Qian, Zheng Xin, Chen Ziyi (2021) Remote sensing image building detection method based on mask r-cnn. Complex & Intelligent Systems, pages 1–9,
Hu Zhichen, Xu Xiaolong, Zhang Yulan, Tang Hongsheng, Cheng Yong, Qian Cheng, Khosravi Mohammad R (2021) Cloud–edge cooperation for meteorological radar big data: a review of data quality control. Complex & Intelligent Systems, pages 1–15,
Huang C, Li Y, Nevatia R (2013) Multiple target tracking by learning-based hierarchical association of detection responses. IEEE Trans Pattern Anal Mach Intell 35(4):898–910
Article Google Scholar
Liang Haoxiang, Song Huansheng, Yun Xu, Sun Shijie, Wang Yingxuan, Zhang Zhaoyang (2021) Traffic incident detection based on a global trajectory spatiotemporal map. Complex & Intelligent Systems, pages 1–20,
Liu P-X, Zhu Z-S, Ye X-F, Li X-F (2020) Conditional random field tracking model based on a visual long short term memory network. Journal of Electronic Science and Technology 18(4):100031
Article Google Scholar
Liu Qiankun, Chu Qi, Liu Bin, Yu Nenghai (2020) GSM: graph similarity model for multi-object tracking. In Christian Bessiere, editor, Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI 2020, pages 530–536. ijcai.org,
Liu S, Liu D, Srivastava G, Połap D, Woźniak M (2021) Overview and methods of correlation filter algorithms in object tracking. Complex & Intelligent Systems 7(4):1895–1917
Article Google Scholar
Mahmoudi Nima, Ahadi Seyed Mohammad, Rahmati Mohammad (2019) Multi-target tracking using cnn-based features: CNNMTT. Multim. Tools Appl., 78(6):7077–7096,
Meinhardt Tim, Kirillov Alexander, Leal-Taixé Laura, Feichtenhofer Christoph (2021) Trackformer: Multi-object tracking with transformers. CoRR, abs/2101.02702,
Ning Chen, Menglu Li, Hao Yuan, Xueping Su, Yunhong Li (2021) Survey of pedestrian detection with occlusion. Complex & Intelligent Systems 7(1):577–587
Pang Bo, Li Yizhuo, Zhang Yifan, Li Muchen, Lu Cewu (2020) Tubetk: Adopting tubes to track multi-object in a one-step training model. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6307–6317,
Pang Jiangmiao, Qiu Linlu, Li Xia, Chen Haofeng, Li Qi, Darrell Trevor, Yu Fisher (2021) Quasi-dense similarity learning for multiple object tracking. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021, pages 164–173. Computer Vision Foundation / IEEE,
Peng Jinlong, Wang Changan, Wan Fangbin, Wu Yang, Wang Yabiao, Tai Ying, Wang Chengjie, Li Jilin, Huang Feiyue, Fu Yanwei (2020) Chained-tracker: Chaining paired attentive regression results for end-to-end joint multiple-object detection and tracking. In Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm, editors, Computer Vision - ECCV 2020 - 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part IV, volume 12349 of Lecture Notes in Computer Science, pages 145–161. Springer,
Peng Jinlong, Wang Changan, Wan Fangbin, Wu Yang, Wang Yabiao, Tai Ying, Wang Chengjie, Li Jilin, Huang Feiyue, Fu Yanwei (2020) Chained-tracker: Chaining paired attentive regression results for end-to-end joint multiple-object detection and tracking. In Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm, editors, Computer Vision - ECCV 2020 - 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part IV, volume 12349 of Lecture Notes in Computer Science, pages 145–161. Springer,
Sanchez-Matilla Ricardo, Poiesi Fabio, Cavallaro Andrea (2016) Online multi-target tracking with strong and weak detections. In Gang Hua and Hervé Jégou, editors, Computer Vision - ECCV 2016 Workshops - Amsterdam, The Netherlands, October 8-10 and 15-16, 2016, Proceedings, Part II, volume 9914 of Lecture Notes in Computer Science, pages 84–99,
Schmitt J, Bönig J, Borggräfe T, Beitinger G, Deuse J (2020) Predictive model-based quality inspection using machine learning and edge cloud computing. Adv Eng Inform 45:101101
Article Google Scholar
Sun Peize, Cao Jinkun, Jiang Yi, Yuan Zehuan, Bai Song, Kitani Kris, Luo Ping (2022) Dancetrack: Multi-object tracking in uniform appearance and diverse motion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20993–21002,
Sun Peize, Jiang Yi, Zhang Rufeng, Xie Enze, Cao Jinkun, Hu Xinting, Kong Tao, Yuan Zehuan, Wang Changhu, Luo Ping (2020) Transtrack: Multiple-object tracking with transformer. CoRR, abs/2012.15460,
Sun SJ, Akhtar N, Song HS, Mian A, Shah M (2021) Deep affinity network for multiple object tracking. IEEE Trans Pattern Anal Mach Intell 43:104–119
Google Scholar
Tang Siyu, Andres Bjoern, Andriluka Mykhaylo, Schiele Bernt (2016) Multi-person tracking by multicut and deep matching. In Gang Hua and Hervé Jégou, editors, Computer Vision - ECCV 2016 Workshops - Amsterdam, The Netherlands, October 8-10 and 15-16, 2016, Proceedings, Part II, volume 9914 of Lecture Notes in Computer Science, pages 100–111,
Tang Siyu, Andriluka Mykhaylo, Andres Bjoern, Schiele Bernt (2017) Multiple people tracking by lifted multicut and person re-identification. pages 3701–3710,
Voigtlaender Paul, Krause Michael, Osep Aljosa, Luiten Jonathon, Balachandar Gnana Sekar Berin, Geiger Andreas, Leibe Bastian (2019) MOTS: multi-object tracking and segmentation. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pages 7942–7951. Computer Vision Foundation / IEEE,
Wan Xingyu, Wang Jinjun, Kong Zhifeng, Zhao Qing, Deng Shunming (2018) Multi-object tracking using online metric learning with long short-term memory. In 2018 25th IEEE International Conference on Image Processing (ICIP), pages 788–792,
Wang W, Tiantian H, Jinan G (2022) Edge-cloud cooperation driven self-adaptive exception control method for the smart factory. Adv Eng Inform 51:101493
Article Google Scholar
Wang Zhongdao, Zheng Liang, Liu Yixuan, Li Yali, Wang Shengjin (2020) Towards real-time multi-object tracking. In Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm, editors, Computer Vision - ECCV 2020 - 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part XI, volume 12356 of Lecture Notes in Computer Science, pages 107–122. Springer,
Wojke Nicolai, Bewley Alex, Paulus Dietrich (2017) Simple online and realtime tracking with a deep association metric. In 2017 IEEE International Conference on Image Processing (ICIP), pages 3645–3649,
Wu Shuang, Song Xiaoning, Feng Zhen-Hua (2021) MECT: multi-metadata embedding based cross-transformer for chinese named entity recognition. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021, pages 1529–1539. Association for Computational Linguistics,
Xiang Yu, Alahi Alexandre, Savarese Silvio (2015) Learning to track: Online multi-object tracking by decision making. In 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015, pages 4705–4713. IEEE Computer Society,
Xiao Z, Xin X, Xing H, Luo S, Dai P, Zhan D (2021) RTFN: A robust temporal feature network for time series classification. Inf Sci 571:65–86
Article MathSciNet Google Scholar
Bingbing X, Shen H, Sun B, An R, Cao Q, Cheng X (2021) Towards consumer loan fraud detection: Graph neural networks with role-constrained conditional random field. Proceedings of the AAAI Conference on Artificial Intelligence 35(5):4537–4545
Article Google Scholar
Yang Chuan, Zhang Lihe, Lu Huchuan, Ruan Xiang, Yang Ming-Hsuan (2013) Saliency detection via graph-based manifold ranking. In 2013 IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, June 23-28, 2013, pages 3166–3173. IEEE Computer Society,
Yu Fengwei, Li Wenbo, Li Quanquan, Liu Yu, Shi Xiaohua, Yan Junjie (2016) POI: multiple object tracking with high performance detection and appearance feature. In Gang Hua and Hervé Jégou, editors, Computer Vision - ECCV 2016 Workshops - Amsterdam, The Netherlands, October 8-10 and 15-16, 2016, Proceedings, Part II, volume 9914 of Lecture Notes in Computer Science, pages 36–42,
Zhang Haoyu, Jin Yaochu, Hao Kuangrong (2022) Evolutionary search for complete neural network architectures with partial weight sharing. IEEE Transactions on Evolutionary Computation,
Zhang Li, Li Yuan, Nevatia Ramakant (2008) Global data association for multi-object tracking using network flows. pages 1–8,
Zhang Y, Wang C, Wang X, Zeng W, Liu W (2021) Fairmot: On the fairness of detection and re-identification in multiple object tracking. Int J Comput Vis 129(11):3069–3087
Article Google Scholar
Zhou Xingyi, Koltun Vladlen, Krähenbühl Philipp (2020) Tracking objects as points. In Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm, editors, Computer Vision - ECCV 2020 - 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part IV, volume 12349 of Lecture Notes in Computer Science, pages 474–490. Springer,
Zhou Zongwei, Xing Junliang, Zhang Mengdan, Hu Weiming (2018) Online multi-target tracking with tensor-based high-order graph matching. In 2018 24th International Conference on Pattern Recognition (ICPR), pages 1809–1814,

Download references

Acknowledgements

This research is supported by the National Natural Science Foundation of China under Grant 61972299 and 61806150.

Author information

Authors and Affiliations

School of Computer Science and Technology, Wuhan University of Science and Technology, Wuhan, 430065, China
Junwen Zhang, Xiaolong Zhang, Ziqi Zhu & Chunhua Deng
Hubei Province Key Laboratory of Intelligent Information Processing and Real-time Industrial System, Wuhan, China
Xiaolong Zhang
Institute of Big Data Science and Engineering, Wuhan University of Science and Technology, Wuhan, China
Xiaolong Zhang

Authors

Junwen Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Xiaolong Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Ziqi Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Chunhua Deng
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chunhua Deng.

Ethics declarations

Conflict of interest

We declare that we have no financial and personal relationships with other people or organizations that can inappropriately influence our work, there is no professional or other personal interest of any nature or kind in any product, service and/or company that could be construed as influencing the position presented in, or the review of, the manuscript entitled.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix: Formula derivation process

$$\begin{aligned} \begin{aligned} \min _{\mathbf {f_k}}\sum _{i\in V_k}(f_{k_i}-z_{k_i})^2+\sum _{(i,j)\in E_k}\frac{1}{2}w_{ij}(f_{k_i}-f_{k_j})^2.\\ \end{aligned} \end{aligned}$$

(18)

The formula (18) is derived as follows. First, the formula

$\sum _{i\in V_k}(f_{k_i}-z_{k_i})^2$ can be obtained by splitting it:

$$\begin{aligned} \sum _{i=1}^{s}{(f_{k_i}-z_{k_i})^2}&={(f_{k_1}-z_{k_1})}^2 \nonumber \\&\quad +{(f_{k_2}-z_{k_2})}^2+...+{(f_{k_s}-z_{k_s})}^2, \end{aligned}$$

(19)

where s represents the total number of nodes. Matrixing the formula (19) yields

$$\begin{aligned}&\sum _{i=1}^s{(f_{k_i}-z_{k_i})^2} \nonumber \\&\quad =\begin{pmatrix} f_{k_1}-z_{k_1}&f_{k_2}-z_{k_2}&...&f_{k_s}-z_{k_s} \end{pmatrix}\begin{pmatrix} f_{k_1} \,-\, z_{k_1} \\ f_{k_2}\, -\, z_{k_2} \\ ... \\ f_{k_s} \,-\, z_{k_s} \end{pmatrix} \nonumber \\&\quad =\mathbf {(f_k-z_k)^T(f_k-z_k)},\mathbf {f_k}\in R^{s\times 1},\mathbf {z_k}\in R^{s\times 1}. \end{aligned}$$

(20)

Then, the formula $\sum _{(i,j)\in E_k}\frac{1}{2}w_{ij}(f_{k_i}-f_{k_j})^2$ can be obtained by splitting it

$$\begin{aligned}&\sum _{(i,j)\in E}w_{ij}{(f_{k_i}-f_{k_j})^2}=\sum _{(i,j)\in E}w_{ij}f_{k_i}^2 \nonumber \\&\quad -2\sum _{(i,j)\in E}w_{ij}f_{k_i}f_{k_j}+\sum _{(i,j)\in E}w_{ij}f_{k_j}^2. \end{aligned}$$

(21)

The formula $\sum _{(i,j)\in E}w_{ij}f_{k_i}f_{k_j}$ can be obtained by matrixing

$$\begin{aligned}&\sum _{(i,j)\in E} f_{k_i}f_{k_j}w_{ij} \nonumber \\&\quad =\begin{pmatrix} f_{k_1}&f_{k_2}&...&f_{k_s} \end{pmatrix}\begin{pmatrix} w_{11}\,\,&{} w_{12}\,\, &{}\,\,... \,\, &{}\,\, w_{1s}\\ w_{11}\,\,&{}\,\,w_{12} \,\, &{}... \,\,&{}\,\,w_{2s} \\ \,\,...\,\, &{} \,\,...\,\, &{}...\,\, &{}\,\,... \\ w_{s1}\,\,&{}\,\,w_{s2} \,\,&{}...\,\, &{} w_{ss} \end{pmatrix}\begin{pmatrix} f_{k_1}\\ f_{k_2}\\ ...\\ f_{k_s} \end{pmatrix} \nonumber \\&\quad =\mathbf {f_k^TWf_k},\textbf{W} \in R^{s\times s}. \end{aligned}$$

(22)

The formula $\sum _{(i,j)\in E}w_{ij}f_{k_i}^2$ can be obtained by splitting it

$$\begin{aligned}&\sum _{i=1}^{s}\sum _{j=1}^s w_{ij}f_{k_i}^2 = \sum _{i=1}^{s}f_{k_i}^2\sum _{j=1}^{s}w_{ij} \nonumber \\&\quad =f_{k_1}^2\sum _{j=1}^sw_{1j}+f_{k_2}^2\sum _{j=2}^sw_{2j}+\cdots +f_{k_s}^2\sum _{j=1}^sw_{sj}. \end{aligned}$$

(23)

Matrixing the formula (23) yields

$$\begin{aligned}&f_{k_1}^2\sum _{j=1}^sw_{1j}+f_{k_2}^2\sum _{j=2}^sw_{2j}+\cdots +f_{k_s}^2\sum _{j=1}^sw_{sj} \nonumber \\&\quad =\begin{pmatrix} f_{k_1}&...&f_{k_s} \end{pmatrix}\begin{pmatrix} \sum \limits _{j=1}^sw_{1j} &{}... &{}0 \\ ... &{}... &{}... \\ 0 &{}... &{} \sum \limits _{j=1}^sw_{sj} \end{pmatrix}\begin{pmatrix} f_{k_1}\\ ...\\ f_{k_s} \end{pmatrix} \nonumber \\&\quad =\mathbf {f_k^T}\begin{pmatrix} \sum \limits _{j=1}^sw_{1j} &{}... &{}0 \\ ... &{}... &{}... \\ 0 &{}... &{} \sum \limits _{j=1}^sw_{sj} \end{pmatrix}\mathbf {f_k}. \end{aligned}$$

(24)

Similarly, matrixing the formula $\sum _{(i,j)\in E} w_{ij}f_{k_j}^2$ yields the following result:

$$\begin{aligned}&\sum _{(i,j)\in E}w_{ij}f_{k_j}^2 \nonumber \\&\quad =\begin{pmatrix} f_{k_1}&...&f_{k_s} \end{pmatrix}\begin{pmatrix} \sum \limits _{i=1}^sw_{i1} &{}... &{}0 \\ ... &{}... &{}... \\ 0 &{}... &{} \sum \limits _{i=1}^sw_{is} \end{pmatrix}\begin{pmatrix} f_{k_1}\\ ...\\ f_{k_s} \end{pmatrix} \nonumber \\&\quad =\mathbf {f_k^T}\begin{pmatrix} \sum \limits _{i=1}^sw_{i1} &{}... &{}0 \\ ... &{}... &{}... \\ 0 &{}... &{} \sum \limits _{i=1}^sw_{is} \end{pmatrix}\mathbf {f_k}. \end{aligned}$$

(25)

Since $\textbf{W}$ is a symmetric matrix, some formulas can be obtained as follows:

$$\begin{aligned} \begin{pmatrix} \sum \limits _{j=1}^sw_{1j} &{}... &{}0 \\ ... &{}... &{}... \\ 0 &{}... &{} \sum \limits _{j=1}^sw_{sj} \end{pmatrix}&=\begin{pmatrix} \sum \limits _{i=1}^sw_{i1} &{}... &{}0 \\ ... &{}... &{}... \\ 0 &{}... &{} \sum \limits _{i=1}^sw_{is} \end{pmatrix}\nonumber \\&=\textbf{D} \in R^{s\times s}. \end{aligned}$$

(26)

The formulas (21), (22), (24), (25), (26) can be matrixed by $\sum _{(i,j)\in E_k}\frac{1}{2}w_{ij}(f_{k_i}-f_{k_j})^2$, which gives the following result:

$$\begin{aligned} \sum _{(i,j)\in E_k}\frac{1}{2}w_{ij}(f_{k_i}-f_{k_j})^2&= \frac{1}{2}\Bigg (\mathbf {f_k^TDf_k-2f_k^TWf_k+f_k^TDf_k}\Bigg )\nonumber \\&\quad =\mathbf {f_k^TDf_k-f_k^TWf_k}. \end{aligned}$$

(27)

The matrix form of the formula (18) can be obtained by associating the formulas (20), (27), as follows.

$$\begin{aligned} \mathbf {(f_k-z_k)^T(f_k-z_k)}+\mathbf {f_k^TDf_k-f_k^TWf_k}. \end{aligned}$$

(28)

The derivative formula (28) can be obtained by

$$\begin{aligned} \mathbf {2(f_k-z_k)+(D+D^T)f_k-(W+W^T)f_k}=0. \end{aligned}$$

(29)

According to these formulas

$$\begin{aligned}&\mathbf {D=D^T}, \nonumber \\&\mathbf {W=W^T}, \end{aligned}$$

(30)

the final result can be obtained as follows:

$$\begin{aligned} \mathbf {f_k=(I+D+W)^{-1}z_k}. \end{aligned}$$

(31)

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Zhang, J., Zhang, X., Zhu, Z. et al. Efficient combination graph model based on conditional random field for online multi-object tracking. Complex Intell. Syst. 9, 3261–3276 (2023). https://doi.org/10.1007/s40747-022-00922-3

Download citation

Received: 06 March 2022
Accepted: 08 October 2022
Published: 02 December 2022
Issue Date: June 2023
DOI: https://doi.org/10.1007/s40747-022-00922-3

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Efficient combination graph model based on conditional random field for online multi-object tracking

Abstract

Similar content being viewed by others

Online Multi-object Tracking Using Single Object Tracker and Markov Clustering

Multiple Object Tracking by Efficient Graph Partitioning

Multi-object Tracking with Conditional Random Field

Introduction

Related work

Method