1 Introduction

Multi-object tracking (MOT) is an important computer vision task and has a wide application in surveillance, robotics, and human-computer interaction. With recent development of object detectors, MOT has been formulated as tracking by detection framework. Most multi-object tracking benchmarks such as MOT16 [16] provide the tracking video sequences and detection results with public detectors. The key issue of the multi-object tracker is to associate tracklets and corresponding detection responses into long trajectories. Tracklets denote the trajectory set which is established up to current frame.

Recent tracking-by-detection methods could be categorized into batch and online methods. The batch methods process video sequences in a batch mode and take into consideration the frames from the future time steps. These methods always solve the association problem by optimization methods. For example, [17] formulates the MOT problem as minimization of a continuous energy. [5] models the MOT problem as the min-cost network flow and finds the optimization solution with convex relaxation. Such systems may obtain a nearly global optimal solution but are not suitable for practical application. The online MOT methods only consider the observations up to current frame and associate the tracklets and detection responses frame by frame. The baseline of these online trackers is to build different models to measure the affinities between tracklets and detection responses. Then an online association algorithm is applied to get global optimum. Motion model, appearance model and interaction model are most frequently adopted to build affinity matrix. In [13], integral channel features are adopted to build a robust appearance model. [6] proposes a nonlinear motion model to get reliable motion affinity. [20] establishes an LSTM interaction model to explore the group behavior and compute the matching likelihoods.

In complex and crowded scenarios, many objects are presented with similar appearance and may be occluded with each other. Mismatches always occur in such scenarios. The result is that the tracker can not associate objects consistently. However, the consistency of the trajectories plays an important role in the follow up works such as trajectory prediction and analysis. Spatial constraints and motion model can not handle such problems. To address this problem, a robust appearance model must be established. Appearance model could improve the tracker’s ability to associate objects consistently and reduce the mismatch rate. Some online trackers [12] adopt raw pixels or histogram as appearance model. These trackers may get a rapid speed but could not distinguish objects with similar appearance. Recent development on convolutional neural network has drove people to train a deep network to extract deep appearance feature. [1, 26] measure appearance similarity with a person re-identification network. However, all these trackers need to crop the objects from images first, then put them into the network in a batch mode. Pre-processing procedure and frequent forward propagations make these trackers time consuming.

The MOTA [2] metric is the widely accepted metric for multi-object tracking evaluation, but it is not capable of evaluating the consistency of the trajectories, and the reasons are explained in Sect. 3.1. In this paper, we adopt ID switch rate and \(IDF_{1}\) score to evaluate the consistency of the trajectories, which is initially proposed for evaluating the ID consistency for cross camera multi-object tracking.

In this paper, we propose a part-based deep network combined with a confidence-based association metric to address above problems. The main contributions are summarized as below: (i) We propose a part-based deep network which employs ROI pooling method [10] to extract part-based deep appearance feature for all objects by just one forward propagation. The network is trained based on the siamese architecture [7], and this makes our tracker gain the ability to associate correctly even if the objects are partly occluded; (ii) we propose an occlusion detector which could predict the occlusion degree and guide the procedure of part-based similarity fusion and appearance model update; (iii)we appeal for more attention to the consistency of the trajectories and conduct extensive experiments with multiple evaluation metrics introduced in [19] and [2] on MOT benchmark. The results demonstrate our tracker can associate the objects consistently and gains a significant improvement in tracking accuracy.

2 MOT Framework

The baseline of our tracker is confidence-based association metric. Appearance, motion and shape models are established to measure the affinities between tracklets and object detections. In Sect. 2.1, the structure of fast part-based deep network is described in detail. Section 2.2 introduces the network training procedure. Section 2.3 describes the confidence-based association metric.

2.1 Fast Part-Based Deep Network

Fig. 1.
figure 1

The feature extraction pipeline of traditional deep network and our part-based deep network

Network Structure. Traditional deep appearance network in MOT field usually takes as input the object regions cropped from the original image in a batch mode. But it is time consuming and needs to do some pre-processing work. The more objects one frame contains, the more times for forward propagation.

The main structure of our part-based deep network is shown in Fig. 1. The network takes as input the entire image and a set of detection responses. The whole image is first processed by several convolutional layers and max pooling layers to generate a shared feature map. Then the ROI pooling method is adopted to generate five feature maps for each detection: the left body (LB), right body (RB), upper body (UB), down body (DB) and full body (FB). Five types of features are fed into the fully-connected layers separately, and the follow up normalization layers normalize the output to obtain the final feature vectors. In this way, our network could extract deep features for all objects by just one forward propagation. Beyond that, an occlusion detector based on the shared feature map is adopted to detect occlusion degree in current detection response, and then guide the procedure of part-based similarity fusion and appearance model update.

The detailed processing steps about ROI pooling are as below: At first, the ROI pooling layer maps the position and scale of the object from original image to the shared feature map, and gets the corresponding ROI window. Then divides the \(h*w\) ROI window into an \(H*W\) grid of sub-window of approximate size \(h/H*w/W\) and maxpools the values in each sub-window into corresponding output grid cell [10]. By adopting ROI pooling layer, the speed for feature extraction gains an improvement compared with other trackers based on deep appearance model.

Part-Based Model. For MOT task, occlusion is still a challenge problem waited to be solved. This can easily cause fragmented trajectories and ID switches especially for online trackers. Mismatches have a great damage to the consistency of the trajectories. We adopt a part-based appearance network combined with a simple occlusion detector to address this problem. It is easy to implement based on the ROI pooling method with almost no speed loss. Persons detected by high position cameras would be easy to be occluded up and down, but they are more likely to be occluded left and right when detected by low position cameras. In this place, we do not design elaborate part detector for the sake of high feature extraction speed and rely more on the representative ability of deep feature. The detected persons are simply divided into UB, DB, LB and RB to overcome multi-view occlusion. During forward propagation, the ROI pooling layer extract features for FB and four divided parts, then a slice layer is added to separate features generated from different parts. So when the object is partly occluded, part-based feature is still reliable for appearance similarity computation. At the same time, the part feature is extracted from the shared convolutional feature map, and there is almost no speed loss for the added part modular.

Occlusion Detector. We propose a novel occlusion detector to detect whether there exist occlusion in current detection and guide the procedure of part-based similarity fusion and appearance model update. At first, the width and height of the detected bounding boxes are enlarged to 1.2 times of original to get more context information. Then the ROI pooling layer is employed to extract corresponding features from the shared feature maps. Follow up classifier takes the features as inputs and outputs the occlusion label, which is composed of three fully-connected layers followed by one softmax layer. The occlusion detector could classify the detections into three types: severe-occluded, part-occluded and non-occluded. For severe-occluded detections, appearance similarity is no more reliable and would not be adopted for final similarity computation. For part-occluded detections, the part-based appearance feature is still reliable would be adopted to measure appearance similarity. For non-occluded detections, FB feature vectors would be employed.

2.2 Network Training

The training procedure is divided into two stages, at first, the part-based deep network is trained based on siamese architecture, then the occlusion detector is trained based on the pretrained base network.

Siamese Architecture Training. To make the deep network gain the ability to distinguish different persons, we select part ALOV300++ sequences [22] which take person as tracking object and MOT training sequences [16] as base training dataset. Then generate positive and negative pairs by randomly sampling same and different identities from video sequences. The part-based deep appearance network is trained based on siamese architecture to learn a dissimilarity metric between pairs of identities. As shown in Fig. 2, we design a siamese network composed of two branches sharing with same structure and filter weights. Each branch has the same architecture with part-based deep network. Two branches are connected with five loss layers for network training. We employ the margin contrastive loss, and the calculation formula is as below:

Fig. 2.
figure 2

The structure of siamese training.

$$\begin{aligned} L\left( x_{i},x_{j},y_{ij} \right) =\frac{1}{2}*y_{ij}*D+\frac{1}{2}(1-y_{ij})max(0,\varepsilon -D) \end{aligned}$$
(1)

Where \(D=||x_{i},x_{j}||^{2}\) is the Euclidean distance of two normalized feature vector: \(x_{i}\) and \(x_{j}\), \(y_{ij}\) indicates whether the object pairs are same identities, \(\varepsilon \) is the minimum distance margin that different pairs of objects should satisfy. We set \(\varepsilon \) to 1 during experiment. The final training loss is the sum of five kinds of losses. After training the siamese architecture network with margin contrastive loss, the part-based deep network could generate good feature representations that are close by enough for positive pairs, whereas they are far away at least by a minimum for negative pairs, and a simple cosine distance metric could measure the appearance similarity.

Occlusion Detector Training. The MOT16 dataset provides the visibility ratio for each annotated bounding box, and we divide these bounding boxes into three types. Bounding boxes with visibility ratio lower than 0.9 and higher than 0.4 is regarded as part-occluded detections, otherwise would be regarded as non-occluded and severe-occluded detections respectively.

After training the part-based network with siamese architecture, the weights of base network are frozen, and the occlusion detector is added after the base network and is trained with softmax loss. To improve the generalization ability of the occlusion detector, the data augmentation metric is adopted during network training. We flap and crop the object, change the brightness, contrast, sharpness and saturation of the images with a certain probability. Finally two components are integrated together to get the final model.

2.3 Association Procedure

The association between tracklets and object detections could be formulated as an assignment problem, We adopt a modified version of confidence-based association metric [1] to solve this problem.

Affinity Computation. The representation of tracklet \(T_{i}^{t}\) and detection \(D_{j}^{t}\) at frame t is defined as below:

$$\begin{aligned} T_{i}^{t}=&\{P_{i}^{t-d:t}(x,y,w,h),A_{i}^{q}(FB,UB,DB,LB,RB),conf_{i},K_{i}(m,p)\}\end{aligned}$$
(2)
$$\begin{aligned} D_{j}^{t}=&\{x,y,w,h,F_{j}(FB,UB,DB,LB,RB),Olabel\} \end{aligned}$$
(3)

where \(P_{i}^{t-d:t}(x,y,w,h)\) is the positions and shapes of the objects from frame \(t-d\) to frame t. K(mp) is a kalman motion model and m, p denote the mean and covariance matrix respectively. At frame \(t+1\), \(K_{i}(m,p)\) predicts the object’s position \(P_{i}^{t+1}(x,y,w,h)\) and calculates the motion and shape affinity as Eqs. 4 and 5, where \(D_{j}^{t+1}\) is the j-th object in frame \(t+1\). Once the tracklet is associated with new detections, the detected bounding box is employed to update K(mp). Besides, K(mp) is also adopted to estimate positions for missed objects.

$$\begin{aligned} sim_{mot}\left( T_{i}^{t+1},D_{j}^{t+1} \right) =e^{-w_{1}((\frac{P_{i}^{t+1}(x)-D_{j}^{t+1}(x)}{D_{j}^{t+1}(w)})^{2}+(\frac{P_{i}^{t+1}(y)-D_{j}^{t+1}(y)}{D_{j}^{t+1}(h)})^{2})} \end{aligned}$$
(4)
$$\begin{aligned} sim_{shp}\left( T_{i}^{t+1},D_{j}^{t+1} \right) =e^{-w_{2}( \frac{\left| P_{i}^{t+1}(h)-D_{j}^{t+1}(h)\right| }{P_{i}^{t+1}(h)+D_{j}^{t+1}(h)}+ \frac{\left| P_{i}^{t+1}(w)-D_{j}^{t+1}(w)\right| }{P_{i}^{t+1}(w)+D_{j}^{t+1}(w)})} \end{aligned}$$
(5)

\(A_{i}^{q}(FB,UB,DB,LB,RB)\) is a queue which stores part-based deep appearance feature vectors in q frames. \(F_{j}(FB,UB,DB,LB,RB)\) is the appearance feature vectors of detection \(D_{j}\), Olabel is the occlusion label. The largest cosine distance between corresponding feature vectors in \(F_{j}\) and \(A_{i}^{q}\) queue is regarded as appearance similarity. When \(D_{j}\) is non-occluded, FB feature vector is employed for similarity computation and \(A_{i}^{q}\) would be updated by five types of feature vectors. When \(D_{j}\) is part-occluded, the maximum similarity of four divided parts would be employed. The corresponding feature vector which is employed for similarity computation would be adopted to update \(A_{i}^{q}\), and when \(D_{j}\) is severe-occluded, the appearance similarity would not be adopted and \(A_{i}^{q}\) would not be updated. During experiment, parameter q and d are set to 6 as most occlusions in MOT dataset last for less than 6 frames. Two linear SVMs are trained to fuse two or three types of affinities in severe-occlusion and other occasions, and yield the final affinity in range of [0,1].

Association Procedure. A simple Hungarian algorithm is employed to obtain the global optimum based on affinity matrix. An affinity threshold \(\tau _{1}\) is set to filter unreliable associations whose affinity score is lower. During association, the tracklets with long length and high association affinities in previous frames should be more reliable and associated first. So each tracklet is modeled with a confidence score \(conf_{i}\) which is calculated as Eq. 6, where \(sim_{k}\) is the association score in previous steps. A confidence threshold \(\tau _{2}\) is set to divide the tracklets into high confidence tracklets and low confidence tracklets. The association procedure is performed on them hierarchically and is summarized in Algorithm 1.

$$\begin{aligned} conf_{i}=\frac{\sum _{k=2}^{length(T_{i})}sim_{k}}{length(T_{i})-1}(1-e^{-w_{3}*length(T_{i})}) \end{aligned}$$
(6)
figure a

3 Experiment

3.1 Evaluation Metrics

A good tracker should find correct numbers of objects and associate them with correct tracklets when a new frame arrives. At the same time, a good tracker should also track each object consistently and overcome the mismatch phenomenon. Based on the above criteria, most trackers adopt MOTA as main metric to evaluate their trackers’ performance, which is calculated as below:

$$\begin{aligned} MOTA=1-\frac{\sum _{t}(FN_{t}+FP_{t}+IDSW_{t})}{\sum _{t}GT_{t}} \end{aligned}$$
(7)

In above formula, FN indicates the number of missed objects, FP indicates the number of false positives, IDSW indicates the number of mismatches. However, in most cases, the number of FN is one order higher than FP and two order higher than IDSW. This means the reduction of IDSW is of little significance for the improvement of MOTA. In addition, a mismatch should not be treated equal with a FP. With recent development of the precision of detectors, the number of FP and FN has dropped a lot, so we appeal for more attention to the consistency of trajectories. The score of MOTA is a good indicator of the tracking accuracy, but not capable of evaluating the consistency, so we adopt ID switch rate, ID precision, ID recall and IDF\(_{1}\) introduced in [19] to evaluate the consistency of the trajectories. IDF\(_{1}\) is calculated by matching trajectories to the ground-truth so as to minimize the sum of discrepancies between corresponding pairs. Unlike MOTA, it penalizes ID switches over the whole trajectory fragments with wrong ID, and can evaluate how well computed identities conform to true identities [19].

Besides above evaluation metrics, following common metrics are also adopted to evaluate our tracker comprehensively:

  • MT: Mostly tracked targets [2]. The ratio of ground-truth trajectories that are covered by a track hypothesis for at least 80% of their respective life span.

  • ML: Mostly lost targets [2]. The ratio of ground-truth trajectories that are covered by a track hypothesis for at most 20% of their respective life span.

  • MOTP: Multiple Object Tracking Precision [2]. The misalignment between the annotated and the predicted bounding boxes.

3.2 Thresholds Selection

To obtain robust affinity threshold \(\tau _{1}\) and confidence threshold \(\tau _{2}\), we test our tracker with grid search method on MOT16 train dataset. The relationship between MOTA and two thresholds is shown in Fig. 3. We set \(\tau _{1}\) to 0.4 and set \(\tau _{2}\) to 0.3 for the rest experiments. The Fig. 3 also demonstrates that adopting confidence-based association metric could improve tracking accuracy.

Fig. 3.
figure 3

Thresholds selection on MOT16 train dataset

3.3 Runtime

To investigate the feature extraction speed of our part-based deep appearance network comprehensively, we test our network and other trackers which adopt deep appearance model and take image patches as inputs on the same platform. The feature extraction speed is tested on a Quadro M4000 GPU and Intel E5V3 CPU and shown in Table 1. Dan and Pdan denote our full-part and part-based deep appearance network respectively. Compared with other trackers, our deep model gets faster speed with smaller batch size, and there is just a minor speed loss for the added part model. The speed for confidence-based association is not very fast and is about 5.16 fps, which is mostly owning to the large number of objects, but our part-based deep appearance network could be transplanted to other association metric conveniently.

Table 1. The speed and consumption for feature extraction

3.4 Experiment Result

Table 2 shows the tracking results on MOT16 test dataset, Hist means histogram appearance model, and Dan-OD denotes full-part deep network without the guidance of occlusion detector for appearance model update. Trackers marked with * adopt same detections supplied in [26]. The results show that adopting part-based deep appearance network and occlusion detector could improve tracking accuracy and consistency obviously. Compared with histogram appearance model, the ID switches reduce from 1014 to 762, both ID precision and ID recall have a certain improvement. The reduction of mismatches also increases the rate of MT, this means our tracker is more capable of getting consistent and long trajectories.

Table 2. Tracking results on MOT16 test Dataset with private detector
Table 3. Overall performance on MOT17 test dataset with public detections
Table 4. Tracking results on MOT17 test dataset based on different public detections

Tables 3 and 4 demonstrate the overall performance and the separated results based on different detectors on MOT17 benchmark respectively. The MOT17 benchmark provides three detection results: the DPM [9], FasterRCNN [18] and SDP detector [25]. As most trakers in MOT ranking list are anonymous submissions, we select trackers with explicit source for comparison. As demonstrated in Table 3, our tracker achieves competitive performance compared with other online trackers, both the consistency and accuracy gain a significant improvement. Compared with the FWT-17 [11] tracker, our tracker yields higher IDF\(_{1}\) score and lower ID switch rate, this demonstrates our trajectories are more consistent. The overall accuracy of our tracker is lower than FWT-17, this is mostly due to our poor performance on DPM weak detections, and it is the inherent inferiority between online association and batch association. The batch methods take into consideration the frames in future time steps. Some sampled trajectories are shown in Fig. 4, and the numbers following ‘#’ denote the frame numbers.

Fig. 4.
figure 4

Sampled trajectories in MOT17 benchmark.

4 Conclusion

In this paper, we propose a part-based deep network which employs ROI pooling method to extract part-based appearance feature to overcome the part-occlusion problem. An occlusion detector is proposed to predict the occlusion degree and guide the procedure of similarity fusion and appearance update. Extensive experiments show our tracker is more capable of getting consistent and long trajectories. Both the consistency and accuracy are competitive on MOT benchmark.