Keywords

1 Introduction

Multi-object tracking (MOT) is emerging technology employed in many real-world applications such as video security, gesture recognition, robot vision, and human robot interaction [115]. The challenge is drifts of tracking points due to appearance variations caused by noises, illumination, pose, cluttered background, interactions, occlusion, and camera movement. Most MOT methods are suffered from varying numbers of objects, and leading to performance degradation and tracking accuracy impairments in cluttered backgrounds. However, most of them only focus on a limited categories, usually people or vehicle tracking. MOT with unlimited classes of objects has been rarely studied due to very complex and high computation requirements.

The Bayesian filter consists of the motion dynamics and observation models which estimates posterior likelihoods. One of the Bayesian filter based object tracking methods is Markov chain Monte Carlo (MCMC)-based method [25], which can handle various object moves and interactions of multiple objects. Most MCMC based methods assume that the number of objects would not change over time, which is not acceptable in a real world applications. Reversible jump MCMC (RJMCMC) was proposed by [2, 4], where a variable number of objects with different motion changes, such as update, swap, birth, and death moves. They start a new track by initializing a new object or terminates currently tracked object by eliminating the object.

Even MCMC based MOT approaches were successful to some extent, computational overheads are very high due to a high-dimensional state space. The variations in appearances, the interaction and occlusions and changing number of moving objects are challenging, which require high computation overheads. Saka et al. [1] proposes a MCMC sampling with low computation overhead by separating motion dynamics into birth and death moves and the iteration loop of the Markov chain for motion moves of update and swap. If the moves of birth and death are determined inside of the MCMC chain, it requires the dimension changes in the MCMC sampling approaches as [2, 3]. Since the Markov chain has no dimension variation in the iteration loop by separating the moves of birth and death, it can reach to stationary states with less computation overhead [1, 6]. However, such a simple approach for separating birth and death dynamics cannot deal with complex situations that occur in MOT. Many of them are suffered from track drifts due to appearance variations.

In this paper, we propose a robust multi-class multi-object tracking (MCMOT) that conducts unlimited object classes by combining detection responses and changing point detection (CPD) algorithm. With advances of deep learning based object detection technology such as Faster R-CNN [28], and ResNet [29], it becomes feasible to adopt a detector ensemble with unlimited classes of objects. The detector ensemble combines the model based detector implemented by Faster R-CNN [28] and the motion detector by Lucas-Kanade Tracker (KLT) algorithm [26]. The method separates the motion dynamic model of Bayesian filter into the entity transitions and motion moves. The entity transitions are modeled as the birth and death events. Observation likelihood is calculated by more sophisticated deep learning based data-driven algorithm. Drift problem which is one of the most cumbersome problems in object tracking is attacked by a CPD algorithm similarly to [24]. Assuming the smoothness of motion dynamics, the abrupt changes of the observation are dealt with the CPD algorithm, whereas the abrupt changes are associated illuminations, cluttered backgrounds, poses, and scales. The main contributions of the paper are below:

  • MCMOT can track varying number of objects with unlimited classes which is formulated as a way to estimate a likelihood of foreground regions with optimal smoothness. Departing from the likelihood estimation only belong to limited type of objects, such as pedestrian or vehicles, efficient convolutional neural network (CNN) based multi-class object detector is employed to compute the likelihoods of multiple object classes.

  • Changing point detection is proposed for a tracking failure assessment by exploiting static observations as well as dynamic ones. Drifts in MCMOT are investigated by detecting such abrupt change points between stationary time series that represent track segment.

This paper is organized as follows. We review related work in Sect. 2. In Sect. 3, the outline of MCMOT is discussed. Section 4 introduces our proposed tracking method. Section 5 describes the experiments, and concluding remarks and future directions are discussed in Sect. 6.

2 Related Work

2.1 Multi Object Tracking

Recent research in MOT has focused on the tracking-by-detection principal to perform data association based on linking object detections through a video sequence. Majority of the batch methods formulates MOT with future frame’s information to get better data association via hierarchical tracks association [13], network flows [12], and global trajectory optimization [11]. However, batch methods have higher computational cost relatively. Whereas online methods only consider past and current frame’s information to solve the data association problem. Online methods are more suitable for real-time application, but those are likely to drift since objects in a video show significant variations in appearances due to noises, illuminations, poses, viewing angles, occlusions, and shadows, some objects enters or leaves the scene, and sometimes show sharp turns and abrupt stops. Dynamically varying number of objects is difficult to handle, especially when track crowded or high traffic objects in [9, 10, 14]. Most MOT methods relying on the observation of different features are prone to result in drifts. Against this nonstationarity and nonlinearity, stochastic-based tracking [2224] appear superior to deterministic based tracking such as Kalman filter [33] or particle filter [2].

2.2 Convolutional Neural Network

In the last few years, considerable improvements have been appeared in the computer vision task using CNN. One of the particularly remarkable studies is R-CNN [34]. They transferred CNN based image classification task to CNN based object detection task using region-based approach. SPPnet [35] and Fast R-CNN [36] extend R-CNN by pooling convolutional features from a shared convolutional feature map. More recently, RPN [28] is suggested to generate region proposals within R-CNN framework using RPN. Those region-based CNN pipelines outperform all the previous works by a significant margin. Despite such great success of CNNs, only a few number of MOT algorithms using the representations from CNNs have been proposed [2022]. In [20, 21], they proposed a CNN based framework with simple object tracking algorithm for MOT task in ImageNet VID. In [22], they used CNN based object detector for MOT Challenge [32]. Our experiment adopts this paradigm of region based CNN to build observation model.

Fig. 1.
figure 1

MCMOT framework has four major steps: (a) Likelihood calculation based on observation models, (b) Track segment creation, (c) Changing point detection, and (d) Trajectory combination. The drifts in segments are effectively controlled by changing point detection algorithm with forward-backward validation.

3 The Outline of MCMOT

We propose an efficient multi-class multi-object tracker, called MCMOT that can deal with object birth, death, occlusion, interaction, and drift efficiently. MCMOT may fail due to the miscalculations of the observation likelihood, interaction model, entry model, and motion model. The objective of MCMOT is to stop the tracking as quick as possible if a drift occurs, recover from the wrong decisions, and to continue tracking. Fig. 1 illustrates the main concept of our framework.

In MCMOT, objects are denoted by bounding boxes which are tracked by a tracking algorithm. In the tracking algorithm, if a possible interaction or occlusion is detected, the trajectory is split into several parts, called track segments. The combination of track segments is controlled by CPD. Considering fallible decision tracker points, CPD monitors a drift due to abnormal events, abrupt changing environments by comparing the localized bounding boxes by the observations within the segment. The motion-based tracking component facilitates KLT [26] adaptive for predicting the region of a next tracking point. The model-based component consists of the global object detector and adaptive local detector. We use a deep feature based multi-class object detector [28] as the global and local object detector. One can notice that the number of object categories can be readily extended depending on object detector capability.

4 Multi-class Multi-object Tracking

MCMOT employs an data-driven approach which investigates the events caused by object-level events, object birth and death, inter-object level events, i.e., interaction and occlusion between objects, and tracking level events, e.g. track birth, update, and death. Possible drifts due to the observation failures are dealt with the abnormality detection method based on the changing point detection.

We define track segments using the birth and death detection. Only visible objects are tracked, the holistic trajectory divided into several track segments, if an occlusion happens as in [16]. If the object becomes ambiguous due to occlusion or noise, the track segment is terminated (associated object death), and the tracker will restart tracking (associated object birth) nearby the terminated tracking point if the same object reoccurs, and the track segment is continuously built, if it is required, or a new track segment is started and merged later.

4.1 Observation Model

We define observation model (observation likelihood) \(P\mathrm{(}\mathbf{z}_{t} |\mathbf{x}_{t} \mathrm{)}\) in this section. The observation likelihood for tracked objects need to estimate both the object class and accurate location. MCMOT ensembles object detectors with different characteristics to calculate the observation likelihood accurately. Since the dimensionality of the scene state is allowed to be varied, the measure is defined as the ratio of the likelihoods of the existence and non-existence. As the likelihood of the non-existence set cannot be measured, we adopt a soft max \(f\mathrm{(}\cdot \mathrm{)}\) of the likelihood model, as in [18].

$$\begin{aligned} \frac{P\mathrm{(}\tilde{\mathbf{o}}_{t} |\mathbf{o}_{id,t} \mathrm{)}}{P\mathrm{(}\tilde{\mathbf{o}}_{t} |\not \mathbf{o}_{id,t} \mathrm{)}} =\exp \left( \sum _{e}f\mathrm{(}\lambda _{e} \log _{e} \mathrm{(}\tilde{\mathbf{o}}_{t} |\mathbf{o}_{id,t} \mathrm{)} \right) \end{aligned}$$
(1)

where \(\not \mathbf{o}_{id,t} \) indicates the non-existence of object id, f soft max function, \(\lambda _{e} \) the weight of object detector e. The approach is expected to be robust to sporadic noises since each detector has its own pros and cons. We employ ensemble object detectors: deep feature based global object detector (GT), deep feature based local object detector (LT), color detector (CT), and motion detector (MT):

  • Global object detector (GT): Deep feature based object detector [28] in terms of hierarchical data model (HDM) [44] is used.

  • Local object detector (LT): By fine-tuning deep feature based object detector using confident track segments, issues due to false negatives can be minimized. Deep feature based object detector [28] is used for the local object detector.

  • Color detector (CT): Similarity score between the observed appearance model and the reference target is calculated through Bhattacharyya distance [17] using RGB color histogram of the bounding box.

  • Motion detector (MT): The presence of an object is checked by using KLT based motion detector [26] which detects the presence of motion in a scene.

4.2 Track Segment Creation

The MCMOT models the tracking problem to determine optimal scene particles in a given video sequence. MCMOT can be thought as reallocation steps of objects from the current scene state to the next scene state repeatedly. First, the birth and death allocations are performed in the entity status transition step. Second, the intermediate track segments are built using the data-driven MCMC sampling step with the assumption that the appearances and positions of track segments change smoothly. In the final step, the detection of a track drift is conducted by a changing point detection algorithm to prevent possible drifts. Change point denotes a time step where the data attributes abruptly change [24] which is expected to be a drift starting point with high probability. We discuss the detail of the data-driven MCMC sampling, and entity status transition in follows.

Date-Driven MCMC Sampling. In a MCMC based sampling, the efficiency of the proposal density function is important since it affects much in constructing a Markov chain with stationary distribution, and thus affects much on tracking performance in practice. The proposal density function should be measurable and can be sampled efficiently from the proposal distribution [2], which is proportional to a desired target distribution. We employ “one object at a time” strategy, whereas one object state is modified at a time, as in [2, 7]. Given a particle \(\mathbf{x}_{t} \) at time t, the distribution of current proposal density function \(\pi \mathrm{(}\mathbf{x'};\mathrm{x}_{t} \mathrm{)}\) is used to suggest for the next particle. In MCMOT, we assume that the distribution of the proposal density follows the pure motion model for the MCMC sampling, i.e., \(\pi \mathrm{(}{} \mathbf{x'};\mathrm{x}_{t} \mathrm{)}\approx P\mathrm{(x}_{t+1} |\mathbf{x}_{t} \mathrm{)}\), as in [2]. Given a scene particle, i.e., a set of object states \(\mathbf{x}_{t} \), a candidate scene particle \(\mathbf{x}'_{t} \) is suggested by randomly selecting object \(\mathbf{o}_{id,t} \), and then determines the proposed state \(\mathbf{x'}_{t} \) relying the object \(\mathbf{o}_{id,t} \) with uniform probability assumption. In this paper, a strategy of data-driven proposal density [3] is employed to make the Markov chain has a better acceptance rate. MCMOT proposes a new state \(\mathbf{o'}_{id,t} \) according to the informed proposal density with a mixture of the state moves to ensure motion smoothness as in [6]:

$$\begin{aligned} \pi \mathrm{(}{} \mathbf{o}'_{id,t};\mathbf{x}_{t} \mathbf{)}=\left[ \lambda _{1} \frac{1}{N} \sum _{s}p\mathbf{(o'}_{id,t} \mathbf{|o}_{id,t-1}^{\mathbf{(}s\mathbf{)}} \mathbf{\; )}+\lambda _{2} p\mathbf{(o}'_{id,t} \mathbf{|D}_{id,t} \mathbf{)} \right] \end{aligned}$$
(2)

where \(\lambda _{1} +\lambda _{2} =1\). The first term is from the motion model and the second term from the detector ensemble and using the closest result from the all detection of object id.

Remind that the posterior probability for time-step t-1 is assumed to be represented by a set of N samples (scene particles). Given observations from the initial time to the current time t, the calculation of the current posterior is done by MCMC sampling using N samples. We use B samples as burn-in samples [6]. B burn-in samples are used initially and eliminated for the efficient convergence to a stationary state distribution. More details and other practical considerations about MCMC can be found in [42].

Estimation of Entity Status Transition. The entity status is estimated by two binomial probabilities of the birth status and death status according to the entry model at time step t and t−1. Let \(ES_{id,t}^{b} \mathrm{(}x,y\mathrm{)}=\nu \; \mathrm{(}\nu \in \mathrm{\{ 1,\; 0\} )}\) denote the birth status with \(\nu \) = 1 indicating true, \(\nu \) = 0 false of an object id in the potion \(\mathrm{(}x,y\mathrm{)}\). Similarly, \(ES_{id,t}^{d} \mathrm{(}x,y\mathrm{)}=\nu \; \) denotes death status. The posterior probability of entry status is defined at time t as follows:

(3)

If a new object id is observed by the observation likelihood mode at time t in position (x,y) which did not exist (detected) in time t-1, the birth status of object id \(ES_{id,t}^{b} \mathrm{(}x,y\mathrm{)}\)is set to 1, otherwise, it is set to 0. If an object id is not observed by the detector ensemble at time t in position (x,y) which existed in time t-1, the death status of object id, i.e.,\(ES_{id,t}^{b} \mathrm{(}x,y\mathrm{)}\)is set to 1, otherwise, it is set to 0.

Fig. 2.
figure 2

Illustration of CPD. A change point score is calculated by the changing point detection algorithm. If the high change point score is detected, forward-backward error is checked from the detected change point. FB error checks whether the segment is drifted. A possible track drift is determined effectively by the change point detection method with forward-backward validation.

4.3 Changing Point Detection

MCMOT may fail to track an object if it is occluded or confused by a cluttered background. MCMOT would determine whether or not a track is terminated or continues tracking. Drifts in MCMOT are investigated by detecting such abrupt change points between stationary time series that represent track segment. A higher response indicates a higher uncertainty with high possibility of a drift occurrence [25]. Two-stage time-series learning algorithm is used as in [24], where a possible track drift is determined by a change point detection method [24] as follows. The 2\({}^{nd}\) level time series is built using the scanned average responses to reduce outliers in the times series. The procedure to prevent drift is illustrated in Fig. 2.

If high CPD response is detected on track segment, the forward-backward error (FB error) validation [7] is defined to estimate the confidence of a track segment by tracking in reverse sequence of the track segments. A given video, the confidence of track segment \(\tau _{t}^{} \) is to be estimated. Let \(\tau _{t}^{r} \) denotes the reverse sequential states, i.e., \(\mathbf{o}_{id,t:1} =\mathrm{\{ }\hat{\mathbf{o}}_{id,t} ,\ldots ,\hat{\mathbf{o}}_{id,1} \mathrm{\} }\). The backward track is a random trajectory that is expected to be similar to the correct forward track. The confidence of a track segments is defined as the distance between these two track segments: \(\mathrm{Conf}(\tau _{t}^{} |\tau _{t}^{r} )=\mathrm{distance}(\tau _{t}^{} ,\tau _{t}^{r} )\). We use the Euclidean distance between the initial point and the end point of the validation trajectory as \(\mathrm{distance}(\tau _{t}^{} ,\tau _{t}^{r} )=||\mathbf{o}_{id,1:t} -\mathbf{o}_{id,t:1} ||\).

The MCMOT algorithm is summarized in the followings:

figure a

5 Experiment Results

We describe the details about MCMOT experiment setting, and demonstrate the performance of MCMOT compared to the state-of-the-art methods in challenging video sequences.

5.1 Implementation Details

To build global and local object detector, we use publicly available sixteen-layer VGG-Net [19] and ResNet [29] which are pre-trained on an ImageNet classification dataset. We fine-tune an initial model using ImageNet Challenge Detection dataset (ImageNet DET) with 280 K iterations at a learning rate of 0.001. After 280 K iterations, the learning rate is decreased by a factor of 10 for fine-tuning with 70 K iteration. For region proposal generation, RPN [28] is employed because it is fast and provides accurate region proposals in end-to-end manner by sharing convolutional features. After building initial model, we perform domain-adaptation for each dataset by fine-tuning with similar step described beforehand. Changing point detection algorithms used a two-stage time-series learning algorithm [24] which is computationally effective and achieves high detection accuracy. We consider as change point when change point score is greater than change point threshold. Change point threshold is empirically set as 0.3.

5.2 Dataset

There are a few benchmark datasets available for multi-class multi-object tracking [43]. Since they deal with only two or three classes, we used benchmark datasets, ImageNet VID [31] and MOT 2016 [32], where the former has 30 object classes and the latter is an up-to-date multiple object tracking benchmark. We compare its performance with state-of-the-arts on the ImageNet VID and MOT Benchmark 2016.

ImageNet VID. We demonstrate our proposed algorithm using ImageNet object detection from video (VID) task dataset [31]. ImageNet VID task is originally used to evaluate performance of object detection from video. Nevertheless, this dataset can be used to evaluate MCMOT because this challenging dataset consists of the video sequences recorded with a moving camera in real-world scenes with 30 object categories and the number of targets in the scene is changing over time. Object categories in these scenes take on different viewpoints and are subject to various degrees of occlusions. To ease the comparison with other state-of-the-arts, the performance of MCMOT on this dataset is primarily measured by mean average precision (mAP) which is used in ImageNet VID Challenge [31]. We use the initial release of ImageNet VID dataset, which consists of three splits which are train, validation, and test.

MOT Benchmark 2016. We evaluate our tracking framework on the MOT Benchmark [32]. The MOT Benchmark is an up-to-date multiple object tracking benchmark. The MOT Benchmark collects some new challenging sequences and widely used video sequences in the MOT community. MOT 2016 consists of a total of 14 sequences in unconstrained environments filmed with both static and moving cameras. All the sequences contain only pedestrians. These challenging sequences are composed with various configurations such as different viewpoints, and different weather condition. Therefore, tracking algorithms which are tuned for specific scenario or scene could not perform well. We adopt the CLEAR MOT tracking metrics [23] using MOT Benchmark Development Kit [32] for the evaluation.

Fig. 3.
figure 3

Change points obtained from the segment in MOT16-02 and MOT16-09 sequence. A higher change point response indicates a higher uncertainty with high possibility of a drift occurrence. Notice that our method can effectively detect drifts in challenging situations.

5.3 MCMOT CPD Analysis

In order to investigate the proposed MCMOT changing point detection component, we select two sequences, MOT16-02 and MOT16-09 from the MOT 2016 training set. For change point detection, we assign a change point if change point score is larger than 0.3. Figure 3 illustrates the observation likelihood and detected change point of the segment. A low likelihood or rapid change in likelihood is an important factor for detecting potential changing point. In the tracking result of MOT16-02 sequence in Fig. 3, unstable likelihood is observed until frame 438, where a motion-blurred half-body person moves. Tracking is drifted because occluded person appears at similar position with previous tracked point at frame 440. After several frames, the target is swapped to another person at frame 444. In this case, bounding boxes within drift area are unstable, which observed strong fluctuation of likelihood. Changing point detection algorithm produces high change point score at frame 440 by detecting this fluctuation. In the tracking result of MOT16-09 sequence in Fig. 3 also illustrates similar situation explained before. As we can see, a possible track drift is implicitly handled by the change point detection method.

Table 1. Effect of different components on the ImageNet VID validation set
Table 2. Tracking performance comparison on the ImageNet VID validation set
Fig. 4.
figure 4

MCMOT tracking results on the validation sequences in the ImageNet VID dataset. Each bounding box is labeled with the identity, the predicted class, and the confidence score of the segment. Viewing digitally with zoom is recommended.

5.4 ImageNet VID Evaluation

Since the official ImageNet Challenge test server is primarily used for annual competition and has limited number of usage, we evaluate the performance of the proposed method on the validation set instead of the test set as a practical convention [20] for ImageNet VID task. For the ImageNet VID train/validation experiment, all the training and testing images are scaled by 600 pixel to be the length of image’s shortest side. This value was selected so that VGG16 or ResNet fits in GPU memory during fine-tuning [28].

Table 1 shows the effect of different components of MCMOT. Each method is distinguished in terms of MCMOT with CPD algorithm (MCMOT CPD), and MCMOT using CPD with forward-backward validation (MCMOT CPD FB). In the following evaluations, we filter out segments that have an average observation score lower than 0.3. As shown in the Table 1, significant improvement can be achieved with 9.8 % from detection baseline by adapting MCMOT CPD, and reached to 71.1 %. After the adaptation of the FB validation, an overall 74.5 % mAP was achieved on the ImageNet VID validation set. Table 2 summarizes the evaluation accuracy of MCMOT and the comparison with the other state-of-the-art algorithms on the whole 281 validation video sequences. Our MCMOT is achieved overall 74.5 % mAP on the ImageNet VID validation set, which is higher than state-of-the-art methods such as T-CNN [20]. This result is mainly due to the MCMOT approach of constructing a highly accurate segments by using CPD. As shown in Fig. 4, unlimited number of classes are successfully tracked with high localization accuracy using MCMOT.

Table 3. Tracking performances comparison on the MOT benchmark 2016 (results on 7/14/2016). The symbol \(\mathrm {\uparrow }\) denotes higher scores indicate better performance. The symbol \(\mathrm {\downarrow }\) means lower scores indicate better performance.
Fig. 5.
figure 5

MCMOT tracking results on the test sequences in the MOT Benchmark 2016. Each frame is sampled every 100 frames (these are not curated). The color of the boxes represents the identity of the targets. The figure is best shown in color. (Color figure online)

5.5 MOT Benchmark 2016 Evaluation

We evaluate MCMOT on the MOT Challenge 2016 benchmark to compare our approach with other state-of-the-art algorithms. For the MOT 2016 experiment, all the training and testing images are scaled by 800 pixel to be the length of image’s shortest side. This larger value is selected because pedestrian bounding box size is smaller than ImageNet VID. In MCMOT, we also implement hierarchical data model (HDM) [44] which is CNN based object detector. The timing excludes detection time.

Table 3 summarizes the evaluation metrics of MCMOT and the other state-of-the-arts on the test video sequences. Figure 5 visualizes examples of MCMOT tracking results on the test sequences. As shown in the Table 3, MCMOT outperforms the previously published state-of-the-art methods on overall performance evaluation metric which is called multi object tracking accuracy (MOTA). We also achieved much smaller numbers of mostly lost targets (ML) by a significant margin. Even though our method outperforms most of the metrics, tracker speed in frames per second (HZ) is faster than other tracking methods. This is thanks to the simple MCMC tracking structure with entity status transition, and selective FB error validation with CPD, which is boosted tracking speed on a multi-object tracking task. However, high identity switch (IDS) and high fragmentation (FRAG) are observed because of the lack of identity mapping between track segments. More importantly, MCMOT achieves state-of-the-art performance in two different datasets, we demonstrate the general multi-class multi-obejct tracking applicability to any kind of situation with unlimited number of classes.

6 Conclusion

This paper presented a novel multi-class multi-object tracking framework. The framework surpassed the performance of state-of-the-art results on ImageNet VID and MOT benchmark 2016. MCMOT that conducted unlimited object class association based on detection responses. The CPD model was used to observe abrupt or abnormal changes due to a drift. The ensemble of KLT based motion detector and CNN based object detector was employed to compute the likelihoods. A future research direction is to deal with the optimization problem of MCMOT structure and identity mapping problem between track segments.