Robust tracking-by-detection using a selection and completion mechanism

It is challenging to track a target continuously in videos with long-term occlusion, or objects which leave then re-enter a scene. Existing tracking algorithms combined with onlinetrained object detectors perform unreliably in complex conditions, and can only provide discontinuous trajectories with jumps in position when the object is occluded. This paper proposes a novel framework of tracking-by-detection using selection and completion to solve the abovementioned problems. It has two components, tracking and trajectory completion. An offline-trained object detector can localize objects in the same category as the object being tracked. The object detector is based on a highly accurate deep learning model. The object selector determines which object should be used to re-initialize a traditional tracker. As the object selector is trained online, it allows the framework to be adaptable. During completion, a predictive non-linear autoregressive neural network completes any discontinuous trajectory. The tracking component is an online real-time algorithm, and the completion part is an after-theevent mechanism. Quantitative experiments show a significant improvement in robustness over prior state-of- the-art methods.


Introduction
Object tracking aims to acquire the moving trajectories of objects of interest in video, and is a fundamental problem in computer vision. It plays a key role in applications like surveillance analysis [1,2] and traffic monitoring [3,4]. Decades of research have led to tremendous progress in this field. However, there is still a long way to go to achieve satisfactory results in many challenging videos with, e.g., violent shaking, longterm occlusion, or objects which leave then re-enter the scene. Traditional tracking methods achieve high accuracy in experimental tests, but perform poorly on practical problems. In most methods, object features are extracted in each frame and used to search for the object in the subsequent frame [5,6]. Errors can accumulate in this process. If occlusion or frame skipping occurs, tracking will fail because of the rapid change of appearance features in local windows.
Combining detection with tracking is a feasible solution to these problems [7]. In the tracking process, errors can accumulate, but a detector can be used to localize the object being tracked and re-initialize the tracker. It is evident that detection accuracy is essential, so a high decision threshold is set for the detector. This means that detection results are accurate but frequently unavailable. In recent years, deep learning has made significant strides in object detection. However, adaptive online training is still an open problem. The computational requirements of training and lack of training data make it hard to recognize a specific target amongst other objects which belong to the same category in a scene. Furthermore, most tracking frameworks can only provide a discontinuous trajectory with jumps in position when an object is occluded. However, in application scenarios such as safety monitoring, for reliable analysis results, we must infer the missing parts of occluded trajectories.
For these reasons, this work has designed a novel framework that decomposes the tracking task into two parts: tracking and trajectory completion. During the tracking stage, three steps are invoked for every frame, including a simple tracker, a detection module, and a selection module. This allows the object to be tracked throughout the video. The tracker attempts to follow the object from one frame to another. In this process, tracking errors may accumulate, and if the object is occluded for a long term, the tracker will fail to follow the object. Thus we use the object detector and the object selector to determine the accurate location of the object to re-initialize the tracker. The object detector's job is to localize objects of the same kind as the object being tracked. The task of the object selector is to discriminate between them and determine which object should be used to re-initialize the tracker. For accuracy, we set a high decision threshold for both object detector and object selector, so the recall rate is low. Thus, the location of the object cannot be obtained in every frame by the detector and the selector. However, once the object is localized, the tracker will be re-initialized.
During the completion phase, we use a predictive neural network to complete the discontinuous trajectory. While the missing parts of the trajectory could be interpolated by a simple curve, e.g., a Hermite cubic spline, this is not a good approach as the missing trajectory may not be smooth or regular. We instead use a neural network, which is capable of learning the more complex behaviour of a real trajectory.
Our experimental results show that our method outperforms previous methods in cases in which the target objects are occasionally occluded, and can generate reliable trajectories for such objects.

Object tracking
Object tracking is the task of estimating the trajectory of a moving target in video. Traditional tracking algorithms start from object initialization in which the target is manually specified using a bounding box or ellipse. Motion estimation is the key phase in tracking. After the object has been modeled, particle filters [8] can be used to estimate object motion. There are two kinds of object modeling approaches: global object representations and local object representations. A variety of global visual representation methods are used for object tracking. Santner et al. [9] adapted an opticalflow-based representation and built a tracker using a single template model. Hedayati et al. [10] combined optical flow with mean shift of color signature to track multiple objects. Optical flow can provide spatiotemporal features of an object, but it can not be applied to scenes with rapid changes in illumination. Zhao et al. [11] represented objects by color distribution. A differential earth mover's distance algorithm was used to calculate the distance between two distributions. Sun et al. [12] used fragment-based features, and handled occlusion by solving a two-stage optimization problem. Hu et al. [13] proposed an active contour-based visual tracking framework. Colors, shapes, and motions are combined to evolve the contour. Jepson et al. [14] used object representations based on filter responses from a steerable pyramid. Other than traditional methods, neural networks can be used to perform object tracking without depending on extracting hand-crafted features. Wang et al. [15] proposed an online training network to transfer pretrained deep features for tracking.
In contrast to global visual representations, local visual representations based on local appearance structures can be more robust to object deformation and illumination changes. Wang et al. [16] segmented superpixel regions surrounding the target, and then represented each superpixel by a feature vector. An appearance model based on superpixels was used to distinguish the object from its background. The scale-invariant feature transform (SIFT) [17] is a widely used local feature extraction algorithm; some approaches [18][19][20] use it to match regions of interest between frames in a tracking framework. Static and motion saliency features [21,22] and corner features [23] have also been commonly used in object tracking. However, local representations reply on rich texture, and they are unstable for low resolution images. Simple motion estimation tracking suffers from error accumulation and cannot deal with object occlusion or re-entry. Thus, combining tracking with detection is meaningful.

Tracking with detection
Some work has applied object detection to tracking systems, and these approaches are most similar to the approach we take. In Ref. [24], the identify of the tracked object was verified by a validator. If verification failed, an offline-trained object detector searched the entire image exhaustively. Li et al. [25] used a probabilistic model combining conventional tracking and object detection to track objects in low frame rate (LFR) video; a cascade particle filter was used. Okuma et al. [26] focused on tracking multiple objects which can leave and enter the scene, using a combination of mixture particle filters and Adaboost. However, there is no discrimination between the objects tracked. Pedestrian detectors can also be used to improve robustness in multiobject tracking [27]. All of the detectors used in the above papers were trained offline. Although offline-trained classifiers may perform better than real-time detectors due to ample training samples and sufficient training time, they cannot distinguish between objects of the same category. For example, a detection mechanism can localize all pedestrians in a frame, but it is unable to distinguish a specific person. Thus, it is hard for the detectors in the above papers to rectify the tracker following a specific object.
Grabner and Bischof [28] used a real-time Adaboost feature selection model for object detection. This work reduced the computational complexity of Adaboost significantly, but because of the limitations of the number of weak classifiers, the accuracy of the detector was low. Babenko et al. [29] trained an online object classifier which was updated by the output of the tracker. A multiple-instancelearning approach was used to reduce ambiguities in the detection phase. Tang et al. [30] treated tracking as a foreground-background classification problem; online support vector machines were built to recognise different features using a co-training framework. Online detectors are more adaptable, and are able to track a specific target amongst many objects from the same class. However, these classifiers perform worse than offline detectors in terms of accuracy, and training data extracted from real-time video have limited reliability. In this paper, we integrate a pre-trained model with a classifier which is updated in real time, to overcome this problem.

Tracking by detection and selection
Our framework has two phases: tracking and trajectory completion. The former can track the target even in the presence of frequent and longterm occlusion, or object absences, while the latter can complete incomplete trajectories having missing segments. The tracking part of our tracking-bydetection using selection and completion (TDSC) framework can track a specific target in videos with long-term occlusion. The user should label the object to be tracked in the first frame, then our tracking algorithm produces the location of this target in every frame. If the target is occluded or goes out of sight, this algorithm outputs the location where the target last appeared. After the target reappears, this tracking algorithm can find the target and output its correct location. The TDSC framework is able to distinguish a specific object amongst others of the same kind, for example, a specific pedestrian among many people. So, even though TDSC is designed to track a single object, we can also use it to deal with the multiple object tracking problem by running multiple simultaneous instances.
A block diagram for our framework is shown in Fig. 1. In this section, we consider the tracking phase, including the object detector, object selector, and tracker. The following section will consider trajectory completion.
Both detector and tracker receive video frames as input data. The object detector can localize objects in the same object category as the target being tracked. However, as a classifier, the detector has two main shortcomings: it is inevitable that (i) it will at times return false positives, and (ii) it will fail to discriminate objects of the same kind. Thus, any objects detected are next filtered by the object selector to remove false positives and objects other than the specific desired target.
Our work aims to build a robust framework for tracking objects with long-term occlusions. This paper does not focus on the design of the tracker. We thus simply use compressive tracking [6] in our implementation; it is often employed as a  benchmark in comparative experiments because of its effectiveness and efficiency.
A detector can produce coordinates and categories of objects.
Object detection is a fundamental problem in computer vision.
To obtain high accuracy, our framework uses an offline-trained detector which has been exposed to abundant training samples, without restriction on training time. In recent years, convolutional neural networks (CNNs) have become widely used in this field, as they have higher detection performance compared to methods based on low-level features such as histograms of oriented gradients (HoG) [31] or SIFT features [32]. In this paper, we employ faster-RCNN [33] as our object detector. Region proposal computation is a bottleneck for fast-RCNN [34]. Faster-RCNN overcomes this problem by using a region proposal network which shares convolutional features with the detection network. It achieves near real-time detection rates and detects multiple objects in specific classes with high accuracy.
The approach used by our object selector is to extract feature vectors from objects found by the detector, which we call object proposals, and use a categorization model to find positive proposals. If an object proposal is recognized as of the same category as that of the object being tracked, we call it a positive proposal. The feature vector is based on the color and shape of the object. A color histogram represents the distribution of colors in an object. We use HoG features to represent shape and contextual information. The first step in calculating the HoG descriptor is to compute image gradient values. The region of interest containing an object proposal is divided into 10 × 10 cells. Each pixel within a cell casts a weighted vote for an orientation-based histogram channel based on the magnitude and orientation of the gradient vector. To counter any changes in illumination over space, cells are grouped into blocks in which we locally normalize the gradient strengths. The HoG descriptor is then extracted by concatenating the normalized cell histograms from all blocks. For every block of an object proposal, the feature vector is calculated by combining the color histogram and the HoG descriptor into a vector x. This is used by a classifier to assign a label y to each object, which is either +1 or −1 to state that it belongs to the target object and some other object respectively.
For simplicity and speed, we use a linear support vector machine (SVM) as the classifier. An SVM provides a method to calculate the hyperplane that optimally separates two high-dimensional classes of objects. The hyperplane is given by where ω is the normal vector to the hyperplane and b is the hyperplane offset from the origin. Finding this hyperplane is a convex optimization problem. To be able to handle data which are not linearly separable, we introduce a soft margin, whereupon the objective function to be minimized is 1 n Doing so gives a hyperplane which can be used for classifying the feature vectors. However, this does not yet take into account temporal coherence.
In the case of continuous successful detection, the current target location should be close to the previous one. If an object proposal is far from the correct target in the previous frame, this object is unlikely to be the correct proposal: distance should also be considered in our object selector. However, if the object has been absent for a while, the object is likely to be further away.
In a standard SVM, a new feature vector x is classified by computing: Taking distance as a penalty factor, given an absence of detection output for T frames, the classification formula can now be written as is the distance between the current object proposal location (x c , y c ) and the previous target location (x p , y p ), and constant µ is a distance threshold set by experimental experience. We can see from Eq.
(4) that we use a quadratic form of distance as the penalty factor. If the current object proposal is far from the previous object, it is highly unlikely for this proposal to be the correct location. As the distance increases, the penalty factor should be dominant, so the penalty factor is made to be a quadratic function of distance.
In reality, the appearance of the object can change during the tracking process. For example, a pedestrian may slowly turn around. Although only the initial bounding box that the user draws is completely reliable, we cannot train our selection only using the initial data because of this appearance change. To provide greater adaptability, the object selector is trained online. In the initial stage, we should draw a rectangle to specify an object to track, and draw another rectangle as a negative sample, to initialize the SVM. Once the initial SVM model has been established, positive and negative samples are extracted in the process of selection. Afterwards, online training is carried out continuously in order to adapt to the changes in appearance of the target.

Trajectory completion
When the object being tracked is occluded, the tracker cannot produce correct coordinates.
A sudden, significant change in object coordinates is inevitable when the object becomes visible again after occlusion. We can determine that the object has been occluded by detecting abrupt position change. This framework does not detect occlusion directly by the tracker, because the tracker should not determine that the object is occluded when it is not distinct and hard to recognize.
When occlusion has been detected, a trajectory completion mechanism is used to correct the discontinuous trajectories. Trajectory completion is a temporal extrapolation problem. Artificial neural networks are one of the most accurate and widely used forecasting models which are capable of identifying complex nonlinear relationships between input and output data. Non-linear autoregressive (NAR) neural networks have proved useful for complicated pattern data forecasting [35]. Completing the tracked object's coordinates is also a data forecasting problem. We thus employ a three-layer NAR neural network for trajectory completion. Compared with using an interpolation method such as spline fitting, an NAR neural network can produce a predicted trajectory without making any assumption about the type of movement the object is undergoing, and need not assume a smooth trajectory.
An NAR network has linear activation functions for the output layer and non-linear logistic activation functions for the hidden layer. Thus our network performs a non-linear functional mapping from the past object coordinates to future locations. Let x t be the horizontal coordinate of the target at time t. The mapping performed by the network is where w are the connection weights of this network and l is the maximum time delay for the input data.
We can see that this network is a nonlinear autoregressive model. The structure of this autoregressive neural network is shown in Fig. 2.
In the majority of cases, trajectories of objects are continuous and smooth, such as when tracking cars and pedestrians. In order to make full use of known information and improve the continuity and smoothness of the trajectories, the forecast is carried out in two directions. The coordinates before occlusion are input to the NAR neural network as training data in time sequence, and the prediction output is denoted by {x + t }, while the coordinates after occlusion are also input to the neural network as training data in the reverse direction, and the prediction output is denoted by {x − t }. The final forecast result {x t } can be calculated using: 5 Experimental results

Dataset preparation
Several datasets exist for benchmarking visual tracking, such as VOT`, VTB a , and MOT b .
However, most sequences in these datasets have no object occlusion. Even in the video samples in which objects are blocked, the occlusion spans are too short. In order to evaluate tracking performance in sophisticated circumstances such as long and frequent occlusion, we introduce a more challenging dataset. Two sequences in this dataset are selected from VTB benchmark, and we have captured four more difficult samples with significant occlusion. For quantitative evaluation, we use the following protocol similar to that used in the MOT benchmark. Tracking starts from an initial bounding box in the first frame. Both the ground truth and the tracking results are a sequence of bounding boxes. If the overlapping area between a tracking bounding box and a ground truth bounding box is larger than an overlap threshold, the tracking result in this frame is deemed successful. For every tracking framework, we plot a success rate curve against the overlap threshold. The overall performance of a tracker can be measured by the AUC (area under the curve) criterion. In order to measure the ability to handle occlusion, re-initialization is not performed in the tracking process.

Experiments on the tracking phase
We compare our proposed TDSC framework with four trackers: CT [6], KCF [36], CSK [37], IMT [38], SORT [39], and LCT [40]. All trackers were run with the same parameters using our dataset. Figure  3 shows tracking results for some test samples. From http://www.votchallenge.net/vot2016/dataset.html a http://cvlab.hanyang.ac.kr/tracker benchmark/index.html b http://motchallenge.net/data/MOT16/ top to bottom these are: David, Jogging, Occlusion 1, Occlusion 2, Occlusion 3, and Frameskip. The first two are relatively simple scenes with short-term occlusion, and come from the VTB dataset. Occlusion 1 is a multi-object occlusion scene. Occlusion 2 includes long-term occlusion and shaking. Occlusion 3 is a long-term and multiobject occlusion scene. The last sample has missing frames in a sequence. The results reveal that while state-of-the-art tracking methods are able to handle short-term occlusion, the targets are lost with longterm occlusion. However, our TDSC framework can continue tracking through re-initialization. Figure 4 presents the performance curves for these seven trackers on our dataset. The results indicate that the proposed tracker has better performance than the other trackers in our experiments. To provide a quantitative analysis, we give the AUC values for these seven trackers using our test dataset, for three specific conditions, in Table 1. The proposed TDSC framework shows a significant improvement over the prior state-of-the-art methods.
The proposed framework achieves real-time processing. For a resolution of 576 × 432, traditional tracking and selection only take 3.9 ms and less than 1 ms respectively using an Intel Core i7 CPU.

Experiments on the completion phase
Building datasets is the first difficulty in trajectory completion experiments. For a video sample in which an object is physically occluded, we are unable to acquire a complete trajectory, so it is hard to annotate ground truth. Therefore, we capture some video samples without occlusion, annotate the object trajectories as ground truth, and then draw synthetic obstacles to occlude the moving objects.
We conducted experiments using two kinds of two cases: straight trajectories and curved trajectories. Results are illustrated in Fig. 5. Blue lines are trajectories extracted by our tracking-detectionselection mechanism. Because of occlusion, these trajectories are discontinuous. Yellow lines are trajectories predicted by the completion mechanism.  Red crosses indicate trajectory ground truth. We use the average distance between points in predicted trajectories and ground truth trajectories for quantitative evaluation. The average distance in straight-trajectory cases is 5 pixels (3% of the target height) and in curved trajectory cases, 10 pixels (14% of the target height). Our experiments furthermore show that our TDSC framework can output continuous trajectories in cases with occlusion.

Conclusions
In this paper, we designed a novel framework to solve the problem of object tracking where longterm occlusion interferes with the tracking process. Continuous tracking is necessary for some realistic difficult problems, especially for safety monitoring. Our framework decomposes the task into two parts: tracking and trajectory completion. The object detector is a deep neural network model which localizes objects in the same category as the object being tracked. The object selector is based on an online-trained SVM model, and discriminates between the outputs of the object detector to determine which object should be used to reinitialize the tracker. Offline-trained and online-  trained classifiers are combined for accuracy and flexibility. To obtain a continuous trajectory, we utilize a non-linear autoregressive neural network to complete the missing parts of trajectories extracted by the tracking component of TDSC. Quantitative experiments show our proposed framework improves upon prior state-of-the-art tracking methods and is able to output continuous trajectories.