Long-term real time object tracking based on multi-scale local correlation filtering and global re-detection

This paper investigates long-term visual object tracking which is a complex problem in computer vision community and big data analysis, due to the variation of the target and the surrounding environment. A novel tracking algorithm based on local correlation filtering and global keypoint matching is proposed to solve problems occurred during long-term tracking such as occlusion, target-losing, etc. The algorithm consists of two major components: (1) local object tracking module, and (2) global losing re-detection module. The local tracking module optimizes the conventional correlation filtering algorithm. Firstly, the Color Name feature is applied to increase the color sensitivity. Secondly, a scale traversal is employed to accommodate target scale changes. In the global losing re-detection module, the target losing judgment and global re-detection is realized by keypoint feature models of foreground and background. The proposed tracker achieves the 1st place in the VTB50 test set with 81.3% precision and 61.3% success rate, which outperforms other existing state-of-the-art trackers by over 10%. And it achieves the 2nd place in our Chasing-Car test set with a higher real-time performance 43.2 fps. The experimental results show that the proposed tracker has higher accuracy and robustness when dealing with situations like object deformation, occlusion and target-losing, etc.


Introduction
Visual object tracking, the problem of locating objects in a video sequence, is one of the central research topics in the field of computer vision [1] and Big Data Analytics (BDA).Recently more and more modern applications, such as precision guidance [2], intelligent surveillance [3], and auto-control system [4], etc., require object tracking result of high accuracy.According to the target changes in the process of tracking, object tracking can be divided into long-term tracking and short-term tracking.Most existing trackers focus on short-term tracking and has achieved excellent performance.However, the long-term tracking problem is still not a fully studied problem [5].The following problems have been long-standing: First, the model update drift, which is caused by the changes of the appearance or scale of target.The target may have different shapes and color, and also, videos may suffer from noise; Second, target occlusion and losing, which refers to the phenomenon that the target may be blocked by other object, and it is a common situation in real video test; Finally, the complexity of the algorithm.The tracker must be simple and flexible enough, as most applications of target tracking always have very high real-time requirements.
In this paper, all these problems considered simultaneously with a correlation filtering and global keypoint matching based tracking algorithm.Combining the speed advantage of correlation filtering in local tracking with the scale and rotation invariance of keypoint in global detection, an accurate and robust long-term tracker is realized.Correlation filtering mainly plays the role of local tracking.It makes use of the circulant matrix's properties to reduce the computational complexity to ensure the real-time performance.As the target model and update mechanism are not sensitive enough to the detail, target occlusion and losing can't be judged well.To solve this problem, global keypoint features are introduced.Through building foreground/background keypoint model, target occlusion judgments and global target detection is realized.The tracker combines the results of both local and global algorithm to update the target model, and ensure the robustness of the long time tracking.
Experiments on VTB50 [6] and the collected Chasing-Car videos show that the proposed tracker: (i)has higher accuracy and robustness when dealing with situations like object deformation or object occlusion; (ii) outperforms the state-of-the-art trackers especially on the long-term video test; (iii) can achieve excellent real-time performance.
The rest of this paper is organized as follows.Section 2 reviews the recent long-term trackers.Section 3 introduces the proposed tracker in detail.Section 4 presents experimental results on VTB50 and the collected Chasing-Car videos.Section 5 revisits the major contribution of this paper.

Related work
Object tracking could be divided into short-term and long-term tracking according to the complexity of the tracking process.Short-time target tracking mainly studies the object representation model and the discriminant method.Long-time target tracking focus on the problem of the large-scale appearance change, occlusion and target losing, and also the updating mechanism of the corresponding target model.TLD [7] algorithm uses a tracker combined a detector.In each frame, the tracker and the detector judge the candidate target separately, and both are used for result.It combines different characteristics of the detector and tracker, but parameters has large quantity and need set by specific target.Another method is the multi-expert [8] system.A tracker snapshot is generated as an expert at regular intervals during tracking process.The target is tracked by multiple experts simultaneously to avoid model drift.This method improves the robustness of the tracker, but the computational complexity of the algorithm increased exponentially.
Correlation filters have been widely used in object detection and recognition.Since the operator is readily transferred into the Fourier domain as element-wise multiplication, correlation filters have attracted considerable attention recently to visual tracking due to its computational efficiency.LCT [9], a simple detector based on random forest is adopted to alleviate drift problem.MUSTer [10] develops an additional long-term tracker based on keypoints to re-locate the target object.DLT [11] introduces tracking resumption in correlation trackers using a detector mechanism that re-initializes the tracker upon a target loss identified using an adaptive threshold.PDCT [12] utilizes a superpixel optical flow to construct a predictor to estimate the target motion and internal scale variation.
In recent years, deep learning has made breakthrough and achievements in the field of computer vision [13].However, it has some difficulties in target tracking.First, object tracking is a model-free problem without priori knowledge, therefore, offline training is not feasible.Second, the target tracking has high real-time requirements, and usually run on low-power computing platforms, such as handheld devices, which limit the complexity of deep learning algorithm.Although there are some quick deep learning trackers, such as GOTRUN [14], the robustness performance of these algorithms are unsatisfactory.

Method overview
The framework of the tracker is shown in Fig. 1.First of all, the initial frame of the video is used to initialize the local and global model of the target.Then the target position of next frame will be calculated by the correlation filter based on local features, and global features will be used respectively to make a judgment whether the target is lost.If the judgment is 'Lost', the target will be re-detected by global keypoint detector.The model of the target will be updated by the current frame.

Local tracking based on optimized correlation filtering
The principle of the correlation filtering algorithm is the optimization of ridge regression.In the initial frame, different positions in the search area are marked with different probabilities to represent the possibilities of the positions to be a corresponding tar- where x represents the image features in the bounding box, and f (x) represents its probability.The optimization problem is constructed by ridge regression, the matrix form is: Then, the optimum solution is: where X represents all of the samples in the search area.The candidate regions will overlap each other, so the solution of this problem can be optimized by using the properties of circulant matrix.The objective function is converted to the frequency domain by Fourier transformation: In order to avoid the high frequency noise caused by the edge effect in Fourier transformation, a two-dimensional Hamming window is employed to product the features, and smooth the results.During the tracking process, the image feature x of the search region will be transformed to frequency domain by Fourier transformation, and the frequency domain response y will be calculated by the elementwise product between x and w.
The spatial response y is obtained through the inverse Fourier transformation, as shown in Fig. 2. The position where the response value is the largest is the position of target.Features of image intensity are commonly used in correlation filter trackers, such as HOG, FHOG [8,15,16], but these features cannot express the color characteristics of images.In our algorithm, the Color Name feature is employed as a supplement, which is proposed by Van et al [17], to achieve color sensitivity.The Color Name feature differs from other common HSV and LAB, while it contains a total of 11 basic colors which are learned from real-world images, as shown in Fig. 3.
Besides the expression of the color characteristics, another problem is that the scale of the target always changes in the video.In our algorithm, a multiscale traversal algorithm is used to find the changed target scale.First, a larger scale change coefficient α l is chosen to carry out the traversal of the scale space to roughly determine the target scale: where S 0 represents the initial scale in the model.The scale has the largest probability between the different response diagrams will be the current scale of the target: where y j represents the probability response map in different scales.
Then, for a more accurate scale, changing the scale change coefficient α l to a smaller one and traverse again, and take S l instead of S 0 .With this method, the tracker can be well adapted to the change of the target scale.Meanwhile, the amount of calculation is much less than the traditional method with a constant α, which guarantee the real-time performance.

Global detection based on keypoint matching
The local tracking based on optimized correlation filter cannot deal with the problem of occlusion or lost, which are common situations in video target tracking.
Instead of the detection-based method, in our algorithm, the foreground and background model is constructed by the keypoint features and the classifier is replaced by keypoint matching.The advantage of this method is that the keypoints represent the features in local neighborhood of each image point, and the feature extraction will be finished in a full-image-traversal time, which avoids the high computational cost in the detection-based method.In addition, occlusion discrimination and global detection are all performed by keypoint matching without training any classifier model, which also reduces the computational complexity of the algorithm.In our method, ORB feature [18] is used for a better real-time performance.
To judge the occlusion/loss of the target, during a long time target tracking, the tracker extracts the global keypoints of the image, establishes the foreground and background model respectively.Then, both a forward and reverse keypoint matching are performed between the model and the current frame.According to the matching results, each point is marked foreground-matched or background-matched.Then the tracker determines whether the current result is missing or occlusion.If the number of matched points with the background model exceeds the threshold, the tracker will judge the target occluded or loss in this frame: where M represents the matching results, N umber M B refers to the number of matched points with background model,σ is the threshold.When the target get lost, the global re-detection will be started.The position of target is located by the weight center of corresponding keypoints between the target keypoint model T and the current keypoints set C t .The scale and rotation variation of the target should be estimated at first by corresponding keypoint pairs.As an instance shown in Fig. 4, k is a keypoint in T and k is the corresponding keypoint in C t .The scale and rotation variation can be estimated by: where N k is the total number of keypoints.s k , s k refer to the scale of k and k , o k , o k refer to the orientation of k and k .In order to eliminate the effect of matching errors in the re-detection, when calculating the center of the target, different weights are assigned to different feature points to calculate the center of target.It is assumed that mismatched keypoints are usually far away from the previous frame target center, and changes between the two target centers of adjacent frames is in a certain range.The weights are set according to the distance from each current keypoint to the target center of the previous frame: 2σ 2 (10) where p t t−1 is the target position in the previous frame and p t,k t−1 is the positon of keypoint k .In this method, mismatched keypoints are assigned smaller weights since their long distances to the target center.On the contrast, matched keypoints are assigned larger weights.Then, the target center is estimated by: where w k i is the normalized weight.

Model updating mechanism
After the local tracking and global detection, the tracker will fuse both the results as the final result.Meanwhile, the target model and the discrimination model will be updated according to the result of the occlusion judgment, as shown in Fig. 5.The tracker updates the local correlation filter using the incremental updating method while If not, turn to step 5. 5. Global re-detection: If the judgment is lost or complete occlusion, the current frame will be marked lost and the next frame will be used to locate the target.If the judgment is part occlusion, the position, scale and rotation of the target will be calculated by keypoints to initialize the correlation filter tracker.6. Model updating: If the target is well matched, update the correlation filter tracker and keypoint model.The parameters in the correlation filter will be updated incrementally, while the keypoint model will be replaced by the current frame model.7. End of tracking: If it's not the last frame, repeat step 3-6.

Evaluation criterion
The performance evaluation of object tracking algorithm mainly includes several aspects: accuracy, robustness and real-time performance.The most common evaluation criterion is Location Error and Overlap Rate.

(a) Location error
The location error is employed to estimate the distance between tracking result and target's actual position.The tracking will be marked success, if the distance is shorter than the distance threshold σ d .And the score is calculated by: where C alg and C GT represent coordinate of tracking result and actual target position.

(b) Overlap rate
The overlap rate is defined by: where B alg is the bounding box (top-left corner, width and height) of tracking result and B GT represents the bounding box of actual target.and represent the intersection and union of two areas.If the ratio is larger than the overlap threshold σ r , the tracking is marked success.In our algorithm, the Precision Plot and Success Plot are both used for evaluation.The Precision Plot shows different location error under different distance thresholds σ d , and the Success Plot shows overlap rates under different overlap thresholds σ r .

Experiments
This section introduces the experimental setup and experimental results.In order to test the accuracy, robustness and real-time performance of our method, the experiment uses the public test set VTB50 and a collected Chasing-Car test set which includes five Chasing-Car videos taken on a helicopter.Those two test sets cover various situations that may occur in target tracking, such as illumination change, scale change, target deformation, fast motion, target occlusion, background clutter and so on.We use the test criteria presented in Visual Tracking Benchmark [6] for algorithmic comparison.Detailed test criteria are described in [6].In the experiments, the padding of the searching area in correlation filtering was set 2.5, while the learning rate was set 0.05.The stride of the rough scale searching was 0.2, and it was set 0.1 in the accurate searching.The threshold in the keypoint matching was set 0.4, and the target will be marked lost if matched points are less than 5.
The proposed tracker is implemented in C++ with OpenCV library, and test on a PC with Intel I5-4460 3.20 GHz CPU and 8G RAM.

Results on VTB50
The first test dataset is VTB50, shown in Fig. 6, which is the most commonly used test dataset in video object tracking.This dataset contains 50 videos with marked targets and the test include illumination change, scale change, occlusion, deformation, blur, rapid motion, rotation, out of view, low resolution etc.The targets include cars, pedestrians, faces, toys etc. Figure 7 shows the tracking results of the proposed method and other state-of-art methods in three different videos of VTB50.Our algorithm performs well in all of the situations while other methods appear varying degrees of loss and shift.Figure 8 shows the precision and success rate tested on the VTB50 video, include the whole test set, occlusion and out-of-view situations.The legend in each plot list the top10 methods of each test.The details of precision and success rate comparisons are shown in Table 1.It can be seen that the proposed tracker achieves the best results in all of the tests.In the success plot on the overall test, the proposed tracker outperforms KCF, the second place, by over 10%.As for the occlusion and out-of-view tests, the success rates of other methods decreased a lot compared to the overall success rate, while our algorithm still maintains a high success rate.So it can conclude that the proposed tracker has an excellent performance in dealing with occlusion and target losing compared with other state-of-art trackers.

Results on Chasing-Car videos
In order to test the tracker's performance in long-term tracking, we collected an additional test set, which contains 5 long Chasing-Car videos, as shown in Fig. 9.All of those videos are taken on a helicopter, so there will be a lot of target occlusion, loss, Fig. 8 Experiment results on VTB50.The first row is the precision plot and the second row is the success plot.From left to right are results for 3 attributes: the whole test set, occlusion and out-of-view.Our trackers provide consistent improvements compared to state-of-art trackers and large-scale changes in target scale rotation.These conditions make the test a great challenge to the accuracy and robustness of the tracker.The resolution of the videos is 480 × 320 and frame number of each is 920.We compare the proposed tracker with 4 state-of-art trackers [7,15,16,19].
Figure 10 shows the tracking results of different methods, it can be seen that there is a lot of rapid motion, scale change and target occlusion.Figure 11 shows the accuracy and success rate in this test dataset.Although the MDNet tracker is superior to our tracker in the accuracy, it is computationally complex and slow.Table 2 is a detailed algorithm comparison.Our tracker achieves 0.809 scores in the center location error score, and 0.610 scores in overlap score which behind the MDNet tracker but outperforms other three trackers by over 0.1 scores.Our tracker is about 60 times faster than the MDNet algorithm, while this deep learning method need online training and updating with high computational complexity.Therefore, the proposed tracker balances the performance and computational complexity, which is more feasible compared to other state-of-art trackers.

Conclusion
This paper presents a novel target tracker based on correlation filtering and keypoint matching.The correlation filter has been improved by the combined feature and a multi-scale detect algorithm.Meanwhile, a global keypoint matching algorithm is applied for the judgment of occlusion and target re-detection when the target lost.We perform comprehensive experiments on two data sets: VTB50 and our Chasing-Car data set.The proposed tracker achieves the 1st place in the VTB50 test set with 81.3% precision and 61.3% success rate outperforms other existing state-of-art trackers by

Fig. 2 Fig. 3
Fig.2The spatial response map of the target

Fig. 4
Fig. 4 Left: keypoints of target model.Right: corresponding keypoint of current frame

Fig. 5
Fig. 5 Flowchart of model updating mechanism

Fig. 9 Fig. 10 Fig. 11
Fig. 9 test set.This test set contains 5 different Chasing-Car videos.In each video, the target car has undergone different situations including occlusion, target losing, significant scale and rotation change

Table 2
Algorithm comparison on Chasing-Car test set And it achieves the 2nd place in our Chasing-Car test set with a higher real-time performance 43.2 fps, which is 60 times faster than the first tracker MDNet.The experimental results show that the proposed tracker is much more robust in dif-