Multi-experts Joint Decision with Adaptive Model Updater for Robust Visual Tracking

Over these years, correlation filters based trackers have shown edges both in accuracy and speed. However, variations of target appearance caused by heavy occlusion, rotation, background clutters and target deformations are the major challenges for tracking. To solve these problems, many works put on exploiting the power of target representation, such as high-level convolutional features. Nonetheless, these methods make a great compromise between the speed and performance. At the same time, there are few researches on improving the performance of model updater and the ensemble methods. In this paper, a multi-experts joint decision strategy base on kernelized correlation filters is proposed to obtain robust and accurate visual tracking, two trackers with handcrafted features and deep convolutional neural network features are integrated in this framework. We also investigate the mechanism of tracking failure caused by occlusion and background clutters, and propose a novel criterion to evaluate the reliability of samples. Our work includes extending the kernelized correlation filter-based tracker with the capability of handling scale changes as well. The proposed tracker is extensively evaluated on the OTB-2013, OTB-2015 and VOT2015 benchmark datasets. Compared with the state-of-the-art trackers, the distinguished experimental results demonstrate the effectiveness of the proposed framework.


Introduction
In recent years, visual tracking is widely used in intelligent surveillance, autopilot, robotics and many other applications, and becomes one of the most popular fields in computer vision. The research focuses in this on single target, object given and model-free tracking. The tracker is initialized with the location and size of the target in the first frame, and any explicit appearance or prior model cannot be used. In the case where the training sample has only the first frame, it is challenging to estimate the target trajectory throughout the sequence. At the same time, the target also suffers from occlusions, scale variations, motion blur and illumination changes during the tracking process.
To tackle the problem of lacking of training samples, most existing trackers either adopt generative [1][2][3] or discriminative [4][5][6] methods to learn appearance model. The generative algorithms search the candidate regions by the minimum reconstruction error to find the best matching position. The discriminative approaches locate the target by designing and training the classifier to distinguish the background and the foreground. Among the discriminative tracking algorithms, correlation filter-based methods [4,6,7] gain much attention for their high accuracy as well as high efficiency. Deep convolutional neural network (CNN) features have achieved great success in many computer vision tasks [8][9][10]. Correlation filters are difficult to adapt to severe deformation and fast motion, so there is a rising trend for introducing CNN features in tracking framework to improve performance with the help of its rich feature representation. MDNet [11] adopt CNN models trained with tracking datasets including [12,13] and achieved better performance. HCF [14] integrates CNN features into the correlation filter framework for robust tracking.
However, the original correlation filtering algorithms [4,6,7] use fixed-size templates, which leads to a problem. When the size of the target changes drastically, the template either contains extra background or only a part of the target. This may causes tracking failure when scale variation in the presence of other complicated factors such as background clutters, occlusion and motion blur. Meanwhile, many practical application scenarios require accurate target size information in the image. Extensive researches [15][16][17][18][19] have been conducted on how to establish a robust scale estimation strategy. Among them, the scale adaptive kernel correlation filter tracker with feature integration (SAMF) [18] and the discriminative scale space tracker (DSST) [19] are the most widely used approaches. SAMF is a straightforward method to estimate the scale by applying the standard learned twodimensional filter to samples of multi-resolutions around the target, this exhaustive scale search strategy is computationally demanding. DSST [17] tackles the scale estimation problem by learning two separate correlation filters for explicit translation and scale estimation. First the conventional discriminative correlation filter is employed to find the maximum response indicating the target position. Next, a separate scale detection model is trained to search the optimal scale in the multi-scale spatial pyramid. In this way, the employ of the two independent filters avoids mutual interference. Although the DSST addresses the scale estimation problem to some degree, the conventional correlation filter for translation estimation still suffers from its relatively low discriminative power.
Scalable correlation filter trackers are still not strong enough. Ensemble approaches [18,[20][21][22][23][24] have been developed as another way to improve the performance with combination of multiple trackers for visual tracking. For example, the ensemble method [22] under the acceleration framework [25] incrementally trains the weak trackers of each component to classify the training samples that were previously misclassified by the tracker. As one of the representative works, multiple experts using entropy minimization (MEEM) [20] demonstrates the potential of ensemble, which proposes to address the problem by using the multiexpert restoration scheme to predict target, where an entropy based loss function is defined to determine the confidence of current tracker. HDT [23] estimates the position of the target by fusing the response maps obtained from a correlation filter trained by hierarchical convolutional features of various resolutions as a weak classifier in a coarse-to-fine scheme. The final prediction result is weighted by the adaptive hedging method of weak classifier. In MCCT [24] Wang et al. introduce the concept of feature pool, which includes seven features, and use the different features of the target to learn the correlation filter tracking experts, finally select the most reliable one as the tracking result in each frame.
Although the impact of model updater is significant on performance [26], there are very few researches focus on this component. The model updater determines the frequency and strategy of updating model. Since only the samples of the first frame are reliable, the tracker must maintain a tradeoff between the collection of new samples during the tracking process and the prevention of the tracker from drifting to the background. Most trackers update the model every frame. In [6], the criterion used to obtain the target position is the naive maximum response value, and update every frame with a moderate learning rate. Entropy minimization is adopted in [20] to identify reliable model update and discard the incorrect ones. Bolme et al. propose a simple measurement of peak strength called the peak-to-sidelobe ratio (PSR) [7]. Wang et al. argue that the robustness of the maximum response value will be heavily degraded due to the presence of some other challenging factors such as motion blur, partial and full occlusion. Thus, in [27] instead of using the naive maximum response value to obtain both the translation estimation, Wang et al. propose a criterion named average peak-to-correlation energy (APCE), when the maximum of the response map and APCE are both great enough, the tracking model will be updated. The above methods either update every frame or discard unreliable samples directly. A reasonable update strategy should adjust the learning rate adaptively according to the confidence level of the sample, so that it does not contaminate the model or lose the information that may be useful for tracking.
To address the problems mentioned above, a multiexperts joint decision framework based on kernelized correlation filter is proposed to carry out robust visual tracking. The main contributions are summarized as follows: 1. First, our trackers are extended with the capability of estimating scale. The size and position of the target can be obtained simultaneously. 2. Then, a multi-experts joint decision strategy base on kernelized correlation filters is presented. Handcrafted features (HOG [28], CN [29]) and CNN features are exploited to build a correlation filter bucket, which contains seven experts. By evaluating the total robustness score of each expert, the most reliable one is selected as tracking result for each frame. 3. Next, the novel criteria of peaks correlation of response map (PCRM) is proposed to evaluate the reliable level of sample. The PCRM of first three response maps are computed, and weigh it to obtain a confidence index for the current sample. 4. Finally, an adaptive model updater strategy is proposed to alleviate the contamination of samples by considering the PCRM of sample and the divergence of experts.
Extensive and comprehensive experiments are conducted on widely used benchmarks OTB-2013 [30], VOT2015 [13] and OTB-2015 [12]. The results validate the improvement in success and precision rate of proposed tracker.

Related Works
During the past few decades, substantial progress has been contributed in the field of visual object tracking. In this section, the works closely related to our method are summarized from three perspectives: tracking by correlation filters, tracker ensemble and model update strategy.

Tracking by Correlation Filters
Due to high efficiency and accuracy, correlation filtersbased trackers remain mainstream in practical applications. Bolme et al. [7] utilize the minimum output sum of squared error (MOSSE) to learn the correlation filters, by using the circular correlation, the resulting filter can be computed efficiently using point-wise operation in Fourier domain. Whereafter, Henriques et al. [4], dense sampling is performed by efficiently utilizing the structure of the circulant matrix, and while maintaining high speed, the discriminative ability of the CSK is enhanced as the negative sample is augmented. The above methods are based on grayscale feature. The work is further extended to HOG multichannel feature in kernel space [6]. Staple [31] makes full use of the complementarity of color and gradient information, while running in excess of real-time. Danelljan et al. [32] introduce a spatially regularized component in the learning to penalize CF coefficients depending on their spatial locations and alleviate the boundary effects. With the rising trend of for introducing CNN features into object tracking field, several trackers [14,33] use deep models pretrained for the object classification task for feature representations, and the performance have been further improved. More recently, C-COT [34] achieves outstanding performance in several benchmarks. C-COT adopts continuous convolution operator to fuse the deep feature maps. After that, in ECO [35], several extra strategies are used to combine the deep and hand-crafted features to speed up in C-COT framework.
More extended work, such as scale variation [17,19] and long-term tracking [36] are added to the correlation filters framework.

Tracker Ensemble
According to the literature [26], the ensemble approach can improve the performance substantially. In MEEM [20] the entropy minimization is used to exploited the relationship between multi experts and its historical tracker. Then in [37], Li et al. extend it by using the unified discrete graph algorithm to model the multiple experts. Qi et al. [23] propose to develop an improved hedge algorithm that combines weak CNN based trackers from various convolutional layers into a single stronger tracker. Wang et al. [24] propose the multicue correlation filters framework, which constructs parallel experts from different features, and selects the expert with the highest robustness score as tracking result for each frame.

Model Update Strategy
Although the implementation of the model updater is often treated as tricks, their impact on performance is usually very significant. Unfortunately, few works focus on this component [26]. Santner et al. propose parallel robust online simple tracking (PROST) [38], using a simple template model as a non-adaptive, a novel optical-flow-based mean-shift tracker as highly adaptive element and an online random forest as moderately adaptive appearance-based learner. In MOSSE [7] a criteria PSR is used to quantify the reliability of the tracked sample, and Bolme et al. argue that, when the value of PSR ranges between 20.0 and 60.0 indicates very strong peaks. MEEM [20] tracker is proposed to identify reliable model update and discard the incorrect ones. In KCF [6], the model is updated every frame with a moderate learning rate. Wang et al. [27] propose to employ the maximum response value and the APCE as the criterion to provide a high-confidence update strategy for robustness.

Methods
A multi-experts joint decision strategy with adaptive model updater for robust tracking base on kernelized correlation filter is proposed in this work. Firstly, the baseline of our trackers [6] adopts fixed target size, that given in the first frame. Therefore, a robust scale estimation approach [17] is employed to handle target scale changes. Secondly, handcrafted or deep features are extracted, seven experts are obtained by splitting and combining these two features, after jointing the decision of experts, the most reliable one is selected as the tracking result. Thirdly, the novel criteria called peaks correlation of response map (PCRM) is proposed. By evaluating the correlation between the maximum value and other peaks of the response map, the PCRM obtains the confidence level of the sample. Finally, by considering PCRM and the historical divergence of experts, the presented model updater strategy can update the model with an appropriate learning rate. The flowchart in Fig. 1 depicts the main framework of our proposed algorithm. In Sect. 3.1, the formulas of our baseline tracker multichannel kernelized correlation filter is treated. The scale estimation approach in our trackers is introduced in Sect. 3.

The Kernelized Correlation Filter Tracker
Since the discriminative ability of KCF tracker is enhanced with the augmentation of negative samples while maintaining high speed by exploiting structure of circulant matrix with high efficiency, KCF has become the baseline of many trackers [39,40].
For notational simplicity, one-dimensional signal is considered, more details can be found in [6]. Given one dimensional data = 1 , 2 , … , n , the training goal is to find f ( ) = T which minimizes the squared error over training samples i and their regression targets i , The scalar 1 is a regularization parameter that controls overfitting. To allow for a more powerful classifier with nonlinear regression functions f ( ) , the solution is expressed as a combination of the samples: where i are the variables under optimization in dual space, ( ) represents a non-linear feature-space, therefore the optimized variables are , instead of . According to literature [41], this alternative representation = 1 , 2 , … , n is said to be in the dual space, as opposed to the primal space .
The solution to the kernelized version of ridge regression can be obtained as follow where K is the kernel matrix containing elements K ij = i , j , which are computed using the kernel function .
For the most commonly used kernels (e.g., Gaussian, linear and polynomial), the circulant matrix trick can also be used to make Eq. (3) diagonal: where xx is the kernel correlation and hat ∧ denotes the Discrete Fourier Transform (DFT) of a vector, ̂ = F( ) . The multiplications and divisions in Eq. (3) are performed element-wise. In our trackers, the Gaussian kernel is adopted for its high accuracy as where kernel function i , j can be simply denoted as In detection process, a patch with the same size of is extracted at the position provided by the previous frame, and the response map is calculated as where the ̂ is the kernelized correlation between and as defined in Eq. (5), meanwhile ̂ is obtained in the previous frame by Eq. (4). Then, the position of the object in the current frame is located by finding the translation with the maximum value in the response map ( ).
To avoid model corruption, KCF uses interpolation to update the model every frame: where is the learning rate and t denotes frame index of image sequences, how to determine the value of will be discussed in Sect. 3.5. This puts more weight on recent frames and lets the effect of previous frames decay exponentially over time.

Discriminative Scale Space Tracking
Our scale searching scheme follows the DSST [17] tracker. Unlike SAMF [18] uses one filter to determine translation and scale simultaneously, the DSST applies two kind of correlation filters. One is two-dimensional translation filter for target localization, and the other one is one-dimensional scale filter for scale evaluation, which are independent of each other.
In our trackers, KCF is employed to locate the target, and learn a separate 1-dimension filter to estimate the scale information. Tens of patches I n of size a n P × a n R are extracted centered around the target to construct the training sample t,scale , here P × R denotes the target size in the current frame and S is the size of the scale filter, n ∈ , a represents the scale factor between feature layers. The aim is to train a scale correlation filter scale consisting of one filter n scale per scale level, this can be solved by minimizing the L 2 error compared the desired output g, here 1-dimensional Gaussian is adopted, here, the ⋆ denotes circular correlation, and the second term is a regularization with a weight parameter 2 .
The value n scale of the training samples scale at scale level n, is set to the d-dimensional feature descriptor of I n . The solution to the problem above is as follows: where the fraction denotes pointwise division. Similar to Eq. (7), new sample t,scale is used to update the numerator A n t and denominator B n t of the scale filter t,scale .
Here, is a learning rate parameter. Many numerical experiments show that = 0.01 makes filter quickly adapt to scale variation while still maintaining robust.
To apply the filter in a new frame t, a test sample t,scale is extracted from the location determined by KCF using the same procedure as for the training sample t,scale . In

Multi-experts Construction
HOG [28], gray and ColorNames [29] (CN) are the most popular handcrafted features in tracking filed for their high extraction efficiency. The HOG features, called histogram of oriented gradient, are constructed by calculating and counting the gradient direction histogram in the local region of the image, which reflect the edge shape information of region block. CN features have rich expressiveness and high identification. They are obtained by transforming RGB space to CN space which can reflect the 11-dimensional thematic color information of the region [29]. The gray features are simple features that contain only brightness information. Different from handcrafted features, CNN features contain rich high-level semantic information and are strong at distinguishing objects of different categories. In our handcrafted features version tracker, only two low-level features (HOG and CN) are adopted to build the experts. Diversity is crucial in ensemble methods [26], to create more experts, 32-dimension HOG is divided into two 16-dimension features, called HOG 1 and HOG 2 . Through permutation and combination of these features, experts are obtained in the feature bucket. In our deep features version tracker, HOG, conv4-4 and conv5-4 are extracted as low, middle and high features, respectively. Details about the experts is shown in Table 1. After these seven experts are generated, results (bounding boxes) are provided from their own perspective, how to choose a reliable one will be discussed in Sect. 3.4.

Ensemble Tracking
In each frame, seven experts track the target and generate cues (bounding boxes) simultaneously, inspired by MCCT [24], pair-evaluation and self-evaluation [24] are adopted to evaluate their robustness degree. And then, the expert with the highest score is selected as the tracking result of the current frame. The procedure of ensemble tracking framework is shown in Fig. 2.

Pair-Evaluation
E 1 to E 7 denote Expert 1 to Expert 7 respectively. Every expert is treated as a black box, the bounding box of Expert i in the t-th frame is written as B t E i , which is a four-  The fluctuation extent of overlap ratios in a short period Δt (e.g., 5 frames in [24]) reveals the stability of overlap evaluation between E i and other experts. However, O �t (Ei,Ej) only represents the overlap ratio between two experts in present t-th frame, thus previous value of overlap ratio should be taken into consideration, the variance of the K = 7 experts is given as follows, The M t is the mean value of overlap ratios, reflects the consistency between Expert i and other experts. In a short period, the closer to the current frame, the greater the relationship between this score and the curr e n t f r a m e . T h u s , a n i n c r e a s i n g s e qu e n c e = 0 , 1 , … , Δt−1 , ( > 1) put more confidence on recent scores. Then the average weighted mean and standard variance are computed through: M �t , respectively, and W denotes the ( − t + Δt)-th element in sequence in , and N is defined by N = ∑ W . The pair-evaluation score of Expert i in t-th frame is computed as below, the existence of small constant is to avoid the pair-wise robustness score infinity when the denominator is zero. Equation (15) indicates that, a higher value of R t pair E i means greater consistency and less volatility between different experts.

Self-Evaluation
The Euclidean distance between the bounding box B t−1 E i in (t − 1)-th frame and the one B t E i in t-th frame reflects the reliability of the tracking output of each expert, which is . The trajectory smoothness degree of Expert i is given as follows, Ei denote the width and height of the Expert i. As mentioned before, to avoid performance fluctuation of the experts, scores in the short term should be considered. Thus, the self-wise expert trajectory smoothness score is given by The higher self-evaluation score means the better reliability of the tracking trajectory.

Joint Decision
The final robustness score R t E i of the Expert i in frame t requires self evaluation score R t self (E i ) and pair-evaluation score R t pair E i to be weighted by the coefficient : finally, the expert with the highest final robustness score is selected as the output result in each frame.
The main advantage of this ensemble method is that, only twice feature extraction (the heaviest computational burden in the tracking process) is needed, one for the training by Eq. (4), the other for detection by Eq. (6), instead of fourteen ( K = 7, 7 × 2 = 14 ) in each frame. This approach considers both diversity and effectiveness, thus our trackers can maintain real-time performance while achieving high accuracy. Furthermore, by sharing the rectified results of target position and model update, the drift and tracking failure of weak experts are effectively alleviated.

A Novel Model Updater
The model updater determines both the strategy and frequency of model update, most of the existing trackers adopt two model update methods: (1) Schemes like [4,6,17,31] update tracking models every frame with constant learning rate without considering whether the sample is credible or not. This may cause tracking failure due to the model corruption when the target is detected inaccurately, severely occluded or totally missing in the current frame; (2) Approaches like [7,27] uses indicators (PSR and APCE respectively) to assess the fluctuation of the response map, and update the model when the indicator meets a certain conditions. This method alleviates tracking failure caused by corruption of the model, however, the learning rate of the model updater is still constant, unable to fully adapt to the needs in some particular scenes. In addition, models that are discarded since they don't meet the conditions can be valuable. In our trackers, we establish an adaptive strategy to update by utilizing the feedback of tracking results. It is demonstrated through many experiments that the number and value of the peaks of the response map can reflect the confidence of the tracking results. The ideal response map should have only one sharp peak, and the other area are relatively flat. The sharper peak can get better tracking accuracy. On the contrary, there are more than one peaks in the response map and the fluctuations are severe, whose pattern is significantly different from ideal response maps. If the updater still adopts the same learning rate, the model corruption will lead to tracking failure. Therefore, we propose a feedback adaptive updating mechanism with a criterion, called peaks correlation of response map (PCRM), which is defined as follows, where max and min denote the maximum and minimum of the response map tran ( ) in Eq. (6) respectively, and d peaks d e n ot e s a l l p e a k s va l u e i n r e s p o n s e m a p , i peaks ∈ 1 peaks , … , d peaks , d is the number of peaks. PCRM reflects the fluctuation degree of response map and the confidence level of detected target. When the target appears completely and obviously in the detection area, the response map is similar to that of a cone with a sharp peak and a smooth descent to a relatively flat area, and the PCRM will become larger. Otherwise, PCRM will significantly decrease if the object is occluded or missing.
Seven response maps are generated from seven experts, due to they are repetitive, only the first three experts are taken to compute the weighted PCRM of different features to evaluate the t-th tracking result: Here, i = 1, 2, 3 , PCRM E i denotes the PCRM values of Expert i response map. Compared with the average PCRM, the weighted PCRM can reflect the overall fluctuation more comprehensive. When occlusion or severe deformation occurs, as mentioned above, the PCRM drops down rapidly, and at the same time it can be seen in our experiment that the average robustness score of the experts R t mean = 1 K ∑ K i=1 R t (E i ) decreases significantly as well, which indicates that the experts have divergence when they encounter unreliable samples. By integrating weighted PCRM and average robustness score, a comprehensive criterion SC t = PCRM t ⋅ R t mean is presented, called sample confidence score. Considering that KCF learns the target and background of the sample through dense sampling, even unreliable samples have their own value, it is unreasonable to discard them directly. When the current sample confidence score SC t much far less than its past mean value SC 1∶t mean = 1 t ∑ t i=1 SC i , the learning rate in Eq. (7) is determined as follows: where lr is the constant learning rate in original KCF, and are the confidence threshold and the power exponent of the power function, respectively. This update system can effectively prevent the tracking failure caused by penalizing samples with low sample confidence score. Figure 3 illustrates the mechanism of the proposed update strategy. As shown in the Fig. 3, in the beginning, the response map shows the one ideal sharp peak without the occlusion of the target, while there are many low-energy regional peaks around it, so that the value of PCRM is relatively large, and the model is updated with medium learning rate . When the target occluded severely, the response map fluctuates fiercely in the second row, so that PCRM drops to 5.84, the learning rate is computed as = 3.23 × 10 4 adaptively. It should be noted that, under this circumstance the unreliable samples which may contain valuable information for later tracking are not simply discarded. Therefore, by combining PCRM and the historical robustness score, the model will be updated with low learning rate in this frame under the proposed strategy. Then the tracking model is not corrupted and the target can be tracked successfully in the subsequent frames. Figure 4 intuitively shows the PCRM and learning rate distribution on the basketball sequence. The athlete is fully and partially occluded in 17-th and 54-th frame respectively, the corresponding PCRM and learning rate value drops to low points. The subsequent low points indicate that, the proposed model update strategy can also react timely to rotation, deformation, illumination variation and background clutters. To validate the effectiveness of our model updater, more experiments will be conducted in the following section.
An overview of our trackers is summarized in Algorithm 1.

Experiments
In this section, comprehensive experiments are employed to evaluate our method. Firstly, the implementation details of our trackers are descried. Secondly, the effectiveness of the model updater in our trackers is validated by comparing We first conduct experiments on two benchmark datasets, OTB-2013 [30] and OTB-2015 [12]. The former has 51 video sequences, and the latter extends to 100. All these sequences are annotated with 11 attributes which cover various challenging factors, including illumination variation (IV), motion blur (MB), deformation (DEF), fast motion (FM), out-of plane rotation (OPR), scale variation (SV), occlusion (OCC), background clutters (BC), out-of-view (OV), in-plane rotation (IPR), low resolution (LR).
Two indicators are used: success plot and precision plot. The success plot represents the percentage of successful frames whose overlap rate between the tracked bounding box and the ground-truth. The precision plot is defined as the percentage of frame in which average distance (in pixel) between the output bounding box and the ground-truth is less than the given threshold. To rank the trackers, two types of ranking metrics are used as: the representative precision score at threshold = 20 for the distance precision plot (DP), and the area under the curve (AUC) metric for the success plot.
For fair evaluation, the third dataset VOT2015 [13] is also used, which contains 60 annotated sequences.

Implementation Details
The regularization parameters in Eqs. (4) and (9) are set to 1 = 0.0001 and 2 = 0.01 , respectively. The learning rate in Eq. (10) is set to = 0.025 . Number of scales S is 33 and scale factor a is 1.02, respectively. For ensemble tracking, the parameter in the weigh sequence is set to 1.1, the weighting factor is set to 0.1. In model updater, and in Eq. (20) are set to 0.6 and 3, respectively. All experts adopt the same parameters. All mentioned parameters are shown in Table 2.
Our experiments are implemented in MATLAB 2019a on a computer with Intel I5-3450 3.1 GHz CPU and 16 GB RAM. The MatConvNet toolbox [42] is used for extracting the deep features from VGG-19 [8]. Our deep features version trackers run at about 1.5 FPS in OTB basketball sequence. The speed of our handcrafted features version tracker is about 25 FPS in the same sequence, which is sufficient for real-time applications.

Analyses of Our Trackers
To evaluate the effectiveness of each component in our framework, we compare our trackers with different versions of itself on OTB-2013 and OTB-2015. Our trackers are denoted as Ours and Ours_deep, we first compare our trackers with the Expert 7 , which denoted as Expert7. Then to demonstrate the effect of the update mechanism, some popular methods are embedded into our trackers, such as PSR in [7], APCE in [27] and interpolation from original KCF [6], which are denoted as Our_with_PSR, Our_with_ APCE and Our_with_ Interpolation respectively. In all compared trackers, only Ours_deep adopts CNN features.
As shown in Fig. 5, our trackers Ours and Ours_deep show the best tracking accuracy and robustness in both OTB-2013 and OTB-2015 datasets. Ours outperform the Expert7 obviously, it is worthy to mention that, Expert7 has achieved a quite sufficient, which illustrates that our ensemble method improves its performance about 5% higher. Besides, Ours_with_Interpolation adopts the constant learning rate to update the model every frame by Eq. (7), Ours_with_APCE and Our_with_PSR simply discard the unreliable sample, which may value to the tracker, due to the limitations of these three kinds, all of them get poor performance both in precision and success. However, our novel strategy by considering the fluctuation degree of the response map and divergence among seven experts boosts the performance further.

OTB-2013 Dataset
In general, according to the evaluation metrics by OTB-2013, the one pass evaluation (OPE) score in precision and success plots are shown in Fig. 6, The "deep" in the brackets of the legend represents that the tracker is based on deep learning. As shown in the plots, our approach achieves promising results compared to many advanced trackers. Ours scores reach 64.3% success rate and 84.4% precision rate. With the help of CNN features, Ours_deep achieves 67.5% success rate and 89% precision rate, ranks fourth and fifth respectively among all the compared trackers. As the baseline of our trackers, the KCF obtains 51.4% success rate and 74.0% precision rate as they reported, meanwhile, our method employs the scale estimation from DSST, which gets 56.5% success rate and 75.4% precision rate. These observations indicate that the proposed framework works better than the original two. Particularly, MEEM similar to our approach is also based on historical tracker ensemble. Our ensemble mechanism exceeds the MEEM significantly by 7.7% of the AUC score and 1.4% of the DP score. The proposed trackers also show comparable performance with the state-of-the-art trackers, MDNet [11], ECO [35], C-COT [34], ADNet [43] in both precision and success rate.

OTB-2015 Dataset
To further validate the effectiveness of our trackers, we conduct our experiment on relatively large dataset OTB-2015 containing 100 annotated targets, thus, OTB-2015 dataset is more comprehensive than its predecessor. As shown in Fig. 7, top fifteen trackers are colored in plots, and DP scores for precision and AUC scores for success are reported in the legends. The proposed tracker Ours_deep achieving DP score of 88.3% and AUC score of 66.3% , ranks fourth in both criteria. As for the scores of Ours are only lower than deep features based trackers, ECO, MDNet, C-COT and ADNet, however, also ranks higher than MEEM in both plots. It is worth mentioning that in this more comprehensive evaluation, our handcrafted features version method provides a gain 19.4 and 20.1% in DP score, 28.9 and 18.7% in AUC score compared to the KCF and DSST respectively. This demonstrates the effectiveness and validity of our framework again. In general, the proposed trackers have demonstrated competitiveness on the OTB benchmark.

Attribute-Based Comparison
We further use the image sequences annotated by eleven attributes to comprehensively evaluate the performance of tracker in different scenarios. Figure 8 shows the AUC scores for eleven different attributes, since the AUC score measures the tracker performance more accurately than DP that is with one threshold. For clarity, the results for top fifteen trackers are reported in the legend. As illustrated in the plots, the proposed trackers achieve excellent results on most of attributes. In sequences annotated with the scale variation attribute, our handcrafted features approach outperforms the DSST, due to the joint decision strategy in our high discriminative ability kernelized correlation filters. Moreover, our trackers are at the forefront in the three attributes of occlusion, out of view and background clutters, this shows that the proposed model updater mechanism boosts the performance much higher in these three distractive scenarios. In addition, targets in sequences annotated out-of-plane rotation and inplane rotation have multiple views, therefore, the strength and frequency of updating the model is particularly critical.
Due to the proposed model updater, our trackers handle well in both attributes. Our approach provides favorable results in attributes of deformation, illumination variation, fast motion and motion blur as well.

Qualitative Evaluation
Here qualitative comparative experiments of our approach with other trackers are performed on twelve image sequences are shown in Fig. 9 In Dog1 and Singer1, both ACFN and LMCF suffer from a significant scale drift in the presence of fast scale change and illumination variation, while our approach performs well. Although Staple can adapt to scale variation and in-plane-rotation in Dog1 and Singer1, it does not perform well in presence of occlusion, background clutters and fast motion in Jogging2 and DragonBaby. In Girl2, when the adult completely blocks the girl, DSST and most other trackers are drifted by the occlusion. While our proposed model updater can avoid the model corruption, after the girl appears again, our trackers correct the drift and continues to track the real target. A similar phenomenon can also be observed in Jogging2 and Skating2. This demonstrates the superior performance of our trackers not only due to ensemble tracking, but also due to the model update scheme. Diving, Bird1, Skiing, MotorRolling and Biker are the most challenging sequences in OTB, with the boost of CNN features, our deep version tracker can track the targets, and even perform better than ECO in Bird1 and MotorRolling.

VOT2015 Dataset
For completeness, we also present the evaluation results on VOT2015 dataset [13], which contains 60 sequences   Fig. 8 The success plots for attribute-based evaluation of trackers on OTB-2015, the AUC scores for top fifteen trackers are reported in the legend. The number of videos related to the attribute is in parentheses above each plot results are illustrated in accuracy and robustness ranking plot, as shown in Fig. 10. The accuracy and failure, as well as expected overlap for dozens of competitive trackers are listed in Table 3. From the plot, it is observed that our deep version tacker resides in the top right corner, which means only MDNet (the VOT2015 winner) ranks higher than Our_deep. It is worth noting that our handcrafted features based tracker outperforms most of the compared trackers. Because relying on iterative optimization operators online, the speeds of MDNet and DeepSRDCF are even lower than 1 FPS, which is far from meeting the realtime requirements. However, the speed of Ours_depp is about 1.5 FPS, our handcrafted features version tracker can reach 25 FPS, which is much faster than the trackers mentioned above. In addition, the proposed method ranks higher than KCF, MEEM and DSST, which demonstrates the effectiveness of the proposed framework again.

Conclusion
In this paper, a multi-experts joint decision framework for visual tracking embedded with adaptive model updater is proposed, which fully explore the strength of multiple features not only in feature-level, but also in decision-level by using high discriminative power of kernelized correlation filters. Moreover, our trackers are extended with an effective scale estimation approach to address the problem of fixed template size. Furthermore, a novel criterion called peaks correlation of response map (PCRM) is proposed to assess confidence of sample through response map, and establish an adaptive model update strategy by considering both PCRM and historical robustness score of experts to alleviate the model corruption problem. Three widely used datasets are adopted to conducted extensive experiments. We compare our approach with state-of-the-art trackers Fig. 9 Comparison of the proposed method with the state-of-theart trackers: ECO [35], ADNet [43] LMCF [27], MEEM [20], Staple [31], KCF [6] and DSST [17]  on OTB-2013 and OTB-2015, the results show the effectiveness and validity of components in our trackers. The proposed trackers are at the front in most different kind of evaluations. Our approach gains outstanding results on VOT2015 as well. The conducted experiments demonstrate the proposed trackers perform competitively against stat-of-the-art approaches. It is worthy to emphasize that, the proposed approach not only performs superiorly, but also can run at high speed on average machines to meet real-time application scenarios.   Fig. 10 The AR ranking plot for baseline. The accuracy and robustness rankings are plotted along the vertical and horizontal axis respectively. Our trackers are denoted by the red circle and yellow cross. The tracker is better if it is closer to the top right corner of the plot Table 3 The table shows accuracy, the average number of failure and expected average overlap of state-of-the-art trackers on the VOT2015 [13]