1 Introduction

In recent years, visual tracking is widely used in intelligent surveillance, autopilot, robotics and many other applications, and becomes one of the most popular fields in computer vision. The research focuses in this on single target, object given and model-free tracking. The tracker is initialized with the location and size of the target in the first frame, and any explicit appearance or prior model cannot be used. In the case where the training sample has only the first frame, it is challenging to estimate the target trajectory throughout the sequence. At the same time, the target also suffers from occlusions, scale variations, motion blur and illumination changes during the tracking process.

To tackle the problem of lacking of training samples, most existing trackers either adopt generative [1,2,3] or discriminative [4,5,6] methods to learn appearance model. The generative algorithms search the candidate regions by the minimum reconstruction error to find the best matching position. The discriminative approaches locate the target by designing and training the classifier to distinguish the background and the foreground. Among the discriminative tracking algorithms, correlation filter-based methods [4, 6, 7] gain much attention for their high accuracy as well as high efficiency. Deep convolutional neural network (CNN) features have achieved great success in many computer vision tasks [8,9,10]. Correlation filters are difficult to adapt to severe deformation and fast motion, so there is a rising trend for introducing CNN features in tracking framework to improve performance with the help of its rich feature representation. MDNet [11] adopt CNN models trained with tracking datasets including [12, 13] and achieved better performance. HCF [14] integrates CNN features into the correlation filter framework for robust tracking.

However, the original correlation filtering algorithms [4, 6, 7] use fixed-size templates, which leads to a problem. When the size of the target changes drastically, the template either contains extra background or only a part of the target. This may causes tracking failure when scale variation in the presence of other complicated factors such as background clutters, occlusion and motion blur. Meanwhile, many practical application scenarios require accurate target size information in the image. Extensive researches [15,16,17,18,19] have been conducted on how to establish a robust scale estimation strategy. Among them, the scale adaptive kernel correlation filter tracker with feature integration (SAMF) [18] and the discriminative scale space tracker (DSST) [19] are the most widely used approaches. SAMF is a straightforward method to estimate the scale by applying the standard learned two-dimensional filter to samples of multi-resolutions around the target, this exhaustive scale search strategy is computationally demanding. DSST [17] tackles the scale estimation problem by learning two separate correlation filters for explicit translation and scale estimation. First the conventional discriminative correlation filter is employed to find the maximum response indicating the target position. Next, a separate scale detection model is trained to search the optimal scale in the multi-scale spatial pyramid. In this way, the employ of the two independent filters avoids mutual interference. Although the DSST addresses the scale estimation problem to some degree, the conventional correlation filter for translation estimation still suffers from its relatively low discriminative power.

Scalable correlation filter trackers are still not strong enough. Ensemble approaches [18, 20,21,22,23,24] have been developed as another way to improve the performance with combination of multiple trackers for visual tracking. For example, the ensemble method [22] under the acceleration framework [25] incrementally trains the weak trackers of each component to classify the training samples that were previously misclassified by the tracker. As one of the representative works, multiple experts using entropy minimization (MEEM) [20] demonstrates the potential of ensemble, which proposes to address the problem by using the multi-expert restoration scheme to predict target, where an entropy based loss function is defined to determine the confidence of current tracker. HDT [23] estimates the position of the target by fusing the response maps obtained from a correlation filter trained by hierarchical convolutional features of various resolutions as a weak classifier in a coarse-to-fine scheme. The final prediction result is weighted by the adaptive hedging method of weak classifier. In MCCT [24] Wang et al. introduce the concept of feature pool, which includes seven features, and use the different features of the target to learn the correlation filter tracking experts, finally select the most reliable one as the tracking result in each frame.

Although the impact of model updater is significant on performance [26], there are very few researches focus on this component. The model updater determines the frequency and strategy of updating model. Since only the samples of the first frame are reliable, the tracker must maintain a tradeoff between the collection of new samples during the tracking process and the prevention of the tracker from drifting to the background. Most trackers update the model every frame. In [6], the criterion used to obtain the target position is the naive maximum response value, and update every frame with a moderate learning rate. Entropy minimization is adopted in [20] to identify reliable model update and discard the incorrect ones. Bolme et al. propose a simple measurement of peak strength called the peak-to-sidelobe ratio (PSR) [7]. Wang et al. argue that the robustness of the maximum response value will be heavily degraded due to the presence of some other challenging factors such as motion blur, partial and full occlusion. Thus, in [27] instead of using the naive maximum response value to obtain both the translation estimation, Wang et al. propose a criterion named average peak-to-correlation energy (APCE), when the maximum of the response map and APCE are both great enough, the tracking model will be updated. The above methods either update every frame or discard unreliable samples directly. A reasonable update strategy should adjust the learning rate adaptively according to the confidence level of the sample, so that it does not contaminate the model or lose the information that may be useful for tracking.

To address the problems mentioned above, a multi-experts joint decision framework based on kernelized correlation filter is proposed to carry out robust visual tracking. The main contributions are summarized as follows:

  1. 1.

    First, our trackers are extended with the capability of estimating scale. The size and position of the target can be obtained simultaneously.

  2. 2.

    Then, a multi-experts joint decision strategy base on kernelized correlation filters is presented. Handcrafted features (HOG [28], CN [29]) and CNN features are exploited to build a correlation filter bucket, which contains seven experts. By evaluating the total robustness score of each expert, the most reliable one is selected as tracking result for each frame.

  3. 3.

    Next, the novel criteria of peaks correlation of response map (PCRM) is proposed to evaluate the reliable level of sample. The PCRM of first three response maps are computed, and weigh it to obtain a confidence index for the current sample.

  4. 4.

    Finally, an adaptive model updater strategy is proposed to alleviate the contamination of samples by considering the PCRM of sample and the divergence of experts.

Extensive and comprehensive experiments are conducted on widely used benchmarks OTB-2013 [30], VOT2015 [13] and OTB-2015 [12]. The results validate the improvement in success and precision rate of proposed tracker.

2 Related Works

During the past few decades, substantial progress has been contributed in the field of visual object tracking. In this section, the works closely related to our method are summarized from three perspectives: tracking by correlation filters, tracker ensemble and model update strategy.

2.1 Tracking by Correlation Filters

Due to high efficiency and accuracy, correlation filters-based trackers remain mainstream in practical applications. Bolme et al. [7] utilize the minimum output sum of squared error (MOSSE) to learn the correlation filters, by using the circular correlation, the resulting filter can be computed efficiently using point-wise operation in Fourier domain. Whereafter, Henriques et al. [4], dense sampling is performed by efficiently utilizing the structure of the circulant matrix, and while maintaining high speed, the discriminative ability of the CSK is enhanced as the negative sample is augmented. The above methods are based on grayscale feature. The work is further extended to HOG multichannel feature in kernel space [6]. Staple [31] makes full use of the complementarity of color and gradient information, while running in excess of real-time. Danelljan et al. [32] introduce a spatially regularized component in the learning to penalize CF coefficients depending on their spatial locations and alleviate the boundary effects. With the rising trend of for introducing CNN features into object tracking field, several trackers [14, 33] use deep models pretrained for the object classification task for feature representations, and the performance have been further improved. More recently, C-COT [34] achieves outstanding performance in several benchmarks. C-COT adopts continuous convolution operator to fuse the deep feature maps. After that, in ECO [35], several extra strategies are used to combine the deep and hand-crafted features to speed up in C-COT framework. More extended work, such as scale variation [17, 19] and long-term tracking [36] are added to the correlation filters framework.

Fig. 1
figure 1

Flowchart of our proposed algorithm. Different features are extracted for each expert from searching area. Then every expert gets target position by using KCF framework. Next, by self and pair evaluation, the most credible expert is selected as the result for the current frame. Finally, by combining the fluctuation of response map of first three experts and historical robustness score, the model updates adaptively

2.2 Tracker Ensemble

According to the literature [26], the ensemble approach can improve the performance substantially. In MEEM [20] the entropy minimization is used to exploited the relationship between multi experts and its historical tracker. Then in [37], Li et al. extend it by using the unified discrete graph algorithm to model the multiple experts. Qi et al. [23] propose to develop an improved hedge algorithm that combines weak CNN based trackers from various convolutional layers into a single stronger tracker. Wang et al. [24] propose the multi-cue correlation filters framework, which constructs parallel experts from different features, and selects the expert with the highest robustness score as tracking result for each frame.

2.3 Model Update Strategy

Although the implementation of the model updater is often treated as tricks, their impact on performance is usually very significant. Unfortunately, few works focus on this component [26]. Santner et al. propose parallel robust online simple tracking (PROST) [38], using a simple template model as a non-adaptive, a novel optical-flow-based mean-shift tracker as highly adaptive element and an online random forest as moderately adaptive appearance-based learner. In MOSSE [7] a criteria PSR is used to quantify the reliability of the tracked sample, and Bolme et al. argue that, when the value of PSR ranges between 20.0 and 60.0 indicates very strong peaks. MEEM [20] tracker is proposed to identify reliable model update and discard the incorrect ones. In KCF [6], the model is updated every frame with a moderate learning rate. Wang et al. [27] propose to employ the maximum response value and the APCE as the criterion to provide a high-confidence update strategy for robustness.

3 Methods

A multi-experts joint decision strategy with adaptive model updater for robust tracking base on kernelized correlation filter is proposed in this work. Firstly, the baseline of our trackers [6] adopts fixed target size, that given in the first frame. Therefore, a robust scale estimation approach [17] is employed to handle target scale changes. Secondly, handcrafted or deep features are extracted, seven experts are obtained by splitting and combining these two features, after jointing the decision of experts, the most reliable one is selected as the tracking result. Thirdly, the novel criteria called peaks correlation of response map (PCRM) is proposed. By evaluating the correlation between the maximum value and other peaks of the response map, the PCRM obtains the confidence level of the sample. Finally, by considering PCRM and the historical divergence of experts, the presented model updater strategy can update the model with an appropriate learning rate. The flowchart in Fig. 1 depicts the main framework of our proposed algorithm.

In Sect. 3.1, the formulas of our baseline tracker multi-channel kernelized correlation filter is treated. The scale estimation approach in our trackers is introduced in Sect. 3.2 and the multi-features construction and ensemble tracking strategy in Sects. 3.3 and 3.4 are demonstrated respectively. In Sect. 3.5, the PCRM and the model updater strategy are proposed.

3.1 The Kernelized Correlation Filter Tracker

Since the discriminative ability of KCF tracker is enhanced with the augmentation of negative samples while maintaining high speed by exploiting structure of circulant matrix with high efficiency, KCF has become the baseline of many trackers [39, 40].

For notational simplicity, one-dimensional signal is considered, more details can be found in [6]. Given one dimensional data \({\mathbf{x}}=\left[ {\mathbf{x}}_{1}, {\mathbf{x}}_{2}, \ldots , {\mathbf{x}}_{n}\right]\), the training goal is to find \(f({\mathbf{z}})={\mathbf{w}}^{\mathrm{T}} {\mathbf{z}}\) which minimizes the squared error over training samples \({\mathbf{x}}_{i}\) and their regression targets \({\mathbf{y}}_{i}\),

$$\begin{aligned} \text{min}_{{\mathbf{w}}} \sum _{i}^{n}\left( f\left( {\mathbf{x}}_{i}\right) -{\mathbf{y}}_{i}\right) ^{2} +\lambda _{1}\Vert {\mathbf{w}}\Vert . \end{aligned}$$
(1)

The scalar \(\lambda _{1}\) is a regularization parameter that controls overfitting. To allow for a more powerful classifier with nonlinear regression functions \(f({\mathbf{z}})\), the solution \({\mathbf{w}}\) is expressed as a combination of the samples:

$$\begin{aligned} {\mathbf{w}}=\sum _{i} \alpha _{i} \varphi \left( {\mathbf{x}}_{i}\right) , \end{aligned}$$
(2)

where \(\alpha _{i}\) are the variables under optimization in dual space, \(\varphi ({\mathbf{x}})\) represents a non-linear feature-space, therefore the optimized variables are \({\varvec{\alpha}}\), instead of \({\mathbf{w}}\). According to literature [41], this alternative representation \({\varvec{\alpha}}=\left[ \alpha _{1}, \alpha _{2}, \ldots , \alpha _{n}\right]\) is said to be in the dual space, as opposed to the primal space \({\mathbf{w}}\).

The solution to the kernelized version of ridge regression can be obtained as follow

$$\begin{aligned} {\varvec{\alpha}}=(K+\lambda _{1} I)^{-1} {\mathbf{y}}, \end{aligned}$$
(3)

where K is the kernel matrix containing elements \(K_{i j}=\kappa \left( {\mathbf{x}}_{i}, {\mathbf{x}}_{j}\right)\), which are computed using the kernel function \(\kappa\).

For the most commonly used kernels (e.g., Gaussian, linear and polynomial), the circulant matrix trick can also be used to make Eq. (3) diagonal:

$$\begin{aligned} {\hat{{\varvec{\alpha}}}}=\frac{{\hat{{\mathbf{y}}}}}{{\hat{{\mathbf{k}}}}^{\mathrm{xx}} +\lambda _{1}}, \end{aligned}$$
(4)

where \({\mathbf{k}}^{\mathrm{xx}}\) is the kernel correlation and hat denotes the Discrete Fourier Transform (DFT) of a vector, \({\hat{{\mathbf{y}}}}={\mathcal {F}}({\mathbf{y}})\). The multiplications and divisions in Eq. (3) are performed element-wise. In our trackers, the Gaussian kernel is adopted for its high accuracy as

$$\begin{aligned} {\mathbf{k}}^{{\mathrm{xx}}^{\prime }}=\text{exp} \left( -\frac{1}{\sigma ^{2}}\left( \Vert {\mathbf{x}}\Vert ^{2}+\left\| {\mathbf{x}}^{\prime } \right\| ^{2}-2 {\mathcal {F}}^{-1}\left( {\hat{{\mathbf{x}}}}^{*} {\hat{{\mathbf{x}}}}^{\prime }\right) \right) \right) , \end{aligned}$$
(5)

where kernel function \(\kappa \left( {\mathbf{x}}_{i}, {\mathbf{x}}_{j}\right)\) can be simply denoted as \({\mathbf{k}}^{{\mathrm{xx}}^{\prime }}\), \({\hat{{\mathbf{x}}}}^{*}\)is the complex-conjugate of \({\hat{{\mathbf{x}}}}\).

In detection process, a patch \({\mathbf{z}}\) with the same size of \({\mathbf{x}}\) is extracted at the position provided by the previous frame, and the response map is calculated as

$$\begin{aligned} {\hat{{\mathbf{f}}}}_{\mathrm{tran}}({\mathbf{z}})={\hat{{\mathbf{k}}}}^{\mathbf{x z}} {\hat{{\varvec{\alpha}}}}, \end{aligned}$$
(6)

where the \({\hat{{\mathbf{k}}}}^{\mathbf{x z}}\) is the kernelized correlation between \({\mathbf{z}}\) and \({\mathbf{x}}\) as defined in Eq. (5), meanwhile \({\hat{{\varvec{\alpha}}}}\) is obtained in the previous frame by Eq. (4). Then, the position of the object in the current frame is located by finding the translation with the maximum value in the response map \({{\mathbf{f}}}({\mathbf{z}})\).

To avoid model corruption, KCF uses interpolation to update the model every frame:

$$\begin{aligned}{\hat{ {\varvec{\alpha}} }}_ { t } = ( 1 - \eta ) {\hat{ {\varvec{\alpha}}}}_ { t - 1 } + \eta {\hat{ {\varvec{\alpha}}}}_ { t },\end{aligned}$$
(7a)
$$\begin{aligned}{\hat{{\mathbf{x}}}} _ { t } = ( 1 - \eta ) {\hat{{\mathbf{x}}}} _ { t - 1 } + \eta {\hat{ {\mathbf{x}}}}_ { t }, \end{aligned}$$
(7b)

where \(\eta\) is the learning rate and t denotes frame index of image sequences, how to determine the value of \(\eta\) will be discussed in Sect. 3.5. This puts more weight on recent frames and lets the effect of previous frames decay exponentially over time.

3.2 Discriminative Scale Space Tracking

Our scale searching scheme follows the DSST [17] tracker. Unlike SAMF [18] uses one filter to determine translation and scale simultaneously, the DSST applies two kind of correlation filters. One is two-dimensional translation filter for target localization, and the other one is one-dimensional scale filter for scale evaluation, which are independent of each other.

In our trackers, KCF is employed to locate the target, and learn a separate 1-dimension filter to estimate the scale information. Tens of patches \(I_{n}\) of size \(a^{n} P \times a^{n} R\) are extracted centered around the target to construct the training sample \({\mathbf{x}}_{t,{\mathrm{scale}}}\), here \(P \times R\) denotes the target size in the current frame and S is the size of the scale filter, \(n \in \left\{ \left\lfloor -\frac{S-1}{2}\right\rfloor , \ldots ,\left\lfloor \frac{S-1}{2}\right\rfloor \right\}\) , a represents the scale factor between feature layers. The aim is to train a scale correlation filter \({\mathbf{h}}_{\mathrm{scale}}\) consisting of one filter \({\mathbf{h}}_{\mathrm{scale}}^{n}\) per scale level, this can be solved by minimizing the \(L^{2}\) error compared the desired output g, here 1-dimensional Gaussian is adopted,

$$\begin{aligned} \varepsilon =\left\| g-\sum _{n=1}^{S} {\mathbf{h}}^{n}_{\mathrm{scale}} \star {\mathbf{x}}^{n}_{\mathrm{scale}}\right\| ^{2}+\lambda _{2} \sum _{n=1}^{S}\left\| {\mathbf{h}}^{n}_{\mathrm{scale}}\right\| ^{2}, \end{aligned}$$
(8)

here, the \(\star\) denotes circular correlation, and the second term is a regularization with a weight parameter \(\lambda _{2}\).

The value \({\mathbf{x}}^{n}_{\mathrm{scale}}\) of the training samples \({\mathbf{x}}_{\mathrm{scale}}\) at scale level n, is set to the d-dimensional feature descriptor of \(I_{n}\). The solution to the problem above is as follows:

$$\begin{aligned} {\hat{{\mathbf{h}}}}^{n}_{\mathrm{scale}}=\frac{{\hat{g}}^{*} {\hat{{\mathbf{x}}}}^{n}_{\mathrm{scale}}}{\sum _{k=1}^{S} ({\hat{{\mathbf{x}}}}^{k}_{\mathrm{scale}})^{*} {\hat{{\mathbf{x}}}}^{k}_{\mathrm{scale}}+\lambda _{2}}, \quad n=1, \ldots , S, \end{aligned}$$
(9)

where the fraction denotes pointwise division. Similar to Eq. (7), new sample \({\mathbf{x}}_{t,{\mathrm{scale}}}\) is used to update the numerator \(A^{n}_{t}\) and denominator \(B^{n}_{t}\) of the scale filter \({\mathbf{h}}_{t,{\mathrm{scale}}}\).

$$\begin{aligned}A_{t}^{n}=(1-\theta ) A_{t-1}^{n}+\theta {\hat{g}}^{*} {\hat{{\mathbf{x}}}}_{t,{\mathrm{scale}}}^{n}, \quad n=1, \ldots , S,\end{aligned}$$
(10a)
$$\begin{aligned}B_{t}=(1-\theta ) B_{t-1}+\theta \sum _{k=1}^{S} ({\hat{{\mathbf{x}}}}^{k}_{t,{\mathrm{scale}}})^{*} {\hat{{\mathbf{x}}}}_{t,{\mathrm{scale}}}^{k}. \end{aligned}$$
(10b)

Here, \(\theta\) is a learning rate parameter. Many numerical experiments show that \(\theta = 0.01\) makes filter quickly adapt to scale variation while still maintaining robust.

To apply the filter in a new frame t, a test sample \({\mathbf{z}}_{t,{\mathrm{scale}}}\) is extracted from the location determined by KCF using the same procedure as for the training sample \({\mathbf{x}}_{t,{\mathrm{scale}}}\). In Fourier domain, the correlation scores \({\hat{{\mathbf{f}}}}_{t,{\mathrm{scale}}}\) are computed by

$$\begin{aligned} {\hat{{\mathbf{f}}}}_{t,{\mathrm{scale}}}({\mathbf{z}}_{t,{\mathrm{scale}}})=\frac{\sum _{n=1}^{S} (A_{t-1}^{n})^{*} {\hat{{\mathbf{z}}}}_{t,{\mathrm{scale}}}^{n}}{B_{t-1}+\lambda _{2}}, \end{aligned}$$
(11)

where \(A_{t-1}^{n}\) and \(B_{t-1}\) are the numerator and denominator of the scale filter updated in \((t-1)\)-th frame. By maximizing the scale correlation score \({\mathbf{f}}_{t,{\mathrm{scale}}}({\mathbf{z}}_{t,{\mathrm{scale}}}) ={\fancyscript{F}}^{-1}\left\{ {\hat{{\mathbf{f}}}}_{t,{\mathrm{scale}}}({\mathbf{z}}_{t,{\mathrm{scale}}})\right\}\), the relative change in scale is obtained compared to the previous frame.

3.3 Multi-experts Construction

HOG [28], gray and ColorNames [29] (CN) are the most popular handcrafted features in tracking filed for their high extraction efficiency. The HOG features, called histogram of oriented gradient, are constructed by calculating and counting the gradient direction histogram in the local region of the image, which reflect the edge shape information of region block. CN features have rich expressiveness and high identification. They are obtained by transforming RGB space to CN space which can reflect the 11-dimensional thematic color information of the region [29]. The gray features are simple features that contain only brightness information. Different from handcrafted features, CNN features contain rich high-level semantic information and are strong at distinguishing objects of different categories.

In our handcrafted features version tracker, only two low-level features (HOG and CN) are adopted to build the experts. Diversity is crucial in ensemble methods [26], to create more experts, 32-dimension HOG is divided into two 16-dimension features, called \({\mathrm{HOG}}_{1}\) and \({\mathrm{HOG}}_{2}\). Through permutation and combination of these features, \(C_{3}^{1}+C_{3}^{2}+C_{3}^{3}=7\) experts are obtained in the feature bucket. In our deep features version tracker, HOG, conv4-4 and conv5-4 are extracted as low, middle and high features, respectively. Details about the experts is shown in Table 1. After these seven experts are generated, results (bounding boxes) are provided from their own perspective, how to choose a reliable one will be discussed in Sect. 3.4.

Table 1 Our trackers both consist of 7 experts, the separation and combination of features increase the diversity of experts, each expert track from their perspective

3.4 Ensemble Tracking

In each frame, seven experts track the target and generate cues (bounding boxes) simultaneously, inspired by MCCT [24], pair-evaluation and self-evaluation [24] are adopted to evaluate their robustness degree. And then, the expert with the highest score is selected as the tracking result of the current frame. The procedure of ensemble tracking framework is shown in Fig. 2.

3.4.1 Pair-Evaluation

\(E_{1}\) to \(E_{7}\) denote Expert 1 to Expert 7 respectively. Every expert is treated as a black box, the bounding box of Expert i in the t-th frame is written as \(B_{E_{i}}^{t}\), which is a four-dimensional vector containing the position and target size. The overlap ratios \(O_{\left( E_{i}, E_{j}\right) }^{t}\) between any two experts \(E_{i}\) and \(E_{j}\) in frame t computed as,

$$\begin{aligned} O_{\left( E_{i}, E_{j}\right) }^{t}=\frac{\text{Area}\left( B_{E_{i}}^{t} \cap B_{E_{j}}^{t}\right) }{\text{Area}\left( B_{E_{i}}^{t} \cup B_{E_{j}}^{t}\right) }, \end{aligned}$$
(12)

where, \(\text{Area}\) represents the area of the intersection or union of the bounding boxes. To reduce the gap between the high and low value of \(O_{\left( E_{i}, E_{j}\right) }^{t}\), it is converted to nonlinear by adopting exponential function

$$\begin{aligned} O_{\left( E_{i}, E_{j}\right) }^{\prime t}= \exp \left( -\left( 1-O_{\left( E_{i}, E_{j}\right) }^{t}\right) ^{2}\right) . \end{aligned}$$
(13)

The fluctuation extent of overlap ratios in a short period \(\Delta t\) (e.g., 5 frames in [24]) reveals the stability of overlap evaluation between \(E_{i}\) and other experts. However, \(O_{\left( E_{i}, E_{j}\right) }^{\prime t}\) only represents the overlap ratio between two experts in present t-th frame, thus previous value of overlap ratio should be taken into consideration, the variance of the \(K=7\) experts is given as follows,

$$\begin{aligned} V_{E_{i}}^{t}=\sqrt{\frac{1}{K} \sum _{j=1}^{K}\left( O_{\left( E_{i}, E_{j}\right) }^{\prime t}-\overline{O_{\left( E_{i}, E_{j}\right) }^{\prime t-\Delta t+1: t}}\right) ^{2}}. \end{aligned}$$
(14)

Here, \(\overline{O_{\left( E_{i}, E_{j}\right) }^{\prime t-\Delta t+1: t}} =\frac{1}{\Delta t} \sum _{\tau } O_{\left( E_{i}, E_{j}\right) }^{\prime \tau }\) and \(\tau \in [t-\Delta t+1, t]\).

The \(M_{E_{i}}^{t}=\frac{1}{K} \sum _{j=1}^{K} O_{\left( E_{i}, E_{j}\right) }^{\prime t}\) is the mean value of overlap ratios, reflects the consistency between Expert i and other experts. In a short period, the closer to the current frame, the greater the relationship between this score and the current frame. Thus, an increasing sequence \({\mathbf{W}}=\left\{ \rho ^{0}, \rho ^{1}, \ldots , \rho ^{\Delta t-1}\right\} ,(\rho >1)\) put more confidence on recent scores. Then the average weighted mean and standard variance are computed through: \(M_{E_{i}}^{\prime t} = \frac{1}{N} \sum _{\tau } W_{\tau } M^{\tau }_{E_{i}}\) and \(V_{E_{i}}^{\prime t} = \frac{1}{N} \sum _{\tau } W_{\tau } V^{\tau }_{E_{i}}\), respectively, and \(W_{\tau }\) denotes the \((\tau - t + \Delta t)\)-th element in sequence in \({\mathbf{W}}\), and N is defined by \(N = \sum _{\tau } W_{\tau }\).

The pair-evaluation score of Expert i in t-th frame is computed as below,

$$\begin{aligned} R_{\mathrm{pair}}^{t}\left( E_{i}\right) =\frac{M_{E_{i}}^{\prime t}}{V_{E_{i}}^{\prime t}+\xi }, \end{aligned}$$
(15)

the existence of small constant \(\xi\) is to avoid the pair-wise robustness score infinity when the denominator is zero. Equation (15) indicates that, a higher value of \(R_{\mathrm{pair}}^{t}\left( E_{i}\right)\) means greater consistency and less volatility between different experts.

Fig. 2
figure 2

Schematic diagram of ensemble tracking framework. Nodes in the hexagon represent the output of experts (bounding box). In a frame, the pair evaluation between two experts is marked with a light gray line. The pair evaluation of the same experts is marked with colorful line

3.4.2 Self-Evaluation

The Euclidean distance between the bounding box \(B_{E_{i}}^{t-1}\) in \((t-1)\)-th frame and the one \(B_{E_{i}}^{t}\) in t-th frame reflects the reliability of the tracking output of each expert, which is defined by \(D_{E_{i}}^{t} = \left\| c\left( B_{E_{i}}^{t-1}\right) -c\left( B_{E_{i}}^{t}\right) \right\|\), Here, \(c\left( B_{E_{i}}^{t}\right)\) is the central coordinate of \(B_{E_{i}}^{t}\). The trajectory smoothness degree of Expert i is given as follows,

$$\begin{aligned} S_{E_{i}}^{t}= \exp \left( -\frac{1}{2 \sigma _{E_{i}}^{2}}\left( D_{E_{i}}^{t}\right) ^{2}\right) , \end{aligned}$$
(16)

where \(\sigma _{E_{i}}^{2} = \frac{1}{2}\left[ W\left( B_{E i}^{t}\right) +H\left( B_{E i}^{t}\right) \right]\), \(W\left( B_{E i}^{t}\right)\) and \(H\left( B_{E i}^{t}\right)\) denote the width and height of the Expert i.

As mentioned before, to avoid performance fluctuation of the experts, scores in the short term should be considered. Thus, the self-wise expert trajectory smoothness score is given by \(R^{t}_{\mathrm{self}}(E_{i}) = \frac{1}{N} \sum _{\tau } W_{\tau } S_{E_{i}}^{\tau }\), again \(N = \sum _{\tau } W_{\tau }\). The higher self-evaluation score means the better reliability of the tracking trajectory.

3.4.3 Joint Decision

The final robustness score \(R_{E_{i}}^{t}\) of the Expert i in frame t requires self evaluation score \(R_{\mathrm{self}}^{t}(E_{i})\) and pair-evaluation score \(R_{\mathrm{pair}}^{t}\left( E_{i}\right)\) to be weighted by the coefficient \(\mu\):

$$\begin{aligned} R^{t}\left( E_{i}\right) =\mu \cdot R_{\mathrm{pair}}^{t}\left( E_{i}\right) +(1-\mu ) \cdot R_{\mathrm{self}}^{t}\left( E_{i}\right) , \end{aligned}$$
(17)

finally, the expert with the highest final robustness score is selected as the output result in each frame.

The main advantage of this ensemble method is that, only twice feature extraction (the heaviest computational burden in the tracking process) is needed, one for the training by Eq. (4), the other for detection by Eq. (6), instead of fourteen (\(K = 7, 7 \times 2 = 14\)) in each frame. This approach considers both diversity and effectiveness, thus our trackers can maintain real-time performance while achieving high accuracy. Furthermore, by sharing the rectified results of target position and model update, the drift and tracking failure of weak experts are effectively alleviated.

Fig. 3
figure 3

The left column is the snapshots of image sequence Basketball, where the red bounding box indicates the tracking results of our trackers, the yellow one indicates the searching area shared by all experts. Seven response maps is generated for seven experts in each frame, for simplicity, only response maps of Expert 2 in our handcrafted version tracker are shown in the right column, and the target and a part of regional peaks in response map is pointed out in the left column

Fig. 4
figure 4

Analysis of PCRM and learning rate on the basketball sequence. Blue and orange curves represent the value of PCRM and learning rate, respectively. The text in the plot indicates the status and frame of the target. See the snapshots in the second column for specific status

3.5 A Novel Model Updater

The model updater determines both the strategy and frequency of model update, most of the existing trackers adopt two model update methods: (1) Schemes like [4, 6, 17, 31] update tracking models every frame with constant learning rate without considering whether the sample is credible or not. This may cause tracking failure due to the model corruption when the target is detected inaccurately, severely occluded or totally missing in the current frame; (2) Approaches like [7, 27] uses indicators (PSR and APCE respectively) to assess the fluctuation of the response map, and update the model when the indicator meets a certain conditions. This method alleviates tracking failure caused by corruption of the model, however, the learning rate of the model updater is still constant, unable to fully adapt to the needs in some particular scenes. In addition, models that are discarded since they don’t meet the conditions can be valuable. In our trackers, we establish an adaptive strategy to update by utilizing the feedback of tracking results.

It is demonstrated through many experiments that the number and value of the peaks of the response map can reflect the confidence of the tracking results. The ideal response map should have only one sharp peak, and the other area are relatively flat. The sharper peak can get better tracking accuracy. On the contrary, there are more than one peaks in the response map and the fluctuations are severe, whose pattern is significantly different from ideal response maps. If the updater still adopts the same learning rate, the model corruption will lead to tracking failure. Therefore, we propose a feedback adaptive updating mechanism with a criterion, called peaks correlation of response map (PCRM), which is defined as follows,

$$\begin{aligned} \text{PCRM} = \frac{\left( {\mathbf{f}}_{\mathrm{max}}-{\mathbf{f}}_{\mathrm{min}} \right) ^{2}}{{\text{mean}}\left( \sum _{i=1}^{d} \left( {\mathbf{f}}_{\mathrm{peaks}}^{i}-{\mathbf{f}}_{\mathrm{min}}\right) ^{2}\right) }, \end{aligned}$$
(18)

where \({\mathbf{f}}_{\mathrm{max}}\) and \({\mathbf{f}}_{\mathrm{min}}\) denote the maximum and minimum of the response map \({\mathbf{f}}_{\mathrm{tran}}({\mathbf{z}})\) in Eq. (6) respectively, and \({\mathbf{f}}_{\mathrm{peaks}}^{d}\) denotes all peaks value in response map, \({\mathbf{f}}_{\mathrm{peaks}}^{i} \in \left\{ {\mathbf{f}}_{\mathrm{peaks}}^{1}, \ldots ,{\mathbf{f}}_{\mathrm{peaks}}^{d}\right\}\), d is the number of peaks. PCRM reflects the fluctuation degree of response map and the confidence level of detected target. When the target appears completely and obviously in the detection area, the response map is similar to that of a cone with a sharp peak and a smooth descent to a relatively flat area, and the PCRM will become larger. Otherwise, PCRM will significantly decrease if the object is occluded or missing.

Seven response maps are generated from seven experts, due to they are repetitive, only the first three experts are taken to compute the weighted PCRM of different features to evaluate the t-th tracking result:

$$\begin{aligned} \text{PCRM}^{t}= {} \sum _{i} \upsilon _{i} \text{PCRM}_{E_{i}}^{t},\end{aligned}$$
(19a)
$$\begin{aligned} \upsilon _{i}= {} \frac{\text{PCRM}_{E_{i}}}{\text{PCRM}_{E_{1}} + \text{PCRM}_{E_{2}} + \text{PCRM}_{E_{3}}}. \end{aligned}$$
(19b)

Here, \(i = 1, 2, 3\), \(\text{PCRM}_{E_{i}}\) denotes the PCRM values of Expert i response map. Compared with the average PCRM, the weighted PCRM can reflect the overall fluctuation more comprehensive.

When occlusion or severe deformation occurs, as mentioned above, the PCRM drops down rapidly, and at the same time it can be seen in our experiment that the average robustness score of the experts \(R_{\mathrm{mean}}^{t} = \frac{1}{K} \sum _{i= 1}^{K} R^{t}(E_{i})\) decreases significantly as well, which indicates that the experts have divergence when they encounter unreliable samples. By integrating weighted PCRM and average robustness score, a comprehensive criterion \(\text{SC}^{t} = \text{PCRM}^{t} \cdot R_{\mathrm{mean}}^{t}\) is presented, called sample confidence score. Considering that KCF learns the target and background of the sample through dense sampling, even unreliable samples have their own value, it is unreasonable to discard them directly. When the current sample confidence score \(\text{SC}_{t}\) much far less than its past mean value \(\text{SC}_{\mathrm{mean}}^{1: t}=\frac{1}{t} \sum _{i=1}^{t} \text{SC}^{i}\), the learning rate \(\eta\) in Eq. (7) is determined as follows:

$$\begin{aligned} \eta =\left\{ \begin{array}{ll}\text{lr} &{}\quad {\text{if SC}^{t}>\alpha \cdot \text{SC}_{mean}^{1: t}} , \\ {\text{lr} \cdot \left[ \text{SC}^{t} /\left( \alpha \cdot \text{SC}_{\mathrm{mean}}^{1: t}\right) \right] ^{\beta }} &{}\quad {\text{otherwise}},\end{array}\right. \end{aligned}$$
(20)

where \(\text{lr}\) is the constant learning rate in original KCF, \(\alpha\) and \(\beta\) are the confidence threshold and the power exponent of the power function, respectively. This update system can effectively prevent the tracking failure caused by penalizing samples with low sample confidence score.

Figure 3 illustrates the mechanism of the proposed update strategy. As shown in the Fig. 3, in the beginning, the response map shows the one ideal sharp peak without the occlusion of the target, while there are many low-energy regional peaks around it, so that the value of PCRM is relatively large, and the model is updated with medium learning rate \(\eta\). When the target occluded severely, the response map fluctuates fiercely in the second row, so that PCRM drops to 5.84, the learning rate is computed as \(\eta = 3.23 \times 10^{4}\) adaptively. It should be noted that, under this circumstance the unreliable samples which may contain valuable information for later tracking are not simply discarded. Therefore, by combining PCRM and the historical robustness score, the model will be updated with low learning rate in this frame under the proposed strategy. Then the tracking model is not corrupted and the target can be tracked successfully in the subsequent frames. Figure 4 intuitively shows the PCRM and learning rate distribution on the basketball sequence. The athlete is fully and partially occluded in 17-th and 54-th frame respectively, the corresponding PCRM and learning rate value drops to low points. The subsequent low points indicate that, the proposed model update strategy can also react timely to rotation, deformation, illumination variation and background clutters. To validate the effectiveness of our model updater, more experiments will be conducted in the following section.

An overview of our trackers is summarized in Algorithm 1.

figure a
Fig. 5
figure 5

Performance comparison of different versions of our trackers on the OTB-2013 and OTB-2015 datasets. The evaluations on OTB-2013 are in the upper row, on OTB-2015 are in the second row. In the legend, the DP at a threshold of 20 pixels and AUC are reported in the left and right figures, respectively

4 Experiments

In this section, comprehensive experiments are employed to evaluate our method. Firstly, the implementation details of our trackers are descried. Secondly, the effectiveness of the model updater in our trackers is validated by comparing other approaches. Finally, our trackers are compared with state-of-the-art trackers.

We first conduct experiments on two benchmark datasets, OTB-2013 [30] and OTB-2015 [12]. The former has 51 video sequences, and the latter extends to 100. All these sequences are annotated with 11 attributes which cover various challenging factors, including illumination variation (IV), motion blur (MB), deformation (DEF), fast motion (FM), out-of plane rotation (OPR), scale variation (SV), occlusion (OCC), background clutters (BC), out-of-view (OV), in-plane rotation (IPR), low resolution (LR).

Two indicators are used: success plot and precision plot. The success plot represents the percentage of successful frames whose overlap rate between the tracked bounding box and the ground-truth. The precision plot is defined as the percentage of frame in which average distance (in pixel) between the output bounding box and the ground-truth is less than the given threshold. To rank the trackers, two types of ranking metrics are used as: the representative precision score at threshold = 20 for the distance precision plot (DP), and the area under the curve (AUC) metric for the success plot.

For fair evaluation, the third dataset VOT2015 [13] is also used, which contains 60 annotated sequences.

Fig. 6
figure 6

The success plots and precision plots on OTB-2013 dataset, 50 image sequence with 51 targets are quantitatively analyzed by using one pass evaluation (OPE), only trackers with scores in the top fifteen shown up, and the other trackers are plotted in light gray curves. The legend illustrates the area under the curve (AUC) for the success plot, and the percentage of the threshold 20 (DP) for the precision

4.1 Implementation Details

The regularization parameters in Eqs. (4) and (9) are set to \(\lambda _{1} = 0.0001\) and \(\lambda _{2} = 0.01\), respectively. The learning rate in Eq. (10) is set to \(\theta = 0.025\). Number of scales S is 33 and scale factor a is 1.02, respectively. For ensemble tracking, the parameter \(\rho\) in the weigh sequence \({\mathbf{W}}\) is set to 1.1, the weighting factor \(\mu\) is set to 0.1. In model updater, \(\alpha\) and \(\beta\) in Eq. (20) are set to 0.6 and 3, respectively. All experts adopt the same parameters. All mentioned parameters are shown in Table 2.

Table 2 Parameters in our trackers

Our experiments are implemented in MATLAB 2019a on a computer with Intel I5-3450 3.1 GHz CPU and 16 GB RAM. The MatConvNet toolbox [42] is used for extracting the deep features from VGG-19 [8]. Our deep features version trackers run at about 1.5 FPS in OTB basketball sequence. The speed of our handcrafted features version tracker is about 25 FPS in the same sequence, which is sufficient for real-time applications.

4.2 Analyses of Our Trackers

To evaluate the effectiveness of each component in our framework, we compare our trackers with different versions of itself on OTB-2013 and OTB-2015. Our trackers are denoted as Ours and Ours_deep, we first compare our trackers with the Expert 7 , which denoted as Expert7. Then to demonstrate the effect of the update mechanism, some popular methods are embedded into our trackers, such as PSR in [7], APCE in [27] and interpolation from original KCF [6], which are denoted as Our_with_PSR, Our_with_APCE and Our_with_ Interpolation respectively. In all compared trackers, only Ours_deep adopts CNN features.

As shown in Fig. 5, our trackers Ours and Ours_deep show the best tracking accuracy and robustness in both OTB-2013 and OTB-2015 datasets. Ours outperform the Expert7 obviously, it is worthy to mention that, Expert7 has achieved a quite sufficient, which illustrates that our ensemble method improves its performance about 5% higher. Besides, Ours_with_Interpolation adopts the constant learning rate to update the model every frame by Eq. (7), Ours_with_APCE and Our_with_PSR simply discard the unreliable sample, which may value to the tracker, due to the limitations of these three kinds, all of them get poor performance both in precision and success. However, our novel strategy by considering the fluctuation degree of the response map and divergence among seven experts boosts the performance further.

4.3 Comparison with the State-of-the-Art

Beside tracking results provided by the benchmark [30], we compare our trackers with stat-of-the-art trackers as well, including MDNet [11], ECO [35], C-COT [34], ADNet [43], SRDCFdecon [44], ACFN [45], CNN-SVM [46], DLSSVM [47], SiamFC-tri [48], MEEM [20], STAPLE [31], KCF [6], DSST and fDSST [17], LCT [21], SAMF [18], DCFNet [49], SRDCF [32]. It should be noted that, where MDNet [11], ECO [35], C-COT [34], ADNet [43], ACFN [45], CNN-SVM [46], DLSSVM [47] and SiamFC-tri [48] are based on deep learning.

Fig. 7
figure 7

The success plot and precision plot on OTB-2015 dataset with 100 image sequences, the DP and the AUC scores are illustrated in the left and right plots, respectively. The trackers after the fifteenth is drawn in light gray in the plots. In both metrics, our approach has achieved the best results among the compared trackers

4.3.1 OTB-2013 Dataset

In general, according to the evaluation metrics by OTB-2013, the one pass evaluation (OPE) score in precision and success plots are shown in Fig. 6, The “deep” in the brackets of the legend represents that the tracker is based on deep learning. As shown in the plots, our approach achieves promising results compared to many advanced trackers. Ours scores reach 64.3\(\%\) success rate and 84.4\(\%\) precision rate. With the help of CNN features, Ours_deep achieves 67.5\(\%\) success rate and 89\(\%\) precision rate, ranks fourth and fifth respectively among all the compared trackers. As the baseline of our trackers, the KCF obtains 51.4\(\%\) success rate and 74.0\(\%\) precision rate as they reported, meanwhile, our method employs the scale estimation from DSST, which gets 56.5\(\%\) success rate and 75.4\(\%\) precision rate. These observations indicate that the proposed framework works better than the original two. Particularly, MEEM similar to our approach is also based on historical tracker ensemble. Our ensemble mechanism exceeds the MEEM significantly by 7.7\(\%\) of the AUC score and 1.4\(\%\) of the DP score. The proposed trackers also show comparable performance with the state-of-the-art trackers, MDNet [11], ECO [35], C-COT [34], ADNet [43] in both precision and success rate.

Fig. 8
figure 8

The success plots for attribute-based evaluation of trackers on OTB-2015, the AUC scores for top fifteen trackers are reported in the legend. The number of videos related to the attribute is in parentheses above each plot

Fig. 9
figure 9

Comparison of the proposed method with the state-of-the-art trackers: ECO [35], ADNet [43] LMCF [27], MEEM [20], Staple [31], KCF [6] and DSST [17] on OTB-2015 over fifteen typical sequences. From top-left to bottom-right, the video illustrations are: Dog1, DragonBaby, Girl2, Jogging2, KiteSurf, Singer1, Skating2, Diving, Bird1, Skiing, MotorRolling, Biker

4.3.2 OTB-2015 Dataset

To further validate the effectiveness of our trackers, we conduct our experiment on relatively large dataset OTB-2015 containing 100 annotated targets, thus, OTB-2015 dataset is more comprehensive than its predecessor. As shown in Fig. 7, top fifteen trackers are colored in plots, and DP scores for precision and AUC scores for success are reported in the legends. The proposed tracker Ours_deep achieving DP score of 88.3\(\%\) and AUC score of 66.3\(\%\), ranks fourth in both criteria. As for the scores of Ours are only lower than deep features based trackers, ECO, MDNet, C-COT and ADNet, however, also ranks higher than MEEM in both plots. It is worth mentioning that in this more comprehensive evaluation, our handcrafted features version method provides a gain 19.4 and 20.1\(\%\) in DP score, 28.9 and 18.7\(\%\) in AUC score compared to the KCF and DSST respectively. This demonstrates the effectiveness and validity of our framework again. In general, the proposed trackers have demonstrated competitiveness on the OTB benchmark.

4.3.3 Attribute-Based Comparison

We further use the image sequences annotated by eleven attributes to comprehensively evaluate the performance of tracker in different scenarios. Figure 8 shows the AUC scores for eleven different attributes, since the AUC score measures the tracker performance more accurately than DP that is with one threshold. For clarity, the results for top fifteen trackers are reported in the legend. As illustrated in the plots, the proposed trackers achieve excellent results on most of attributes. In sequences annotated with the scale variation attribute, our handcrafted features approach outperforms the DSST, due to the joint decision strategy in our high discriminative ability kernelized correlation filters. Moreover, our trackers are at the forefront in the three attributes of occlusion, out of view and background clutters, this shows that the proposed model updater mechanism boosts the performance much higher in these three distractive scenarios. In addition, targets in sequences annotated out-of-plane rotation and in-plane rotation have multiple views, therefore, the strength and frequency of updating the model is particularly critical. Due to the proposed model updater, our trackers handle well in both attributes. Our approach provides favorable results in attributes of deformation, illumination variation, fast motion and motion blur as well.

4.3.4 Qualitative Evaluation

Here qualitative comparative experiments of our approach with other trackers are performed on twelve image sequences are shown in Fig. 9. In nine trackers, only KCF is incapable of estimating scale variation. Among all the test sequences, Dog1 and Singer1 have significant scale variation. Girl2, Jogging2, Skating2 go through part or whole occlusion. Targets in MotorRolling and Biker have the attributes of motion blur and fast motion. Target in DragonBaby are suffer from frequent appearance variations. Diving, Bird1, Skiing, Surfer and KiteSurf have sever deformation.

In Dog1 and Singer1, both ACFN and LMCF suffer from a significant scale drift in the presence of fast scale change and illumination variation, while our approach performs well. Although Staple can adapt to scale variation and in-plane-rotation in Dog1 and Singer1, it does not perform well in presence of occlusion, background clutters and fast motion in Jogging2 and DragonBaby. In Girl2, when the adult completely blocks the girl, DSST and most other trackers are drifted by the occlusion. While our proposed model updater can avoid the model corruption, after the girl appears again, our trackers correct the drift and continues to track the real target. A similar phenomenon can also be observed in Jogging2 and Skating2. This demonstrates the superior performance of our trackers not only due to ensemble tracking, but also due to the model update scheme. Diving, Bird1, Skiing, MotorRolling and Biker are the most challenging sequences in OTB, with the boost of CNN features, our deep version tracker can track the targets, and even perform better than ECO in Bird1 and MotorRolling.

Fig. 10
figure 10

The AR ranking plot for baseline. The accuracy and robustness rankings are plotted along the vertical and horizontal axis respectively. Our trackers are denoted by the red circle and yellow cross. The tracker is better if it is closer to the top right corner of the plot

Table 3 The table shows accuracy, the average number of failure and expected average overlap of state-of-the-art trackers on the VOT2015 [13]

4.4 VOT2015 Dataset

For completeness, we also present the evaluation results on VOT2015 dataset [13], which contains 60 sequences with substantial variations. Unlike the unsupervised OTB datasets, in the VOT2015 methodology, when the region overlap rate is below the threshold, the tracker is re-initialized. The evaluation reports accuracy and robustness, corresponding to the bounding box overlap rate and the number of failures, respectively. Please refer to [13, 50] for more information.

For clarity, we compare our algorithm with part of trackers provided by the dataset. The overall experimental results are illustrated in accuracy and robustness ranking plot, as shown in Fig. 10. The accuracy and failure, as well as expected overlap for dozens of competitive trackers are listed in Table 3. From the plot, it is observed that our deep version tacker resides in the top right corner, which means only MDNet (the VOT2015 winner) ranks higher than Our_deep. It is worth noting that our handcrafted features based tracker outperforms most of the compared trackers. Because relying on iterative optimization operators online, the speeds of MDNet and DeepSRDCF are even lower than 1 FPS, which is far from meeting the real-time requirements. However, the speed of Ours_depp is about 1.5 FPS, our handcrafted features version tracker can reach 25 FPS, which is much faster than the trackers mentioned above. In addition, the proposed method ranks higher than KCF, MEEM and DSST, which demonstrates the effectiveness of the proposed framework again.

5 Conclusion

In this paper, a multi-experts joint decision framework for visual tracking embedded with adaptive model updater is proposed, which fully explore the strength of multiple features not only in feature-level, but also in decision-level by using high discriminative power of kernelized correlation filters. Moreover, our trackers are extended with an effective scale estimation approach to address the problem of fixed template size. Furthermore, a novel criterion called peaks correlation of response map (PCRM) is proposed to assess confidence of sample through response map, and establish an adaptive model update strategy by considering both PCRM and historical robustness score of experts to alleviate the model corruption problem. Three widely used datasets are adopted to conducted extensive experiments. We compare our approach with state-of-the-art trackers on OTB-2013 and OTB-2015, the results show the effectiveness and validity of components in our trackers. The proposed trackers are at the front in most different kind of evaluations. Our approach gains outstanding results on VOT2015 as well. The conducted experiments demonstrate the proposed trackers perform competitively against stat-of-the-art approaches. It is worthy to emphasize that, the proposed approach not only performs superiorly, but also can run at high speed on average machines to meet real-time application scenarios.