In recent years, target tracking has become a research hotspot in the computer vision domain due to its practical application in multiple fields, including video surveillance, human-computer interaction, driver less, and medical image analysis [1,2,3]. Target tracking requires providing the target size and position information in the initial frame, then predicts the accurate size and the position of the target in subsequent frames of the video sequence. Despite the remarkable progress made in target tracking technologies in the past few decades, a few associated limitations remain unresolved, which include scale variations, background clutter, motion blur, among others. Bolme et al. [2] first applied correlation filtering in the tracking field, and proposed a new filter, namely minimum output sum of squared error filter (MOSSE) [4] to find the largest response of tracking target. The Exploiting circulation structure of Tracking-by-detection with Kernels (CSK) [5] algorithm adds dense sampling and kernel mechanisms based on MOSEE to increase the tracking frame rate from 20FPS to 400FPS. Joao et al. [4] proposed the Kernel Correlation Filter algorithm that improved the CSK algorithm by extending the HOG feature of the multi-channel gradient. Martin Daniella et al. [6] designed a color names (CN) feature and added multi-channel color features to the CSK algorithm. Poria et al. [7] identify important features of rough theory to find a higher accuracy in retrieval results. Reza et al. [8] proposed an edge calculation method to solve the problem of the concepts of the fuzzy similarity relation and homogeneity region. Those archived good results.

Despite the apparent advantages of high speed in correlation filtering algorithms, there is still scope for improvement. The first area for consideration is that the target deformation in the tracking process leads to unstable tracking. The traditional KCF and DCF [4] algorithms use the HOG feature [9] as the sample feature, showing strong stability for phenomena like motion blur and illumination change. But these model relies heavily on the contour structure of the tracking target. Consequently, the algorithm becomes extremely sensitive to object deformation leading to unstable tracking results. The second area for consideration is the boundary effect of samples caused by circular shift of the center image block. In the training phase, while the dense samples are obtained by the circular shifting of the center image block, making only the center samples accurate, the others have displacement boundaries leading to the fact that even the trained classifier cannot accurately track the object moving rapidly. The third area for consideration is the lowering of the tracking accuracy due to the non-scalability of the target scale as per the target size. In the target tracking process, both the reduction and expansion of the target scale cause the tracking drift by including a large amount of background information and containing only part of the target information, respectively, in the selected image block. The fourth area for consideration is target occlusion. In target tracking, the occluded target causes drift in the tracking results, which affect the target training model to a certain extent. Thus, with longer occlusion time, the tracking fails. This paper mainly provides solutions to the above discussed four limitations of correlation filtering algorithms. In summary, the main contributions of this paper are as follows: (1) A feature fused by HOG, CN, and HSV is to enhance feature responses discrimination and improve the stability of tracking when the scene is deformed or lighting changes.

Fig. 1
figure 1

Algorithm flowchart. The algorithm structure is roughly divided into three main parts: (1) feature extraction and fusion, (2) template and response calculation and (3) template update

(2) A spatial regularization weight is set according to the location information of training samples and the target space. And a spatial weight function is proposed to penalize the magnitude of the filter coefficients of ADMM [10] to reduce iteration of filter coefficients, weaken the boundary effect to keep the efficiency of tracking.

(3) An adaptive scale filter with a 7-scale pool is designed, which makes the algorithm adaptable to the scale variations.

(4) The correlation peak average difference ratio is applied to estimate the state of occlusion, which can realize the adaptive updating of the tracking model and improve the stability when the target occlusion.

Related work

Despite the correlation filtering based target tracking method achieved remarkable progress, there are a few limitations remain unseasoned, which include target de-formations, boundary effects, scale variations and target occlusion. Researchers put a lot of efforts to solve these issues.

For the target deformation, Poria et al. [7] identify important features of rough theory to find a higher accuracy in retrieval results. Gupta et al. [11] proposed a RE-SiamNets to circumvent the adverse effect of rotation. The SiamNets allow estimating the change in the orientation of object in an unsupervised manner. Joao et al. [3] proposed an algorithm (CN) based on color space to limit the scope of the problem. As color features only focus on color changes and are not sensitive to contour changes, they show strong robustness to target deformation. This algorithm extends the RGB color space and proposes CN space, with eleven channels (named black, blue, brown, gray, green, orange, pink, purple, red, white, and yellow). Bertinetto et al. [12] improved this tracking algorithm from the aspect of feature fusion HOG feature training the correlation filter and the color histogram are used for obtaining a tracking score and the statistical score, respectively, and are fused to generate the final response image and estimate the target position. This feature fusion improves the accuracy of the tracking algorithm but also makes the calculation slightly more complicated. Ma et al. [13] introduced depth features based on correlation filtering and designed a tracker (HCFT) based on multi-layer convolution features With the depth features being are more robust than the common features. VGG16 [14] was used to extract the output features of conv3-4, conv4-4, and conv5-4 layers, as well as train the respective correlation filters. During target tracking, the 3-layer features of the search area are the input to the corresponding correlation filter, and the response image is generated by adding the weights, and the target is located through the maximum position.

For resolving the boundary effect problem, the solution of most algorithms is to add a cosine window on the image to weaken the influence of image boundary on the result, as KCF. The influence of the boundary effect remains weak, as long as the center part of the shifted image is reasonable. However, with an increasing number of reasonable samples, the validity of all training samples cannot be guaranteed in this method. Besides, the addition of a cosine window can make the tracker block the background information and only accept part of the valid information, thereby reducing the discriminating ability of the classifier. Danelljan et al. proposed a spatial constrained correlation filter (SRDCF) [15], with the filter coefficients mainly concentrated in the central region by adding weight con-straints. Lukei et al. [16] proposed a critical correlation filter CSR-DCF for reliable channel and space. Yan et al. [17] propose a novel, flexible and accurate refinement module called Alpha-Refine, which exploits a precise pixel-wise correlation layer together with a spatial-aware non-local layer to fuse features and can predict three complementary outputs: bounding box, corners and mask. In the proposed filter, the binary mask is obtained by using spatial reliability for adaptive selecting the target region easier to track and thus, reduce the boundary effect. CF+CA [18] points out that the negative samples, used for correlation filtering training, are obtained only by cyclic displacement of positive samples, which limits the background discrimination ability of the trained classifier. Therefore, negative samples collected around the positive samples are introduced in training to improve the tracking accuracy.

Fig. 2
figure 2

A fusion features provides a 45-dimensions including 31-dimensional HOG feature, 11-dimensional CN feature and 3-dimensional HSV feature

Fig. 3
figure 3

Fusion response graph. The single feature response graph being affected by a large amount of surrounding noise impacted accurate distinguishing of the target, while the feature response after fusion depicted a stronger discrimination between target and others

To solves the impact of scale variation on tracking performance, the adaptive scale variation correlation filter tracker (SAMF) [19] and scale judgment space tracker (DSST) [20] introduced scale estimation in KCF. SAMF [19] with 7-thick scales is used in a translation filter to detect multi-scale image blocks and selects the translation position and target scale corresponding to the largest response value. DSST [20] simultaneously trains the translation filter and the scale filter, respectively using 35 fine scales. At the on-set, the translation filter and the scale filter are used for position estimation and scale estimation, respectively. Most popular algorithms use these two scales to estimate position and scale.

For the target occlusion, Zhang et al. [21] used the kernel gray histogram as the description feature tracking each component of the target. It not only increases the robustness to occlusion but also solves the problem of non-rigid deformation of the target. Liu et al. [22] proposed a modeling method for unknown parts using hidden variables By extending the online Pegasus algorithm to the structured prediction of hidden variables of various parts, this method provides a better tracking effect than the best contemporary linear and nonlinear kernel tracker. Harley et at. [23] propose an unsupervised method for detecting and tracking moving objects in 3D, in unlabeled RGB-D videos. The constraint of ensemble agreement helps combat contamination of the generated pseudo-labels, and data augmentation helps the modules generalize to yet-unlabeled data.

Proposed method

The proposed methods is illustrated in Fig. 1, which is composed of three components. (1) Feature extraction. The HOG, CN, HSV features on target prediction area and candidate area are extracted, then fuse the feature to obtain the feature template. (2) Temple and response calculation. The response value of the template is calculated, and the credibility of the template is calculated by the correlation peak difference ratio. (3) Model update. If the credibility of the template is high, the template is updated. If it is low, the previous frame template is retained.

Feature fusion

A feature fusion method based on HOG, CN, and HSV is used to enhance feature responses discrimination and improves the stability of the target tracking. HOG feature that stable for light, which consists of 18 direction-sensitive channels, 9 direction-insensitive channels, 4 texture channels, and 1 zero channel [24]. The CN feature is the low dimensional adaptive extension of a color attribute, which is the language label commonly used to describe color [25]. HSV contains hue, saturation, and intensity information. Due to more similarity with human visual characteristics, HSV color space performs better than the RGB color space in visual tracking. The proposed method fused the HOG feature to represent the gradient change, the CN color space to represent color information, and the HSV space to obtain more detailed information. HOG feature is 31-dimensional (except for zero channel), CN feature is 11-dimensional (RGB colors map to eleven basic colors, namely, black, brown, gray, green, orange, pink, purple, red, blue, white, and yellow), and HSV feature is 3-dimensional. A fusion features provides a 45-dimensions integration feature as shown in Fig. 2. Figure 3 presents the response graphs for both single feature and fusion feature that shows the discrimination between target and others stronger.

Spatial regularization based on ADMM

In the KCF correlation filtering algorithm, to obtain the optimal classifier under the minimum square error [26], the circular shift sample is used to train the classifier, and Eq. 1 defines the training process loss function.

$$\begin{aligned} {\psi _t}(\omega ) = \sum \limits _{i = 1}^t {\frac{1}{2}{{\left\| {f({x_i}) - {y_i}} \right\| }^2}} + \frac{\lambda }{2}\sum \limits _{j = 1}^d {{{\left\| {{\omega ^j}} \right\| }^2}} \ \end{aligned}$$

where \({\psi _t}\) is the training error for the first t frame classifier, t is the current frame number, i is the history frame serial number, \({x_i}\) is the first i frame of input samples, \(f({x_i})\) is the response score after the input sample of the i-th frame, \({y_i}\) is the expected response of the sample in the i-th frame, omega is the filter coefficient for training, j is the number of channels of the filter, \({a_i}\) is the frame weighting factor of classifier learning, d is the classifier dimensions, and \(\lambda \) is a constant regularization factor for over-fitting prevention.

It can be noted that the regularization factor \(\lambda \) is constant in the training process that treats the sample in the background area as the sample in the target area. However, in practical tracking, the target area is much more important than the background region. Thus, the regularization weight of the target area sample should be less than the background part. The paper introduced the spatial regularization weighting factor \(\theta \), building the spatial regularization correlation filter for weakening the interference of the background region, and improving the classification ability of classifiers in a cluttered background. Simultaneously, the use of this characteristic expands the search area and solves the issue of target loss due to rapid movement.

The original formula, after the introduction of the weight factor \(\theta \), is represented in Eq. 2

$$\begin{aligned} {\psi _t}(\omega ) = \frac{1}{2}\sum \limits _{i = 1}^t {{{\left\| {f({x_i}) - {y_i}} \right\| }^2}} + \frac{1}{2}\sum \limits _{j = 1}^d {{{\left\| {\theta \odot {\omega ^j}} \right\| }^2}} \end{aligned}$$

Here \(\odot \) is the dot product operation for \(\theta = \sqrt{\lambda }\), and the remaining parameters are similar to Eq. 1. The regularized weight is defined by Eq. 3.

$$\begin{aligned} \theta (m,n) = {\theta _{base}} + {\theta _{shift}}(m,n) \end{aligned}$$

Where m and n represents the offset of the cyclic sample, \({\theta _{base}}\) represents the constant basic weight of spatial regularization, and \({\theta _{shift}}\) represents the regularized weight offset of the training sample and is defined Eq. 4.

$$\begin{aligned} {\theta _{shift}}(m,n) = {\theta _{width}}{(\frac{m}{{{\rho _{width}}}})^2} + {\theta _{height}}{(\frac{n}{{{\rho _{height}}}})^2} \end{aligned}$$

\({\rho _{width}}\) and \({\rho _{height}}\)represent the width and height of the search image, respectively. \( {\theta _{width}}\) and \({\theta _{height}}\)represent the weighting factors in the horizontal and vertical directions, respectively. Equation 11 depicts that the distance between the training sample and the target center is directly proportional to the value of \({\theta _{shift}}\), i.e., the greater the regularization weight of the background region, the smaller the weight of the target region.

Find the solution for the filter coefficient \(\omega \), a key issue in the correlation filtering algorithm. Advancements in the related tracker filters, including CFLB [27] with the BACF [26] algorithm in the training of the filter, have introduced the space constraints in handling the boundary effect. Although this algorithm has solved the issue of the boundary effect, it has made the filter model more complex, slowed the computing speed, and made the computing speed advantage less apparent in the correlation filtering algorithm.

The alternating direction multiplier method (ADMM) is proposed in this paper to solve the correlation filter. ADMM divides a large optimization problem into multiple sub-problems to obtain the solutions. The approximated solution of the filter can be quickly obtained by iteration of the sub-problems.

ADMM algorithm, in general, is used to solve the following form of minimization problem (Eq. 5.):

$$\begin{aligned}&\arg \mathop {\min }\limits _{x,y} f(x) + g(y) \nonumber \\&s.t. Ax + By = c \end{aligned}$$

The augmented Lagrange function of this problem is defined as Eq. 6.

$$\begin{aligned} L(x,y,\varsigma )= & {} f(x) + g(y) + {\varsigma ^T}(Ax + By - c)\nonumber \\&+ \frac{\mu }{2}\left\| {Ax + By - c} \right\| _2^2 \end{aligned}$$

The augmentation of the augmented Lagrangian function is to add a square regular term to the Lagrangian function. The main purpose of introducing the augmented term is to make f as long as a convex function and to ensure its convergence. Then L is solved by the dual ascent method. The dual ascent method is (Eq. 7):

$$\begin{aligned}&({x^{(k + 1)}},{z^{(k + 1)}})\mathrm{{:= }}\arg \min L_\rho (x,z,{y^k}) \nonumber \\&{y^{(k + 1)}}\mathrm{{:= }}{y^k} + \mathrm{{}}\rho (A{x^{(k + 1)}} + B{z^{(k + 1)}} - c) \end{aligned}$$

The classic ADMM algorithm framework is as follows: Initialize \({y^0}\), \({\varsigma ^0}\),\(\mu > 0\); The alternating direction in the ADMM algorithm is to modify the above dual ascending problem (x, z iterates together) to iterate x, z alternately, the iterative steps are as follows: Eq. 8

$$\begin{aligned} \begin{array}{l} {x^{k + 1}}: = \arg \mathop {\min }\limits _x {L_\mu }(x,{y^k},{\varsigma ^k})\\ {y^{k + 1}}: = \arg \mathop {\min }\limits _y {L_\mu }({x^{k + 1}},{y^k},{\varsigma ^k})\\ {\varsigma ^{k + 1}}: = {\varsigma ^k} + \mu (A{x^{k + 1}} + B{y^{k + 1}} - c) \end{array} \end{aligned}$$

If the termination condition is fulfilled, the iteration is stopped, presenting output results or return to continue the iteration. Equation 2 is converted to the augmented Lagrangian function form. As ADMM iteration requires two variables, constructed as auxiliary variable and set and then converted Eq. 2 is represented as Eq. 9,

$$\begin{aligned} \begin{array}{l} \arg \mathop {\min }\limits _{\omega ,\beta } \frac{1}{2}\sum \limits _{i = 1}^t {{{\left\| {\sum \limits _{j = 1}^d {x_i^j * {\beta ^j}} - {y_i}} \right\| }^2}} + \frac{1}{2}\sum \limits _{j = 1}^d {{{\left\| {\theta \odot {\omega ^j}} \right\| }^2}} \\ s.t. \beta = \omega \end{array} \end{aligned}$$

Converting the above equation to the frequency domain (Eq. 10),

$$\begin{aligned} \begin{array}{l} \arg \mathop {\min }\limits _{\omega ,{\hat{\beta }} } = \frac{1}{2}\left\| {{\hat{y}} - {\hat{X}}{\hat{\beta }} } \right\| _2^2 + \frac{1}{2}\left\| {\theta \omega } \right\| _2^2\\ s.t.{\hat{\beta }} = \sqrt{t} F\omega \end{array} \end{aligned}$$

where \(\wedge \) represents the Fourier transform of the variable, for example, the discrete Fourier transform of a one-dimensional signal a is represented as \({\hat{a}} = \sqrt{t} Fa\), F represents the orthogonal Fourier transform matrix of size \(t \times t,{\hat{y}} = [{\hat{y}}(1),{\hat{y}}(2),...,{\hat{y}}(t)]\), \({\hat{X}} = [diag{({{\hat{x}}_1})^T},...,diag{({{\hat{x}}_d})^T}]\) with the size \(t \times dt\hat{\beta }= [{\hat{\beta }} _1^T,...,{\hat{\beta }} _d^T]\),and \(h = [h_1^T,...,h_d^T]\) is a matrix composed of multi-channel cyclic samples with the size \(dt \times 1\) Thus, the Augmented Lagrange expression is as Eq. 21:

$$\begin{aligned} \begin{array}{lll} L(\omega ,{\hat{\beta }},{\hat{\varsigma }} ) &{}= \frac{1}{2}\left\| {{\hat{y}} - {\hat{X}}{\hat{\beta }} } \right\| _2^2 + \frac{1}{2}\left\| {\theta \omega } \right\| _2^2 + {{{\hat{\varsigma }} }^T}({\hat{\beta }} - \sqrt{t} F\omega )\\ &{}\quad + \frac{\mu }{2}\left\| {\hat{\beta }- \sqrt{t} F\omega } \right\| _2^2 \end{array} \end{aligned}$$

Here \(\mu \) is the penalty factor and \({\hat{\varsigma }} = {[\hat{\varsigma }_1^T,...,{\hat{\varsigma }} _K^T]^T}\) is the Lagrange vector of size \(dt \times 1\) in the Fourier domain. The ADMM algorithm can be used to solve the above equation iteratively according to Eq. 8 and every sub-problem \(\omega \) and \({\hat{\beta }}\) has a closed-form solution.

For sub-problem \(\omega \) the solution formula is Eq. 12

$$\begin{aligned} \begin{array}{llll} \omega &{}= \arg \mathop {\min }\limits _\omega \{ \frac{1}{2}\left\| {\theta \omega } \right\| _2^2 + {{{\hat{\varsigma }} }^T}({\hat{\beta }} - \sqrt{t} F\omega ) \\ &{}\quad + \frac{\mu }{2}\left\| {{\hat{\beta }} - \sqrt{t} F\omega } \right\| _2^2\} = \frac{{\varsigma + \mu \beta }}{{{\omega ^T}\omega + \mu }} \end{array} \end{aligned}$$

Here \(\varsigma = \frac{1}{{\sqrt{t} }}{F^{ - 1}}{\hat{\varsigma }} \) and \(\beta = \frac{1}{{\sqrt{t} }}{F^{ - 1}}{\hat{\beta }}\). Due to the linear nature of the discrete Fourier trans-form, each channel in the arrays \(\{ {\varsigma _1},...,{\varsigma _d}\}\) and \(\{ {\beta _1},...,{\beta _d}\}\) can be solved separately in the Fourier domain and the computational complexity of Eq. 12 is \( O(dt\log (t))\).

For sub-problem \({\hat{\beta }}\) the solution formula is Eq. 13:

$$\begin{aligned} \begin{array}{llll} {\hat{\beta }} &{} = \arg \mathop {\min }\limits _{{\hat{\beta }} } \frac{1}{2}\left\| {{\hat{y}} - {\hat{X}}{\hat{\beta }} } \right\| _2^2 + {{{\hat{\varsigma }} }^T}({\hat{\beta }} - \sqrt{t} F\omega ) \\ &{}\quad + \frac{\mu }{2}\left\| {{\hat{\beta }} - \sqrt{t} F\omega } \right\| _2^2 \end{array} \end{aligned}$$

The complexity of directly solving this equation is \(O({t^3}{d^3})\), as each ADMM iteration needs to solve\({\hat{\beta }}\), it significantly affects the real-time performance of the algorithm. However, sample a represents \( {\hat{X}}\), which is a banded sparse matrix. Accordingly elements of the array \({\hat{y}}(s) = [{\hat{y}}(1),{\hat{y}}(2),...,{\hat{y}}(t)]\) are only related to the k-th element of arrays \({\hat{x}}(s) = {[{{\hat{x}}_1}(t),...,{{\hat{x}}_k}(t)]^T}\) and \({\hat{\beta }} (s) = {[conj({{\hat{\beta }} _1}(t)),...,conj({{\hat{\beta }} _k}(t))]^T}\).The operator conj is the complex conjugate applied to complex number vectors. Therefore,\(\hat{\beta }\) the above equation can be represented as,\({\hat{\beta }} (s)\), \(s = [1,...,t]\), where t is independent small targets.

$$\begin{aligned} \begin{array}{llll} {\hat{\beta }} (s) &{}= \arg \mathop {\min }\limits _{{\hat{\beta }} (s)} \{ \frac{1}{2}\left\| {{\hat{y}}(s) - {\hat{x}}{{(s)}^T}{\hat{\beta }} (s)} \right\| _2^2 + {\hat{\varsigma }} {(s)^T}({\hat{\beta }} (s) - {\hat{\omega }} (s)) \\ &{}\quad + \frac{\mu }{2}\left\| {{\hat{\beta }} (s) - {\hat{\omega }} (s)} \right\| _2^2\} \end{array} \end{aligned}$$

Here, \( {\hat{\omega }} (s) = [{{\hat{\omega }} _1}(s),...,{{\hat{\omega }} _k}(s)]\), \({{\hat{\omega }} _k} = \sqrt{t} F{\omega _k}\).

$$\begin{aligned} {\hat{\beta }} (s) = \frac{{{\hat{y}}(s){\hat{x}}(s) - t{\hat{\varsigma }} (s) + \mu t{\hat{\omega }} (s)}}{{{\hat{x}}(s){\hat{x}}{{(s)}^T} + \mu t{I_k}}} \end{aligned}$$

The computational complexity of Eq. 13 is \(O(t{d^3})\) due to the issue of dealing with the t independent \(K \times K\) linear systems. The d dimensional variables in the de-nominator and use of the Sherman-Morrison formula(\( {(u{v^T} + A)^{ - 1}} = {A^{ - 1}} - {({v^T}{A^{ - 1}}u)^{ - 1}}{A^{ - 1}}u{v^T}{A^{ - 1}}\)) for acceleration makes \(A = \mu t{I_k}\) and \(u = v = {\hat{x}}(s)\). Thus, the original formula can be simplified as Eq. 16,

$$\begin{aligned} \begin{array}{lllll} {\hat{\beta }} (s) &{}= \frac{1}{\mu }(t{\hat{y}}(s){\hat{x}}(s) - {\hat{\varsigma }} (s) + \mu {\hat{\omega }} (s))\\ &{}\quad - \frac{{{\hat{x}}(s)}}{{\mu b}}\left( {t{\hat{y}}(s){{{\hat{S}}}_x}(s) - {{{\hat{S}}}_\varsigma }(s) + \mu {{{\hat{S}}}_\omega }(s)} \right) \end{array} \end{aligned}$$

Here,\({{\hat{S}}_x}(s) = {\hat{x}}{(s)^T}{\hat{x}}, {{\hat{S}}_\varsigma }(s) = {\hat{x}}{(s)^T}{\hat{\varsigma }} \),\({{\hat{S}}_\omega }(s) = {\hat{x}}{(s)^T}{\hat{\omega }}\),and\( b = {{\hat{S}}_x}(s) + \mu t\). Therefore, the computational complexity of the formula is reduced to O(td).

The Eq. 17 is iterative update:

$$\begin{aligned} {{\hat{\varsigma }} ^{k + 1}}: = {{\hat{\varsigma }} ^k} + \mu ({{\hat{\beta }} ^{k + 1}} - {{\hat{\omega }} ^{k + 1}}) \end{aligned}$$

where \({{\hat{\beta }} ^{k + 1}}\) and \({\omega ^{k + 1}}\) are the current solutions to the above sub-problems at iteration \(k + 1\) within the iterative ADMM. Thus, \({{\hat{\omega }} ^{k + 1}} = \sqrt{t} F{\omega ^{k + 1}}\) and \({\mu ^{k + 1}} = \min ({\mu _{\max }},\alpha {\mu ^k})\). The filter parameter \({\hat{\beta }} _t^j\) is obtained through the ADMM iterative optimization solution, and the change of tracking target position is estimated through the target response graph of the standard correlation filter used in tracking. Thus, the target output response in the time domain is as Eq. 18:

$$\begin{aligned} f(z) = {F^{ - 1}}\left( \sum \limits _{j = 1}^d {{{{\hat{z}}}^j} \odot \hat{\beta }_t^j} \right) \end{aligned}$$

Scale adaptive scheme

The size of the target template remains fixed for most of the tracing methods. Thus, to deal with scale variation, an extension of scale-space from countable integer space to uncountable floating point space is proposed. Assuming that the size of the original image in the template is \({s_k}\), the different scale d form scale pool \(S = \{ {d_1}{s_k},{d_2}{s_k},...,{d_d}{s_k}\}\) is defined at the track. The d image blocks of different sizes according to s are taken in the new frame, and then through the bilinear interpolation method, the image block is adjusted for the same dimensions as the initial frame template \({s_k}\). Figure  4 depicts the specific process.

Fig. 4
figure 4

Sampling and adjustment process. (1) In the new frame of image, sample the image by sliding window according to d scales of different scales in S, and calculate the sample response of each scale, so as to determine the candidate regions of different scales. (2) Adjust these candidate area image blocks to the same dimension as the initial frame template by bilinear interpolation. (3) Perform feature extraction on the candidate ar-ea image blocks

We have specially trained a scale filter in the algorithm to estimate the scale of the target. The method of sliding window sampling is used to sample candidates with different scales in the scale pool, and then calculate separately the response value. The scale of the new frame target is updated according to the value of the scale with the largest response in the input scale pool, which improves the adaptability to changes in the scale of different targets, thereby achieving adaptive update of the scale. The step of target candidate box is calculated by Eq. 19,

$$\begin{aligned} box = \arg \max {F^{ - 1}}{\hat{f}}({z^{{d_i}}}) \end{aligned}$$

Here \({z^{{d_i}}}\) is the image block detected of size \({d_i}{s_k}(i = 1,..,d)\) in a new frame. The response graph infers the moving steps of the target, and thereby, the corresponding real displacement deviation is the result of multiplication with the resulting displacement d.

Model update strategy based on high confidence

The current target tracking algorithm updates the model in almost every frame, regardless of the accuracy of the target detection. In the case of an inaccurate new tracking result, the result updates the model and pollutes it, which leads to tracking drift. In this algorithm, the HSV feature, HOG feature, and CN feature are combined for the target tracking. As the final feature dimension is very high, quite a lot of parameters need to be updated every time to update the model, which is quite time-consuming. Thus, model updating with every frame predictably slows the speed.

Therefore, the model update strategy based on high confidence solves the pollution problem of the model, improves the robustness of the tracking algorithm to occlusion and other issues, improves the tracking speed, and prevents over-fitting.

The actual use of KCF revealed that with the blocking of the target, the tracking result drifts, and the longer blocking time fails the tracking. The KCF [4] updates the model for every frame without considering the target blocking, and so with the blocked target, the tracking model gets polluted causing target loss. It infers that only when the part in the target box of the current frame has high confidence (the target is not obscured or blurred), the model could be updated. Therefore, the method of judging the sample confidence is the problem research problem of this chapter. Wang [28] concluded, through multiple KCF experiments, that the response profile of KCF has only one distinct peak and its overall distribution represents a two-dimensional Gaussian distribution, approximately. Thus, when a complicated situation occurs in the tracking process (especially occlusion, loss, blur, and so), the response graph oscillates violently.

The peaks and fluctuations in the response graph reflect the confidence of tracking results to some extent. The perfect matching of the detected with the correct target results in the ideal response graph with only one peak, and other areas tend to be smooth. The higher the correlation peak, the better is the positioning accuracy. In case of inaccurate positioning, the response graph oscillates violently, and its shape becomes significantly different from that of the correct match. Thus, this paper proposes a judgment formula CPMDR (Eq. 20):

$$\begin{aligned} CPMDR = {\left| {{f_{\max }} - {f_{\min }}} \right| ^2}\frac{{MN}}{{\sum \limits _{m = 0}^M {\sum \limits _{n = 0}^N {{{({f_{m,n}} - {f_{\min }})}^2}} } }} \end{aligned}$$

Where \({f_{\max }}\) is the maximum value of the response graph, \({f_{\min }}\) is the minimum value of the response graph, and \({f_{m,n}}\) is the value of the response graph at (mn).

The Correlation Peak Mean Difference Ratio (CPMDR) reflects the fluctuation of the response graph. When CPMDR is below a certain threshold, the target is judges as lost in the tracking process, obscured, or out of sight.

In traditional KCF tracing, the simple model update method used is as Eq. 21:

$$\begin{aligned} {\hat{x}}_{\bmod el}^{(f)} = (1 - \eta ){\hat{x}}_{\bmod el}^{(f - 1)} + \eta {\hat{x}}_{\bmod el}^{(f)} \end{aligned}$$

Here, \(\eta \) is the update rate of the model. According to this method, every frame for the classifier is to be updated. Once tracking fails, it cannot continue tracking. The proposed solution is to use the updating strategy of the learning rate adaptive high confidence model. To prevent the model from being contaminated, when the target area is blocked, the target model must not update. When the CPMDR value exceeds a certain threshold, the model can update. By setting the model update rate to be positively correlated with the CPMDR value, \(\eta = {\eta _1}(1 - \frac{1}{{CPMDR}})\) can be made. With \({\eta _1}\) set to 0.02, the updated adaptive model is Eq. 22:

$$\begin{aligned} {\hat{x}}_{\bmod el}^{(f)} = \left\{ \begin{array}{l} (1 - \eta ){\hat{x}}_{\bmod el}^{(f - 1)} + \eta {\hat{x}}_{\bmod el}^{(f)},\eta > threshold\\ {\hat{x}}_{\bmod el}^{(f - 1)},else \end{array} \right. \end{aligned}$$

This updated model calculates, \({\hat{\beta }} (s)\), \({{\hat{S}}_x}(s)\), \({{\hat{S}}_\varsigma }(s)\), and \({{\hat{S}}_\omega }(s)\). As measured by the experiment, when the CPMDR value is greater than 50, it identifies as accurate tracking, so the threshold is set as 0.0196. Figures 5 and 6 are comparison to Basic KCF and advanced method.

Fig. 5
figure 5

a The result of Basic KCF algorithm tracking. b The result of KCF algorithm tracking with high confidence model update strategy added. The comparison of the two sets of pictures reveals that the KCF algorithm with a high-confidence model update strategy is better than the basic KCF algorithm. As the improved KCF algorithm does not update the model when it is occluded, the model is not contaminated. Besides, after the target reappeared, the algorithm tracked the target again

The comparison of the two sets of pictures reveals that the KCF algorithm with a high-confidence model update strategy is better than the basic KCF algorithm. As the improved KCF algorithm does not update the model when it is occluded, the model is not contaminated. Besides, after the target reappeared, the algorithm tracked the target again.


The experimental configuration

The proposed algorithm is implemented in MATLAB R2014a with a tracking speed of 12 frames per second. The experimental platform is configured in the following manner, the operating system is 64-bit Windows 7, the memory is 16 GB, the CPU is Intel i7-8700 k (6 core 3.7 GHZ), and the graphics card is NVIDIA GeForce GTX 1060.

The basic parameters of the experiment are as follows: The HOG feature uses a \(4 \times 4\) pixel cell size, the scale pool size is 7, and the scale factor \(S = [0.97,0.98,0.99,1.00,1.01,1.02,1.03]\).The search area is \({4^2}\) times the target area, the regularized base weight \({\theta _{base}}\) is 0.1, and the weight factors \({\theta _{width}}\) and \({\theta _{height}}\) are 3. For ADMM optimization, the iterations are 2 and the penalty factor \(\mu \) is 1. In iteration \(k + 1\), the penalty factor is updated by \({\mu ^{k + 1}} = \min ({\mu _{\max }},\alpha {\mu ^k})\), among them \(\alpha = 10\) and \({\mu _{\max }} = {10^3}\). The threshold of the target template learning rate is 0.0196.

The OTB50 standard target tracking test set [28], containing 50 video sequences, tests the proposed algorithm. The complete demonstration of the tracking effect of the proposed algorithm is through a comparison of selected 9 relevant algorithms for the same dataset. These algorithms are, ECO [29], SRDCF [30], STAPLE-CA [12], SAMF [18], DCF-CA [31], KCF [4], STRUCK [32], TLD [32], and CT [33]. Among them, CF [4], STRUCK [32], TLD [10], and CT [33] are the best classical algorithm from the OTB benchmark test. ECO [29], SRDCF [30], STAPLE-CA [12], SAMF [18], DCF-CA [31] are the best tracking algorithms based on correlation filtering, and ECO is also a classical algorithm combining correlation filtering and deep learning.

Quantitative comparisons

Overall performance

A comprehensive evaluation of the tracking results, in the following two ways, assesses the performance of algorithm. (1) The success rate for distance error; If in a specific frame, the distance error between the tracking algorithm and the manually calibrated tracking results is less than a certain threshold, then that frame is regarded as successful. (2) The success rate for coincidence degree; If in a specific frame, the coincidence degree between the tracking algorithm and the manually calibrated tracking results is larger than a certain threshold, then that frame is regarded as successful.

Figure 7 is the success rate schematic diagram of tracking OTB50 test video, (a) is the precision plot, and (a) is the success plot. In (a), the horizontal axis represents the threshold of the distance error, and the vertical axis represents the ratio of the number of frames with the distance error less than the threshold value to the total number of frames. The number after the title indicates the number of videos containing the tracking feature in the test video. The number after the algorithm indicates the area under the curve (AUC) with the coordinate axis, and OPE (One-Pass Evaluation) is the complete segment of the one-time tracking video. The range error success rate reflects the accuracy of the tracking position. (b) Shows the success rate of the degree of coincidence, where the horizontal axis represents the threshold of the degree of coincidence, and the vertical axis represents the ratio of the number of frames with the degree of coincidence greater than the threshold value to the total number of frames. The success rate of coincidence degree reflects the overall tracking accuracy of the algorithm.

In addition, statistics of calculation time among the competitors and proposed is show in the Table 1, which illustrates that the proposed method performance best in the accuracy of tracking with a short time.

Table 1 The comparison of calculated time among competitors and proposed
Fig. 6
figure 6

The comparison success rate on the OTB-50 test video (red line is ours)

Figure 7 depicts the accuracy and success rate scores of the proposed algorithm are 0.853 and 0.821, respectively, which is best among the ten tracking algorithms compared. The accuracy and success rate increase by 11.3 and 19.8%, respectively, as compared to the classic KCF [4] algorithm. Compared to the second ECO [29] algorithm increase is by 0.5 and 1.8, respectively. Among the top five algorithms, SAMF [18] and STAPLE-CA [12] are the improved versions of the KCF [4]. SAMF [18] adds an adaptive scale transformation to KCF [4], and STAPLE-CA [12] adds the feature fusion and combination of CN and HOG to KCF [4]. ECO [29] and SRDCF [30] are the improved versions of the DCF [31] tracker with contextual awareness and taking background information into account in its model appearance. SRDCF adds spatial regularization based on DCF [31]. ECO [29] integrates the functions of CNN [34] into SRDCF [30] and realizes the acceleration of the algorithm. The proposed algorithm is SRDCF-based, with the addition of the feature fusion and the model update based on confidence. Besides, the introduction of the iterative acceleration calculation in the ADMM algorithm reduces the computational complexity and improves the accuracy of tracking and the calculation speed. Experimental results show that the proposed algorithm has higher tracking accuracy and robustness.

Performance analysis based on video attributes

To better analyze the performance strengths and limitations of the algorithm proposed in this paper. Figures 8 and 9 depicts the recorded accuracy scores and success rate scores of 10 algorithms in 11 video attributes. These 11 attributes include (a) fast motion, (b) background clutter, (c) motion blur, (d) deformation, (e) illumination variation, (f) in-plane rotation, (g) low resolution, (h) occlusion, (i) out-of-plane rotation, (j) out of view, (k) scale variation. In the accuracy score, the proposed algorithm scores among the top four algorithms, with six out of eleven attributes ranked in the top two. In the success rate score, the proposed algorithm is best in all the eleven attribute scores, with eight scores in the top two and five scores ranked first.

Fig. 7
figure 7

Accuracy score curve of 11 algorithms on an OTB-50 dataset

Figure 9 shows the recorded success plot of ten algorithms for the eleven video attributes, simultaneously. The eleven attributes include illumination variation, scale variation, occlusion, deformation, motion blur, fast motion, in-plane rotation, out-of-plane, rotation, out of view, background clutter, and low resolution. In the success rate score, eight of the eleven attributes of the proposed algorithm is in the top two, and five are ranked first.

Fig. 8
figure 8

The success rate score curve of 11 algorithms in the OTB-50 dataset

Fig. 9
figure 9

Comparison of tracking effects of multiple trackers under occlusion of ten algorithms for three different video sequences

Fig. 10
figure 10

Tracking effect of multiple trackers under fast-moving and cluttered background

Among the two attributes of fast movement and target occlusion, the proposed algorithm ranks first in tracking accuracy and success rate scores. Among them, in the case of fast-moving, the algorithm improves the accuracy by 17.5 compared with the traditional KCF [4] algorithm and improves the success rate by 20.4. This is because traditional algorithms update each frame of the target model, which can easily lead to template pollution and lead to tracking failure, and the general algorithm treats the background and the target equally, which may cause the target to be lost when it is moving fast. As the algorithm introduces spatial regularization to penalize the sample boundary, it reduces the influence of the background on the target model and the boundary effect, allowing a broader range search for the target, effectively preventing the loss of the target case. In the case of target occlusion, tracking accuracy and success rate of the proposed algorithm scores are 0.864 and 0.827, respectively. The tracking accuracy is improved by 2 compared to the second place SRDCF (0.844) [30], and the tracking success rate is relative to the second place ECO (0.796) [29] is increased by 3.1. When the target is occluded, the tracking result drifts. At this time, updating the model would pollute the tracking model and affect the follow-up tracking accuracy. For this, the paper proposes a correlation peak-to-average difference ratio, which determines whether or not the target is in an occluded state. Besides, it also decides whether to update the model or prevent the model from being polluted due to the resulting drift. The above experimental results also prove that this method is effective. There are 25 sequences in the OTB-50 video sequence that have lighting changes. The accuracy and success rate of our algorithm under this attribute are 0.777 and 0.740, ranking second and first, respectively, for occurring nineteen sequences. Considering the deformation, the accuracy and success rate of the algorithm under this attribute are 0.823 and 0.808, ranking 3rd and 2nd, respectively. This is due to the merger of the three features of HOG, CN, and HSV in the feature improvement. The HOG feature mainly focuses on the contour gradient changes of the target and is not sensitive to change in color, so it is very stable to change in light. CN and HSV feature mainly focuses on the color of the target and is not sensitive to the changes of target shape, so it is very stable against the deformation. The experimental results reveal that other algorithms using feature fusion has also achieved better results, such as STAPLE-CA [12]. Therefore, the introduction of feature fusion significantly improves the success rate of the algorithm under both conditions of illumination change and deformation. Besides, the introduction of both an adaptive scale update method and seven scale filters set-up update the scale of the target in real-time. The proposed algorithm also has a good performance in the attribute of scale transformation, tracking accuracy, and success rate with scores 0.803 and 0.763 which ranks second and first, respectively.


Performance analysis under occlusion

To verify the performance under occlusion, we conduct some evaluations on videos sequences where all the target is occluded in the scene by a large area as shown in Fig. 10.

In the Jogging sequence, the tracked target is the girl on the left. The girl, at frame 75, was obscured by the tele-phone pole completely. After the obscuring, TLD [10], STRUCK [32], STAPLE-CA [12], and CT [33] algorithms produced large center errors in tracking depicting failure phenomenon. However, other algorithms demonstrated more stable tracking. Among them, the new proposed algorithm, SRDCF [30], and ECO [29] completed the frame selection of the target as soon as it appears after the occlusion. In the David3 sequence, the tracked target is the walking person, and in the 28-th frame, the road signs obscure the target. In the 82-nd frame, the target is obscured by the tree trunk, causing the failure of the TLD [10], CT [33], and Struck algorithms [32]. The proposed algorithm always completed Stable tracking. Besides, in the subway sequence, the target is blocked by pedestrians passing by at frame 46 and frame 94, and the center error is still minute for the proposed algorithm. Traditional algorithms update the target model in every frame, which can easily lead to template pollution and tracking failure. Moreover, traditional algorithms treat the target area and the background area equally, and it is difficult to find the target when the target is lost due to occlusion. As the algorithm has an adaptive template update strategy, if the target is occluded, the output response has multiple peaks, making the correlation peak-average difference ratio less than the threshold. Consequently, suspending the model update and preventing the model from being contaminated. The introduction of spatial regularization suppresses the boundary effect and the influence of background information, which allowed a broader search area to accurately and timely locate the target on reappearance.

Performance analysis under fast-moving and cluttered background

To investigate the effectiveness under fast-moving and cluttered background, we conduct some evaluations on videos sequences in the case of fast movement and cluttered background as shown in Fig  11 Among them, Deer, Liquor, and Jumping are all fast-moving cases, and Deer and Liquor are cases of the cluttered background.

Fig. 11
figure 11

Comparison of tracking effects of multiple trackers under deformation and lighting changes

In the sequence Blot, the tracked target is a sprinter, a non-rigid object. In this video, the deformation and moving speed of the target are relatively large, causing failure for STRUCK [32], SAMF [18], TLD [10], and CT [33] from the 13th frame. Since the traditional algorithm generally uses a single feature for feature extraction, it has certain limitations. In a specific scene, such as target deformation, the performance of illumination change will relatively be poor. The proposed algorithm lost the target at the beginning, even though it is relatively robust due to the feature fusion of HOG, CN, HSV in the algorithm. As the color-related features are more affected by color changes and have more stability to the target deformation, so the algorithm can keep accurate tracking. In the sequence singer, the tracked target is a singer. In the 94th frame, the light of the video changes significantly compared to the previous frame, which also causes a specific drift in the CT [33]. Due to the HOG used by the proposed algorithm, features are not sensitive to changes in color, and so the algorithm is also very stable to changes in lighting. In the sequence with a woman, the light changes from bright to dark to bright, and the target to be tracked is also deformed due to the occlusion of the car, posing a great challenge in tracking. Among them, the TLD [10] and CT [33] algorithms lose the target in the 215th frame. The above experimental results infer that the feature fusion method has strong robustness to deformation and illumination changes. Besides, the 331st frame of the sequence singer and the 566th frame of the sequence woman both have the target scale change. As the proposed algorithm has an added adaptive scale filter, the algorithm can adapt to the scale change well.


The correlation filtering based target tracking method has impressive tracking performance and computational efficiency. However, some factors limit the accuracy of tracking, including the object deformation, boundary effects, scale variations, and the target occlusion. This paper proposed a robust target tracking algorithm based on spatial regularization and adaptive Updating Model to solve these issues. First, a feature fusion method based on HOG, color-naming, and HSV is used to enhance feature responses discrimination between target and others. Second, a spatial weight function is introduced to penalize the magnitude of the filter coefficients, which the spatial regularization weight is set according to the location information of training samples and the target space, therefore, a larger detection area is available to be selected. Then, an ADMM algorithm is employed to reduce iteration of filter coefficients that created by a larger detection area, weaken the boundary effect, so that keep the efficiency of tracking. Third, an adaptive scale filter is designed with a proposed scale pool of seven scales, which makes the algorithm adaptable to the scale variations. Finally, the correlation peak average difference ratio is applied to estimate the state of occlusion, which can realize the adaptive updating of the model and improve the stability of the algorithm. The experiments are conducted on OTB50 dataset, and the result demonstrate that the proposed algorithm improved tracking results compared to state-of-the-art correlation filtering based target tracking method.

This paper proposed an improved target tracking algorithm based on correlation filtering, aiming at the tracking failure phenomenon that the KCF [4] algorithm is prone to in the case of object deformation, boundary effects, scale variations, and the target occlusion. To improve the stability of the target in the case of deformation and illumination variation, adoption of HOG, color-naming, and HSV feature fusion, is proposed to enhance feature responses discrimination between target and others. For overcoming the boundary effect existing in correlation filtering, a spatial weight function is introduced that can penalize the magnitude of the filter coefficients, with the spatial regularization weight set according to the location information of training samples and target space. Besides, a larger detection area is adopted, and the ADMM algorithm is used to reduce iteration complexity, weaken the boundary effect, and improve the operation efficiency of the algorithm. For enhancing adaptability to scale variations, the adaptive scale filter is added to the algorithm with a scale pool containing seven scales. For overcoming the model pollution caused by target occlusion, the correlation peak average difference ratio is proposed to find out the occlusion state, to realize the adaptive updating of the model, and improve the stability of the algorithm. The OTB-50 dataset tested the proposed algorithm, and the overall precision rate and success rate were 0.853 and 0.821, respectively. The experiment results showed that the tracking algorithm presented in this paper was relatively stable under various conditions, which provided a theoretical and experimental basis for the design of a high-precision and fast target tracking algorithm with great potential for application. The proposed method aims to design an adaptive and efficient tracking algorithm so as not to compare the efficiency of the deep-learning-based method. The subsequent study will be compared with the tracking algorithm based on deep learning, and the proposed method will be combined with deep learning and other advanced methods such as fuzzy systems [7, 8] to further improve the efficiency of tracking.