Introduction

Rapid developments of artificial intelligence and computer vision have been widely visible in various fields. Computer vision refers to the use of cameras and computers instead of human eyes to visually recognize, track, and measure targets. First, image processing is performed so that the processed image is more suitable for human eye observation or instrument detection. Then, the process of visual object tracking is to track the target state in subsequent video sequence frames with the presented initial position and size of the target. Currently, object tracking is widely used in transportation hub monitoring, medical imaging, human–computer interaction, and other related fields. Scholars have done lots of work in these areas. Specific applications, among several other, are:

  • Application of object tracking in unmanned aerial vehicles (UAVs). UAVs is commonly known as drones. Compared with the human eye, the drone has the advantages of stability and accurately capture. Therefore, drones using object tracking technology can achieve more robust tracking results [1]. Object tracking technology is also be used to identify and track specific target in a wide area to avoid accidents. For example, it is used for online grazing of grassland flocks, forest fire monitoring and early warning.

  • Application of object tracking in industrial automation. All industries are developing intelligently, and object tracking is also widely used in industrial production [2]. Computer vision technology can be used to identify and track problematic products. However, in actual automated industrial production, the speed of product transmission is fast and camera model equipment is limited. Thence, it is easy to cause motion blurring and increase the tracking difficulty. Therefore, the realization of intelligent industrial production still needs to study the object tracking technology continuously.

  • Application of object tracking in intelligent transportation. Today, an efficient transportation selection system is vital for the masses. In intelligent transportation, real-time monitoring and tracking of vehicles can be achieved using object detection and tracking technology [3]. The object detection and tracking technology can be used to obtain the current lane congestion and then develop or select an optimized travel plan.

In addition to drones, industrial and intelligent transportation, artificial intelligence technologies such as object tracking can also be used in mobile healthcare, military space and intelligent transportation. For example, Chen [4] proposed the FGM–ACO–FWA method to solve the problem of unsustainability using smart technologies in mobile medicine. According to the different applications of aircraft in various military and space industries, Pazooki [5] modeled and simulated this special type of UAV to meet the specific situation and location of arrival at an appropriate time. In addition, autonomous and intelligent systems have made progress in urban traffic management. Wuthishuwong et al. [6] modeled the transportation network using the concept of multi-agents to achieve a balanced and stable traffic volume at each intersection. Object tracking technology can also be used in the live broadcast of sports events [7]. Bai et al. [8] proposed a correlation filter with characteristic heterogeneous adaptation to improve the tracking ability of the system. Liu et al. [9] conducted extensive research on the developed template matching strategies, which improves the tracking performance. For most newly developed filters used for various inspection purposes, it is a useful strategy.

To easily and comprehensively understand the single target tracking technology and algorithm, this article focuses on the working principle and development of the correlation filter algorithm. In this paper, we provide a comprehensive introduction to existing data sets, and summarize the current correlation filter-based object tracking algorithms to present a comparison of models sourced in the domain of correlation filter tracking. In the following sections of this article, we first introduce four mainstream data sets for evaluation of tracking algorithms. Then, the key technologies of the correlation filter algorithms and the results are summarized. Moreover, we propose a template update strategy during object tracking. Finally, experiments show that this method improves the tracking effect without using complex mathematical operations to expand the model.

Data set introduction

In single object tracking, there are many types of datasets. Among them, the most authoritative and widely used are VOT and OTB datasets. Figure 1 shows the number of sequences and the average sequence length in the form of a histogram to allow a more intuitive comparison of the data sets.

Fig. 1
figure 1

Comparison of the overall situation of each data set

In Fig. 1, the average sequence length of OTB2013, OTB2015 and TC128 are more than 500 frames. While the number of VOT2014 is the lowest. Among them, the OTB and VOT datasets are the most frequently used by people. The chapter is a detailed introduction to each dataset.

VOT dataset

The VOT dataset has been updated every year since 2013, which is composed of high-resolution color sequences. The latest VOT-2019 [10] introduces the challenges of VOT-RGBT and VOT-RGBD. The VOT-RGBT will evaluate the use of a four-channel (RGB + IR) tracker in the tracking. The VOT-RGBD evaluates a tracker which uses four-channel (RGB + depth) in the tracking. The VOT uses the success rate and robustness evaluation as an evaluation index, where the success rate is the overlap rate between the bounding box which the tracker is tracking and the ground-truth on a single test sequence, and the robustness is the number of times that the tracker fails to track under a single test sequence during tracking.

To better evaluate the performance of the tracker, VOT is divided into five visual attributes: occlusion (OCC), illumination change (IC), motion change (MC), size change (SC), and camera motion (CM). When a frame does not belong to any of these five attributes, it is represented as a non-degraded (ND) attribute. These attributes allow the tracker to be compared on a subset of frames corresponding to the same attribute.

OTB dataset

The OTB2013data set contains 51 videos. The data set contains a quarter of the grayscale image. 100 video sequences are contained in extended version OTB2015 [11] which is extended from the OTB2013 (50 videos). The ground-truth of the dataset is manually labeled. The OTB evaluation index uses both accuracy and success rate to evaluate tracker’s performance. The accuracy refers to the average Euclidean distance between the center point of the bounding box in the ground-truth and the tracking result of the algorithm and the ratio between the number of frames in the threshold range and the total number of image frames in the entire sequence is then calculated. That is the average pixel error (APE). The threshold is typically set to 20 pixels. The success rate is the area overlap ratio of the bounded box of the ground-truth and the tracking result of the algorithm, and is the area under the image curve which overlap rate is greater than the threshold. That is the average overlap rate (AOR). The threshold is generally set to 0.5. The robustness evaluation of OTB indicators include one-pass evaluation (OPE), temporal robustness assessment (TRE), and spatial robustness assessment (SRE).

Given an initial bounding box of the initial frame, the tracker runs to the end of the sequence. TRE evaluates the tracker on each segment and counts all the information. SRE is the sampling of the initial bounding box in the first frame by moving or scaling the ground-truth. SRE uses four center shifts and four angular shifts and four proportional changes. The amount of transfer is 10% of the target size, and the scale factor varies among 0.8, 0.9, 1.1, and 1.2. Therefore, the SRE evaluates each tracker 12 times. 11 sub-attribute are defined in the OTB, namely: illumination variation (IV), scale variation (SV), occlusion (OCC), deformation (DEF), motion blur (MB), fast motion (FM), in-plane rotation (IPR), out-of-plane rotation (OPR), out-of-view (OV), background clutter (BC), and low resolution (LR). Figure 2 below shows some of the sub-attributes.

Fig. 2
figure 2

Some of the challenging sub-attributes in the OTB. In the left part shows illumination variation and in the right part shows scale variation

TColor-128 dataset

Most modern trackers rely purely on the grayscale version of the input image, ignoring the rich color information. TColor-128 [12] shows that color information is very helpful in improving visual tracking, and the improvements it brings are common to different algorithms. TColor-128 is systematically studied through algorithms and reference angles. In terms of algorithms, 16 state-of-the-art visual trackers are selected carefully with 10 color models fully coded. On the reference angles, 128 color sequences with the ground-truth and various challenge factor are annotated. This data set systematically combines various color models with state-of-the-art grayscale trackers and studies their performance. A color tracking benchmark is formed by creating a large color sequence reference using the annotations.

In addition, TColor-128 also performs color tracking evaluation through combinations of different color models and visual trackers. Finally, the success rate and accuracy are used as evaluation criteria. The success rate uses the area under the curve (AUC). The center location error (CLE) is used in the accuracy. The accuracy threshold is also set for 20 pixels. The results of the 160 color-coded tracker and the recently proposed color tracker on TColor-128 show that some color models (such as HSV and LAB) are generally more effective at improving tracking performance. When the target is in deformation or rotation, color information is most helpful.

Like OTB, TColor-128 also contains 11 sub-attributes. In particular, fast motion in the TColor-128 refers to target motion greater than 20 pixels, and low resolution refers to less than 400 pixels in the ground-truth boundary box. Figure 3 shows a subsequence of the TColor-128 dataset.

Fig. 3
figure 3

TColor-128 partial subsequences and the challenge factors involved. The red sequence is a new sequence, and the blue sequence is the original sequence

UAV123 dataset

The UAV123 [13] dataset is a set of video sequences for drone tracking proposed in 2016. It contains 123 high-resolution aerial video sequences annotated, totaling more than 110K frames. The UAV123 data set contains three subsets. Using off-the-shelf professional-grade drones, 103 sequences on different objects at an altitude of 5–25 m are captured in Set1. The frame rate of video sequence is 30–96 FPS with resolution between 720p and 4K. All sequences are available in 720p and 30 FPS with an upright border of 30 FPS. The annotation is done manually at 10 FPS and then linearly interpolated at 30 FPS.

12 sequences are contained in set 2. Due to limited video transmission, these sequences have lower resolution and contain a reasonable amount of noise bandwidth. These subsets are annotated in the same way. Set 3 contains eight synthetic sequences captured by our proposed UAV simulator. From the perspective of a flying drone, the target moves along a predetermined trajectory in a different world rendered by the Unreal4 game engine. Annotations are done automatically at 30 fps. The full object masks are also available. UVA123 can evaluate current advanced trackers using multiple metrics. The space tracking errors of tracker can also be evaluated for specific situations by annotating various attributes of the video sequence. The tracker is evaluated by a high-fidelity visual tracking simulator method. The combination of the simulator and extensive spatial benchmark provides a more comprehensive assessment toolbox for modern and advanced trackers and opens up new avenues for experimentation and analysis. Figure 4 below shows a partial subsequence of the UAV123 dataset.

Fig. 4
figure 4

The first frame of selected sequences from UAV123 dataset. The red bounding box indicates the ground-truth annotation

These sequences contain common visual tracking challenges, including aspect ratio change, background clutter, camera motion, fast motion, full occlusion, illumination variation, low resolution, out-of-view, partial occlusion, similar object, scale variation and viewpoints change.

LASOT data set

With the rapid development of object tracking technology, the size and attribute of data sets have gradually increased. The general data set contains complete ground-truth and attribute comments as well as different evaluation criteria. Different fields have corresponding data sets. In addition to these basic data sets, the latest LASOT [14] data set consists of 1400 sequences of more than 3.5 million frames. The average sequence length is more than 2500 frames. LASOT is the largest tracking benchmark with high-quality annotations by far. It is designed to train deep trackers and evaluate long-term tracking performance. The development of these data sets provides the possibility for object tracking to move forward.

Object tracking

Object tracking is an important component of computer vision. Object tracking can be divided into single object tracking and multi-object tracking. The process of single object tracking is to find a calibration target in the next sequence based on the given first frame information. In single object tracking, there are two non-depth methods: generative method and discriminative method. At present, the five steps of the discriminative method commonly used are motion model, feature extraction, observation model, model update and integration method. Figure 5 shows the general flow of object tracking.

Fig. 5
figure 5

The flowchart of object tracking. Firstly, the image to be processed is input, and then the motion model is established and the feature extraction operation is performed. Finally, the image is output according to the prediction result

At first, the image to be processed needs to be input and the motion model also needs to be selected. The motion model is used to generate candidate sample information which may contain the target. The speed and quality of sample generation affect the speed and effectiveness of the entire tracking. Motion models typically include particle filtering and sliding windows. Particle filtering uses particle sets to represent probabilities. It can be used for various state space models. This method approximates the probability density function by finding a set of random samples propagating in the state space, and replaces the integral operation with the sample mean. The sliding window generates a series of candidate samples on the calibration target frame through the cyclic matrix method.

Next, feature extraction is performed. Feature extraction is to find features in the candidate region which can uniquely identify the target. In object tracking, the quality of the feature has the most direct impact on tracking results. Scholars have been working on the effects of different features on tracking and the effects of fusion of different features on tracking. Features generally include manual features and depth features. The manual features are obtained using information such as the shape of the image, geometric attributes and statistical histograms, Haar features, HOG features, LBP features and color features. The depth feature [15] is learned using a large number of training sets. In general, the depth feature has better robustness than manual feature.

The observation model is used to determine whether the current sample is matching tracked object. The generative model and the discriminative model are divided according to the observation model. The generated model is tracked by searching for the object most similar to the target in the current area. This is a template matching process, where sparse representation is the most commonly used. Finally, a discriminant model is obtained through learning as a discriminator, and the discriminator is used to discriminate whether the current sample is the target.

The template update is proposed to prevent the model from drifting due to changes in the appearance of the target. Template update can be considered primarily in terms of strategy and frequency. The update of the template frequency may be continuous for each frame or updated for multiple frames. The continuous update of each frame may introduce unwanted noise, thereby affecting the tracking effect. Interval multi-frame updates may speed up tracking. There are a variety of template strategies, and we can choose to adopt an update strategy which combines different methods.

The integration method is used to select the object. The final tracking result can directly select the highest confidence level, and can also refer to multiple prediction results to form the final result. Feature extraction is critical during tracking process. A sufficiently robust feature can handle most of the tracking challenges. In object tracking, most scholars are committed to the study of features. In addition, how to choose an effective tracking combination process is also a key issue in visual tracking.

The correlation filter tracking algorithm

In 2010, correlation filter method was used in object tracking for the first time. After nearly a decade of development, the correlation filter tracking algorithms now have matured. In this chapter, we will introduce the development of correlation filter algorithms. The specific development process is as follows.

By learning from gray images, the minimum output sum of squared error (MOSSE) [16] filter applies correlation filter to the tracking field for the first time. This filter is easy to calculate and can quickly track objects, but it does not guarantee to track accurately when the object's appearance changes. After that, Henriques et al. [17] proposed the circulant structure tracking with kernels (CSK) in 2012. Then, Danwelljan et al. [18] proposed that the Kernels correlation filter (KCF) further adjusts the channel characteristics to multi-channel features and introduces CN features for tracking in 2014. The CN feature improves the filter's discriminative ability. However, the adaptability of the filter to rotation, out-of-view and fast motion still needs to be improved. Subsequently, Danelljan et al. [19] proposed a discriminative scale space tracker (DSST) using the feature pyramid to solve the multi-scale variation problem and also proposed the improved fDSST algorithm [20]. With the rapid rise of deep learning, the C-COT algorithm [21] effectively represents spatial position information with shallow CNN features, which is a combination of correlation filtering and CNN. The algorithm won the VOT2016 competition. Similar to C-COT, the CSR-DCF algorithm [22] also applies CNN features to the correlation filtering algorithm, which improved the robustness of the algorithms.

MOSSE algorithm

The MOSSE algorithm [16] introduced the correlation filter technology into the visual tracking field. This kind of algorithm can adapt to the problems of occlusion and rotation and achieve an amazing tracking speed of 669 fps. Running 26 times faster than the advanced MIL algorithm, the MOSSE filter is trained by the first frame and can have strong robust performance for illumination, scale and posture variation. When the target is occluded, the algorithm can determine the status of object tracking and update the filter parameters according to the PSR value. When the object reappears, it can be tracked again.

In the MOSSE algorithm, to create a fast tracker, the fast Fourier transform (FFT) is used to calculate the correlation in the Fourier domain. First, calculating the 2D Fourier transform of the input image \((F=F\left(f\right)\)) and filter (\(H=F\left(h\right))\).The convolution theorem states that correlation is the element multiplication in the Fourier domain. The symbol ⊙ represents element-by-element multiplication, * indicates complex conjugate and the representation of correlation is as follows:

$$ g = f \otimes h, $$
(1)

where \(g\), \(f\) and \(h\) represent response output, input image and filter template, respectively. It can be seen that we only need to determine the filter template \(\mathrm{h}\) to get the response output. The fast Fourier transform (FFT) is used in Eq. (1). Therefore, the convolution operation becomes a point multiplication operation, which greatly reduces the amount of calculation. That is, the above formula becomes:

$$ F\left( g \right) = F\left( {f \otimes h} \right) = F\left( f \right) \cdot F\left( h \right)^{*} . $$
(2)

Then, the above formula is abbreviated as follows: \(G=F\bullet {H}^{\mathrm{*}}\) and the next task to track is to find the filter template \({H}^{\mathrm{*}}\): \({H}^{\mathrm{*}}=\frac{G}{F}\).

In the process of actual tracking, we need to consider the influence of factors such as the appearance of the object. At the same time, considering the \(m\) images of the object as a reference can significantly improve the robustness of the filter template. The MOSSE model formula is as follows:

$${\mathrm{m}\mathrm{i}\mathrm{n}}_{{H}^{\mathrm{*}}}=\sum_{i=1}^{m}{\left|{H}^{\mathrm{*}}{F}_{i}-{G}_{i}\right|}^{2}, $$
(3)

after a series of transformations, a closed solution is obtained:

$${H}^{\mathrm{*}}=\frac{\sum_{i}{G}_{i}\bullet {F}_{i}^{\mathrm{*}}}{\sum_{i}{F}_{i}\bullet {F}_{i}^{\mathrm{*}}} .$$
(4)

The algorithm tracks the object by correlating filters on the search window in the next frame. The new position of the object is represented by the maximum value of the associated output. Then performs an online update in the new location. The tracker update method uses the following formula:

$${H}_{i}^{\mathrm{*}}=\frac{{A}_{i}}{{B}_{i}}.$$
(5)
$${A}_{i}=\eta {G}_{i}\odot {F}_{i}^{\mathrm{*}}+\left(1-\eta \right){A}_{i-1}.$$
(6a)
$${B}_{i}=\eta {F}_{i}\odot {F}_{i}^{\mathrm{*}}+{B}_{i-1}.$$
(6b)

Finally, the PSR value is used for failure detection:

$$\mathrm{P}\mathrm{S}\mathrm{R}=\frac{\mathrm{p}\mathrm{e}\mathrm{a}\mathrm{k}-\mu }{\sigma }.$$
(7)

In the experiment, the PSR value between 20 and 60 is considered to be a good tracking effect. When the PSR value is lower than 7, it is judged as tracking failure and the template is not updated.

The MOSSE algorithm overall can adapt to small-scale variation, but it cannot adapt to large-scale variation. In addition, the MOSSE algorithm uses grayscale features that are not powerful enough and expressive in general. The sample sampling of the MOSSE algorithm is still a sparse sampling, and the training effect is general.

CSK algorithm

Unlike the traditional MOSSE algorithm using sparse sampling, the CSK algorithm [17] used a dense sampling method. The use of dense sampling leads to computational burden problems. Thus, the CSK algorithm uses the nature of the cyclic matrix to introduce a Fast Fourier Transform to speed up the algorithm. In addition, the CSK algorithm also introduces kernel techniques on MOSSE to improve the accuracy. A gaussian kernel is used in CSK to calculate the correlation between two adjacent frames. Specifically, the CSK linear classifier solves the correlation filter tracker expression as follows:

$$\underset{w,b}{\mathrm{min}}\sum_{i}^{n}L\left({y}_{i},f\left({x}_{i}\right)\right)+\lambda {\Vert w\Vert }^{2},$$
(8)

where \(i\) is the number of samples after dense sampling,\(\mathrm{w}\) corresponds to the correlation filter \(H\) in MOSSE. The problem is solved by the ridge regression method, where \(L\) is the loss function of the least squares method. The calculation method of L is \(L\left({y}_{i},f\left({x}_{i}\right)\right)={({y}_{i}-f\left({x}_{i}\right))}^{2}\), where \(f\left({x}_{i}\right)=<w,{x}_{i}>+b\) is the ideal Gaussian response. \(f\left({x}_{i}\right)\) represents the dot product of the image \({x}_{i}\) and the filter \(\mathrm{w}\) in the frequency domain. \(<,>\) means dot product, the same as ⨀. Therefore, \(L\left({y}_{i},f\left({x}_{i}\right)\right)\) is \(\left| {H^{*} \odot F_{i} - G_{i} } \right|^{2}\) in MOSSE. That is, the formula used by CSK is just to add a regular term \(\lambda {\Vert w\Vert }^{2}\) behind MOSSE to prevent overfitting.

In addition, to improve the speed of classifying samples in the high-dimensional feature space, a kernel function is used in CSK. Let \({\varnothing }(\mathrm{x})\) denotes the feature space, \(K\left(x,{x}^{^{\prime}}\right)=<\varnothing \left(x\right),\varnothing ({x}^{^{\prime}})>\) denotes its kernel function, according to the ridge regression \(w=\sum_{j}^{n}{\alpha }_{j}\varphi ({x}_{j})\). Finally, after a series of solutions, we get \({\alpha }\):

$$\alpha ={(K-\lambda I)}^{-1}y.$$
(9)

However, the target size of the algorithm is fixed and the robustness to scale variation is poor. Next, the nature of the circulant matrix is introduced.

KCF algorithm

The KCF [18] is a classic of traditional discriminant method. This series of algorithms learn filters from a series of training samples. Like CSK, the KCF sample generation method uses the cyclic shift method. Assuming one-dimensional data as \(x=[{x}_{1},{x}_{2},\dots ,{x}_{n},]\) the cyclic shift of x is denoted as \({P}_{x}=[{x}_{n},{x}_{1},\dots ,{x}_{n-1}]\). All cyclic shift samples form a cyclic matrix are:

$$X=C\left(x\right)=\left[\begin{array}{ccc}{x}_{1},{x}_{2}& \cdots & {x}_{n}\\ \vdots & \ddots & \vdots \\ {x}_{n},{x}_{n-1}& \cdots & {x}_{1}\end{array}\right].$$
(10)

That is, it uses \((M\times N)\) image block \(\mathrm{x}\) to train a filter \(f\left(x\right)=\langle \omega ,{\phi }_{x}\rangle \), which generates a training sample by performing a cyclic shift operation on \(x\). The training samples include all cyclic shift forms \({P}_{i}\), where \(i\in \{0,..., M-1\}\times \{0,...,N-1\}\). Each \({P}_{i}\) generates a corresponding score \({y}_{i}\)(\({y}_{i}\in [0, 1])\) which is generated by a Gaussian function based on the shift distance. Minimizing the regression error, the classifier is trained as:

$$w=\mathrm{a}\mathrm{r}\mathrm{g}\underset{w}{\mathrm{min}}\sum_{i}{(\langle w,\phi \left(x\right)\rangle -{y}_{i})}^{2}+\lambda {\Vert w\Vert }^{2}.$$
(11)

Among them, \(\phi (x)\) is the mapping of Fourier space. \(\lambda \ge 0\) is the regularization parameter, which shows the simplicity of the model. The periodic hypothesis achieves effective training and detection by using fast Fourier transform. If the translation invariance of the kernel function is used, \({\alpha }\) can be quickly obtained as \(\widehat{\alpha }=\frac{\widehat{y}}{{\widehat{k}}^{xx}+\lambda }\) for the special nature of the circulant matrix.In the filtering conversion process, a \(m\times n\) candidate image block \(z\) for the search space is evaluated by the following formula:

$$ f\left( z \right) = { }{\mathcal{F}}^{ - 1} \left( {\hat{k}^{xx} \odot \hat{\alpha }} \right), $$
(12)

where \(f\left(z\right)\) is the filter response of all cyclic matrices z, and the highest response is the object of the current frame. The KCF algorithm generates a series of candidate samples by exploiting the properties of the cyclic matrix on the candidate window. It greatly improves the tracking speed compared to traditional window sampling. The problem is then converted to a fast operation in the frequency domain by Fourier transform. This turns the ridge regression problem in the time domain into a cross-correlation problem in the frequency domain. The KCF algorithm uses a multi-channel HOG feature instead of a single-channel grayscale feature. Due to the use of cyclic shift, the KCF algorithm has a boundary effect problem. In addition, the search area is fixed in KCF, so it is easy to exceed the search range in fast motion. Figure 6 is an effect diagram of the cyclic matrix. After that, algorithms such as DSST improved the fixed scale problem of KCF.

Fig. 6
figure 6

The principle and effect diagram of cyclic shift. The upper left corner is the effect of the original sample moving to the left and up

DSST and fDSST algorithms

Robust scale estimation is a challenging issue in visual tracking. Most existing methods are unable to handle scale variation in complex image sequences. Therefore, the DSST algorithm [19] proposed a scale search and object estimation method based on one-dimensional independent correlation filter. Specifically, in a new frame, a two-dimensional position correlation filter is first used to determine a new candidate position of the target. A one-dimensional scale correlation filter is used to obtain candidate patches of different scales with the current center position as a center point, thereby finding the most matching scale. The scale filter of the DSST algorithm is learned by the scale pyramid representation. This scale estimation method is common to any tracking algorithm without scale variation. The loss function of DSST is as follows:

$$\varepsilon ={\left\Vert \sum_{l=1}^{d}{h}^{l}\mathrm{*}{f}_{j}-g\right\Vert }^{2}+\lambda \sum_{l=1}^{d}{\Vert {h}^{l}\Vert }^{2}.$$
(13)

Its solution in the frequency domain is:

$${H}_{t}=\frac{\sum_{j=1}^{t}{G}_{j}{F}_{j}}{\sum_{j=1}^{t}{F}_{j}{F}_{j}}.$$
(14)

After dissolving, get the solution:

$${H}^{l}=\frac{{\bar{G}}{F}^{l}}{\sum_{k=1}^{d}{\overline{{F}^{K}}}{F}^{K}+\lambda }, \quad l=1,\dots d. $$
(15)

Furthermore, the principle of scale selection is:

$${a}^{n}p\times {a}^{n}R, n\in \left\{\left[-\frac{S-1}{2}\right],\dots ,\left[\frac{S-1}{2}\right]\right\},$$
(16)

where \(P\) and \(R\) are the width and height of the object in the previous frame, \(a\) is the scale factor, and \(S\) is the number of scales. The DSST algorithm has a scale factor \(a\) of 1.02 and a scale number \(S\) of 33. Scale detection is gradually detected from fine to coarse.

Finally, the DSST algorithm uses the compressed training samples \({\tilde{F}}_{t}=\fancyscript{f}\left\{{P}_{t},{f}_{t}\right\}\) and the compression object template \({\tilde{U}}_{t}=\fancyscript{f}\left\{{P}_{t}{u}_{t}\right\}\) to updates the filter, resulting in:

$${A}_{t}^{l}=\left(1-\eta \right){A}_{t-1}^{l}+\upeta {\bar{G}}_{t}{F}_{t}^{l}$$
(17a)
$${B}_{t}=\left(1-\eta \right){B}_{t-1}+\upeta \sum_{k=1}^{d}{\overline{{F}_{t}^{k}}}{{F}_{t}^{k}}_{t}.$$
(17b)

Here, \(\eta \) is a learning rate parameter. The correlation score y at the rectangular area z of the feature map is calculated using the following formula. Then, it finds the new object state by maximizing the score \(y\). Then, the new object state is found by maximizing the y score.

$${Y}_{t}=\frac{{\sum }_{l=1}^{d}{\overline{{A}_{t-1}^{l}}}{Z}_{t}^{l}}{{B}_{t-1}+\lambda }$$
(18)

The correlation score at each location is then calculated by inverting DFT \({y}_{t}={\fancyscript{f}}^{-1}\left\{{Y}_{t}\right\}\). The estimation of the current target state is obtained by finding the maximum correlation score.

DSST uses 33 scale estimates increasing the computational burden. Therefore, the fDSST algorithm [20] accelerated the DSST. The fDSST algorithm used the techniques of feature dimension reduction and interpolation to greatly accelerate the algorithm. The search box of fDSST has become larger, thereby improving tracking accuracy. PCA features reduce dimensionality in positional filters. Based on computational considerations, the QR-decomposition reduction can reduce the loss of 1000 × 17 to 17 × 17 almost non-destructively in scaled filters. For insufficient samples, triangular interpolation is used to supplement to 33. In this way, the acceleration strategy greatly increases the speed of the algorithm, and extra time is spent to expand the search domain to improve the robustness. Unlike DSST, the response value of fDSST is calculated as

$$\mathrm{y}={\fancyscript{f}}^{-1}\left\{\frac{{\sum }_{l=1}^{d}{\overline{{A}^{l}}}{Z}^{l}}{B+\lambda }\right\} .$$
(19)

To train the filter, the feature mapping \(f\) of the patchis extracted. Then, the position of the new frame is estimatedby extracting the feature map z at the predicted target position. Finally, the correlation score is calculated and updated. The proposed adaptive scale method also allows the algorithm to adaptively adapt to the scale variation. Figure 7 shows the principle of the DSST algorithm with scale variation and the tracking effect after the scale variation.

Fig. 7
figure 7

DSST algorithm tracking schematic with scale variation

SRDCF algorithm

To overcome the boundary effect appearing in the correlation filtering, the SRDCF algorithm [23] added a regular penalty term. The SRDCF algorithm divides the scale into several scales to overcome the scale variation. When solving the correlation filter, the SRDCF algorithm uses the iterative Gauss–Seidel method to learn online.

In the original DCF, the online training method is as follows:

$${\varepsilon }_{t}\left(f\right)={\sum_{k=1}^{t}{\alpha }_{k}\Vert {S}_{f}\left({x}_{k}\right)-{y}_{k}\Vert ^{2}}+\lambda \sum_{l=1}^{d}{\Vert {f}^{l}\Vert }^{2}.$$
(20)

While SRDCF adds a regularization term, i.e. penalty term w:

$$\varepsilon \left(f\right)={\sum_{k=1}^{t}{\alpha }_{k}\Vert {S}_{f}\left({x}_{k}\right)-{y}_{k}\Vert ^{2}}+\sum_{l=1}^{d}{\Vert w\bullet {f}^{l}\Vert }^{2}, $$
(21)

where \(f\) is the filter template, \(l\) is the \(l \mathrm{t}\mathrm{h}\) channel, and \(w\) is the regular coefficient matrix. Therefore, the background information is suppressed, and the filter can pay more attention to the object information. After normalization, we get:

$$ \overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\smile}$}}{\varepsilon } \left( {\hat{f}} \right) = \mathop \sum \limits_{k = 1}^{t} \alpha_{k} \left\Vert\mathop \sum \limits_{l = 1}^{d} \hat{x}_{k}^{l} \cdot \hat{f}^{l} - \hat{y}_{k} \right\Vert^{2}+ \left\Vert \mathop \sum \limits_{l = 1}^{d} \frac{{\hat{w}}}{MN}*\hat{f}^{l} \right\Vert ^{2}. $$
(22)

The smoothed response graph is calculated by utilizing FFT and cyclic matrix properties for the above equation. After solving the function, the function is solved and Pascal's theorem is used to transform the objective function into the frequency domain. And the parameters are vectorized. To simplify the solution, we deal with as follows:

$$ \overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\smile}$}}{\varepsilon } \left( {\hat{f}} \right) = \mathop \sum \limits_{k = 1}^{t} \alpha_{k} \left\Vert \mathop \sum \limits_{l = 1}^{d} D\left( {\hat{x}_{k}^{l} } \right)\hat{f}^{l} - \hat{y}_{k} \right\Vert ^{2} + \left\Vert \mathop \sum \limits_{l = 1}^{d} \frac{{C\left( {\hat{w}} \right)}}{MN}\hat{f}^{l} \right\Vert ^{2} , $$
(23)

where D is the diagonalization operation, C is the cycling operation, and \(k\) is the \(k-\mathrm{t}\mathrm{h}\) sample. The convolutional symbol can be removed by a loop operation. Ultimately equivalent to solving linear equations:

$${A}_{t}{\tilde{f}}={\tilde{b}}_{t}.$$
(24)

Among them:

$${A}_{t}=\sum_{k=1}^{t}{\alpha }_{k}{D}_{k}^{T}+{W}^{T}W.$$
(25a)
$${\tilde{b}}_{t}=\sum_{k=1}^{t}{\alpha }_{k}{D}_{k}^{T}{\tilde{y}}_{k}. $$
(25b)

The Gaussian–Seidel method is used for simplified solving. In the tracking process, according to the ground-truth of the first frame, the training can be performed in an iterative manner:

$${A}_{t}=(1-\gamma ){{A}_{t-1}+\gamma ({D}_{t}^{T}{D}_{t}+W}^{T}W).$$
(26a)
$${\tilde{b}}_{t}=\left(1-\gamma \right){\tilde{b}}_{t-1}+\gamma {D}_{t}^{T}{\tilde{y}}_{k.}$$
(26b)

The update method of A and b can reduce the amount of calculation. Scale detection uses the SAMF pyramid method. Down sample speeds up the calculation. Finally, the obtained response is interpolated to get the best scale, and then the Newton iteration method is used to find the maximum response point. After adding the regular coefficient matrix, the response value at the background is obviously suppressed. This makes it possible to expand the search domain for tracking. Figure 8 below shows the contrast effect of the SRDCF algorithm after adding the regular term constraint.

Fig. 8
figure 8

The effect of the SRDCF algorithm

To solve this problem, SRDCF has a model on multiple training images, but this model limits efficiency. STRCF introduces time regularization into single-sample SRDCF, and uses the alternate direction method of the multiplier (ADMM) algorithm to make STRCF each sub-problem has closed solution. In addition, the use of manual features achieves a 5 × acceleration, further solving the boundary effect. The SRDCFDecon algorithm [24] improves the sample and learning rate of the SRDCF algorithm. To overcome the problem of sample drift caused by the correlation filter samples being susceptible to contamination, the SRDCFDecon algorithm chooses to save historical samples. In the optimization objective function, the SRDCFDecon algorithm adds sample weight parameters and regular terms.

STAPLE algorithm

Luca Bertinetto et al. [25] found that the previous algorithm model learning relies on the spatial information of the tracking object, which is not robust to the deformation. However, the use of color features to learn the object can track well in the case of deformation and motion blur. When the light changes, the color features are not well expressed. The HOG features can track the object under the illumination variation. Therefore, the STAPLE algorithm achieves a relatively fast speed using the HOG and color features for fusion at a very good speed 80fps. The tracking effect is better than most existing tracking algorithms. The calculation of the STAPLE algorithm is obtained by linear combination of the template and the color histogram. The function is expressed as follows:

$$f\left(x\right)={\gamma }_{\mathrm{t}\mathrm{e}\mathrm{m}\mathrm{p}1}{f}_{\mathrm{t}\mathrm{e}\mathrm{m}\mathrm{p}1}\left(x\right)+{\gamma }_{\mathrm{h}\mathrm{i}\mathrm{s}\mathrm{t}}{f}_{\mathrm{h}\mathrm{i}\mathrm{s}\mathrm{t}}\left(x\right).$$
(27)

The template score is a linear function of the K-channel feature image\(\phi_{x} :\tau \to R^{K}\), obtained from x and defined on the finite grid \(\tau \subset {Z}^{2}\):

$${f}_{\mathrm{t}\mathrm{e}\mathrm{m}\mathrm{p}1}\left(x;h\right)=\sum_{u\subset \tau }{h\left[u\right]}^{\mathrm{T}}{\phi }_{x}\left[u\right].$$
(28)

Among them, the template \(h\) is another K-channel image. The histogram score is calculated from the M channel feature image \({\psi }_{x}:\mathcal{H}\to {R}^{M}\), and x is obtained and defined in the finite mesh \(\mathcal{H}\subset {\mathrm{Z}}^{2}\):

$${f}_{\mathrm{h}\mathrm{i}\mathrm{s}\mathrm{t}}\left(x;\beta \right)=g\left({\psi }_{x};\beta \right).$$
(29)

LMCF algorithm

The LMCF algorithm [26] was published on CVPR in 2017. Since the structured SVM has more discriminative power than the traditional SVM, the author combines the structured SVM with the correlation filter algorithm. In the tracking process, when there are similar interference objects around the target, the response graph usually has multiple peaks. The highest peak may be an interference object, which may cause misjudgment. Therefore, it uses multi-peak forward detection to overcome similar object interference.

In addition, LMCF improved the KCF algorithm from the perspective of model update for the first time. According to the current tracking situation, the model update is judged, thereby improving the accuracy of the tracking. In the traditional structured SVM struck algorithm, although the score is directly output through the online SVM, the algorithm is inefficient for using sparse mode. Therefore, LMCF uses a cyclic matrix instead of sparse sampling to increase the speed of structured SVM with CF. In addition, APCE is introduced for multi-peak judgment, the APCE formula is expressed as:

$$\mathrm{A}\mathrm{P}\mathrm{C}\mathrm{E}=\frac{{\left|{F}_{\mathrm{m}\mathrm{a}\mathrm{x}}-{F}_{\mathrm{m}\mathrm{i}\mathrm{n}}\right|}^{2}}{\mathrm{m}\mathrm{e}\mathrm{a}\mathrm{n}\left({\sum_{w,h}\left({F}_{w,h}-{F}_{\mathrm{m}\mathrm{i}\mathrm{n}}\right)}^{2}\right)},$$
(30)

where \({F}_{\mathrm{m}\mathrm{a}\mathrm{x}},{F}_{\mathrm{m}\mathrm{i}\mathrm{n}},{F}_{w,h}\) represent the response at the highest, lowest and position, respectively. The APCE reflects the degree of oscillation of the response graph. According to the value of the APCE, the judgment of the target motion state can be judged, thereby determining the update of the template. If the APCE suddenly decreases, the target is likely to be occluded or lost. In this case, the model is not updated to avoid model drift. When the APCE and \({F}_{\mathrm{m}\mathrm{a}\mathrm{x}}\) are greater than the historical mean by a certain ratio, the model is updated. This not only reduces the model drift and the number of model updates but also speeds up the operation of the algorithm.

BACF algorithm

For the traditional correlation filtering, the boundary problem caused by the training samples is generated by using cyclic matrix. The BACF algorithm [27] first enlarges the object search area, and then improves the quality of the generated samples. The traditional solution can be expressed as

$$E\left(h\right)=\frac{1}{2}{\sum_{j=1}^{D}\left\Vert y\left(j\right)-{\sum_{k=1}^{K}{h}_{k}^{T}X_{k}}{\left[\Delta {\mathcal{T}}_{j}\right]}_{2}^{2}\right\Vert }^{2}+\sum_{l=1}^{d}{\Vert \omega \bullet {f}^{l}\Vert }^{2}+\frac{\lambda }{2}\sum_{k=1}^{K}{\Vert {h}_{k}\Vert }_{2}^{2} .$$
(31)

The BACF algorithm adds the matrix P to the original algorithm instead:

$$ \begin{aligned}E\left(h\right)&=\frac{1}{2}{\sum_{j=1}^{D}\left\Vert y\left(j\right)-{\sum_{k=1}^{K}{h}_{k}^{T}PX}_{k}{\left[\Delta {\mathcal{T}}_{j}\right]}_{2}^{2}\right\Vert }^{2}\\ &\quad+\sum_{l=1}^{d}{\left\Vert \omega \bullet {f}^{l}\right\Vert }^{2}+\frac{\lambda }{2}\sum_{k=1}^{K}{\Vert {h}_{k}\Vert }_{2}^{2} .\end{aligned} $$
(32)

The addition of the matrix P is the process of secondary processing of the cyclic samples. The original and valid samples are preserved by P, which reduces the influence of the virtual samples on the tracking, then solves the formula:

$$E\left(h,\widehat{g}\right)=\frac{1}{2}{\Vert \widehat{y}-\widehat{X}\widehat{g}\Vert }_{2}^{2}+\frac{\lambda }{2}{\Vert h\Vert }_{2}^{2} \quad \mathrm{s}.\mathrm{t}. \, \, \widehat{g}=\sqrt{T}\left(FP\otimes {I}_{K}\right) h$$
(33)

After applying fast Fourier transform to the frequency domain, the augmented Lagrange method (ALM) is used, the auxiliary variable g is constructed and g is subjected to the cropping operation. After using the augmented Lagrange method, we get:

$$ \begin{aligned} {\mathcal{L}}\left( {\hat{g},h,\hat{\varsigma }} \right) &= \frac{1}{2}\left\Vert\hat{y} - \hat{X}\hat{g}\right\Vert_{2}^{2} + \frac{\lambda }{2}\left\Vert h\right\Vert_{2}^{2} \\ &\quad+ \hat{\varsigma }^{T} \left( {\hat{g} - \sqrt T \left( {FP^{T} \otimes I_{K} } \right)h} \right) \\ &\quad+ \frac{\mu }{2}\left\Vert\hat{g} - \sqrt T \left( {FP^{T} \otimes I_{K} } \right)h\right\Vert_{2}^{2} . \end{aligned} $$
(34)

Then, the ADMM optimization algorithm is used to transform the original problem into two sub-problems that solve the filter h:

$$ h^{*} = \arg \mathop {\min }\limits_{h} \left\{ {\frac{\lambda }{2}\left\Vert h_{2}^{2} \right\Vert + \hat{\varsigma }^{T} \left( {\hat{g} - \sqrt T \left( {FP^{T} \otimes I_{K} } \right)h} \right) + \frac{\mu }{2}\left\Vert \hat{g} - \sqrt T \left( {FP^{T} \otimes I_{K} } \right)h\right\Vert_{2}^{2} } \right\} = \left( {\mu + \frac{\lambda }{\sqrt T }} \right)^{ - 1} \left( {\mu g + \varsigma } \right){, } $$
(35)

and the auxiliary variable \(g\):

$$ \hat{g}^{*} = \arg \mathop {\min }\limits_{{\hat{g}}} \left\{ {\frac{1}{2}\left\Vert\hat{y} - \hat{X}\hat{g}\right\Vert_{2}^{2} + \hat{\varsigma }^{T} \left( {\hat{g} - \sqrt T \left( {FP^{T} \otimes I_{K} } \right)h} \right) + \frac{\mu }{2}\left\Vert\hat{g} - \sqrt T \left( {FP^{T} \otimes I_{K} } \right)h\right\Vert_{2}^{2} } \right\} = \left( {\mu + \frac{\lambda }{\sqrt T }} \right)^{ - 1} \left( {\mu g + \varsigma } \right). $$
(36)

Then, when solving the sub-problem g, some simplification processing must be done to achieve the real-time performance of the tracking system since the calculation is too large. Finally, the solution problem of g is split into T independent objective functions:

$${\widehat{g}(t)}^{\mathrm{*}}=\mathrm{arg}\underset{\widehat{g\left(t\right)}}{\mathrm{min}}\left\{\begin{array}{c}\frac{1}{2}{\Vert \widehat{y}\left(t\right)-\widehat{X}\left(t\right)\widehat{g}\left(t\right)\Vert }_{2}^{2}+{\widehat{\varsigma }\left(t\right)}^{T}\left(\widehat{g}\left(t\right)-h\left(t\right)\right)\\ +\frac{\mu }{2}{\Vert \widehat{g}\left(t\right)-h\left(t\right)\Vert }_{2}^{2}\end{array}\right\} .$$
(37)
$${\widehat{g}(t)}^{\mathrm{*}}={\left(\widehat{X}\left(t\right){\widehat{X}\left(t\right)}^{T}+T\mu {I}_{K}\right)}^{-1}\left(\widehat{y}\left(t\right)\widehat{X}\left(t\right)-T\widehat{\varsigma }\left(t\right)+T\mu \widehat{h}\left(t\right)\right). $$
(38)

This makes the complexity of \({\widehat{\mathrm{g}}}^{\mathrm{*}}\) from \(O({K}^{3}{T}^{3})\) down to \(O({K}^{3}T)\). Finally, the Sherman–Morrison formula is used to simplify the inversion calculation:

$${\widehat{g}(t)}^{\mathrm{*}}= \, \frac{1}{\mu }\left(T\widehat{y}\left(t\right)\widehat{X}\left(t\right)-\widehat{\varsigma }\left(t\right)+\mu \widehat{h}(t)\right)-\frac{\widehat{X}\left(t\right)}{\mu b}\left(T\widehat{y}\left(t\right){\widehat{s}}_{x}\left(t\right)-{\widehat{s}}_{\varsigma }\left(t\right)+\mu {\widehat{s}}_{h}\left(t\right)\right).$$
(39)

Eventually, the complexity is reduced to \(O(KT)\), where T is the dimension after the entire image is converted to a vector, and K is the number of layers of the feature. The model update strategy uses the traditional CF linear interpolation method:

$${\widehat{X}}_{\mathrm{m}\mathrm{o}\mathrm{d}\mathrm{e}\mathrm{l}}^{(f)}=\left(1-\eta \right){\widehat{X}}_{\mathrm{m}\mathrm{o}\mathrm{d}\mathrm{e}\mathrm{l}}^{(f-1)}+\eta {\widehat{X}}^{f} .$$
(40)

Figure 9 below shows the sample training of the DCF algorithm and the BACF algorithm processing operation and effects of the improved quality. Among them, the CS operation refers to the cyclic shift operation, and the crop is the operation of cutting using the P matrix.

Fig. 9
figure 9

Sample processing of the BACF algorithm

DRT algorithm

Existing CF methods usually focus on the discrimination of filters, while less attention is paid to reliability learning. This may cause the trained filter to be dominated by unexpectedly highlighted areas on the feature map, resulting in model degradation. To solve this problem, Sun et al. [28] proposed a new CF-based optimization problem to jointly simulate identification and reliability information. First, the filter is divided into elemental products of the underlying filter and reliability terminology. The base filter is used to learn the identification information between the target and the background, and the reliability term encourages the final filter to focus on a more reliable area. Second, general terminology for local response consistency is introduced to emphasize equal contributions from different regions and to prevent trackers from being controlled by unreliable regions. The proposed optimization problem can be solved using an alternating direction method and accelerated in the Fourier domain. In the model construction, the DRT algorithm mainly splits the original tracking template w into the point multiplication of the reliability weight map \({V}_{d}\) and the original filter \({h}_{d}\), the formula is expressed as

$$ W_{d} = h_{d} \odot V_{d} . $$
(41)

The reliability weight map is set to have a value in the area of the target frame, and the other area values are zero. The weight map is further divided into weighted sums of nine sub-areas, and each sub-weight graph only focuses on a part of the target area. The weight of this part of the area is 1, i.e.:

$${V}_{d}=\sum\limits_{m=1}^{M}{\beta }_{m}{P}_{d}^{m}, $$
(42)

where \({P}_{d}^{m}\in {R}^{K\times 1}\) is a binary mask, and the weights have upper and lower limits to ensure the stability of the tracker. Thus, a new tracking template is formed.

Subsequently, the basic filter \(h={\left[{h}_{1}^{T},\dots ,{h}_{D}^{T}\right]}^{T}\) and the reliability template are optimized:

$$\left[\dot{h},\dot{\beta }\right]=\mathrm{arg}\, \underset{h,\beta }{\mathrm{min}} \, f\left(h,\beta ; \, X\right)$$
$$\mathrm{s}.\mathrm{t}. \, {\theta }_{\mathrm{m}\mathrm{i}\mathrm{n}}\le {\beta }_{m}\le {\theta }_{\mathrm{m}\mathrm{a}\mathrm{x}},\forall ,$$
(43)

where the objective function is defined as

$$f\left(h,\beta ; \,X\right)={f}_{1}\left(h,\beta ; \,X\right)+\eta {f}_{2}\left(h; \,X\right)+\gamma {\Vert h\Vert }_{2}^{2}.$$
(44)

Finally, the \(\dot{h}\) and \(\dot{\beta }\) are solved by the alternating direction method. The model update is using the conjugate gradient descent method to update \(h\) and updating \(\beta \) by solving the quadratic programming problem method. The SAMF method is used in scale detection, and features including a combination of traditional and deep features are also used. The ROI region at different scales centered on the estimated position of the last frame is extracted to obtain the multi-channel feature map \({X}_{d}^{s}\). Finally, the response of the object position of the scale \(s\) is calculated:

$${r}_{s}=\sum_{d=1}^{D}{\mathcal{F}}^{-1}\left({\mathcal{F}\left({w}_{d}\right)\odot \left(\mathcal{F}\left({x}_{d}^{s}\right)\right)}^{H}\right) .$$
(45)

Then, the target position and scale are jointly determined by finding the maximum value in the S response map. This joint estimation strategy shows a better performance. This method first estimates the target location and then re-scales based on the estimated location. The process comparison between the DRT algorithm and the original algorithm is shown in Fig. 10.

Fig. 10
figure 10

Comparison of the tracking process and results of the DRT algorithm and the underlying algorithm. The confidence of blue in the confidence map is the lowest and the confidence of red is the highest

ASRCF algorithm

The SRDCF and BACF algorithms have imposed additional spatial constraints on the filter coefficients, the boundary effects are mitigated to some extent. However, these constraints are usually fixed for different objects and cannot fully utilize the diversity information of the target. Moreover, object localization and scale estimation are usually performed on the same feature space, which requires extracting multi-scale feature maps during the tracking process. When the tracker takes advantage of some powerful and complex features, this strategy can significantly increase the computational load and slow down the tracking. Therefore, Dai et al. [29] proposed a new adaptive spatial regularization correlation filter (ASRCF) model, which can effectively estimate the object's perceived spatial regularization and obtain more reliable filter coefficients in the tracking process. ASRCF is a generic CF model. The ASRCF model is effectively optimized by the ADMM so that each sub-problem has an analytical solution. Finally, the method efficiently estimates the position and scale by two CF models: one uses shallow and deep features for precise position; the other uses shallow features for fast scale estimation. The target function is expressed as:

$$E\left(H,w\right)=\frac{1}{2}{\left\Vert y-\sum_{k=1}^{K}{x}_{k}\mathrm{*}({P}^{T}{h}_{k})\right\Vert }_{2}^{2}+\frac{{\lambda }_{1}}{2}\sum_{k=1}^{K}{\Vert w\odot {h}_{k}\Vert }_{2}^{2}+\frac{{\lambda }_{2}}{2}{\Vert w-{w}^{r}\Vert }_{2}^{2} ,$$
(46)

where the first term is a ridge regression term, which convolves the training data \(X=\left[{X}_{1},{X}_{2},\dots ,{X}_{K}\right]\) with the filter \(H=\left[{h}_{1},{h}_{2},\dots ,{h}_{K}\right]\) to the Gaussian distribution of ground-truth \(y\). The second term is a regularization term that introduces adaptive spatial regularization on filter \(H\), where the spatial weight \(w\) needs to be optimized. The third term attempts to make the adaptive spatial weight \(\mathrm{w}\) similar to the reference weight \({w}^{r}\). This constraint introduces a priori information about \(w\) and avoids model degradation. \({\lambda }_{1}\) and \({\lambda }_{1}\) are the regularization parameters of the second and third terms, respectively.Inspired by SRDCF and BACF, the subsequent solution converts the objective function to the frequency domain and then uses the ADMM optimization algorithm to finally obtain the expression of the augmented Lagrange form:

$$L\left(H,\widehat{G},w,\widehat{V}\right)= E \left(H,\widehat{G},w\right)+\sum_{k=1}^{K}{\widehat{V}}_{k}^{T}({\widehat{g}}_{k}-\sqrt{T}F{P}^{T}{h}_{k})+\frac{\mu }{2}{\sum_{k=1}^{K}\left\Vert {\widehat{g}}_{k}-\sqrt{T}F{P}^{T}{h}_{k}\right\Vert }_{2}^{2} .$$
(47)

After the solution, the ADMM optimization algorithm is used to transform the original problem into two sub-problems for solving the filter \(h\) and the auxiliary variable \(g\).The expression of sub-problems h is:

$$ h_{k}^{*} = \arg \mathop {\min }\limits_{{h_{k} }} \left\{ {\frac{{\lambda_{1} }}{2}w \odot h_{k2}^{2} + \frac{\mu }{2}\hat{g}_{k} - \sqrt T FP^{T} h_{k} + \hat{s}_{k2}^{2} } \right\} = \frac{{\mu Tp \odot \left( {s_{k} + g_{k} } \right)}}{{\lambda_{1} \left( {w \odot w} \right) + \mu Tp}}, $$
(48)

and sub-problems g is:

$${\widehat{G}}^{\mathrm{*}}=\mathrm{arg} \, \underset{\widehat{G}}{\mathrm{min}}\left\{\frac{1}{2}{\left\Vert \widehat{y}-\sum_{k=1}^{K}{\widehat{X}}_{k}\odot {\widehat{g}}_{k}\right\Vert }_{2}^{2}+\frac{\mu }{2}{\sum_{k=1}^{K}\left\Vert {\widehat{g}}_{k}-\sqrt{T}F{P}^{T}{h}_{k}+{\widehat{s}}_{k}\right\Vert }_{2}^{2}\right\} .$$
(49)

When \(H,\widehat{G}\) and \(\widehat{S}\) are resolved, \(w\) can get a closed solution:

$${w}^{\mathrm{*}}=\mathrm{arg} \, \underset{w}{\mathrm{min}}\left\{\frac{{\lambda }_{1}}{2}\sum_{k=1}^{K}{\Vert {N}_{k}w\Vert }_{2}^{2}+\frac{{\lambda }_{2}}{2}{\Vert w-{w}^{\mathrm{r}}\Vert }_{2}^{2}\right\}=\frac{{\lambda }_{2}{w}^{\mathrm{r}}}{{\lambda }_{2}\sum_{k=1}^{K}{h}_{k}\odot {h}_{k}+{\lambda }_{2}I}.$$
(50)

Finally, the template update method is as follows:

$${\widehat{X}}_{\mathrm{m}\mathrm{o}\mathrm{d}\mathrm{e}\mathrm{l}}^{\mathrm{n}\mathrm{e}\mathrm{w}}=\left(1-\eta \right){\widehat{X}}_{\mathrm{m}\mathrm{o}\mathrm{d}\mathrm{e}\mathrm{l}}^{\mathrm{o}\mathrm{l}\mathrm{d}}+\eta {\widehat{X}}^{\mathrm{*}}, $$
(51)

where \({\widehat{X}}_{\mathrm{m}\mathrm{o}\mathrm{d}\mathrm{e}\mathrm{l}}^{\mathrm{o}\mathrm{l}\mathrm{d}}\) is the latest updated template, \({\widehat{X}}_{\mathrm{m}\mathrm{o}\mathrm{d}\mathrm{e}\mathrm{l}}^{\mathrm{o}\mathrm{l}\mathrm{d}}\) is the old template, \({\widehat{X}}^{\mathrm{*}}\) is the current observation, and \(\eta \) is the online learning rate.

In the tracking process, the BACF method applies this CF model to the search areas at five scales and obtains their relevant response graphs. Then, the optimal ratio is determined based on the ratio of the maximum score corresponding to the five response graphs. In each frame, the position is first estimated using the positional CF model with complex features, and then the scale is redefined to apply the scale CF model based on the five-dimensional HOG feature map. Figure 11 below shows the overall framework flow of the ASRCF algorithm.

Fig. 11
figure 11

Comparison of the overall process framework of the ASRCF algorithm

Summary of correlation filtering algorithms

The correlation filter algorithm has the characteristics of high speed and precision, but it faces the challenges of boundary effect and scale effect. Therefore, each algorithm is improved by template updating strategy, feature improvement and area detection. For example, Li et al. [30] adopted an effective phase correlation scheme to simultaneously process scale and rotation changes in log polar coordinates, and achieved robust estimation of similar transformations of large displacements. With deep learning methods prevalent, correlation filtering algorithms still have a place in object tracking, which cannot be ignored. The overall performance of each algorithm is comprehensively analyzed and compared. Next, we show the accuracy and success rate of eight correlation filter algorithms, such as BACF and SRDCF, on OTB2013.

From Fig. 12 above, we can intuitively see that the overall accuracy and success rate of each algorithm after gradually improving the basic algorithm has gradually improved. In addition, Fig. 13 shows the tracking of these eight algorithms in the OTB2013 dataset.

Fig. 12
figure 12

Accuracy and success rate of the correlation filtering algorithm on OTB2013

Fig. 13
figure 13

The tracking effect of the correlation filtering algorithms on the video sequence. The video sets from left to right are Jogging 1, Shating 1, Coke and Singer 1

From Fig. 13, we can see intuitively that the latest BACF and SRDCFDecon are robust in complex scenarios such as occlusion. Furthermore, most algorithms can better adapt to scale transformation 1 besides the original CSK and KCF algorithms. In addition, many scholars propose correlation filter algorithms using deep features. For example, the C-COT algorithm [21] applies depth features to the correlation filtering, and achieves good tracking results. In the SRDCF algorithm, the STRCF algorithm further overcomes the boundary effect. It combines correlation filtering and deep convolutional neural networks. It is also the first work to unify flow extraction and tracking tasks in a network. This proves that the correlation filter algorithm still has strong vitality, and further research is needed.

Correlation filtering-based template update judgment

In object tracking, a number of samples are generated firstly based on given a priori information. These samples are subjected to feature extraction operations. In this way, the filter can be trained through online training. Finally, the trained filter is used to track the object. In the object tracking process, the template update generally uses a single template update strategy. To effectively deal with various problems and challenges in object tracking, our team proposes solution to improve the tracking effect using a template update strategy. In this chapter, we will detail the three tracking algorithms proposed by our team.

Object tracking strategy with visual attention features structure

Zheng [31] enhanced the predictive ability of the correlation filter algorithm for object position and scale algorithms by combining visual motion feature methods and spatiotemporal continuity. Merged KCF_VAF algorithm reduces the probability of tracking failures such as occlusion, scale variation, and motion blur. This method mainly updates the template according to the motion characteristics. The motion characteristics include the scale change rate, velocity and acceleration. The calculations of these three motion characteristics are introduced below.

(1) Scale change rate: It is used to measure the change of the moving target's own scale between successive frames. The scale change rate is calculated by using formulas (52):

$${S}_{w}=\frac{{W}_{o}}{{W}_{\mathrm{p}}}. $$
(52a)
$${S}_{h}=\frac{{H}_{o}}{{H}_{\mathrm{p}}} . $$
(52b)

The rate of scale change is measured by the coefficients of variation of the scale \({S}_{w}\) and \({S}_{h}\). \({S}_{w}\) represents the ratio of the target width \({W}_{o}\) to the frame width \({W}_{p}\), and \({S}_{h}\) represents the ratio of the target height \({H}_{o}\) to the frame height \({H}_{p}\). When the scale of the target changes, its width and height will change, while the width and height of the frame will remain unchanged. Therefore, the ratio of the two can effectively measure the scale change of the target.

(2) Speed: The speed of the moving target can be expressed by the displacement between two adjacent frames. Considering the scale change of the target will also affect the speed, we can calculate the target speed using the formulas (53)–(55):

$${V}_{x}=\frac{{P}_{x}^{(i)}-{P}_{x}^{(i-1)}}{{S}_{w}} , $$
(53a)
$${V}_{y}=\frac{{\mathrm{P}}_{\mathrm{y}}^{(\mathrm{i})}-{\mathrm{P}}_{\mathrm{y}}^{(\mathrm{i}-1)}}{{S}_{h}}, $$
(53b)
$$\Vert \overrightarrow{v}\Vert =\sqrt{{V}_{x}^{2}+{V}_{y}^{2}}, $$
(54)
$$\mathrm{tan}\theta =\frac{{V}_{y}}{{V}_{x}},$$
(55)

where \({P}^{(i)}\) and \({P}^{(i-1)}\) represent the central position of the target at the time \(i\) and \(i-1\), \({S}_{w}\) and \({S}_{h}\) are the scale change coefficients, respectively. \({v}_{x}\) and \({v}_{y}\) are the object speed in the horizontal and vertical directions. \(\Vert \overrightarrow{v}\Vert \) is the magnitude of the speed. The direction of velocity is expressed as \(\mathrm{tan}\theta \). \(\theta \) is the angle between the velocity direction and the horizontal direction. The units of velocity in Eqs. (53)–(55) are pixels per frame.

(3) Acceleration: The acceleration of the moving target is the first-order differential of the speed. The calculation method is similar to the speed. It can be expressed by the speed difference between two adjacent frames. The target acceleration is calculated by the formulas (56)–(58):

$${a}_{x}={V}_{x}^{(i)}- {V}_{x}^{\left(i-1\right)},$$
(56a)
$${a}_{y}={V}_{y}^{(i)}- {V}_{y}^{\left(i-1\right)},$$
(56b)
$$\Vert \overrightarrow{a}\Vert =\sqrt{{a}_{x}^{2}+{a}_{y}^{2}}, $$
(57)
$$\mathrm{tan}\alpha =\frac{{a}_{y}}{{a}_{x}} ,$$
(58)

where \({a}_{x}\) and \({a}_{y}\) are the acceleration in the horizontal and vertical directions, \(\Vert \overrightarrow{a}\Vert \) is the magnitude of the acceleration, and the acceleration is represented by \(\mathrm{tan}\alpha \), and \(\alpha \) is the angle between the acceleration and the horizontal.

This paper first proposes to use the scale change rate to solve the scale problem of KCF. And velocity y and acceleration are used as motion features. In the Kth frame, when the candidate region is generated, the scale change rate \({S}_{w}\) and \({S}_{h}\), velocity \(v\), and acceleration \(\mathrm{a}\) are calculated according to the formulas (56)–(58). Based on this, the possible location of the object \({\mathrm{p}\mathrm{o}\mathrm{s}}_{\mathrm{k}}\) is calculated. Finally, the nearby \({\mathrm{p}\mathrm{o}\mathrm{s}}_{\mathrm{k}}\) candidate regions \(\left\{\begin{array}{lll}{C}^{1}& {C}^{2}& \begin{array}{cc},\dots ,& {C}^{N}\end{array}\end{array}\right\}\) and the weights \(\left\{\begin{array}{lll}{w}^{1}& {w}^{2}& \begin{array}{cc},\dots ,& {w}^{n}\end{array}\end{array}\right\}\) are assigned according to the weighted Gaussian distribution \(N(\mu ,{\sigma }^{2})\) [32]. Then, the results are weighted according to the weights \(\left\{\begin{array}{lll}{w}^{1}& {w}^{2}& \begin{array}{cc},\dots ,& {w}^{n}\end{array}\end{array}\right\}\) to get the final predicted position \(\widehat{p}\). At this time, if the confidence of all candidate regions is lower than the threshold \({T}_{C}\), it is considered that the target may belost due to occlusion, illumination changes and the like. Subsequently, the position of the target is predicted according to the motion characteristics.The target position \(\widehat{p}\) is given with reference to the target category and the original template is not updated. The tracking process is shown in Fig. 14.

Fig. 14
figure 14

Fusion visual motion feature tracking process diagram

Figure 15 is the experimental results after integrating the proposed strategy into KCF and PF + HoG algorithms.

Fig. 15
figure 15

The accuracy and success rate comparison results on OTB2013 after the strategy is integrated into the tracking algorithm

Object tracking strategy based on visual memory mechanism

Liu et al. [33, 34] proposed the concept of visual memory mechanism and constructed a model of visual memory mechanism. The proposed concept is suitable for tracking targets in mobile scenes. The object will remain in mind as soon as it appears. When target reappears, the human can quickly and accurately match the target according to the memory [32, 35, 36]. The role of the alternate template is to make the tracking algorithm have the function of reappearing the target like in human vision. The environment where the target is located is determined by the proposed method firstly. When the target is in a complex environment, the exact template information in the t − 1 frame is stored and saved as the standby template \({R}_{a}\). Then, using the optical flow method, a new target position \(\mathrm{P}\) is calculated as formula (59):

$${\rm P}_{x}^{t}{={\rm P}}_{x}^{t-1}{+\mathcal{V}}_{\mathcal{Y}}.$$
(59a)
$${\rm P}_{\mathrm{y}}^{t}={\rm P}_{y}^{t-1}+{\mathcal{V}}_{\mathcal{Y}}.$$
(59b)

Among them, \({v}_{x}\) and \({v}_{y}\) are the speeds of the target in the horizontal direction and the vertical direction respectively after using the optical flow method. Next, \({v}_{x}\) and \({v}_{y}\) are calculated as formulas (60):

$${\mathcal{V}}_{{x}}={\mathcal{V}}_{y}= \sum\limits_i^{m + n} {\frac{{v_x^i}}{{m + n}}}$$
(60a)
$${\mathcal{V}}_{y}= \sum\limits_i^{m + n} {\frac{{v_y^i}}{{m + n}}}$$
(60b)

In current frame, when the confidence value of the selected target is lower than the threshold \({T}_{a}\), the template (context regression model) which has been trained from the previous frame is saved as an alternate template \({R}_{a}\). According to the speed of the object, multi-position detection is performed in the current frame to train another template (context regression model) \({R}_{c}\). In the next K frames, the existing trained filters in each frame are used to update the template. Meanwhile, the template \({R}_{a}\) is reserved and not updated. After K frames, alternate template \({R}_{a}\) is used to detect object and update the template.

$${\widehat{x}}^{t}=\left(1-\theta \right){\widehat{x}}^{t-1}+\theta {\widehat{x}}^{t}$$
(61)
$${\widehat{A}}^{t}=\left(1-\theta \right){\widehat{A}}^{t-1}+\theta {\widehat{A}}^{t}$$
(62)

Hence, the improvement process of the fusion template update method proposed in this paper is shown as follows:

  • Initialize the first frame of the video to determine the tracking target. Then, the corresponding feature information is extracted and the classifier is trained.

  • Using the classifier, the correlation of the candidate regions is calculated and the responsivity map is obtained.

  • The maximum response value \({F}_{\mathrm{m}\mathrm{a}\mathrm{x}}\) is calculated from the response graph.

  • The value of \({F}_{\mathrm{m}\mathrm{a}\mathrm{x}}\) is compared with the set threshold. If any values smaller than threshold, the motion features are used to predict the position of the object. The target information of the previous frame is saved as a backup template through feature storage.

  • After several frames, the target returns from the complex scene to the simple scene. When the value of \({F}_{\mathrm{m}\mathrm{a}\mathrm{x}}\) is higher than set threshold, the alternate template \({R}_{a}\) is used to match the target and the corresponding model is updated.

Figure 16 below shows the principle of the proposed improved object tracking strategy of fusing visual memory mechanism.

Fig. 16
figure 16

Principle of the object tracking strategy of the fusion visual memory mechanism

After incorporating this strategy into the FDSST and PF + HoG algorithms, the experimental results are shown in Fig. 17.

Fig. 17
figure 17

Comparison of accuracy and success rate on OTB2015 after the strategy is integrated into the tracking algorithm

Proposed reliability-based visual tracking method

In the process of object tracking, the continuous changes of the object and its surrounding information result in the best matching position of the object (the position of the maximum response value) is not the actual object position. Therefore, the credibility of the response value is studied. First, the reliability of the information around the target location is calculated. The new strategy is re-selected, and the target location is located based on the predicted reliability. Then, nine features are extracted from the information around the target to obtain the reliability of the target location. These features can be used to digitize the differences and characteristics of information around the target. Because that selecting the best matching position as the target position leads to inaccurate tracking in the current frame, negative samples are chosen in first step. The use of negative samples in the maximum response value is a good indicator of positioning error. Therefore, negative samples are defined by formulas (63).

$$ \sqrt {(x_{n} - gt\left( {x_{n} } \right))^{2} + (y_{n} - gt\left( {y_{n} } \right))^{2} } > \sqrt {\left( {ts\left( w \right)/\varepsilon } \right)^{2} + (ts\left( h \right)/\varepsilon )^{2} } $$
(63a)
$$ \sqrt {(x_{n} - x_{n - 1} )^{2} + (y_{n} - y_{n - 1} )^{2} } > \sqrt {\left( {ts\left( w \right)/\gamma } \right)^{2} + (ts\left( h \right)/\gamma )^{2} } $$
(63b)

In formulas (63), \({x}_{n}\), \({y}_{n}\) represent the position of the current frame of the tracking algorithm and \(gt({x}_{n})\) represents the true location of the target. \(ts\left(w\right)\), \(ts(h)\) represent the width and height of the target frame, \({x}_{n-1}\), \({y}_{n-1}\) represents the target position of the previous frame of the tracking algorithm. The Euclidean distance is used here to measure the current position and the real position, as well as the difference in the position of the previous frame. When the formulas (63) are satisfied, such sample is used as a negative sample of this experiment. Then, nine tuple features are extracted from the surrounding information of multiple sequences to achieve reliability-assisted decision-making. Feature tuples are maximum values number, larger response value ratio, negative value ratio, difference, X and Y direction gradient (including two features), maximum response value and distance from the center of the corresponding position (including two functions). These nine features can describe and digitize the distribution of the response matrix from various aspects. The details are as follows:

(1) Peak number: When looking for the same target as the previous frame in a given area, each candidate frame in a given area is compared with the target template determined in the previous frame, thereby obtaining the response values of the respective candidate frames. Therefore, the response values are combined to form a response matrix R.

The centralized matrix \(\mathrm{r}\mathrm{c}\) is obtained by exchanging R matrix, as in formulas (64)–(66).

$$R=\left[\begin{array}{cc}{A}_{1}& \quad {A}_{2}\\ {A}_{3}&\quad {A}_{4}\end{array}\right],$$
(64)
$$rc=\left[\begin{array}{cc}{A}_{4}& \quad {A}_{3}\\ {A}_{2}& \quad {A}_{1}\end{array}\right],$$
(65)

where the calculation methods of \({A}_{1},{ A}_{2},{ A}_{3}\) and \({A}_{4}\) are as follows:

$$ A_{1} = \left[ {\begin{array}{*{20}c} {a_{1,1} } & \cdots & {a_{1,\lfloor m/2\rfloor} } \\ \vdots & \ddots & \vdots \\ {a_{\lfloor n/2\rfloor,1} } & \cdots & {a_{\lfloor n/2\rfloor ,\lfloor m/2 \rfloor} } \\ \end{array} } \right]. $$
(66a)
$$ A_{2} = \left[ {\begin{array}{*{20}c} {a_{{1,\lfloor\frac{m}{2} + 1\rfloor}} } & \cdots & {a_{1,m} } \\ \vdots & \ddots & \vdots \\ {a_{{\lfloor n/2 \rfloor,\lfloor\frac{m}{2} + 1\rfloor}} } & \cdots & {a_{\lfloor n/2\rfloor,m} } \\ \end{array} } \right]. $$
(66b)
$$ A_{3} = \left[ {\begin{array}{*{20}c} {a_{{\lfloor\frac{n}{2}+ 1\rfloor ,1}} } & \cdots & {a_{{\lfloor\frac{n}{2} + 1\rfloor ,\lfloor m/2\rfloor }} } \\ \vdots & \ddots & \vdots \\ {a_{{{\text{n}},1}} } & \cdots & {a_{n,\lfloor m/2\rfloor } } \\ \end{array} } \right]. $$
(66c)
$$ A_{4} = \left[ {\begin{array}{*{20}c} {a_{{\lfloor \frac{n}{2} + 1\rfloor ,\lfloor \frac{m}{2} + 1\rfloor }} } & \cdots & {a_{{\lfloor \frac{n}{2} + 1\rfloor ,m}} } \\ \vdots & \ddots & \vdots \\ {a_{{n,\lfloor \frac{m}{2} + 1\rfloor }} } & \cdots & {a_{n,m} } \\ \end{array} } \right]. $$
(66d)

After the above centralization processing, the matrix \(rc\) has a plurality of extreme points. Since the target is likely to appear in one of many maxima, the maxima need to be processed. To obtain the maximum value matrix, the centered response value matrix \(rc\) is subjected to region maximization processing. The formula (67) defines a function to assign a value of 1 to the maximum value in the matrix \(\mathrm{r}\mathrm{c}\), and the other is set to 0.That is, the 0–1 matrix B is obtained.

$$B=f\left(rc\right)=\left\{\begin{array}{l}1\\ 0\end{array}\right.\begin{array}{l}rc\left(i,j\right)>\mathrm{t}\mathrm{h}\mathrm{e} \, \mathrm{a}\mathrm{d}\mathrm{j}\mathrm{a}\mathrm{c}\mathrm{e}\mathrm{n}\mathrm{t} \, \mathrm{e}\mathrm{i}\mathrm{g}\mathrm{h}\mathrm{t} \, \mathrm{r}\mathrm{e}\mathrm{s}\mathrm{p}\mathrm{o}\mathrm{n}\mathrm{s}\mathrm{e} \, \mathrm{v}\mathrm{a}\mathrm{l}\mathrm{u}\mathrm{e}\mathrm{s}\\ \mathrm{o}\mathrm{t}\mathrm{h}\mathrm{e}\mathrm{r}\end{array}$$
(67)

The matrix B is the maximum matrix that we want to get. As shown in formula (68), the center matrix rc is multiplied by 0–1 matrix B to obtain the matrix R1.

$$R1=rc. \, * \, B $$
(68)

R1 is the maximum value response matrix in which the corresponding response value is stored at the maximum value and all other positions are zero. To avoid interference of the maximum value to the prediction, the threshold \({\mu }_{1}\) is set to exclude these smaller maxima. That is, if the response value at the maximum value is too small, it is ignored. Only those maximum values greater than the threshold are saved. Therefore, it is only necessary to count the number \(nf\) of the R1 matrix that satisfies the threshold condition.

(2) Large response ratio: Here, a larger response value than the threshold \({\mu }_{2}\) is defined. The information around many sequence targets is analyzed. The number of times greater than the threshold is counted and the ratio of the number of larger response values to the number of all samples is also obtained. During the tracking process, the response value of 1 is almost impossible to achieve. Therefore, the response value greater than the threshold \({\mu }_{2}\) is set to 1, which is the ratio of the number of assignments to the total number of matrices. Equation (69) is its formulaic expression.

$$Pb=\frac{{\mathrm{n}\mathrm{u}\mathrm{m}\mathrm{b}\mathrm{e}\mathrm{r}}_{{a}_{i,j}>{\mu }_{2}}}{n\mathrm{*}m}$$
(69)

(3) Negative value ratio: the negative ratio is similar to the ratio of larger response value of Eq. (70). The analysis of the response value of each frame leads to the conclusion that the response values in the response matrix are less than one and greater than zero and the distribution numbers of these response values are different. The negative number is more special. Therefore, the number of response values less than zero in each frame is divided by the number of all response values as a feature of the reliability degree judgment in the reliability network.

$$z=\frac{{\mathrm{n}\mathrm{u}\mathrm{m}\mathrm{b}\mathrm{e}\mathrm{r}}_{{a}_{i,j}<0}}{n\mathrm{*}m} $$
(70)

(4) Difference value: It is used to measure the difference between the maximum values under certain threshold conditions in formula (71). The difference between each maximum response value is measured using the square sum of the difference between each maximum response value greater than \({\mu }_{3}\) and the maximum response value in the response matrix. In this way, the difference value can be used to describe the difference of the response values. The difference value is used as another characteristic information of the reliability network.

$$v=\sum_{i=1}^{nf}{(R.\mathrm{m}\mathrm{a}\mathrm{x}-R(i))}^{2}$$
(71)

(5) Gradient values in the X and Y directions: As shown in formula (72), the position of the maximum response value is shifted by unit length in the X and Y directions to calculate the gradient. The steepness of the maximum response can be described by the gradient in both directions. The larger the gradient value, the steeper the position where the maximum response value is located. Since the boundary problem is considered, there are two cases in X direction.

$$\frac{\partial \mathrm{r}\mathrm{e}\mathrm{s}\mathrm{p}\mathrm{o}\mathrm{n}\mathrm{s}\mathrm{e}}{\partial x}=\left\{\begin{array}{c}\frac{{a}_{x,y}-{a}_{x-1,y}}{\Delta x}\\ \frac{{a}_{x,y}-{a}_{x+1,y}}{\Delta x}\end{array}\right.\begin{array}{c}x=n\\ x\ne n\end{array}$$
(72)

Here, x, y is used to indicate the position at the maximum response value and the response matrix is not centered. This gradient represents the difference between the position of the maximum response value and the surrounding information, so it is extracted as two of the reliability characteristic information of the reliability calculation. The same is true in the Y direction, for example, formula (73).

$$\frac{\partial \mathrm{r}\mathrm{e}\mathrm{s}\mathrm{p}\mathrm{o}\mathrm{n}\mathrm{s}\mathrm{e}}{\partial y}=\left\{\begin{array}{c}\frac{{a}_{x,y}-{a}_{x,y-1}}{\Delta y}\\ \frac{{a}_{x,y}-{a}_{x,y+1}}{\Delta y}\end{array}\right.\begin{array}{c}y=m\\ y\ne m\end{array}$$
(73)

(6) The maximum response value and the distance to center ratio of its position: The size of the maximum response is very important. The size of the maximum response reflects the similarity with the target of the previous frame. The maximum response value is used as a feature in the response matrix. Therefore, the center ratio of the target frame from the coordinates of x and y are used as the last two features of reliability-assisted decision.

Then, the above nine features are put into the reliability-assisted decision-making model for training and testing. The reliability network model is a single-layer network structure consisting of nine feature inputs, two predictive outputs and three neurons. In the tracking process, the location information of the current frame target matching is extracted through nine features, and then the data is put into the input layer. The weight of the hidden layer saved in the previous training is used to calculate the reliability of the best matching position. According to the reliability of the output layer, it is determined whether the best matching position is reliable in the final. Therefore, the overall execution steps can be expressed as:

  • A response matrix describing the target matching information of the kernel correlation filter tracker loop iteration is found in each frame.

  • Perform reliability feature extraction on the response matrix to obtain feature data.

  • Put the reliability feature data into the model of reliability-assisted decision-making to have a reasonable prediction.

  • Give reasonable thresholds based on predictions and multiple experimental procedures.

  • According to the difference in reliability obtained, the secondary positioning strategy should be adopted to achieve high-precision prediction.

Applying the reliability-assisted strategy in the tracking algorithm can judge the reliability of the current target. This can achieve more accurate target positioning. The experimental results in the case of fast motion after incorporating this strategy into the KCF algorithm are shown in Fig. 18 below.

Fig. 18
figure 18

Quantitative comparison of success rate and accuracy for fast motion in OTB2015

The results show that this method achieves better results than the basic algorithm. This improvement may be a very important advantage for system security in specific areas. Because of proposed gradient approach used in distribution of the response matrix only the maximum position is used in the response value to shift the window in directions of X and Y. Proposed tracking process is also boosted by matching model. We choose negative samples and measure distance between target, which helps to reduce similarity to other objects and improve matching to target object. Proposed method is efficient and results show positive effect in increasing precision and success rates.

Conclusions and future research

This article mainly introduces the correlation filter algorithms and the data sets used to evaluate the algorithms. The contributions of many research scholars and our team in object tracking are also introduced. Our team mainly uses the template update method based on correlation filter algorithm. The tracking accuracy and success rates have effectively improved after using this method, while the running speed decreases slightly. When encountering various problems, scholars' innovative solution scan provide new ideas and inspiration for the development of object tracking algorithms.

The biggest advantage of the traditional correlation filter algorithm is that it runs fast and has good running effect, which can realize real-time tracking. Currently, deep learning algorithms achieve the highest tracking accuracy. However, calculations of deep learning algorithms take much time and run slowly. Therefore, in the next stage, we can focus on three issues:

  1. 1.

    In traditional filter algorithms, we can continue to search for different template update strategies or feature fusion strategies or applying depth features to object tracking. For example, applying partial deep learning results in non-deep traditional algorithms can allow depth features to continue to take advantage of traditional algorithms.

  2. 2.

    In the deep learning algorithm, we can try to combine the new strategy with the classic feature processing method. In this way, future object tracking algorithms can be developed in the direction of real-time speed and high-precision tracking results.

  3. 3.

    We can summarize the object tracking algorithm of the deep learning or the tracking algorithm combined with deep learning and correlation filter. In the future, we can also consider trying to apply some techniques in object target tracking to multiple object tracking

In this way, the object tracking technology can achieve more efficient and faster real-world applications.