Spatio-temporal joint aberrance suppressed correlation filter for visual tracking

The discriminative correlation filter (DCF)-based tracking methods have achieved remarkable performance in visual tracking. However, the existing DCF paradigm still suffers from dilemmas such as boundary effect, filter degradation, and aberrance. To address these problems, we propose a spatio-temporal joint aberrance suppressed regularization (STAR) correlation filter tracker under a unified framework of response map. Specifically, a dynamic spatio-temporal regularizer is introduced into the DCF to alleviate the boundary effect and filter degradation, simultaneously. Meanwhile, an aberrance suppressed regularizer is exploited to reduce the interference of background clutter. The proposed STAR model is effectively optimized using the alternating direction method of multipliers (ADMM). Finally, comprehensive experiments on TC128, OTB2013, OTB2015 and UAV123 benchmarks demonstrate that the STAR tracker achieves compelling performance compared with the state-of-the-art (SOTA) trackers.


Introduction
Visual tracking aims to estimate the state of the target in image sequences, given its initial state. It plays a crucial role in computer vision-based applications, e.g., vehicle navigation, video surveillance and robotic perception [2,16,26,31]. In recent years, the DCF-based methods have attracted extensive attention due to the high efficiency. However, DCF-based tracking remains a challenging problem due to many intricate issues, such as boundary effect, filter degradation, and aberrance.
Boundary effect. The efficiency of DCF-based methods relies on the periodic assumption at the stage of training and detection. However, this assumption induces the filters to be trained and performed on partially unreal samples and subsequently results in the unexpected boundary effect. The B Mingliang Gao mlgao@sdut.edu.cn 1 boundary effect mainly impedes the performance of the DCF in two aspects [13]. (i) The inaccurate negative training samples reduce the discriminative power of the learned filters. (ii) The detection scores are reliable only around the center of the region, while the remaining scores are heavily influenced by the periodic repetitions of the detection samples. To address this issue, several competitive DCF-based trackers utilize the constant spatial regularizer to penalize the filter coefficients outside the bounding box [13,18,25]. However, these constant spatial constraints are usually fixed at the stage of tracking, and the diverse information (e.g., the appearance variation of the target and the confidence of the tracking results) is not fully utilized. To address this problem, in this paper, we propose a dynamic spatial regularizer based on response variation rate, which enables the filter to learn more reliable filter coefficients.
Filter degradation. Generally, the DCF-based methods adopt the model update mechanism based on fixed rate, which ignores the variation between different frames [45]. Once the appearance of the target varies dramatically, the filter learned from the previous frame cannot adjust to appearance changes, resulting in the filter degradation. To cope with the filter degradation, several DCF-based trackers adopt the temporal regularizer into filter training [25,28,45]. Nevertheless, the temporal regularizer is based on the assumption that filters between consecutive frames should be coherent. The filter training may be interfered with severe occlusion, background clutter, etc., resulting in a corrupted filter and breaking this assumption. To solve this issue, in this paper, we propose a dynamic temporal regularizer based on average peak-to-correlation energy (APCE) [39] to suppress the filter degradation.
Aberrance. Due to the spatial regularization, the correlation filter can be learned on larger image regions [13]. Nevertheless, with the expansion of the learning regions, more background clutter will be introduced, leading to aberrance at the detection stage, which is manifested as the abrupt variation in response maps. To reduce the effect of aberrance, Wang et al. [39] proposed the Large Margin Object Tracking (LMCF) method, in which the quality of response maps is verified during the filter learning and used to carry out the model updating in high confidence. Choi et al. [7] proposed the Attentional Correlation Filter Network (ACFN) tracker that integrates multiple correlation filters into a network. The verified scores which are generated based on response maps are utilized to select the suitable filter. However, these trackers deal with the aberrance at the stage of detection, and thus the tracking performance is decreased inevitably. Unlike these trackers, in this paper, we integrate an aberrance suppressed regularizer into the DCF schema to suppress the aberrance at the stage of filter training.
In this work, we address the above issues simultaneously under a unified framework of response map by learning a spatio-temporal joint aberrance suppressed regularization correlation filter. The main contributions are summarized as follows. The rest of this paper is organized as follows. In "Related work", we present an overview of the prior work most relevant to the proposed method. In "Proposed method", the proposed STAR model is introduced, and the ADMM algorithm is developed to solve the STAR efficiently. In "Experimental results", quantitative and qualitative evaluations of the proposed tracker with the SOTA trackers are presented. Conclusions are presented in "Conclusion".

Related work
The visual tracking methods can be classified into generative tracking methods and discriminative tracking methods [31,40]. Among the discriminative-based trackers, the DCF promote the visual tracking to a new level.

Generative tracking
The generative tracking attempts to build models to represent the appearance of the target and search the most similar candidate region with minimal reconstruction error. Comaniciu et al. [8] proposed the mean-shift tracking method with iterative histogram matching for visual tracking. Adam et al. [1] proposed the fragments-based tracker, which utilizes multiple image fragments to represent the object. Subsequently, Ross et al. [35] proposed the subspace-based tracking method to learn and update the low-dimensional subspace representation of the target. Although generative tracking has achieved considerable success in constrained scenarios, they are vulnerable to complicated appearance variations of the target. Therefore, more attention is shifted to discriminative tracking, due to it is less susceptible to background clutter during the tracking process.

Discriminative tracking
The discriminative tracking trains a classifier to discriminate the target from the background. Grabner et al. [19] proposed an online boosting tracker by fusing multiple weak classifiers. Kalal et al. [24] proposed the Tracking-Learning-Detection (TLD) tracker that decomposes the long-term tracking into three sub-tasks, namely tracking, learning, and detection. More recently, many deep neural network (DNN) based trackers under the framework of "end-to-end learning" and "offline-learning and online-tracking" are proposed. For example, Bertinetto et al. [4] proposed the Fully Convolutional Siamese Networks (SiamFC) tracker that trains a fully convolutional siamese network by cross-correlating two inputs of the bilinear layer. Valmadre et al. [37] put forward the CFNet tracker that considers the correlation filter as a differentiable layer of the deep neural network. In general, discriminative tracking is relatively more effective than generative tracking in preventing the negative effects of complex background clutter or target appearance variations [40].

DCF-based tracking
Recently, DCF has received considerable attention due to its efficiency and scalability. Bolme et al. [5] first proposed the correlation filter tracker, termed minimum output sum of squared error (MOSSE), to learn a filter between multiple training image patches and a template of user-specified ideal correlation response. Henriques et al. [21] proposed the circulant structure of Tracking-by-Detection with Kernels (CSK) tracker, which exploits the circulant structure of the local image patch to learn a kernel regularized least squares classifier.
To further improve the tracking performance, the followup improvements are mainly carried out around two aspects, namely feature representation and scale estimation. In feature representation, Danelljan et al. [11] proposed the color attributes tracker by investigating the color names (CN) [38] feature in the tracking-by-detection framework. Henriques et al. [22] proposed the kernelized correlation filters (KCF) method by utilizing the histogram of oriented gradient (HOG) [9] feature. In addition, Bertinetto et al. [3] proposed the Sum of Template And Pixel-wise LEarners (STAPLE) tracker using the HOG and colour features to improve the tracking credibility. Moreover, Convolutional Neural Network (CNN) features have been used to further improve the feature representation [12,14,25,45]. In scale estimation, Danelljan et al. [10] proposed the Discriminative Scale Space Tracking (DSST) method, which learns a separate scale filter to address the scale variation. Li et al. [27] proposed the Scale Adaptive with Multiple Features (SAMF) tracker by employing a bilinear interpolation to generate image representations in multiple scales.

Revisit the standard DCF
In the standard DCF [22], x ∈ R M×N ×C denotes the training sample with M × N feature size and C channels. y ∈ R M×N is the corresponding Gaussian-shaped label (desired output). The filter f ∈ R M×N ×C is trained by regressing the samples, which is defined as follows, arg min where * stands for the circular convolution operator, and α is the regularization parameter to prevent overfitting.
In the standard DCF model, there are several problems need to be further addressed. (i) It suffers from periodic repetitions on boundary positions caused by circulant shifted training sample. (ii) It does not tackle the problem of filter degradation, since the model is updated based on fixed rate. (iii) There is no response mechanism to copy with the aberrance, and the target will be easily lost when aberrance occurs.

The proposed model STAR
To address the problems mentioned above, we propose a novel spatio-temporal joint aberrance suppressed regularization (STAR) correlation filter for robust visual tracking. The tracking framework of the proposed STAR model is shown in Fig. 1. The spatial regularizer, temporal regularizer and aberrance suppressed regularizer are exploited to the standard DCF to tackle the boundary effect, filter degradation and aberrance suppression, simultaneously.
We assume that the learning of the correlation filter f is conducted for the t-th frame. The filter is learned by minimizing the following objective function, arg min denotes the regression loss parameterized by f. The R s , R t and R a refer to the spatial, temporal and aberrance suppressed regularizer, respectively. The parameters λ, μ and η are the corresponding coefficients to the regularizers.

Dynamic spatial regularizer
The constant spatial regularizer in the SOTA trackers (e.g., SRDCF [13], BACF [18] and STRCF [25]) does not fully exploit the diversity information of the target. The filter coefficients will be unreliable, leading to tracking failures, when the target suffers from interferences, e.g., severe occlusion, background clutter. To solve this problem, we design a dynamic spatial regularizer based on the response variation rate.
The response variation rate is defined as = 1 , 2 , . . . , M N , and the i-th element i is defined as, where [ψ ] is the shift operator. It enables the peaks of response R t and R t−1 to coincide with each other to eliminate the motion influence [23]. Considering that the response variation rate reveals the confidence level of each pixel in the search area, we introduce into the spatial weight w, where δ is a hyperparameter for adjusting the weight of , andw is a matrix for initializing spatial regularization weight w. The dynamic spatial regularizer of STAR model is defined as, where is the Hadamard product. The visualization of the dynamic variation of the spatial regularization is shown in Fig. 2. It shows that the dynamic spatial regularizer can impose different penalties on the spatial position according to the value of the response variation rate. Specifically, it imposes a higher penalty on the larger part of the response variation rate while a lower penalty on the smaller part. Thus, it achieves more reliable filter coefficients at the detection state.

Dynamic temporal regularizer
The 2 F is constructed using the previous filter f t−1 (e.g., STRCF [25], LADCF [45] and AutoTrack [28]). The filter learned at frame t is affected to a large extent by the filter f t−1 . However, f t−1 may be corrupted by occlusion or background clutter; thus, it will break the assumption that the filters between consecutive frames should be coherent. To tackle this issue, we propose to learn a dynamic temporal regularizer based on APCE measure. The APCE measure is defined as, where R max , R min and R w,h denote the maximum, minimum and the wth row hth column elements of the response R, respectively. The visualization of the value of APCE with its corresponding threshold in a typical tracking sample is shown in Fig. 3. At the stage of training, the filter may be corrupted by occlusion, background clutter, etc., then, the response map with interference is generated by the convolution of the corrupted filter and the feature map. As a consequence, the value of APCE obtained by Eq. (6) will drop significantly. This specialty of APCE can be adopted to judge whether the filter is corrupted or not. Subsequently, the uncorrupted filter f s is selected for temporal regularizer instead of f t−1 , as follows, where f t−1 and f t−i denote the filter at the (t−1)-th and (t−i)th frame, respectively. ζ is hyperparameter, and APCE hm stands for the historical mean value of APCE.
The uncorrupted filter f s is selected to construct the dynamic temporal regularizer for the STAR model as follows, Compared with the existing temporal regularization methods [25,28,45], the STAR model takes the full advantage of the video continuity natures by exploiting f t − f s 2 F to penalize the difference between the current filter f t and the uncorrupted filter f s . Thus, the proposed STAR gains a more robust appearance model, and alleviate the filter degradation effectively.

Aberrance suppressed regularizer
The response map can reveal the confidence degree about the tracking results to a large extent [39]. The aberrance caused by background clutter occurs at the detection stage, and it will result in an abrupt variation in response maps. The aberrance can be effectively repressed by restricting the response variation. As a result, an aberrance suppressed regularizer is introduced to handle the aberrance at the stage of training. The aberrance suppressed regularizer is formulated as, where all the variables have been explained in the Eq. (3).

Optimization of STAR
After all the regularization defined, optimization of the Eq.
(2) is one of the key to solve the tracking. The Eq. (2) can be minimized using ADMM [6] to achieve the optimal solution benefitting from its convexity. Specifically, we introduce the auxiliary variable g = f and the step size parameter γ to construct the following augmented Lagrange function, where r = R t−1 [ψ ], and s refers to the Lagrange multiplier. By introducing h = 1 γ s, Eq. (10) can be reformulated as, Then, the following subproblems are alternately optimized via ADMM formulation.
Subproblem f: For the first subproblem of Eq. (12), it can be transformed into the frequency domain using Parseval's formulation as, whereˆdenotes the discrete Fourier transform (DFT). The j-th element of the label y relies on the j-th element of the sample x t and the filter f t across all C channels. V (f) ∈ R C is the vector consisting of the j-th element of f along the channels. Equation (13) can be further decomposed into M × N subproblems, where each subproblem is defined as, where superscript T on a complex vector or matrix indicates conjugate transpose operation. Taking the derivative of Eq. (14) as zero, the closed-form solution of V j ( f * ) can be denoted as, where the vector Eq. (15) can be further rewritten via the Sherman-Morrsion formulation [32] as, Note that Eq. (16) only contains vector multiply-add operation, thus it can be computed efficiently. f can be further obtained by the IDFT of f.

Subproblem g:
For the second subproblem of Eq. (12), each element of g can be computed independently as, Lagrangian multiplier update: The Lagrange multiplier is updated as, where the subscript i represents the i-th iteration. f * and g * are the solution of subproblem f and g, respectively.
By solving the aforementioned subproblems iteratively, the optimal filter f * of the t-th frame can be obtained and then used for tracking at (t + 1)-th frame.

Target localization
The response map R t at the t-th frame in Fourier domain can be calculated as, After computing the IDFT on R to obtain the response map R t , the location can be predicted based on the maximum value of the response map. The overall tracking algorithm of the STAR model is summarized in Algorithm 1.

Evaluation metrics
Quantitative and qualitative experiments are conducted on four tracking benchmarks, i.e., TC128 [29], OTB2013 [43], OTB2015 [44] and UAV123 [33]. For these benchmarks, success rate and precision are utilized under the rule of one pass evaluation (OPE) [43,44]. The AREA UNDER CURVE (AUC) in the success rate and the distance precision (DP) at a threshold of 20 pixels in the precision are adopted as the evaluation metrics to measure the tracking accuracy. Meanwhile, the speed is measured in frames per second (FPS). For the sake of fair comparison, the compared trackers are based on publicly available code or results reported in the original paper.

Experimental setup
The experiments are conducted on a PC equipped with i7-9700K CPU and NVIDIA GTX 1080Ti GPU using MAT-LAB R2017a and MatConvNet toolbox. 1 We combine the output of Conv-3 layer from VGG-M network [36] with HOG+CN features for target representation. The values of spatial, temporal and aberrance suppressed regularizer are set as λ = 1, μ = 10 and η = 0.1, respectively. The step size parameter γ is initialized to 1 and updated by γ i+1 = min γ max , ργ i , (where ρ = 10, γ max = 1000). Other hyper-parameters are set to δ = 0.1 and ζ = 0.7, and the ADMM iteration is set to N = 3. To make a fair comparison, the parameters of the STAR tracker are fixed throughout the experiments.
On the OTB2013 benchmark, the proposed STAR archives the best AUC (0.688) and the second-best DP (0.892). Compared with the feature selection-based tracker, i.e., LADCF-HC, the STAR improves the AUC and DP by 1.6 and 2.8%, respectively. Compared with UDT, which is trained in an unsupervised manner, the STAR improves by 6.1 and 6.6% in AUC and DP, respectively.
On the OTB2015 benchmark, the proposed STAR achieves the score of 0.672 and 0.875 in AUC and DP, respectively. Compared with the BACF tracker that uses the constant spatial regularizer, the STAR improves the AUC by 5.7% and the DP by 5.9%. This is mainly benefited from the dynamic spatial regularizer, which can impose different penalties on the spatial position based on the value of response variation rate, and produces more reliable filter coefficients at the tracking stage.

Ablation studies
Ablation studies on OTB2013 [43] are conducted to demonstrate the effectiveness of the key components in the proposed STAR tracker. The key components include the dynamic spatial regularizer (DSR), dynamic temporal regularizer (DTR) and aberrance suppressed regularizer (AR). We compare the baseline with four variants, i.e., "Baseline" (the standard DCF tracker in "Revisit the standard DCF" which adopts the same feature representation as in STAR), "Base-line+DSR", "Baseline+DTR", "Baseline+AR" and "Base-line+DSR+DTR+AR" (i.e., the final STAR tracker). The ablation results are reported in Table 2. It shows that the baseline tracker achieves the score of 0.642 and 0.841 in AUC and DP. When the components of "DTR", "AR" and "DSR" are introduced into the "Baseline", they can improve the tracking performance gradually. Finally, the proposed STAR which integrates all the key components surpasses the "Baseline" by 4.6 and 5.1% in AUC and DP, respectively.

Qualitative evaluations
To intuitively exhibit the superiority of the STAR tracker, six sets of screenshots of the tracking results from OTB2015, i.e., biker, bird2, box, football, human4 and soccer(from top to bottom) are shown in Fig. 9. The target in these sequences undergoes challenging attributes such as rotation, scale variation, occlusion, motion blur, and fast motion. The compared trackers include AutoTrack [23], ARCF [23], CFWCR [20], ECO [14], LADCF-HC [45], STRCF [25] and TB-BiCF [30]. It shows that the proposed STAR (in red box) achieves much better tracking precision compared with other SOTA trackers. Specifically, in the "biker" sequence in which the target suffers from fast motion and motion blur, most of the compared trackers fail at frame 70. The attributes of "soccer" sequences include occlusion and background cluster, causing most compared trackers to fail at frame 365. In contrast, the proposed STAR achieves satisfying performance in these sequences.

Conclusion
In this paper, we propose a novel spatio-temporal joint aberrance suppressed regularization (STAR) correlation filter for robust visual tracking. The STAR tracker takes full advantage of spatio-temporal information and employs aberrance suppressed strategy. The dynamic spatio-temporal regularizer can effectively alleviate boundary effect and filter degradation, while the aberrance suppressed strategy reduces the interference caused by background cluster. Besides, the STAR tracker is efficiently optimized based on the ADMM formulation. Comprehensive experiments on four tracking benchmarks demonstrate the superiority of the proposed method against the SOTA trackers.