1 Introduction

Visual object tracking is one of the most fundamental topics in pattern recognition and computer vision, which plays crucial roles in a wide range of visual intelligent systems, e.g., medical image analysis, human-computer interaction, transportation intelligence, and robotics. To consistently and accurately track an arbitrary object in unconstrained scenarios is very challenging due to deformable shape, changing aspect, and textural variations of the target. Considering existing advanced tracking algorithms, discriminative correlation filter (DCF-) based trackers [1] have exhibited promising performance in various benchmarks [24] and competitions such as the Visual Object Tracking (VOT) challenges [5, 6] and VisDrone [7, 8]. In general, the advantages of DCF include its spatial appearance model exploiting the circulant matrix structure [9] and efficient optimization in the frequency domain [10]. More recent innovations focus on scale detection [11], joint regularization [12], continuous domain mapping [13], multi-response fusion [14], etc.

The success of current advanced DCF trackers can be attributed to two main factors: spatial regularization and temporal fusion. Regarding the spatial regularization, as images and videos rectify the 2D planes from the view of a camera, the proposal of spatial regularization enables a direct improvement of the tracking performance by potentially endowing the learned classifiers with a specific attention mechanism, enhancing the model’s discrimination by focusing on less ambiguous and background regions [1517]. Considering the temporal fusion techniques, advanced DCF trackers highlight the online appearance clues by gathering more historical target information or constructing temporally consistent constraints on the discriminative learning stage [1820]. To this end, the above-mentioned spatio-temporal model methodologies have received continuous attention in the visual tracking community, especially for the powerful deep learning representations developed in recent years [2123].

However, from the geometry viewpoint, the current DCF paradigm extracts the discriminative information from independent training templates (points), without unifying the spatio-temporal appearance jointly. Specifically, the multi-channel feature representations from different frames obtained by a pre-trained convolutional neural network (CNN) are simply inputs to the DCF learning stage with a moving average. Therefore, the model capacity against appearance variation can only be guaranteed within a limited \(\ell _{2}\)-norm ball around the training templates, impeding the generalization of the learned filters, as illustrated in Fig. 1. Motivated by this observation, we argue the necessity of constructing the appearance from independent points to spatio-temporal affine subspace. The relationships among multiple historical frames can be jointly considered in the affine subspace, effectively extending the model capacity. In our design, during the online tracking process, all the previous frames are collected to construct the affine subspace, consisting of an origin and a linear subspace. To mitigate the increased calculation complexity in obtaining the linear subspace when a large number of frames are involved, we employ the incremental learning technique to update the origin and the linear subspace online, resulting in efficient affine subspace learning and updating.

Figure 1
figure 1

Illustration of the proposed appearance learning space that is supported by affine subspace. left: The appearance of a single template point is used to train the filter in each frame node. right: The appearance generated by the affine subspace is used to train one main filter (from origin μ) and K auxiliary filters (from basis \(\boldsymbol{u}_{1},\boldsymbol{u}_{2},\ldots ,\boldsymbol{u}_{K}\)) to provide improved generalization (\(K=3\))

In addition to constructing the affine subspace to reflect the spatio-temporal appearance, we also propose to endow the DCF model with parsimony and consistency constraints. In principle, with the development of robust visual features, e.g., the Haar descriptor, histogram of oriented gradient (HOG), and convolutional blocks of deep architectures (AlexNet, VGGNet, ResNet) [2224], the volume of the feature representations has witnessed a continuous swell. Accordingly, these high dimensional feature maps provide improved discriminative information to achieve better tracking performance in distinguishing the target from its corresponding surroundings. However, there exists inevitable redundancy and noise in these feature maps. Therefore, to highlight the relevance between deep feature representations and the discriminative learning task, we propose to regularize the learned filters to be sparse. In addition, temporal smoothness is also emphasized in the DCF learning objective to achieve consistency in filter training, improving the stability of the tracking model.

To combine the DCF learning paradigm in the constructed affine subspace, we use the origin and the basis of the subspace to train one main filter and multiple auxiliary filters. In principle, the filter learning process corresponding to the origin is similar to that in the standard DCF paradigm, where a moving average template is employed to train the classifier in the current frame. The novelty of our affine subspace DCF (ASDCF) learning approach emphasizes the design of learning auxiliary filters corresponding to the basis of the subspace. Specifically, after obtaining the basis of the current K dimensional subspace, we propose to train K separate auxiliary filters corresponding to the K basis representation. To this end, each auxiliary filter is associated with specific appearance variation, improving the capacity of the proposed learning model. The proposed ASDCF injects the spatio-temporal information represented by the deep features to the online updated affine subspace, unifying the spatial visual features and the changing temporal variations, with improved discrimination and interpretability compared with the standard DCF framework.

The Simaese-based trackers have recently achieved remarkable performance by learning to map the target template and instance into an appearance variation that has preserved feature space through an end-to-end network. However, the Siamese-based trackers conduct tracking by relying on a fixed template, and the appearance capacity is not modeled, resulting in performance that is especially dependent on the invariance of extracted features. In contrast, the constructed affine space of this work enhances the ability to model target appearance, increasing the tolerance of the learned model to spatio-temporal appearance variation of an object. Therefore, by combining the proposed affine space construction and updating with DCF learning, more accurate and stable target tracking results can be realized.

The main contributions of the proposed ASDCF tracking approach include the following:

1) A new affine subspace construction technique in online visual tracking to unify the spatial and temporal discriminative information, with an efficient incremental learning method to update the affine subspace during tracking.

2) An effective DCF learning objective imposing sparsity and temporal smoothness regularization for the filters.

3) A comprehensive evaluation of ASDCF on several well-known public available benchmarking datasets, including OTB2013 [2], OTB2015 [3], UAV123 [4], and VOT2018 [6]. The results support the advantage of the proposed ASDCF, with superior tracking performance compared with the state-of-the-art trackers.

The rest of this paper is organized as follows. In Sect. 2, we briefly review relevant tracking approaches for constructing spatio-temporal appearance models, especially the development of the DCF framework. The proposed affine subspace construction is presented in Sect. 3. The details of the proposed ASDCF method are introduced in Sect. 4, accompanied by an efficient optimization scheme. The implementation details and experimental results are reported in Sect. 5, with ablation studies and comparative analysis. Conclusions are presented in Sect. 6.

2 Related work

Existing visual object tracking approaches include generative learning and discriminative learning, e.g., image matching [25], statistical theory [26], particle filtering framework [27], subspace learning methodology [28], discriminative correlation filters [1], and deep neural networks [29]. In this section, we focus on introducing the development of the above-mentioned tracking approaches that are pertinent to our ASDCF. Continuously improved tracking performance has been evidenced by recent tracking benchmarking datasets and competitions such as VOT [6]. Readers are recommended to refer to recent surveys [3, 3032] for detailed and comprehensive reviews of the visual tracking approaches.

2.1 Generative learning framework

Generative learning frameworks aim at learning the intrinsic target state distribution to represent the target appearance, based on which similarity metric or reconstruction error can be employed to calculate the final probabilities for the candidates in the next frame. Typical generative learning models in the early visual tracking research stage include optical flow [25] and mean-shift [33]. The basic assumptions behind these two methodologies are consistent brightness and limited appearance variations. Although these two methods provide complete mathematical derivations to model the visual tracking task, their rigid constraints cannot satisfy the real-world scenarios, resulting in poor tracking performance when processing challenging videos. To enhance the tracking robustness, the particle filtering system is applied to visual tracking [27, 34] to estimate the posterior distribution of the target via Bayes’s theorem and sampling techniques. Specifically, the conditional distribution is approximated via the similarity between the current samples and the model distribution, providing nonlinear inference for the tracking scope. It should be noted that improved performance can be achieved with the increasing number of involved particles while sacrificing the model efficiency. Due to the convenience that the particle filtering system is an external predicting framework, it has been widely studied and extended to fuse with other generative methods, e.g., sparse subspace representations and low-rank representations [3537]. In principle, the subspace-based tracking paradigm has received wide attention since the proposal of the incremental subspace learning scheme [28], which assumes that the target can be linearly represented by its corresponding eigenvectors. Sparse trackers assume the target to be sparsely represented by an over-complete dictionary. Accordingly, the representation coefficients and reconstruction errors are used to gauge the quality of candidates. Furthermore, low-rank constraints have been proposed to increase the relevance of particles by suppressing spurious information [37].

The advantage of generative learning framework focuses on its exploration of enlarging the tracking model capacity via carefully designed appearance representation and inference systems. However, generative tracking methods suffer from the limitation of neglecting the background appearance, resulting in less discriminative performance.

2.2 Discriminative learning framework

In addition to generative learning methods, various classification methods, such as support vector machine [26], multiple instance boosting [38], and linear regression [10] have been employed in constructing learning models in a discriminative manner, exploring the discriminative information between the target region and its surroundings. Discriminative learning approaches construct a tracking task as a classification or regression problem, aiming at directly inferring the output of a sampling candidate by estimating the conditional distribution of labels for the given inputs. Therefore, the optimal sampling candidate with the maximal response is selected as the final tracking result. However, a common limitation of the above discriminative trackers is that the initialization of the learning model is performed in the initial frame with insufficient appearance information, without guaranteed tracking robustness for the following frames. More recently, Siamese networks [29, 3941] have been successfully applied in visual tracking. Taking the advantage of large annotated tracking datasets, deep architectures and powerful graphical processing units, Siamese networks achieve efficient visual tracking by performing efficient template matching in the learned feature embedding space.

Compared with basic generative learning approaches, discriminative methods developed a comparatively more robust modeling paradigm that extracts and analyzes appearance from both foreground and background, achieving better tracking performance.

2.3 Discriminative correlation filter

DCF belongs to the discriminative learning paradigm, and we provide detailed instructions of its development in this subsection as it is the baseline of our proposed ASDCF. The seminal work of the DCF framework is minimum output sum of squared error (MOSSE) [42], which formulates the tracking task as discriminative filter learning [43] rather than template matching [44], achieving improved tracking efficiency. Based on this modeling technique, the concept of circulant matrix [9] is introduced to DCF by CSK [10] with an enlarged search window, enabling the generation of more negative training samples in the discriminative filter learning stage. To further explore the potential of the DCF framework, spatial-temporal context information [45] and kernel modeling technique [1] are leveraged to improve the learning formulation by involving local appearance and nonlinear metrics, respectively. In recent years, the DCF paradigm has further been extended by exploiting scale detection [46, 47], structural patch analysis [48, 49], multi-clue fusion [14, 50, 51], sparse representation [36, 52, 53], support vector machine [54, 55], enhanced sampling mechanisms [56, 57] and end-to-end deep neural networks [29, 40, 58].

Despite the outstanding performance of the DCF framework in visual object tracking, it is still a very challenging task to achieve high-performance tracking for a spatio-temporal changing arbitrary object, especially in unconstrained scenarios. The main obstacles include spatial bounding effect and temporal inconsistency. To alleviate the boundary effect problem caused by the circulant structure, SRDCF [15] proposes introducing spatial regularization in the DCF formulation, which allocates more filter energy for the central region and less energy for the surroundings using a pre-defined spatial smooth weighting function. A similar technique has been pursued by pruning the training samples or learned filters with pre-defined binary mask [16, 5962]. To achieve adaptive spatial regularization, LADCF [63] embeds dynamic spatial feature selection in the filter learning stage, activating the supportive spatial regions not only from the foreground but also from the background. Similarly, A3DCF [64] proposes an adaptive attribute-aware mechanism to learn channel-wise masks to enhance discriminative elements of feature maps while suppressing irrelevant features. ADTrack [65] adopts image pre-treatment to achieve mask generation for discriminative filter learning. The above spatial regularization approaches decrease the ambiguity emanating from the background and enable a relatively enlarged search window for DCF tracking. However, these approaches only consider information redundancy and unbalance along the spatial dimension. On the other hand, to mitigate temporal filter inconsistency, historical appearance information is rearranged in SRDCFdecon [18] and C-COT [13], with enhanced robustness and temporal stability, by gathering multiple previous frames in the filter learning stage. In addition, to alleviate the computational burden caused by involving a large number of historical samples, ECO [20] decreases the inherent computational complexity by clustering historical frames in a generative sample space and employing projection matrix to reduce the channel numbers for the feature representations.

To advance the DCF modeling space, we introduce the affine subspace to enlarge the representative power for the potential appearance variations from a geometric viewpoint. Therefore, performing discriminative modeling in the affine subspace can unify the spatial and temporal discriminative information, enhancing the DCF capacity for challenging video sequences.

3 Affine subspace generation

To accommodate appearance variations for spatio-temporal changing objects, we propose to employ affine subspace to represent both static and dynamic information. An affine subspace can be formulated as:

$$ \mathcal{A}= \bigl\{ \boldsymbol{x}\in \mathbb{R}^{D}:\boldsymbol{x}= \boldsymbol{\mu}+\boldsymbol{Uz} \bigr\} , $$
(1)

where \(\boldsymbol{\mu}\in \mathbb{R}^{D}\) denotes the origin of the affine subspace, and U (\(\boldsymbol{U}= [\boldsymbol{u}_{1},\boldsymbol{u}_{2}\ldots ]\)) represents the corresponding basis of the subspace, as depicted in Fig. 1. Based on the formulation in Eq. (1), we can gather the historical appearance with an updated affine subspace, realizing extended representation capacity compared with a single template. Specifically, the origin μ reflects the weighted average static appearance from all the previous frames, while the basis U constructs the detailed variations during the tracking process. Here, the basis is obtained by calculating the dominant K eigenvectors of the subspace based on singular value decomposition (SVD).

It should be noted that the affine subspace has to be updated once a new frame is available, resulting in an increasing burden for the SVD calculation. Therefore, the computational burden would explode if hundreds of high-dimensional representations were involved in the affine subspace construction. To mitigate this issue, we propose to introduce an incremental learning technique to achieve efficient updates for the origin μ and the basis U. Given a data matrix \(\boldsymbol{A}= [\boldsymbol{x}_{1},\boldsymbol{x}_{2},\ldots ,\boldsymbol{x}_{n} ]\in \mathbb{R}^{D\times n}\), where each column \(\boldsymbol{x}_{i}\) denotes the appearance gathered from the ith frame.

Update the origin: Suppose we have already obtained the mean vector of A as \(\boldsymbol{\mu}_{A}\). When the appearance representations from new m frames are available, denoted as \(\boldsymbol{B}= [\boldsymbol{x}_{n+1},\boldsymbol{x}_{n+2},\ldots ,\boldsymbol{x}_{n+m} ] \in \mathbb{R}^{D\times m}\), the aim is to incrementally calculate the mean vector of the new data matrix \([\boldsymbol{A}\ \boldsymbol{B} ]\). Denoting the mean vector of B as \(\boldsymbol{\mu}_{B}\), the updated origin for the affine subspace expanded by \([\boldsymbol{A}\ \boldsymbol{B} ]\) can be calculated as:

$$ \boldsymbol{\mu}=\frac{n}{m+n}\boldsymbol{\mu}_{A}+ \frac{m}{m+n}\boldsymbol{\mu}_{B}. $$
(2)

Update the basis: Suppose we have already obtained the SVD of A as \(\boldsymbol{A}=\boldsymbol{U\Sigma V}^{\top}\). When the appearance representations from the new m frames are available, denoted as \(\boldsymbol{B}= [\boldsymbol{x}_{n+1},\boldsymbol{x}_{n+2},\ldots ,\boldsymbol{x}_{n+m} ] \in \mathbb{R}^{D\times m}\), the aim is to incrementally calculate the SVD results of the new data matrix \([\boldsymbol{A}\ \boldsymbol{B} ]\) as \([\boldsymbol{A}\ \boldsymbol{B} ]=\boldsymbol{U}^{\prime}\boldsymbol{\Sigma}^{\prime } \boldsymbol{V}^{\prime \top}\). Denoting the component of B that is orthogonal to U as \(\tilde{\boldsymbol{B}}\), such that the SVD of \([\boldsymbol{A}\ \boldsymbol{B} ]\) can be partitioned as follows:

[ A B ] = [ U B ˜ ] R [ V 0 0 I ] ,
(3)

where R= [ Σ U B 0 B ˜ B ] . To balance the tracking efficiency and effectiveness, we retain the first K eigenvectors in U. Considering the size of the new frames, m, the SVD of R can be calculated in constant time regardless of the number of frames in A, \(\boldsymbol{R}=\tilde{\boldsymbol{U}}\tilde{\boldsymbol{\Sigma}}\tilde{\boldsymbol{V}}^{\top}\). Therefore, Eq. (3) can be formulated as:

[ A B ] = [ U B ˜ ] U ˜ Σ ˜ V ˜ [ V 0 0 I ] .
(4)

Based on Eq. (4), the final eigenvectors \(\boldsymbol{U}^{\prime}\) can be obtained as \(\boldsymbol{U}^{\prime}= [\boldsymbol{U}\ \tilde{\boldsymbol{B}} ]\tilde{\boldsymbol{U}}\). The corresponding eigenvalues \(\boldsymbol{\Sigma}^{\prime}=\tilde{\boldsymbol{\Sigma}}\). After obtaining \(\boldsymbol{U}^{\prime}\), we retain the first K eigenvectors in \(\boldsymbol{U}^{\prime}\) to represent the basis of the updated affine subspace.

4 Approach

4.1 Basic discriminative correlation filter

Given the location and scale of a target at frame t, visual object tracking aims at predicting the location of the target in the next frame. In the learning stage, we aim to train a discriminative filter that obtains high-value responses around the target center and low-value responses for the background. DCF is formulated to learn a filter that distinguishes the target from the near background. In general, a padded search window centered around the target location from frame t is extracted with corresponding feature representation \(\boldsymbol{x} = [x_{1},x_{2},\ldots ,x_{n} ]^{\top}\in \mathbb{R}^{D}\). The circulant matrix can be generated as [9]:

X= [ x 1 x 2 x 3 x D x D x 1 x 2 x D 1 x D 1 x D x 1 x D 2 x 2 x 3 x 4 x 1 ] ,
(5)

where each row in X can be considered an augmented sample, and therefore the DCF formulation employs X as the training data matrix [10]. Given labeled training sample pairs \(\{\boldsymbol{X},\boldsymbol{y} \}\), the learning stage of DCF is formulated as a ridge regression problem:

$$ \begin{aligned} \boldsymbol{w} & = \arg \min_{\boldsymbol{w}} \Vert \boldsymbol{X}\boldsymbol{w}-\boldsymbol{y} \Vert ^{2}+ \lambda \Vert \boldsymbol{w} \Vert ^{2} \\ & = \arg \min_{\boldsymbol{w}} \Vert \boldsymbol{x}\ast \boldsymbol{w}- \boldsymbol{y} \Vert ^{2}+ \lambda \Vert \boldsymbol{w} \Vert ^{2}, \end{aligned} $$
(6)

where λ is the balancing parameter for the regularization term, and ∗ denotes the cross correlation operator. According to the time-frequency convolution theorem, a closed-form solution in the frequency domain can be obtained as:

$$ \hat{\boldsymbol{w}}= \frac{\hat{\boldsymbol{x}}\odot \hat{\boldsymbol{y}}^{\ast}}{\hat{\boldsymbol{x}}\odot \hat{\mathbf{x}}^{\ast}+\lambda \boldsymbol{1}}, $$
(7)

where ⊙ denotes the element-wise multiplication, 1 is an all-ones vector sharing the same size with \(\hat{\boldsymbol{x}}\), \(\hat{\cdot}\) denotes discrete fourier transform (DFT) representation and ⋅ represents the complex conjugate.

4.2 Sparse discriminative correlation filter

Though promising tracking results have been achieved by the basic DCF formulations, the impact of the redundancy and noise in the high dimensional feature representations is not well addressed, especially in the feature maps extracted from deep CNN architectures, e.g., AlexNet, VGGNet, and ResNet. To this end, with the aim of highlighting the relevance between deep feature representations and the discriminative learning task, we propose to regularize the learned filters to be sparse. In addition, temporal smoothness is also emphasized in the proposed DCF learning objective to achieve consistency in filter training, improving the stability of the tracking model. In principle, the filter learning objective is formulated as follows:

$$ \begin{aligned} \boldsymbol{w} &= \arg \min _{\boldsymbol{w}} \Vert \boldsymbol{x}\ast \boldsymbol{w} - \boldsymbol{y} \Vert ^{2} + \lambda _{1} \Vert \boldsymbol{w} \Vert _{1} \\ &\quad {}+\lambda _{2} \Vert \boldsymbol{w}-\boldsymbol{w}_{t-1} \Vert ^{2}, \end{aligned} $$
(8)

where \(\|\cdot \|_{1}\) denotes the \(\ell _{1}\)-norm, \(\lambda _{1}\) and \(\lambda _{2}\) are the corresponding balancing parameters for the sparse regularization and temporal smoothness terms, respectively. Based on the formulation in Eq. (8), the model parsimony can be achieved for high-dimensional feature representations with temporally enforced stability.

4.3 Affine subspace discriminative correlation filters

In the implementation, multi-channel feature maps from CNN are used to enhance the representation power, and we transform the objective in Eq. (8) from the single-channel to multi-channel formulation as follows:

$$ \begin{aligned} \boldsymbol{w} &= \arg \min _{\boldsymbol{w}} \Biggl\Vert \sum_{c=1}^{C} \boldsymbol{x}^{c}\ast \boldsymbol{w}^{c}-\boldsymbol{y} \Biggr\Vert ^{2}+ \lambda _{1}\sum _{c=1}^{C} \bigl\Vert \boldsymbol{w}^{c} \bigr\Vert _{1} \\ &\quad{}+\lambda _{2}\sum_{c=1}^{C} \bigl\Vert \boldsymbol{w}^{c}-\boldsymbol{w}^{c}_{t-1} \bigr\Vert ^{2}. \end{aligned} $$
(9)

In general, the origin contains the global static appearance of the target, while each eigenvector in the basis focuses on specific variation during the past tracking frames. To perform DCF learning in the affine subspace, we propose to learn the discriminative filters for the origin and the basis separately. Specifically, in frame t, the current affine subspace \(\mathcal{A}_{t}\) can be represented by the origin μ and the K eigenvectors in U, \(\{\boldsymbol{u}_{1},\boldsymbol{u}_{2},\ldots ,\boldsymbol{u}_{K} \}\). We consider the \(K+1\) vectors, \(\{\boldsymbol{\mu}_{t}, \boldsymbol{u}_{1},\boldsymbol{u}_{2},\ldots ,\boldsymbol{u}_{K} \}\), as our training data to train one main filter \(\boldsymbol{w}_{\boldsymbol{\mu}}\) and K auxiliary filters \(\{\boldsymbol{w}_{\boldsymbol{u1}},\boldsymbol{w}_{\boldsymbol{u2}},\ldots ,\boldsymbol{w}_{\boldsymbol{uK}} \}\) based on Eq. (9). As presented in Sect. 3, the proposed affine space can represent both static and dynamic information of the target and is updated by incremental learning. In this way, the spatio-temporal information of target appearance variations in tracking can be well modeled. Through further sparse DCF learning framework, this enhanced representation of target appearance served for tracking model learning which leads to better tracking performance, especially in challenging situations.

4.4 Optimization

According to the convexity of the proposed formulation in Eq. (9), we employ the augmented Lagrange method to optimize the problem. Here, we introduce a slack variable \(\boldsymbol{w}^{\prime}=\boldsymbol{w}\) for the estimate. The Lagrange function can be expressed as follows:

$$ \begin{aligned} \mathcal{L} & = \Biggl\Vert \sum _{c=1}^{C}\boldsymbol{x}^{c}\ast \boldsymbol{w}^{c}- \boldsymbol{y} \Biggr\Vert ^{2} + \lambda _{1}\sum_{c=1}^{C} \bigl\Vert \boldsymbol{w}^{\prime c} \bigr\Vert _{1} \\ &\quad{}+\lambda _{2}\sum_{c=1}^{C} \bigl\Vert \boldsymbol{w}^{c}-\boldsymbol{w}^{c}_{t-1} \bigr\Vert ^{2} \\ &\quad{}+\frac{\nu}{2}\sum_{c=1}^{C} \biggl\Vert \boldsymbol{w}^{c}-\boldsymbol{w}^{ \prime c}+ \frac{\boldsymbol{\gamma}^{c}}{\nu} \biggr\Vert ^{2}, \end{aligned} $$
(10)

where γ is the Lagrange multiplier with the same size as x, and ν is the corresponding penalty parameter for the slack variable \(\boldsymbol{w}^{\prime}\). We exploit the alternating direction method of multipliers [66] approach to iteratively optimize the following sub-problems:

$$ \textstyle\begin{cases} \boldsymbol{w}=\arg \min_{\boldsymbol{w}} \mathcal{L} (\boldsymbol{w},\boldsymbol{w}^{ \prime},\boldsymbol{\gamma},\nu ), \\ \boldsymbol{w}^{\prime}=\arg \min_{\boldsymbol{w}^{\prime}} \mathcal{L} (\boldsymbol{w},\boldsymbol{w}^{\prime},\boldsymbol{\gamma},\nu ), \\ \boldsymbol{\gamma}=\arg \min_{\boldsymbol{\gamma}} \mathcal{L} ( \boldsymbol{w},\boldsymbol{w}^{\prime},\boldsymbol{\gamma},\nu ). \end{cases} $$
(11)

4.4.1 Optimizing w

To optimize w, we exploit the circulant structure [1] and Parseval’s theorem to transfer the sub-problems from the original spatial domain to the frequency domain:

$$ \begin{aligned} &\min \Biggl\Vert \sum _{c=1}^{C}\hat{\boldsymbol{x}}^{c}\odot \hat{\boldsymbol{w}}^{c}-\hat{\boldsymbol{y}} \Biggr\Vert ^{2} +\lambda _{2}\sum_{c=1}^{C} \bigl\Vert \hat{\boldsymbol{w}}^{c}-\hat{\boldsymbol{w}}_{t-1}^{k} \bigr\Vert ^{2} \\ &\quad{}+\frac{\nu}{2}\sum_{c=1}^{C} \biggl\Vert \hat{\boldsymbol{w}}^{c}- \hat{\boldsymbol{w}}^{\prime c}+ \frac{\hat{\boldsymbol{\gamma}}^{c}}{\nu} \biggr\Vert ^{2}. \end{aligned} $$
(12)

A closed-form solution for the above sub-problem can be obtained as [67]:

$$ \hat{\boldsymbol{w}}_{i}= \biggl(\boldsymbol{I}- \frac{\hat{\boldsymbol{x}}_{i}\hat{\boldsymbol{x}}_{i}^{\top}}{\lambda _{2}+\nu /2+\hat{\boldsymbol{x}}_{i}^{\top}\hat{\boldsymbol{x}}_{i}} \biggr)\boldsymbol{g}, $$
(13)

where \(\boldsymbol{g}= (\hat{\boldsymbol{x}}_{i}\hat{y}_{i}+\nu \hat{\boldsymbol{w}}^{\prime}_{i}- \nu \hat{\boldsymbol{\gamma}}_{i}+\lambda _{2}\hat{\boldsymbol{w}}_{t-1\ i} )/ (\lambda _{2}+\nu )\), the vectors \(\hat{\boldsymbol{w}}_{i}\) (\(\hat{\boldsymbol{w}}_{i}= [\hat{w}_{i}^{1},\hat{w}_{i}^{2},\ldots , \hat{w}_{i}^{C} ]\in \mathbb{C}^{C}\)), \(\hat{\boldsymbol{x}}_{i}\), and \(\hat{\boldsymbol{w}}_{t-1\ i}\) denote the i-th units of \(\hat{\boldsymbol{w}}\), \(\hat{\boldsymbol{x}}\) and \(\hat{\boldsymbol{w}_{t-1}}\), respectively, across all C channels, and \(i\in \{1,2,\ldots ,D\}\).

4.4.2 Optimizing \(\boldsymbol{w}^{\prime}\)

To optimize \(\boldsymbol{w}^{\prime}\), we need to minimize the following sub-problem:

$$ \min \lambda _{1}\sum_{c=1}^{C} \bigl\Vert \boldsymbol{w}^{\prime c} \bigr\Vert _{1}+ \frac{\nu}{2}\sum_{c=1}^{C} \biggl\Vert \boldsymbol{w}^{c}- \boldsymbol{w}^{\prime c}+ \frac{\boldsymbol{\gamma}^{c}}{\nu} \biggr\Vert ^{2}. $$
(14)

The soft-threshold shrinkage operator is used here to form a closed-form solution for each element \(w^{\prime c}_{i}\) in the vector \(\boldsymbol{w}^{\prime}\) separately:

$$ w^{\prime c}_{i}=\operatorname{sign} (p )\max \biggl(0, \vert p \vert -\frac{\lambda _{1}}{\nu} \biggr), $$
(15)

where \(p=w^{c}_{i}+\frac{\gamma ^{c}_{i}}{\nu}\), with \(w^{c}_{i}\) and \(\gamma ^{c}_{i}\) being the values corresponding to the elements at the i-th spatial unit and c-th channel in w and γ, respectively.

4.4.3 Optimizing multiplier γ and penalty ν

The multiplier γ and the penalty ν are updated at the end of each iteration as:

$$ \textstyle\begin{cases} \boldsymbol{\gamma} = \boldsymbol{\gamma} + \nu (\boldsymbol{w}-\boldsymbol{w}^{\prime} ), \\ \nu = \min (\rho \nu ,\nu _{\max} ), \end{cases} $$
(16)

where ρ is the parameter that controls the strictness of the penalty and \(\nu _{\max}\) is the corresponding upper threshold.

4.5 ASDCF algorithm

We summarize our ASDCF in detail in two stages, i.e, tracking and learning.

4.5.1 Tracking stage

As shown in Fig. 2, given a new image in frame t and the predicted target state of frame \(t-1\) (target center \(p_{t-1}\), the target width, \(w_{t-1}\), and height \(h_{t-1}\)), we extract a search window \(\{\boldsymbol{I} \}\) centered around \(p_{t-1}\). The search window patch is of \(n^{\prime}\times n^{\prime}\) pixels. We re-size the patch to the \(n\times n\) basic search window size. \(n^{\prime}\) is determined by the target size \(w_{t-1}\times h_{t-1}\) and the padding parameter, ϱ as: \(n^{\prime}= (1+\varrho )\sqrt{w_{t-1}\times h_{t-1}}\). Then we extract multi-channel features of the search window as \(\boldsymbol{x}\in \mathbb{R}^{D\times C}\). Given the filter model obtained from the previous frame, one main filter \(\boldsymbol{w}_{\boldsymbol{\mu}}\) and K auxiliary filters \(\{\boldsymbol{w}_{\boldsymbol{u1}},\boldsymbol{w}_{\boldsymbol{u2}},\ldots ,\boldsymbol{w}_{\boldsymbol{uK}} \}\), the response map y can efficiently be calculated in the frequency domain as:

$$ \hat{\boldsymbol{y}}=\sum_{c=1}^{C} \hat{\boldsymbol{x}}^{c}\odot \hat{\boldsymbol{w}}^{c}_{\boldsymbol{\mu}}+ \lambda _{3}\sum_{k=1}^{K}\sum _{c=1}^{C} \bigl(\hat{\boldsymbol{x}}^{c}- \hat{\boldsymbol{\mu}}^{c} \bigr) \odot \hat{\boldsymbol{w}}^{c}_{\boldsymbol{uk}}, $$
(17)

where \(\lambda _{3}\) is a balancing parameter. The new position corresponds to the maximal value in the response maps y.

Figure 2
figure 2

Overview of the proposed ASDCF. Affine subspace generation and online updating are introduced in Sect. 3. Details of the affine sparse DCF formulation and optimization are presented in Sect. 4.3 and Sect. 4.4. We discuss the online tracking stage of our ASDCF in Sect. 4.5 (\(K=3\) in the illustration)

4.5.2 Learning stage

To balance the accuracy and efficiency, our tracker performs filter training every 5 frames. In the filter learning stage, we first extract the 5 feature representations, \(\{\boldsymbol{x}_{t-4},\boldsymbol{x}_{t-3},\ldots ,\boldsymbol{x}_{t} \}\) of the target appearance from frame \(t-4\) to frame t based on the tracking results. Then the affine subspace \(\mathcal{A}\) is updated according to Sect. 3. After obtaining \(\mathcal{A}\), the main filter \(\boldsymbol{w}_{\boldsymbol{\mu}}\) and K auxiliary filters \(\{\boldsymbol{w}_{\boldsymbol{u1}},\boldsymbol{w}_{\boldsymbol{u2}},\ldots ,\boldsymbol{w}_{\boldsymbol{uK}} \}\) are trained according to Eq. (10)-Eq. (16).

5 Evaluation

5.1 Implementation

To evaluate the performance of the proposed ASDCF, we implement the tracking algorithm in the MATLAB platform on an Intel i7 2.20 GHz CPU with an Nvidia GTX 1050Ti GPU. The detailed settings for the parameters used in Sect. 4.5 are as follows. The number of auxiliary filters \(K=3\), corresponding to the number of eigenvectors we use to represent the subspace. We set the basic window size \(n\times n = 240\times 240\) pixels, the padding parameter \(\varrho =4\). We equip the proposed ASDCF with both hand-crafted and deep CNN features. The hand-crafted set includes HOG and color names (CN) features, with 4 pixel cell size, \(\lambda _{1}=10^{-5}\), \(\lambda _{2}=30\), and \(\lambda _{3}=0.3\). Specifically, the HOG (31 channels) and CN (10 channels) features are concatenated along the channel dimension to obtain the final hand-crafted feature representation \(\boldsymbol{x}\in \mathbb{R}^{3600\times 41}\). We use ResNet-50 (the output of layer 3) to extract deep feature representations using the MatConvNet toolboxFootnote 1 [68]. The regularization parameters \(\lambda _{1}=10^{-6}\), \(\lambda _{2}=5\), and \(\lambda _{3}=0.2\). The dimensionality of the ResNet-50 feature representation is \(\boldsymbol{x}\in \mathbb{R}^{225\times 1024}\).

5.2 Evaluation metrics

We perform an experimental evaluation on 4 challenging benchmarks: OTB2013 [2], OTB2015 [3], UAV123 [4], and VOT2018 [6]. For OTB2013, OTB2015, and UAV123, we employ precision plots and success plots to measure the tracking performance [2]. The precision plot indicates the proportion of frames with the distance between the tracking results and the ground truth less than a certain number of pixels. The distance precision (DP) is defined by the corresponding value when the precision threshold is 20 pixels. Center location error (CLE) measures the mean distance between the centers of the tracking results and the ground truth values. The success plot describes the percentage of successful frames with a threshold ranging from 0 to 1. The target in a frame is considered successfully tracked if the overlap of the two bounding boxes exceeds a given threshold. The overlap precision (OP) is defined by the corresponding value when the overlap threshold is 0.5. The area under the curve (AUC) of the success plot quantifies the result in terms of overlap evaluation. For VOT2018, we use the expected average overlap (EAO), accuracy and robustness metrics for performance evaluation [69].

We compare our method against recent state-of-the-art tracking approaches, including A3DCF [64], KYS [70], ASRCF [71], VITAL [72], STRCF [19], ECO [20], C-COT [13], MCPF [56], MetaTracker [73], CREST [74], BACF [59], CACF [57], ACFN [75], CSRDCF [16], Staple [14], SiamFC [76], CFNet [40], SRDCF [15], DSST [47] and KCF [1]. For VOT2018, we compare our ASDCF with the top trackers in VOT2018, i.e., ECO, CFCF [77], UPDT [78], SiamRPN [58], LADCF [63], ULAST [79] and FCOS_MAML [80].

5.3 Ablation studies

The proposed ASDCF aims at improving discrimination by explicitly modeling the spatio-temporal appearance in an online updated affine subspace. In addition, spatial sparsity and temporal smoothness are also fused in the DCF formulation, decreasing the redundancy and noise from the high dimensional feature representations. Therefore, the ablation studies are conducted to verify the effectiveness of performing DCF learning in the affine subspace.

The corresponding results are reported in Table 1. According to Table 1, introducing the affine subspace (\(K>0\)) in the DCF framework improves the tracking performance compared with single template learning (\(K=0\)). The performance witnesses a continuous improvement when increasing the number of auxiliary filters until \(K=3\). Then, slight performance degradation can be observed at \(K=4\) and \(K=5\). The above results indicate that the model capacity in the affine subspace can be enhanced before saturation, reflecting the effectiveness of the model in terms of the appearance variation in the affine subspace. In addition, the best performance is achieved with 3 auxiliary filters in the tracking system, with the improvement from 90.8% to 92.7% in terms of DP, and from 67.3% to 69.7% in terms of AUC. Ablation studies demonstrate the merits of performing DCF in the updated affine subspace, as well as the necessity of considering appearance variation with explicit modeling techniques during the online tracking system.

Table 1 Ablation performance on OTB2015 with/without affine subspace and the impact of using different numbers of auxiliary filters

5.4 Comparison with state-of-the-art methods

5.4.1 Quantitative performance

First, we report the precision plots and success plots on OTB2013 and OTB2015 in Fig. 3, with the numerical DP and AUC scores reported in the corresponding legends, respectively. Based on the result curves, ASDCF exhibits superior performance against the state-of-the-art trackers in both cases. On OTB2013, ASDCF achieves promising tracking results with 95.6% in DP. Compared to ECO and LADCF, which can be considered the best of a class of DCF-based trackers, our performance is better. On OTB2015, a consistent advantage of our ASDCF among the state-of-the-art methods is obtained, achieving 92.7% in terms of DP and 69.7% in terms of AUC. In addition, OP, CLE and AUC metrics on these two datasets are also reported in Table 2. Our ASDCF achieves the best OP score and AUC on both OTB2013 and OTB2015. On OTB2015, ASDCF obtains accurate and robust tracking results, with the best OP/CLE, \(87.9\%/9.5\) pixels. We credit the performance improvement to the effective affine subspace construction, with more discriminative information retained in the filter learning stage.

Figure 3
figure 3

The experimental performance on OTB2013 and OTB2015. Precision plots (with the DP score reported in the figure legend) and the success plots (with the AUC score reported in the figure legend) are presented. Only the top ten trackers are presented for each metric

Table 2 Performance comparison of our ASDCF method with the state-of-the-art trackers, evaluated on OTB2013 and OTB2015 in terms of OP and CLE. The best three results are highlighted in red, blue and brown

We also report the precision plots and success plots on UAV123 in Fig. 4. As shown in the figure, the proposed ASDCF produces the best results in terms of both DP and AUC. ASDCF outperforms the advanced DCF trackers, i.e., ECO (by 2.0% and 0.6%), C-COT (by 5.0% and 3.1%), and LADCF (by 5.1% and 1.6%), respectively, in terms of DP and AUC. Therefore, by explicitly modeling the appearance variation during spatio-temporal changes, ASDCF exhibits adaptive context awareness with an outstanding generalization.

Figure 4
figure 4

The experimental performance on UAV123. Precision plots (with the DP score reported in the figure legend) and the success plots (with the AUC score reported in the figure legend) are presented

In addition, in Table 3, we report the tracking performance obtained on VOT2018. VOT sequences consist of diverse challenging factors, with more severe appearance variations. Our ASDCF approach performs best in the EAO metric, achieving a relative gain of 1.2% compared to the DCF approach LADCF. Compared to the deep learning based method FCOS_MAML trained offline with large-scale data, the proposed ASDCF reports a gain of 0.9% in terms of EAO. For robustness, ASDCF also produces comparable results within the top 3 trackers. In principle, the proposed ASDCF realizes favorable tracking performance compared with other DCF approaches, i.e., ECO, CFWCR, UPDT, and LADCF, demonstrating the advantage of performing filter learning based on the appearance representation provided by the affine subspace.

Table 3 The tracking results on VOT2018. The best three results are highlighted by red, blue and brown

Compared to these state-of-the-art DCF-based trackers that extract representations from independent templates, the proposed affine subspace strengthens the representation capacity for latent appearance variations. With more powerful representation, undoubtedly, the ASDCF can learn more discriminative and robust filters, leading to precise and stable tracking, even in the presence of severe appearance variations caused by various factors. Therefore, on these challenging benchmark datasets, the proposed ASDCF outperforms the state-of-the-art DCF-based methods and some deep learning-based trackers.

5.4.2 Qualitative performance

Qualitative comparisons with tracking challenges are presented in Fig. 5, which shows the intuitive tracking results of the state-of-the-art approaches, i.e., BACF, C-COT, CACF, ECO, VITAL and the proposed ASDCF, on some challenging video sequences. The difficulties are generated by rapid changes in the appearance of both targets and the corresponding surroundings. Our ASDCF exhibits competitive performance on these challenges as it successfully identifies the pertinent spatio-temporal target patterns. Sequences with deformations (MotorRolling, Matrix) and out-of-view (Biker, Bird1) can be successfully tracked by our method without any failures. Videos with rapid motions (Biker, Matrix) also benefit from our strategy of exploring relevant deep channels to enhance discrimination. Specifically, ASDCF is an expert in solving in-plane and out-of-plane rotations (Biker, MotorRolling), because the proposed affine subspace enables adaptive appearance updating with improved model capacity compared with other DCF approaches.

Figure 5
figure 5

A qualitative comparison of our ASDCF method with the state-of-the-art trackers, including BACF [59], CACF [57], C-COT [13], ECO [20] and VITAL [72], on some challenging video sequences of the OTB2015 [3] (Left column top to bottom: Biker, MotorRolling, and Soccer. Right column top to bottom: Bird1, Matrix, and Shaking)

6 Conclusion

In this paper, we proposed an effective appearance model with an outstanding performance by learning discriminative correlation filters in the adaptively updated affine subspace. The affine subspace enables effective spatio-temporal appearance representation, providing more discriminative clues than single template learning. A spatio-temporal regularized DCF formulation accompanied by efficient optimization also contributes to achieving accurate and robust performance in the affine subspace. The quantitative and qualitative experimental results on tracking benchmarking datasets demonstrate the consistent effectiveness of our method, compared with state-of-the-art trackers. The merits of introducing affine subspace to the DCF learning framework support the potential of exploring more effective representation spaces with spatio-temporal capacity in online visual object tracking.