Abstract
Discriminative correlation filters (DCF) with powerful feature descriptors have proven to be very effective for advanced visual object tracking approaches. However, due to the fixed capacity in achieving discriminative learning, existing DCF trackers perform the filter training on a single template extracted by convolutional neural networks (CNN) or hand-crafted descriptors. Such single template learning cannot provide powerful discriminative filters with guaranteed validity under appearance variation. To pinpoint the structural relevance of spatio-temporal appearance to the filtering system, we propose a new tracking algorithm that incorporates the construction of the Grassmannian manifold learning in the DCF formulation. Our method constructs the model appearance within an online updated affine subspace. It enables joint discriminative learning in the origin and basis of the subspace, achieving enhanced discrimination and interpretability of the learned filters. In addition, to improve tracking efficiency, we adaptively integrate online incremental learning to update the obtained manifold. To this end, specific spatio-temporal appearance patterns are dynamically learned during tracking, highlighting relevant variations and alleviating the performance degrading impact of less discriminative representations from a single template. The experimental results obtained on several well-known datasets, i.e., OTB2013, OTB2015, UAV123, and VOT2018, demonstrate the merits of the proposed method and its superiority over the state-of-the-art trackers.
Similar content being viewed by others
1 Introduction
Visual object tracking is one of the most fundamental topics in pattern recognition and computer vision, which plays crucial roles in a wide range of visual intelligent systems, e.g., medical image analysis, human-computer interaction, transportation intelligence, and robotics. To consistently and accurately track an arbitrary object in unconstrained scenarios is very challenging due to deformable shape, changing aspect, and textural variations of the target. Considering existing advanced tracking algorithms, discriminative correlation filter (DCF-) based trackers [1] have exhibited promising performance in various benchmarks [2–4] and competitions such as the Visual Object Tracking (VOT) challenges [5, 6] and VisDrone [7, 8]. In general, the advantages of DCF include its spatial appearance model exploiting the circulant matrix structure [9] and efficient optimization in the frequency domain [10]. More recent innovations focus on scale detection [11], joint regularization [12], continuous domain mapping [13], multi-response fusion [14], etc.
The success of current advanced DCF trackers can be attributed to two main factors: spatial regularization and temporal fusion. Regarding the spatial regularization, as images and videos rectify the 2D planes from the view of a camera, the proposal of spatial regularization enables a direct improvement of the tracking performance by potentially endowing the learned classifiers with a specific attention mechanism, enhancing the model’s discrimination by focusing on less ambiguous and background regions [15–17]. Considering the temporal fusion techniques, advanced DCF trackers highlight the online appearance clues by gathering more historical target information or constructing temporally consistent constraints on the discriminative learning stage [18–20]. To this end, the above-mentioned spatio-temporal model methodologies have received continuous attention in the visual tracking community, especially for the powerful deep learning representations developed in recent years [21–23].
However, from the geometry viewpoint, the current DCF paradigm extracts the discriminative information from independent training templates (points), without unifying the spatio-temporal appearance jointly. Specifically, the multi-channel feature representations from different frames obtained by a pre-trained convolutional neural network (CNN) are simply inputs to the DCF learning stage with a moving average. Therefore, the model capacity against appearance variation can only be guaranteed within a limited \(\ell _{2}\)-norm ball around the training templates, impeding the generalization of the learned filters, as illustrated in Fig. 1. Motivated by this observation, we argue the necessity of constructing the appearance from independent points to spatio-temporal affine subspace. The relationships among multiple historical frames can be jointly considered in the affine subspace, effectively extending the model capacity. In our design, during the online tracking process, all the previous frames are collected to construct the affine subspace, consisting of an origin and a linear subspace. To mitigate the increased calculation complexity in obtaining the linear subspace when a large number of frames are involved, we employ the incremental learning technique to update the origin and the linear subspace online, resulting in efficient affine subspace learning and updating.
In addition to constructing the affine subspace to reflect the spatio-temporal appearance, we also propose to endow the DCF model with parsimony and consistency constraints. In principle, with the development of robust visual features, e.g., the Haar descriptor, histogram of oriented gradient (HOG), and convolutional blocks of deep architectures (AlexNet, VGGNet, ResNet) [22–24], the volume of the feature representations has witnessed a continuous swell. Accordingly, these high dimensional feature maps provide improved discriminative information to achieve better tracking performance in distinguishing the target from its corresponding surroundings. However, there exists inevitable redundancy and noise in these feature maps. Therefore, to highlight the relevance between deep feature representations and the discriminative learning task, we propose to regularize the learned filters to be sparse. In addition, temporal smoothness is also emphasized in the DCF learning objective to achieve consistency in filter training, improving the stability of the tracking model.
To combine the DCF learning paradigm in the constructed affine subspace, we use the origin and the basis of the subspace to train one main filter and multiple auxiliary filters. In principle, the filter learning process corresponding to the origin is similar to that in the standard DCF paradigm, where a moving average template is employed to train the classifier in the current frame. The novelty of our affine subspace DCF (ASDCF) learning approach emphasizes the design of learning auxiliary filters corresponding to the basis of the subspace. Specifically, after obtaining the basis of the current K dimensional subspace, we propose to train K separate auxiliary filters corresponding to the K basis representation. To this end, each auxiliary filter is associated with specific appearance variation, improving the capacity of the proposed learning model. The proposed ASDCF injects the spatio-temporal information represented by the deep features to the online updated affine subspace, unifying the spatial visual features and the changing temporal variations, with improved discrimination and interpretability compared with the standard DCF framework.
The Simaese-based trackers have recently achieved remarkable performance by learning to map the target template and instance into an appearance variation that has preserved feature space through an end-to-end network. However, the Siamese-based trackers conduct tracking by relying on a fixed template, and the appearance capacity is not modeled, resulting in performance that is especially dependent on the invariance of extracted features. In contrast, the constructed affine space of this work enhances the ability to model target appearance, increasing the tolerance of the learned model to spatio-temporal appearance variation of an object. Therefore, by combining the proposed affine space construction and updating with DCF learning, more accurate and stable target tracking results can be realized.
The main contributions of the proposed ASDCF tracking approach include the following:
1) A new affine subspace construction technique in online visual tracking to unify the spatial and temporal discriminative information, with an efficient incremental learning method to update the affine subspace during tracking.
2) An effective DCF learning objective imposing sparsity and temporal smoothness regularization for the filters.
3) A comprehensive evaluation of ASDCF on several well-known public available benchmarking datasets, including OTB2013 [2], OTB2015 [3], UAV123 [4], and VOT2018 [6]. The results support the advantage of the proposed ASDCF, with superior tracking performance compared with the state-of-the-art trackers.
The rest of this paper is organized as follows. In Sect. 2, we briefly review relevant tracking approaches for constructing spatio-temporal appearance models, especially the development of the DCF framework. The proposed affine subspace construction is presented in Sect. 3. The details of the proposed ASDCF method are introduced in Sect. 4, accompanied by an efficient optimization scheme. The implementation details and experimental results are reported in Sect. 5, with ablation studies and comparative analysis. Conclusions are presented in Sect. 6.
2 Related work
Existing visual object tracking approaches include generative learning and discriminative learning, e.g., image matching [25], statistical theory [26], particle filtering framework [27], subspace learning methodology [28], discriminative correlation filters [1], and deep neural networks [29]. In this section, we focus on introducing the development of the above-mentioned tracking approaches that are pertinent to our ASDCF. Continuously improved tracking performance has been evidenced by recent tracking benchmarking datasets and competitions such as VOT [6]. Readers are recommended to refer to recent surveys [3, 30–32] for detailed and comprehensive reviews of the visual tracking approaches.
2.1 Generative learning framework
Generative learning frameworks aim at learning the intrinsic target state distribution to represent the target appearance, based on which similarity metric or reconstruction error can be employed to calculate the final probabilities for the candidates in the next frame. Typical generative learning models in the early visual tracking research stage include optical flow [25] and mean-shift [33]. The basic assumptions behind these two methodologies are consistent brightness and limited appearance variations. Although these two methods provide complete mathematical derivations to model the visual tracking task, their rigid constraints cannot satisfy the real-world scenarios, resulting in poor tracking performance when processing challenging videos. To enhance the tracking robustness, the particle filtering system is applied to visual tracking [27, 34] to estimate the posterior distribution of the target via Bayes’s theorem and sampling techniques. Specifically, the conditional distribution is approximated via the similarity between the current samples and the model distribution, providing nonlinear inference for the tracking scope. It should be noted that improved performance can be achieved with the increasing number of involved particles while sacrificing the model efficiency. Due to the convenience that the particle filtering system is an external predicting framework, it has been widely studied and extended to fuse with other generative methods, e.g., sparse subspace representations and low-rank representations [35–37]. In principle, the subspace-based tracking paradigm has received wide attention since the proposal of the incremental subspace learning scheme [28], which assumes that the target can be linearly represented by its corresponding eigenvectors. Sparse trackers assume the target to be sparsely represented by an over-complete dictionary. Accordingly, the representation coefficients and reconstruction errors are used to gauge the quality of candidates. Furthermore, low-rank constraints have been proposed to increase the relevance of particles by suppressing spurious information [37].
The advantage of generative learning framework focuses on its exploration of enlarging the tracking model capacity via carefully designed appearance representation and inference systems. However, generative tracking methods suffer from the limitation of neglecting the background appearance, resulting in less discriminative performance.
2.2 Discriminative learning framework
In addition to generative learning methods, various classification methods, such as support vector machine [26], multiple instance boosting [38], and linear regression [10] have been employed in constructing learning models in a discriminative manner, exploring the discriminative information between the target region and its surroundings. Discriminative learning approaches construct a tracking task as a classification or regression problem, aiming at directly inferring the output of a sampling candidate by estimating the conditional distribution of labels for the given inputs. Therefore, the optimal sampling candidate with the maximal response is selected as the final tracking result. However, a common limitation of the above discriminative trackers is that the initialization of the learning model is performed in the initial frame with insufficient appearance information, without guaranteed tracking robustness for the following frames. More recently, Siamese networks [29, 39–41] have been successfully applied in visual tracking. Taking the advantage of large annotated tracking datasets, deep architectures and powerful graphical processing units, Siamese networks achieve efficient visual tracking by performing efficient template matching in the learned feature embedding space.
Compared with basic generative learning approaches, discriminative methods developed a comparatively more robust modeling paradigm that extracts and analyzes appearance from both foreground and background, achieving better tracking performance.
2.3 Discriminative correlation filter
DCF belongs to the discriminative learning paradigm, and we provide detailed instructions of its development in this subsection as it is the baseline of our proposed ASDCF. The seminal work of the DCF framework is minimum output sum of squared error (MOSSE) [42], which formulates the tracking task as discriminative filter learning [43] rather than template matching [44], achieving improved tracking efficiency. Based on this modeling technique, the concept of circulant matrix [9] is introduced to DCF by CSK [10] with an enlarged search window, enabling the generation of more negative training samples in the discriminative filter learning stage. To further explore the potential of the DCF framework, spatial-temporal context information [45] and kernel modeling technique [1] are leveraged to improve the learning formulation by involving local appearance and nonlinear metrics, respectively. In recent years, the DCF paradigm has further been extended by exploiting scale detection [46, 47], structural patch analysis [48, 49], multi-clue fusion [14, 50, 51], sparse representation [36, 52, 53], support vector machine [54, 55], enhanced sampling mechanisms [56, 57] and end-to-end deep neural networks [29, 40, 58].
Despite the outstanding performance of the DCF framework in visual object tracking, it is still a very challenging task to achieve high-performance tracking for a spatio-temporal changing arbitrary object, especially in unconstrained scenarios. The main obstacles include spatial bounding effect and temporal inconsistency. To alleviate the boundary effect problem caused by the circulant structure, SRDCF [15] proposes introducing spatial regularization in the DCF formulation, which allocates more filter energy for the central region and less energy for the surroundings using a pre-defined spatial smooth weighting function. A similar technique has been pursued by pruning the training samples or learned filters with pre-defined binary mask [16, 59–62]. To achieve adaptive spatial regularization, LADCF [63] embeds dynamic spatial feature selection in the filter learning stage, activating the supportive spatial regions not only from the foreground but also from the background. Similarly, A3DCF [64] proposes an adaptive attribute-aware mechanism to learn channel-wise masks to enhance discriminative elements of feature maps while suppressing irrelevant features. ADTrack [65] adopts image pre-treatment to achieve mask generation for discriminative filter learning. The above spatial regularization approaches decrease the ambiguity emanating from the background and enable a relatively enlarged search window for DCF tracking. However, these approaches only consider information redundancy and unbalance along the spatial dimension. On the other hand, to mitigate temporal filter inconsistency, historical appearance information is rearranged in SRDCFdecon [18] and C-COT [13], with enhanced robustness and temporal stability, by gathering multiple previous frames in the filter learning stage. In addition, to alleviate the computational burden caused by involving a large number of historical samples, ECO [20] decreases the inherent computational complexity by clustering historical frames in a generative sample space and employing projection matrix to reduce the channel numbers for the feature representations.
To advance the DCF modeling space, we introduce the affine subspace to enlarge the representative power for the potential appearance variations from a geometric viewpoint. Therefore, performing discriminative modeling in the affine subspace can unify the spatial and temporal discriminative information, enhancing the DCF capacity for challenging video sequences.
3 Affine subspace generation
To accommodate appearance variations for spatio-temporal changing objects, we propose to employ affine subspace to represent both static and dynamic information. An affine subspace can be formulated as:
where \(\boldsymbol{\mu}\in \mathbb{R}^{D}\) denotes the origin of the affine subspace, and U (\(\boldsymbol{U}= [\boldsymbol{u}_{1},\boldsymbol{u}_{2}\ldots ]\)) represents the corresponding basis of the subspace, as depicted in Fig. 1. Based on the formulation in Eq. (1), we can gather the historical appearance with an updated affine subspace, realizing extended representation capacity compared with a single template. Specifically, the origin μ reflects the weighted average static appearance from all the previous frames, while the basis U constructs the detailed variations during the tracking process. Here, the basis is obtained by calculating the dominant K eigenvectors of the subspace based on singular value decomposition (SVD).
It should be noted that the affine subspace has to be updated once a new frame is available, resulting in an increasing burden for the SVD calculation. Therefore, the computational burden would explode if hundreds of high-dimensional representations were involved in the affine subspace construction. To mitigate this issue, we propose to introduce an incremental learning technique to achieve efficient updates for the origin μ and the basis U. Given a data matrix \(\boldsymbol{A}= [\boldsymbol{x}_{1},\boldsymbol{x}_{2},\ldots ,\boldsymbol{x}_{n} ]\in \mathbb{R}^{D\times n}\), where each column \(\boldsymbol{x}_{i}\) denotes the appearance gathered from the ith frame.
Update the origin: Suppose we have already obtained the mean vector of A as \(\boldsymbol{\mu}_{A}\). When the appearance representations from new m frames are available, denoted as \(\boldsymbol{B}= [\boldsymbol{x}_{n+1},\boldsymbol{x}_{n+2},\ldots ,\boldsymbol{x}_{n+m} ] \in \mathbb{R}^{D\times m}\), the aim is to incrementally calculate the mean vector of the new data matrix \([\boldsymbol{A}\ \boldsymbol{B} ]\). Denoting the mean vector of B as \(\boldsymbol{\mu}_{B}\), the updated origin for the affine subspace expanded by \([\boldsymbol{A}\ \boldsymbol{B} ]\) can be calculated as:
Update the basis: Suppose we have already obtained the SVD of A as \(\boldsymbol{A}=\boldsymbol{U\Sigma V}^{\top}\). When the appearance representations from the new m frames are available, denoted as \(\boldsymbol{B}= [\boldsymbol{x}_{n+1},\boldsymbol{x}_{n+2},\ldots ,\boldsymbol{x}_{n+m} ] \in \mathbb{R}^{D\times m}\), the aim is to incrementally calculate the SVD results of the new data matrix \([\boldsymbol{A}\ \boldsymbol{B} ]\) as \([\boldsymbol{A}\ \boldsymbol{B} ]=\boldsymbol{U}^{\prime}\boldsymbol{\Sigma}^{\prime } \boldsymbol{V}^{\prime \top}\). Denoting the component of B that is orthogonal to U as \(\tilde{\boldsymbol{B}}\), such that the SVD of \([\boldsymbol{A}\ \boldsymbol{B} ]\) can be partitioned as follows:
where . To balance the tracking efficiency and effectiveness, we retain the first K eigenvectors in U. Considering the size of the new frames, m, the SVD of R can be calculated in constant time regardless of the number of frames in A, \(\boldsymbol{R}=\tilde{\boldsymbol{U}}\tilde{\boldsymbol{\Sigma}}\tilde{\boldsymbol{V}}^{\top}\). Therefore, Eq. (3) can be formulated as:
Based on Eq. (4), the final eigenvectors \(\boldsymbol{U}^{\prime}\) can be obtained as \(\boldsymbol{U}^{\prime}= [\boldsymbol{U}\ \tilde{\boldsymbol{B}} ]\tilde{\boldsymbol{U}}\). The corresponding eigenvalues \(\boldsymbol{\Sigma}^{\prime}=\tilde{\boldsymbol{\Sigma}}\). After obtaining \(\boldsymbol{U}^{\prime}\), we retain the first K eigenvectors in \(\boldsymbol{U}^{\prime}\) to represent the basis of the updated affine subspace.
4 Approach
4.1 Basic discriminative correlation filter
Given the location and scale of a target at frame t, visual object tracking aims at predicting the location of the target in the next frame. In the learning stage, we aim to train a discriminative filter that obtains high-value responses around the target center and low-value responses for the background. DCF is formulated to learn a filter that distinguishes the target from the near background. In general, a padded search window centered around the target location from frame t is extracted with corresponding feature representation \(\boldsymbol{x} = [x_{1},x_{2},\ldots ,x_{n} ]^{\top}\in \mathbb{R}^{D}\). The circulant matrix can be generated as [9]:
where each row in X can be considered an augmented sample, and therefore the DCF formulation employs X as the training data matrix [10]. Given labeled training sample pairs \(\{\boldsymbol{X},\boldsymbol{y} \}\), the learning stage of DCF is formulated as a ridge regression problem:
where λ is the balancing parameter for the regularization term, and ∗ denotes the cross correlation operator. According to the time-frequency convolution theorem, a closed-form solution in the frequency domain can be obtained as:
where ⊙ denotes the element-wise multiplication, 1 is an all-ones vector sharing the same size with \(\hat{\boldsymbol{x}}\), \(\hat{\cdot}\) denotes discrete fourier transform (DFT) representation and ⋅∗ represents the complex conjugate.
4.2 Sparse discriminative correlation filter
Though promising tracking results have been achieved by the basic DCF formulations, the impact of the redundancy and noise in the high dimensional feature representations is not well addressed, especially in the feature maps extracted from deep CNN architectures, e.g., AlexNet, VGGNet, and ResNet. To this end, with the aim of highlighting the relevance between deep feature representations and the discriminative learning task, we propose to regularize the learned filters to be sparse. In addition, temporal smoothness is also emphasized in the proposed DCF learning objective to achieve consistency in filter training, improving the stability of the tracking model. In principle, the filter learning objective is formulated as follows:
where \(\|\cdot \|_{1}\) denotes the \(\ell _{1}\)-norm, \(\lambda _{1}\) and \(\lambda _{2}\) are the corresponding balancing parameters for the sparse regularization and temporal smoothness terms, respectively. Based on the formulation in Eq. (8), the model parsimony can be achieved for high-dimensional feature representations with temporally enforced stability.
4.3 Affine subspace discriminative correlation filters
In the implementation, multi-channel feature maps from CNN are used to enhance the representation power, and we transform the objective in Eq. (8) from the single-channel to multi-channel formulation as follows:
In general, the origin contains the global static appearance of the target, while each eigenvector in the basis focuses on specific variation during the past tracking frames. To perform DCF learning in the affine subspace, we propose to learn the discriminative filters for the origin and the basis separately. Specifically, in frame t, the current affine subspace \(\mathcal{A}_{t}\) can be represented by the origin μ and the K eigenvectors in U, \(\{\boldsymbol{u}_{1},\boldsymbol{u}_{2},\ldots ,\boldsymbol{u}_{K} \}\). We consider the \(K+1\) vectors, \(\{\boldsymbol{\mu}_{t}, \boldsymbol{u}_{1},\boldsymbol{u}_{2},\ldots ,\boldsymbol{u}_{K} \}\), as our training data to train one main filter \(\boldsymbol{w}_{\boldsymbol{\mu}}\) and K auxiliary filters \(\{\boldsymbol{w}_{\boldsymbol{u1}},\boldsymbol{w}_{\boldsymbol{u2}},\ldots ,\boldsymbol{w}_{\boldsymbol{uK}} \}\) based on Eq. (9). As presented in Sect. 3, the proposed affine space can represent both static and dynamic information of the target and is updated by incremental learning. In this way, the spatio-temporal information of target appearance variations in tracking can be well modeled. Through further sparse DCF learning framework, this enhanced representation of target appearance served for tracking model learning which leads to better tracking performance, especially in challenging situations.
4.4 Optimization
According to the convexity of the proposed formulation in Eq. (9), we employ the augmented Lagrange method to optimize the problem. Here, we introduce a slack variable \(\boldsymbol{w}^{\prime}=\boldsymbol{w}\) for the estimate. The Lagrange function can be expressed as follows:
where γ is the Lagrange multiplier with the same size as x, and ν is the corresponding penalty parameter for the slack variable \(\boldsymbol{w}^{\prime}\). We exploit the alternating direction method of multipliers [66] approach to iteratively optimize the following sub-problems:
4.4.1 Optimizing w
To optimize w, we exploit the circulant structure [1] and Parseval’s theorem to transfer the sub-problems from the original spatial domain to the frequency domain:
A closed-form solution for the above sub-problem can be obtained as [67]:
where \(\boldsymbol{g}= (\hat{\boldsymbol{x}}_{i}\hat{y}_{i}+\nu \hat{\boldsymbol{w}}^{\prime}_{i}- \nu \hat{\boldsymbol{\gamma}}_{i}+\lambda _{2}\hat{\boldsymbol{w}}_{t-1\ i} )/ (\lambda _{2}+\nu )\), the vectors \(\hat{\boldsymbol{w}}_{i}\) (\(\hat{\boldsymbol{w}}_{i}= [\hat{w}_{i}^{1},\hat{w}_{i}^{2},\ldots , \hat{w}_{i}^{C} ]\in \mathbb{C}^{C}\)), \(\hat{\boldsymbol{x}}_{i}\), and \(\hat{\boldsymbol{w}}_{t-1\ i}\) denote the i-th units of \(\hat{\boldsymbol{w}}\), \(\hat{\boldsymbol{x}}\) and \(\hat{\boldsymbol{w}_{t-1}}\), respectively, across all C channels, and \(i\in \{1,2,\ldots ,D\}\).
4.4.2 Optimizing \(\boldsymbol{w}^{\prime}\)
To optimize \(\boldsymbol{w}^{\prime}\), we need to minimize the following sub-problem:
The soft-threshold shrinkage operator is used here to form a closed-form solution for each element \(w^{\prime c}_{i}\) in the vector \(\boldsymbol{w}^{\prime}\) separately:
where \(p=w^{c}_{i}+\frac{\gamma ^{c}_{i}}{\nu}\), with \(w^{c}_{i}\) and \(\gamma ^{c}_{i}\) being the values corresponding to the elements at the i-th spatial unit and c-th channel in w and γ, respectively.
4.4.3 Optimizing multiplier γ and penalty ν
The multiplier γ and the penalty ν are updated at the end of each iteration as:
where ρ is the parameter that controls the strictness of the penalty and \(\nu _{\max}\) is the corresponding upper threshold.
4.5 ASDCF algorithm
We summarize our ASDCF in detail in two stages, i.e, tracking and learning.
4.5.1 Tracking stage
As shown in Fig. 2, given a new image in frame t and the predicted target state of frame \(t-1\) (target center \(p_{t-1}\), the target width, \(w_{t-1}\), and height \(h_{t-1}\)), we extract a search window \(\{\boldsymbol{I} \}\) centered around \(p_{t-1}\). The search window patch is of \(n^{\prime}\times n^{\prime}\) pixels. We re-size the patch to the \(n\times n\) basic search window size. \(n^{\prime}\) is determined by the target size \(w_{t-1}\times h_{t-1}\) and the padding parameter, ϱ as: \(n^{\prime}= (1+\varrho )\sqrt{w_{t-1}\times h_{t-1}}\). Then we extract multi-channel features of the search window as \(\boldsymbol{x}\in \mathbb{R}^{D\times C}\). Given the filter model obtained from the previous frame, one main filter \(\boldsymbol{w}_{\boldsymbol{\mu}}\) and K auxiliary filters \(\{\boldsymbol{w}_{\boldsymbol{u1}},\boldsymbol{w}_{\boldsymbol{u2}},\ldots ,\boldsymbol{w}_{\boldsymbol{uK}} \}\), the response map y can efficiently be calculated in the frequency domain as:
where \(\lambda _{3}\) is a balancing parameter. The new position corresponds to the maximal value in the response maps y.
4.5.2 Learning stage
To balance the accuracy and efficiency, our tracker performs filter training every 5 frames. In the filter learning stage, we first extract the 5 feature representations, \(\{\boldsymbol{x}_{t-4},\boldsymbol{x}_{t-3},\ldots ,\boldsymbol{x}_{t} \}\) of the target appearance from frame \(t-4\) to frame t based on the tracking results. Then the affine subspace \(\mathcal{A}\) is updated according to Sect. 3. After obtaining \(\mathcal{A}\), the main filter \(\boldsymbol{w}_{\boldsymbol{\mu}}\) and K auxiliary filters \(\{\boldsymbol{w}_{\boldsymbol{u1}},\boldsymbol{w}_{\boldsymbol{u2}},\ldots ,\boldsymbol{w}_{\boldsymbol{uK}} \}\) are trained according to Eq. (10)-Eq. (16).
5 Evaluation
5.1 Implementation
To evaluate the performance of the proposed ASDCF, we implement the tracking algorithm in the MATLAB platform on an Intel i7 2.20 GHz CPU with an Nvidia GTX 1050Ti GPU. The detailed settings for the parameters used in Sect. 4.5 are as follows. The number of auxiliary filters \(K=3\), corresponding to the number of eigenvectors we use to represent the subspace. We set the basic window size \(n\times n = 240\times 240\) pixels, the padding parameter \(\varrho =4\). We equip the proposed ASDCF with both hand-crafted and deep CNN features. The hand-crafted set includes HOG and color names (CN) features, with 4 pixel cell size, \(\lambda _{1}=10^{-5}\), \(\lambda _{2}=30\), and \(\lambda _{3}=0.3\). Specifically, the HOG (31 channels) and CN (10 channels) features are concatenated along the channel dimension to obtain the final hand-crafted feature representation \(\boldsymbol{x}\in \mathbb{R}^{3600\times 41}\). We use ResNet-50 (the output of layer 3) to extract deep feature representations using the MatConvNet toolboxFootnote 1 [68]. The regularization parameters \(\lambda _{1}=10^{-6}\), \(\lambda _{2}=5\), and \(\lambda _{3}=0.2\). The dimensionality of the ResNet-50 feature representation is \(\boldsymbol{x}\in \mathbb{R}^{225\times 1024}\).
5.2 Evaluation metrics
We perform an experimental evaluation on 4 challenging benchmarks: OTB2013 [2], OTB2015 [3], UAV123 [4], and VOT2018 [6]. For OTB2013, OTB2015, and UAV123, we employ precision plots and success plots to measure the tracking performance [2]. The precision plot indicates the proportion of frames with the distance between the tracking results and the ground truth less than a certain number of pixels. The distance precision (DP) is defined by the corresponding value when the precision threshold is 20 pixels. Center location error (CLE) measures the mean distance between the centers of the tracking results and the ground truth values. The success plot describes the percentage of successful frames with a threshold ranging from 0 to 1. The target in a frame is considered successfully tracked if the overlap of the two bounding boxes exceeds a given threshold. The overlap precision (OP) is defined by the corresponding value when the overlap threshold is 0.5. The area under the curve (AUC) of the success plot quantifies the result in terms of overlap evaluation. For VOT2018, we use the expected average overlap (EAO), accuracy and robustness metrics for performance evaluation [69].
We compare our method against recent state-of-the-art tracking approaches, including A3DCF [64], KYS [70], ASRCF [71], VITAL [72], STRCF [19], ECO [20], C-COT [13], MCPF [56], MetaTracker [73], CREST [74], BACF [59], CACF [57], ACFN [75], CSRDCF [16], Staple [14], SiamFC [76], CFNet [40], SRDCF [15], DSST [47] and KCF [1]. For VOT2018, we compare our ASDCF with the top trackers in VOT2018, i.e., ECO, CFCF [77], UPDT [78], SiamRPN [58], LADCF [63], ULAST [79] and FCOS_MAML [80].
5.3 Ablation studies
The proposed ASDCF aims at improving discrimination by explicitly modeling the spatio-temporal appearance in an online updated affine subspace. In addition, spatial sparsity and temporal smoothness are also fused in the DCF formulation, decreasing the redundancy and noise from the high dimensional feature representations. Therefore, the ablation studies are conducted to verify the effectiveness of performing DCF learning in the affine subspace.
The corresponding results are reported in Table 1. According to Table 1, introducing the affine subspace (\(K>0\)) in the DCF framework improves the tracking performance compared with single template learning (\(K=0\)). The performance witnesses a continuous improvement when increasing the number of auxiliary filters until \(K=3\). Then, slight performance degradation can be observed at \(K=4\) and \(K=5\). The above results indicate that the model capacity in the affine subspace can be enhanced before saturation, reflecting the effectiveness of the model in terms of the appearance variation in the affine subspace. In addition, the best performance is achieved with 3 auxiliary filters in the tracking system, with the improvement from 90.8% to 92.7% in terms of DP, and from 67.3% to 69.7% in terms of AUC. Ablation studies demonstrate the merits of performing DCF in the updated affine subspace, as well as the necessity of considering appearance variation with explicit modeling techniques during the online tracking system.
5.4 Comparison with state-of-the-art methods
5.4.1 Quantitative performance
First, we report the precision plots and success plots on OTB2013 and OTB2015 in Fig. 3, with the numerical DP and AUC scores reported in the corresponding legends, respectively. Based on the result curves, ASDCF exhibits superior performance against the state-of-the-art trackers in both cases. On OTB2013, ASDCF achieves promising tracking results with 95.6% in DP. Compared to ECO and LADCF, which can be considered the best of a class of DCF-based trackers, our performance is better. On OTB2015, a consistent advantage of our ASDCF among the state-of-the-art methods is obtained, achieving 92.7% in terms of DP and 69.7% in terms of AUC. In addition, OP, CLE and AUC metrics on these two datasets are also reported in Table 2. Our ASDCF achieves the best OP score and AUC on both OTB2013 and OTB2015. On OTB2015, ASDCF obtains accurate and robust tracking results, with the best OP/CLE, \(87.9\%/9.5\) pixels. We credit the performance improvement to the effective affine subspace construction, with more discriminative information retained in the filter learning stage.
We also report the precision plots and success plots on UAV123 in Fig. 4. As shown in the figure, the proposed ASDCF produces the best results in terms of both DP and AUC. ASDCF outperforms the advanced DCF trackers, i.e., ECO (by 2.0% and 0.6%), C-COT (by 5.0% and 3.1%), and LADCF (by 5.1% and 1.6%), respectively, in terms of DP and AUC. Therefore, by explicitly modeling the appearance variation during spatio-temporal changes, ASDCF exhibits adaptive context awareness with an outstanding generalization.
In addition, in Table 3, we report the tracking performance obtained on VOT2018. VOT sequences consist of diverse challenging factors, with more severe appearance variations. Our ASDCF approach performs best in the EAO metric, achieving a relative gain of 1.2% compared to the DCF approach LADCF. Compared to the deep learning based method FCOS_MAML trained offline with large-scale data, the proposed ASDCF reports a gain of 0.9% in terms of EAO. For robustness, ASDCF also produces comparable results within the top 3 trackers. In principle, the proposed ASDCF realizes favorable tracking performance compared with other DCF approaches, i.e., ECO, CFWCR, UPDT, and LADCF, demonstrating the advantage of performing filter learning based on the appearance representation provided by the affine subspace.
Compared to these state-of-the-art DCF-based trackers that extract representations from independent templates, the proposed affine subspace strengthens the representation capacity for latent appearance variations. With more powerful representation, undoubtedly, the ASDCF can learn more discriminative and robust filters, leading to precise and stable tracking, even in the presence of severe appearance variations caused by various factors. Therefore, on these challenging benchmark datasets, the proposed ASDCF outperforms the state-of-the-art DCF-based methods and some deep learning-based trackers.
5.4.2 Qualitative performance
Qualitative comparisons with tracking challenges are presented in Fig. 5, which shows the intuitive tracking results of the state-of-the-art approaches, i.e., BACF, C-COT, CACF, ECO, VITAL and the proposed ASDCF, on some challenging video sequences. The difficulties are generated by rapid changes in the appearance of both targets and the corresponding surroundings. Our ASDCF exhibits competitive performance on these challenges as it successfully identifies the pertinent spatio-temporal target patterns. Sequences with deformations (MotorRolling, Matrix) and out-of-view (Biker, Bird1) can be successfully tracked by our method without any failures. Videos with rapid motions (Biker, Matrix) also benefit from our strategy of exploring relevant deep channels to enhance discrimination. Specifically, ASDCF is an expert in solving in-plane and out-of-plane rotations (Biker, MotorRolling), because the proposed affine subspace enables adaptive appearance updating with improved model capacity compared with other DCF approaches.
6 Conclusion
In this paper, we proposed an effective appearance model with an outstanding performance by learning discriminative correlation filters in the adaptively updated affine subspace. The affine subspace enables effective spatio-temporal appearance representation, providing more discriminative clues than single template learning. A spatio-temporal regularized DCF formulation accompanied by efficient optimization also contributes to achieving accurate and robust performance in the affine subspace. The quantitative and qualitative experimental results on tracking benchmarking datasets demonstrate the consistent effectiveness of our method, compared with state-of-the-art trackers. The merits of introducing affine subspace to the DCF learning framework support the potential of exploring more effective representation spaces with spatio-temporal capacity in online visual object tracking.
Availability of data and materials
The datasets generated during and/or analyzed during the current study are available from the corresponding author on reasonable request.
Abbreviations
- ASDCF:
-
affine subspace DCF
- AUC:
-
area under the curve
- CLE:
-
center location error
- CNN:
-
convolutional neural network
- DCF:
-
discriminative correlation filter
- DP:
-
distance precision
- EAO:
-
expected average overlap
- HOG:
-
histogram of oriented gradient
- MOSSE:
-
minimum output sum of squared error
- OP:
-
overlap precision
- SVD:
-
singular value decomposition
- VOT:
-
visual object tracking
References
Henriques, J. F., Rui, C., Martins, P., & Batista, J. (2015). High-speed tracking with kernelized correlation filters. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(3), 583–596.
Wu, Y., Lim, J., & Yang, M. H. (2013). Online object tracking: a benchmark. In IEEE conference on computer vision and pattern recognition (pp. 2411–2418). Los Alamitos: IEEE.
Wu, Y., Lim, J., & Yang, M.-H. (2015). Object tracking benchmark. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(9), 1834–1848.
Mueller, M., Smith, N., & Ghanem, B. (2016). A benchmark and simulator for uav tracking. In European conference on computer vision (pp. 445–461). Berlin: Springer.
Kristan, M., Leonardis, A., Matas, J., Felsberg, M., Pflugfelder, R., Zajc, L. C., Vojir, T., Hager, G., Lukezic, A., Eldesokey, A., & Fernandez, G. (2017). The visual object tracking VOT2017 challenge results. In 2017 IEEE international conference on computer vision workshops (pp. 1949–1972). Los Alamitos: IEEE. https://doi.org/10.1109/ICCVW.2017.230.
Kristan, M., Leonardis, A., Matas, J., Felsberg, M., Pfugfelder, R., Zajc, L. C., Vojir, T., Bhat, G., Lukezic, A., Eldesokey, A., Fernandez, G., et al. (2018). The sixth visual object tracking VOT2018 challenge results. In ECCV workshops 2018 (pp. 3–53). Berlin: Springer.
Dawei, D., Zhu, P., Wen, L., Bian, X., Ling, H., Hu, Q., et al. (2019). VisDrone-SOT2019: the vision meets drone single object tracking challenge results. In Proceedings of the IEEE international conference on computer vision workshops (pp. 199–212). Los Alamitos: IEEE.
Fan, H., Wen, L., Du, D., Zhu, P., Hu, Q., Ling, H., et al. (2020). VisDrone-SOT2020: the vision meets drone single object tracking challenge results. In European conference on computer vision (pp. 728–749). Berlin: Springer.
Gray, R. M. (2006). Toeplitz and circulant matrices: a review. Foundations and Trends in Communications and Information Theory, 2(3), 155–239.
Henriques, J. F., Caseiro, R., Martins, P., & Batista, J. (2012). Exploiting the circulant structure of tracking-by-detection with kernels. In European conference on computer vision (pp. 702–715). Berlin: Springer.
Danelljan, M., Khan, F. S., Felsberg, M., & Van De Weijer, J. (2014). Adaptive color attributes for real-time visual tracking. In IEEE conference on computer vision and pattern recognition (pp. 1090–1097). Los Alamitos: IEEE.
Xu, T., Feng, Z.-H., Wu, X.-J., & Kittler, J. (2019). Joint group feature selection and discriminative filter learning for robust visual object tracking. In Proceedings of the IEEE international conference on computer vision (pp. 7950–7960). Los Alamitos: IEEE.
Martin, D., Andreas, R., Fahad, K., & Michael, F. (2016). Beyond correlation filters: learning continuous convolution operators for visual tracking. In European conference on computer vision (pp. 472–488). Berlin: Springer.
Bertinetto, L., Valmadre, J., Golodetz, S., Miksik, O., & Torr, P. H. S. (2016). Staple: complementary learners for real-time tracking. In IEEE conference on computer vision and pattern recognition (Vol. 38, pp. 1401–1409). Los Alamitos: IEEE.
Danelljan, M., Hager, G., Khan, F. S., & Felsberg, M. (2015). Learning spatially regularized correlation filters for visual tracking. In IEEE international conference on computer vision (pp. 4310–4318). Los Alamitos: IEEE.
Lukezic, A., Vojir, T., Zajc, L. C., Matas, J., & Kristan, M. (2017). Discriminative correlation filter with channel and spatial reliability. In IEEE conference on computer vision and pattern recognition (pp. 4847–4856). Los Alamitos: IEEE.
Zhang, M., Wang, Q., Xing, J., Gao, J., Peng, P., Hu, W., & Maybank, S. (2018). Visual tracking via spatially aligned correlation filters network. In Proceedings of the European conference on computer vision (ECCV) (pp. 469–485). Berlin: Springer.
Danelljan, M., Hager, G., Khan, F. S., & Felsberg, M. (2016). Adaptive decontamination of the training set: a unified formulation for discriminative visual tracking. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1430–1438). Los Alamitos: IEEE.
Li, F., Tian, C., Zuo, W., Zhang, L., & Yang, M.-H. (2018). Learning spatial-temporal regularized correlation filters for visual tracking. arXiv preprint. arXiv:1803.08679.
Danelljan, M., Bhat, G., Khan, F. S., & Eco, M. F. (2017). Efficient convolution operators for tracking. In IEEE conference on computer vision and pattern recognition (pp. 6931–6939). Los Alamitos: IEEE.
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In F. Pereira, C. J. Burges, L. Bottou, & K. Q. Weinberger (Eds.), Advances in neural information processing systems 25 (NIPS 2012) (pp. 1097–1105). Red Hook: Curran Associates.
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., & Rabinovich, A. (2015). Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1–9). Los Alamitos: IEEE.
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770–778). Los Alamitos: IEEE.
Liu, D., Cui, W., Jin, K., Guo, Y., & Qu, H. (2018). Deeptracker: visualizing the training process of convolutional neural networks. ACM Transactions on Intelligent Systems and Technology, 10(1), 1–25.
Lucas, B. D., & Kanade, T. (1981). An iterative image registration technique with an application to stereo vision. In Proceedings of the 7th international joint conference on artificial intelligence (IJCAI’81) (pp. 674–679). Los Altos: William Kaufmann.
Avidan, S. (2004). Support vector tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(8), 1064–1072.
Arulampalam, M. S., Maskell, S., Gordon, N., & Clapp, T. (2002). A tutorial on particle filters for online nonlinear/non-Gaussian Bayesian tracking. IEEE Transactions on Signal Processing, 50(2), 174–188.
Ross, D. A., Lim, J., Lin, R.-S., & Yang, M.-H. (2008). Incremental learning for robust visual tracking. International Journal of Computer Vision, 77(1–3), 125–141.
Li, B., Yan, J., Wu, W., Zhu, Z., & Hu, X. (2018). High performance visual tracking with Siamese region proposal network. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 8971–8980). Los Alamitos: IEEE.
Li, X., Hu, W., Shen, C., Zhang, Z., Dick, A., & Van Den Hengel, A. (2013). A survey of appearance models in visual object tracking. ACM Transactions on Intelligent Systems and Technology, 4(4), 1–48.
Li, A., Lin, M., Wu, Y., Yang, M. H., & Yan, S. (2016). Nus-pro: a new visual tracking challenge. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(2), 335–349.
Yao, R., Lin, G., Xia, S., Zhao, J., & Zhou, Y. (2020). Video object segmentation and tracking: a survey. ACM Transactions on Intelligent Systems and Technology, 11(4), 1–47.
Comaniciu, D., Ramesh, V., & Meer, P. (2000). Real-time tracking of non-rigid objects using mean shift. In IEEE conference on computer vision and pattern recognition (pp. 142–149). Los Alamitos: IEEE.
Hardegger, M., Roggen, D., Calatroni, A., & Tröster, G. (2016). S-smart: a unified Bayesian framework for simultaneous semantic mapping, activity recognition, and tracking. ACM Transactions on Intelligent Systems and Technology, 7(3), 1–28.
Zhang, S., Yao, H., Sun, X., & Liu, S. (2012). Robust visual tracking using an effective appearance model based on sparse coding. ACM Transactions on Intelligent Systems and Technology, 3(3), 1–18.
Zhang, T., Bibi, A., & Ghanem, B. (2016). In defense of sparse tracking: circulant sparse tracker. In 2016 IEEE conference on computer vision and pattern recognition (pp. 3880–3888). Los Alamitos: IEEE.
Zhang, T., Liu, S., Ahuja, N., Yang, M.-H., & Ghanem, B. (2015). Robust visual tracking via consistent low-rank sparse learning. International Journal of Computer Vision, 111(2), 171–190.
Babenko, B., Yang, M. H., & Belongie, S. (2011). Robust object tracking with online multiple instance learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(8), 1619–1632.
Tao, R., Gavves, E., & Smeulders, A. W. M. (2016). Siamese instance search for tracking. In IEEE conference on computer vision and pattern recognition (pp. 1420–1429). Los Alamitos: IEEE.
Valmadre, J., Bertinetto, L., Henriques, J., Vedaldi, A., & Torr, P. H. S. (2017). End-to-end representation learning for correlation filter based tracking. In IEEE conference on computer vision and pattern recognition (pp. 5000–5008). Los Alamitos: IEEE.
Xu, T., Feng, Z.-H., Wu, X.-J., & Kittler, J. (2020). Afat: adaptive failure-aware tracker for robust visual object tracking. arXiv preprint. arXiv:2005.13708.
Bolme, D. S., Beveridge, J. R., Draper, B. A., & Lui, Y. M. (2010). Visual object tracking using adaptive correlation filters. In 2010 IEEE conference on computer vision and pattern recognition (pp. 2544–2550). Los Alamitos: IEEE.
Bolme, D. S., Draper, B. A., & Beveridge, J. R. (2009). Average of synthetic exact filters. In 2009 IEEE conference on computer vision and pattern recognition (pp. 2105–2112). Los Alamitos: IEEE.
Briechle, K., & Hanebeck, U. D. (2001). Template matching using fast normalized cross correlation. In Optical pattern recognition XII (Vol. 4387, pp. 95–103). Bellingham: International Society for Optics and Photonics.
Zhang, K., Zhang, L., Liu, Q., Zhang, D., & Yang, M. H. (2014). Fast visual tracking via dense spatio-temporal context learning. In European conference on computer vision (pp. 127–141). Berlin: Springer.
Li, Y., & Zhu, J. (2014). A scale adaptive kernel correlation filter tracker with feature integration. In European conference on computer vision workshops (pp. 254–265). Berlin: Springer.
Danelljan, M., Häger, G., Khan, F. S., & Felsberg, M. (2017). Discriminative scale space tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(8), 1561–1575.
Li, Y., Zhu, J., & Hoi, S. C. H. (2015). Reliable patch trackers: robust visual tracking by exploiting reliable patches. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 353–361). Los Alamitos: IEEE.
Liu, S., Zhang, T., Cao, X., & Xu, C. (2016). Structural correlation filter for robust visual tracking. In IEEE conference on computer vision and pattern recognition (pp. 4312–4320). Los Alamitos: IEEE.
Tang, M., & Feng, J. (2015). Multi-kernel correlation filter for visual tracking. In IEEE international conference on computer vision (pp. 3038–3046). Los Alamitos: IEEE.
Xu, T., Feng, Z.-H., Wu, X.-J., & Kittler, J. (2020). Learning low-rank and sparse discriminative correlation filters for coarse-to-fine visual object tracking. IEEE Transactions on Circuits and Systems for Video Technology, 30(10), 3727–3739.
Zhang, T., Xu, C., & Yang, M.-H. (2018). Robust structural sparse tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(2), 473–486.
Xu, T., Wu, X.-J., & Kittler, J. (2018). Non-negative subspace representation learning scheme for correlation filter based tracking. In 2018 24th international conference on pattern recognition (ICPR) (pp. 1888–1893). Los Alamitos: IEEE.
Wang, M., Liu, Y., & Huang, Z. (2017). Large margin object tracking with circulant feature maps. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 21–26). Los Alamitos: IEEE.
Zuo, W., Wu, X., Lin, L., Zhang, L., & Yang, M.-H. (2018). Learning support correlation filters for visual tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(5), 1158–1172.
Zhang, T., Xu, C., & Yang, M.-H. (2017). Multi-task correlation particle filter for robust object tracking. In IEEE conference on computer vision and pattern recognition (Vol. 1, p. 3). Los Alamitos: IEEE.
Mueller, M., Smith, N., & Ghanem, B. (2017). Context-aware correlation filter tracking. In IEEE conference on computer vision and pattern recognition (pp. 1396–1404). Los Alamitos: IEEE.
Zhu, Z., Wang, Q., Li, B., Wu, W., Yan, J., & Hu, W. (2018). Distractor-aware Siamese networks for visual object tracking. In European conference on computer vision (pp. 103–119). Berlin: Springer.
Galoogahi, H. K., Fagg, A., & Lucey, S. (2017). Learning background-aware correlation filters for visual tracking. In IEEE international conference on computer vision (pp. 1144–1152). Los Alamitos: IEEE.
Xu, T., Feng, Z.-H., Wu, X.-J., & Kittler, J. (2020). An accelerated correlation filter tracker. Pattern Recognition, 102, 107172.
Xu, L., Kim, P., Wang, M., Pan, J., Yang, X., & Gao, M. (2022). Spatio-temporal joint aberrance suppressed correlation filter for visual tracking. Complex & Intelligent Systems, 8(5), 3765–3777.
Xu, T., Feng, Z., Wu, X.-J., & Kittler, J. (2021). Adaptive channel selection for robust visual object tracking with discriminative correlation filters. International Journal of Computer Vision, 129(5), 1359–1375.
Xu, T., Feng, Z.-H., Wu, X.-J., & Kittler, J. (2019). Learning adaptive discriminative correlation filters via temporal consistency preserving spatial feature selection for robust visual object tracking. IEEE Transactions on Image Processing, 28(11), 5596–5609.
Zhu, X.-F., Wu, X.-J., Xu, T., Feng, Z.-H., & Kittler, J. (2021). Robust visual object tracking via adaptive attribute-aware discriminative correlation filters. IEEE Transactions on Multimedia, 24, 301–312.
Bowen, L., Fu, C., Ding, F., Ye, J., & Lin, F. (2021). Adtrack: target-aware dual filter learning for real-time anti-dark uav tracking. In 2021 IEEE international conference on robotics and automation (ICRA) (pp. 496–502). Los Alamitos: IEEE.
Boyd, S., Parikh, N., Chu, E., Peleato, B., & Eckstein, J. (2011). Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine Learning, 3(1), 1–122.
Petersen, K. B., Pedersen, M. S., et al. (2008). The matrix cookbook. Technical University of Denmark, 7(15), 510.
Vedaldi, A., & Lenc, K. (2015). Matconvnet: convolutional neural networks for Matlab. In Proceedings of the 23rd ACM international conference on multimedia (pp. 689–692). New York: ACM.
Kristan, M., Leonardis, A., Matas, J., Felsberg, M., Pflugfelder, R., Zajc, L. Č., et al. (2016). The visual object tracking VOT2016 challenge results. In ECCV 2016 workshops (pp. 777–823). Berlin: Springer.
Bhat, G., Danelljan, M., Van Gool, L., & Timofte, R. (2020). Know your surroundings: exploiting scene information for object tracking. In European conference on computer vision (pp. 205–221). Berlin: Springer.
Dai, K., Wang, D., Lu, H., Sun, C., & Li, J. (2019). Visual tracking via adaptive spatially-regularized correlation filters. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4670–4679). Los Alamitos: IEEE.
Song, Y., Ma, C., Wu, X., Gong, L., Bao, L., Zuo, W., Shen, C., Lau, R., & Yang, M.-H. (2018). Vital: visual tracking via adversarial learning. arXiv preprint. arXiv:1804.04273.
Park, E., & Berg, A. C. (2018). Meta-tracker: fast and robust online adaptation for visual object trackers. arXiv preprint. arXiv:1801.03049.
Song, Y., Ma, C., Gong, L., Zhang, J., Lau, R., & Yang, M.-H. (2017). Crest: convolutional residual learning for visual tracking. In IEEE international conference on computer vision (pp. 2555–2564). Los Alamitos: IEEE.
Choi, J., Chang, H. J., Yun, S., Fischer, T., Demiris, Y., & Choi, J. Y. (2017). Attentional correlation filter network for adaptive visual tracking. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4807–4816). Los Alamitos: IEEE.
Bertinetto, L., Valmadre, J., Henriques, J. F., Vedaldi, A., & Torr, P. H. S. (2016). Fully-convolutional Siamese networks for object tracking. In European conference on computer vision (pp. 850–865). Berlin: Springer.
Gundogdu, E., & Alatan, A. A. (2018). Good features to correlate for visual tracking. IEEE Transactions on Image Processing, 27(5), 2526–2540.
Bhat, G., Johnander, J., Danelljan, M., Khan, F. S., & Felsberg, M. (2018). Unveiling the power of deep tracking. arXiv preprint. arXiv:1804.06833.
Shen, Q., Qiao, L., Guo, J., Li, P., Li, X., Li, B., Feng, W., Gan, W., Wu, W., & Ouyang, W. (2022). Unsupervised learning of accurate Siamese tracking. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 8101–8110). Los Alamitos: IEEE.
Wang, G., Luo, C., Sun, X., Xiong, Z., & Zeng, W. (2020). Tracking by instance detection: a meta-learning approach. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 6288–6297). Los Alamitos: IEEE.
Funding
This work was supported in part by the National Natural Science Foundation of China (Grant Nos. U1836218, 62106089).
Author information
Authors and Affiliations
Contributions
All authors contributed to the study conception and design. Material preparation, data collection and analysis were performed by TX, X-FZ and X-JW. The first draft of the manuscript was written by TX and all authors commented on previous versions of the manuscript. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Xu, T., Zhu, XF. & Wu, XJ. Learning spatio-temporal discriminative model for affine subspace based visual object tracking. Vis. Intell. 1, 4 (2023). https://doi.org/10.1007/s44267-023-00002-1
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s44267-023-00002-1