Adaptive Channel Selection for Robust Visual Object Tracking with Discriminative Correlation Filters

Discriminative Correlation Filters (DCF) have been shown to achieve impressive performance in visual object tracking. However, existing DCF-based trackers rely heavily on learning regularised appearance models from invariant image feature representations. To further improve the performance of DCF in accuracy and provide a parsimonious model from the attribute perspective, we propose to gauge the relevance of multi-channel features for the purpose of channel selection. This is achieved by assessing the information conveyed by the features of each channel as a group, using an adaptive group elastic net inducing independent sparsity and temporal smoothness on the DCF solution. The robustness and stability of the learned appearance model are significantly enhanced by the proposed method as the process of channel selection performs implicit spatial regularisation. We use the augmented Lagrangian method to optimise the discriminative filters efficiently. The experimental results obtained on a number of well-known benchmarking datasets demonstrate the effectiveness and stability of the proposed method. A superior performance over the state-of-the-art trackers is achieved using less than 10%\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$10\%$$\end{document} deep feature channels.

wide spectrum of practical applications in robotics, medical image analysis, intelligent transportation and humancomputer interaction. Given the initial state of a target in the first frame of a video sequence, a tracker aims to automatically locate the target in the subsequent frames. Typical visual object tracking algorithms include particle filters (Sanjeev 2002), support vector machine (Avidan 2004), subspace representations with sparse and low-rank constraints (Zhang et al. 2015), and deep neural networks . They are invariably equipped with powerful image features such as histograms (Dalal and Triggs 2005), colour attributes (Weijer et al. 2009) and deep Convolutional Neural Network (CNN) features (Danelljan et al. 2017a). Despite the significant progress made in the tracking methodology and the ever improving results, the fast growing video data with practical challenges, e.g. occlusion, non-rigid deformation, blur and background clutter, imposes increasingly stricter requirements on the accuracy, speed and robustness of visual object tracking algorithms.
In order to mitigate the tension between the effectiveness and efficiency of traditional visual object tracking methods, the Discriminative Correlation Filters (DCF) tracking paradigm has been proposed and extensively studied (Henriques et al. 2012). The efficiency of its learning and localisation stages, involving all circularly augmented samples, is guaranteed by the property of the circulant matrix (Gray 2006). The learning of a correlation operator, formulated as the ridge regression problem, is accelerated by Discrete Fourier Transform (DFT) with closed-form solutions in the frequency domain (Henriques et al. 2015). Exploiting the advantage of this framework, the recent improvements focusing on spatial regularisation (Danelljan et al. 2015;Kiani Galoogahi et al. 2017) and deep neural networks (Danelljan et al. 2016(Danelljan et al. , 2017aValmadre et al. 2017) have achieved superior performance on benchmarking datasets (Wu et al. 2013Liang et al. 2015;Mueller et al. 2016) and in competitions (Kristan et al. 2015(Kristan et al. , 2016(Kristan et al. , 2018. In addition, it has been demonstrated that feature representation plays the most important role in boosting the performance of visual object tracking (Wang et al. 2015). Compared with other image descriptors, deep Convolutional Neural Network (CNN) features are more intuitive and effective. However, hundreds to thousands of channels of deep features, some of which may be redundant, are directly fused in the DCF paradigm. The relationships among multiple deep channels have not been explored. Motivated by this observation, we investigate the relevance of high dimensional multi-channel features in the learning framework to identify the group relationships between deep image features with the aim of adaptive channel selection.
To reflect the group character of multi-channel features, we impose an adaptive group elastic net regularisation on the DCF solution so as to simultaneously select relevant channels and enforce appearance model continuity across successive frames. A standard elastic net is the combination of 1 -norm and 2 -norm regularisation. It shrinks the variables towards the origin with a trade-off between bias and variance. In our proposal, we construct an adaptive elastic net by combining the 1 -norm regularisation, which induces group-independent sparsity, with an 2 -norm temporal smoothness constraint. It will be shown that this creates a combination of a group elastic net and an adaptive term which controls the shape of the net depending on the temporal smoothness of the estimated DCF. It will also be argued that our adaptive channel selection performs implicit spatial regularisation, in compliance with the principle of spatially regularised DCFs (Danelljan et al. 2015;Kiani Galoogahi et al. 2017).
In this paper, we propose a new DCF-based tracking algorithm equipped with an adaptive channel selection mechanism (ACS-DCF). An overview of our ACS-DCF algorithm is depicted in Fig. 1. To the best of our knowledge, this is the first study introducing adaptive channel selection in the formulation of DCF-based visual object tracking. It reduces the number of feature channels by structured regularisation using an adaptive group elastic net, which tends to induce sparsity across channels and smoothness across frames. The pro-posed channel selection strategy suppresses the perturbations injected by non-informative channels, as well as reducing the number of filters. In the learning stage, the augmented Lagrangian method is used to achieve fast optimisation. The experimental results obtained on OTB100 , OTB2013 (Wu et al. 2013) and VOT2017/ VOT2018 ) demonstrate the effectiveness and robustness of the proposed adaptive channel selection framework, delivering superior performance over the state-of-the-art trackers. In addition, the stability of the proposed method is confirmed by experiments involving adding random noise to the learned filter model. The impact of varying the regularisation parameters is analysed in the paper. A notable improvement of the tracking performance is observed, especially for deep CNN features, in experiments covering a wide range of regularisation parameter settings.
The main contributions of the proposed ACS-DCF method are: -A new appearance model construction technique endowed with adaptive channel selection. Relevant channels are adaptively selected in the learning stage to reduce filter dimensionality as well as to enhance discrimination. Improved performance is achieved by our ACS-DCF even when only 10% deep feature channels are used for the DCF design. -A spatio-temporal group variable selection method using a novel adaptive group elastic net regularisation. Independent sparsity and temporal smoothness are combined in our tracking framework to realise a robust channel selection mechanism. Thanks to the convexity of the proposed formulation, we employ an iterative optimisation technique for efficient filter learning. -A deep analysis of the impact of each regularisation term as well as the channel selection ratio. The experimental results confirm the merits and effectiveness of the proposed adaptive channel selection strategy.
The rest of this paper is organised as follows. In Sect. 2, we briefly review related tracking techniques for constructing appearance models and extracting multi-channel features. The classical regularised DCF formulation is presented in Sect. 3. The proposed ACS-DCF method is introduced in Sect. 4 where an efficient optimisation scheme is developed. In Sect. 5, we present the details of the proposed tracking algorithm. The implementation details and experimental results are reported in Sect. 6, which also presents a component and stability analysis.

Related Work
In this section, we briefly review existing visual object tracking methods, focusing on the learning models and feature representation approaches. For a detailed account and comprehensive understanding of the visual object tracking literature the reader is referred to recent surveys Kristan et al. 2016). The context of the DCF paradigm with its advanced improvements is also presented. We discuss high-dimensional multi-channel features and their common implementations in DCF-based trackers as a prerequisite to analysing their properties to provide supporting evidence and motivation for the approach proposed in this paper.

Learning Models
Learning models describe the mathematical framework underpinning the visual object tracking task. The most wellknown learning concepts in the pioneering stages of visual object tracking include optical flow (Lucas and Kanade 1981) and mean-shift (Comaniciu et al. 2000). The key assumptions behind these approaches are brightness constancy and negligible appearance variations and as they are rarely satisfied, the methods invariably fail when processing challenging videos. To improve the tracking robustness, a particle filter was applied to visual object tracking (Sanjeev 2002) as a means of estimating the target posterior distribution. It is well known that more particles can achieve a better estimate, but only at the expense of growing computational complexity. As the particle filter paradigm is an external modelling frame-work, it has been successfully fused with other generative methods, e.g. sparse and low-rank subspace representations (Bao et al. 2012;Zhang et al. 2013Zhang et al. , 2016Zhang et al. , 2012Zhang et al. , 2015. A change of paradigm was introduced by formulating visual object tracking as a target recognition problem. Various classification methods, such as support vector machine (Avidan 2004), multiple instance boosting (Babenko et al. 2011), and linear regression (Henriques et al. 2012) have been employed in constructing learning models, exploiting the discriminatory information between target region and its surroundings. However, a common weakness of the above trackers is their robustness, as they initialise the learning model in the first frame, when the information about the target is limited. More recently, deep Siamese networks (Tao et al. 2016;Valmadre et al. 2017;Song et al. 2017;Wang et al. 2018;Li et al. 2018a;Xu et al. 2020b;Wang et al. 2019b, a) have been successfully applied in visual object tracking. Taking the advantages of large visual datasets, deep structures and powerful Graphical Processing Units, Siamese networks achieve efficient visual object tracking by performing template matching in the feature space of high-level abstraction. The DCF paradigm is the closest learning model that defines the baseline for the research presented in this paper. To achieve efficiency and adaptibility in visual object tracking, the DCF paradigm has been intensively studied in the recent years (Kristan et al. 2015(Kristan et al. , 2016. Almost all the top performing trackers in the recent VOT challenges are based on the DCF framework. The origins of the approach can be traced back to Bolme et al., who proposed the minimum output sum of squared error (MOSSE) filter (Bolme et al. 2010) to realise adaptive correlation filtering in the frequency domain. This fundamental work was then extended to kernel methods (Henriques et al. 2015) and theoretically interpreted in terms of the circulant structure (Henriques et al. 2012). Exploiting the basic formulation, Danelljan et al. realised effective tracking by learning spatially regularised discriminative correlation filters (SRDCF) (Danelljan et al. 2015). To emphasise colour information, Sum of Template and Pixel-wise Learners (Staple) ) was proposed to combine DCF with colour histograms. In addition, context-aware (Mueller et al. 2017) and background-aware (Kiani Galoogahi et al. 2017) correlation filters were proposed to explore relevant target surroundings to achieve enhanced discriminative capability of a DCF tracker. To further improve the tracking performance in accuracy, Danelljan et al. proposed sub-grid tracking by learning continuous convolution operators (C-COT) (Danelljan et al. 2016). Besides performing spatial regularisation with predefined energy distribution, Xu et al. proposed to learn adaptive discriminative correlation filters (LADCF) using spatial feature selection (Xu et al. 2019). In contrast, the proposed ACS-DCF method focuses on channel selection, improving the tracking performance by assigning discriminative attributes (feature channels) in each frame. In this paper, we improve the existing DCF framework by incorporating an adaptive channel selection mechanism to identify the most effective multi-channel features, including both hand-crafted features and advanced deep CNN features.

Feature Representation
Another important component of appearance modelling is target representation, which has been demonstrated to play the most essential role in high-performance visual object tracking (Wang et al. 2015;Gundogdu and Alatan 2018;Xu et al. 2020c). We roughly divide existing target representation approaches into three categories: Region of Interest (ROI-) based features, histogram-based features and multi-channel features.
Typical examples of ROI-based features include Scale-Invariant Feature Transform (SIFT) (Lowe 1999) and Speeded Up Robust Features (SURF) (Bay et al. 2006). Both of them are local representations that convey the information about the context of target visual appearance conveyed by the search region. As such, ROI-based features are designed to preserve the local pattern. They are suitable for videos with a stable content. However, their use is frequently invalidated by non-rigid deformations, 3D object motion and motion blur.
To extract deformation-invariant image features, histogrambased feature extraction methods have been proposed to capture the distribution of the characteristics adopted for modelling visual appearance inside an image patch. Specific examples include colour histograms, which have been shown to exhibite impressive performance in visual object tracking (Comaniciu et al. 2000;Sanjeev 2002;Lukezic et al. 2017). But histogram-based features are dependent only on the intensity values in the image patch, ignoring the shape and texture.
To simultaneously acquire local invariance and incorporate spatial context in visual object tracking, multi-channel features have been studied extensively. Histogram of Oriented Gradients (HOG) (Dalal and Triggs 2005), which has been also widely used in many other computer vision and pattern recognition applications, is the seminal representation for visual object tracking that re-arranges the gradient information into orientation bins, (Zhu et al. 2006;Feng et al. 2015Feng et al. , 2017. The Colour Names feature extraction method (CN) (Weijer et al. 2009) maps the original 3channel RGB format patch to a 10-channel image feature representation, enhancing the discrimination among specific colour attributes. Recent deep learning architectures, e.g., AlexNet (Krizhevsky et al. 2012), VGG (Simonyan and Zisserman 2014), GoogLeNet (Szegedy et al. 2015) and ResNet (He et al. 2016), generate even more powerful multichannel features for high-performance visual object tracking. These deep multi-channel features, learned from large image datasets, contain hundreds to thousands of channels, offering outstanding discrimination. However, for a specific target, many of these channels are irrelevant and they also retain a lot of redundancy. This deficiency has not been addressed in the visual object tracking research.
Although a promising performance has been achieved by combining the regularised DCF paradigm and powerful features (Danelljan et al. 2016;Xu et al. 2020a;Sun et al. 2019), existing studies usually consider feature channels as being equally important. In order to mitigate this shortcoming, Lukezic et al. (Lukezic et al. 2017) proposed the concept of channel reliability, where each channel is weighted by analysing the ratio of the first and second major mode in the response map. Sun et al. (Sun et al. 2018a), on the other hand, acknowledged reliability in terms of spatial masks. However, such channel-wise weighting strategies ignore the relevance of a channel in the context of other channels. Moreover, the diversity and redundancy of multi-channel features are not considered. In contrast, in our approach, we perform an adaptive channel selection of multi-channel features in the learning stage of DCF. The adaptive spatio-temporal group variable selection is achieved by imposing channel sparsity and temporal smoothness of the learned filters in successive video frames. responding labels in the form of a response map, the aim of visual object tracking is to distinguish the target from its background in the next frame. We follow the DCF paradigm (Henriques et al. 2015) to formulate our objective as a regularised least square problem that learns the multichannel discriminative filters W ∈ R N ×N ×C (frame index t is omitted for simplification): where is the circular convolution operator (Henriques et al. 2012), X j ∈ R N ×N and W j ∈ R N ×N are the j-th channel feature map and the corresponding discriminative filter slice. Y is the predefined desired response map in the form of 2D Gaussian shape (Henriques et al. 2012) that highlights the target's centre. R (W) is the regularisation term corresponding to the prior assumptions constraining the filters. The conventional DCF paradigm employs the 2 -norm (Frobenius norm for a matrix, C j=1 W j 2 F ) constraint to formulate the main objective as a ridge regression problem. A closed-form solution can be directly obtained in the frequency domain (Henriques et al. 2015). Though the 2 -norm ( F -norm) regularisation achieves computational efficiency, it sacrifices the parsimony of the discriminative filter. To achieve enhanced discrimination and parsimony in the regularised DCF formulation, we propose a group variable selection mechanism to realise adaptive spatio-temporal channel selection.

Adaptive Channel Selection
In this section, we introduce our proposed learning framework that achieves adaptive channel selection for a multichannel feature representation. To be more specific, an adaptive group elastic net is proposed to implement the group variable selection. A unique characteristic of the proposed learning framework is that the relevance of spatial feature channels and their temporal smoothness are simultaneously enforced. To gain intuitive understanding of the proposed approach, we elaborate the potential properties of the selected channels from the perspectives of attribute selection and spatial unit selection. A qualitative evaluation is also carried out to provide supporting evidence for the asserted merit and effectiveness of the proposed adaptive channel selection method.

Spatio-temporal Adaptive Elastic Net
Motivated by the seminal work of variable selection (Yuan and Lin 2006;Zou and Hastie 2005;Nie et al. 2010), in the proposed approach, we embed adaptive channel selection in the learning framework by means of an adaptive group elastic net. First, the elements in W are arranged to specific groups according to their third dimension (channel). Such a grouping operator forms natural clusters of variables for the high-dimensional filtering system. To fuse the information conveyed by each group, we need a balancing function σ () to gauge its reliability. Here, we define the balancing function, σ (), as the Frobenius norm for each N × N matrix W j , i.e., σ j = σ W j = W j F . and form a balancing reliability vector σ = σ 1 , σ 2 , . . . , σ C for use in the adaptive group elastic net to impose spatio-temporal regularisation.
To amplify spatial discrimination and impose temporal continuity for the selected channels, the filters are assumed to be sparse across channels and to be smooth across frames: (2) The first term in the above equation imposes independent sparsity for the balancing group reliability vector oe, focusing on the current training data pair to regularise the estimate centred at the origin. The second term in Eq.
(2) promotes temporal smoothness forcing the newly learned filters W to be close to the filter learned from the previous frame W t-1 , such that the estimate is robust to target appearance variations, with the selected channels being similar to those of the consecutive frame. Expanding the second term on the right hand side of Eq.
(2) and rearranging, we can write Note that the two terms on the first line on the right of Eq. (3) define the standard elastic net regularisation. The impact on regularisation of the term on the second line on the right is adaptive. It will change depending on the relationship between σ and σ t-1 . When σ t-1 = 0, as for instance at the beginning of tracking, the regularisation will revert to the standard group elastic net. When the balancing reliability vector in the current frame is equivalent to its previous frame counterpart, i.e. σ = σ t-1 , the regularisation collapses to the raw pixels (2) with σ = (σ 1 , σ 2 ), R(σ ) = 1 and α = 0.5. The state of σ t−1 controls the structured regularisation of the estimate. Compared with the original elastic net (σ t−1 = (0, 0)), the proposed adaptive elastic net guides the processing of variable selection. As σ t−1 moves away from the origin along the σ 1 -axis, the model tends to select variables σ 1 group variable selection in W, improving the stability by selecting highly correlated variables. The difference between the proposed adaptive elastic net and 1 -norm in inducing sparsity can be viewed to lie in assigning different priors: the Laplacian sparse distribution prior, and a combination of the Laplacian sparse distribution and Gaussian dense distribution prior, as well as an additional adaptive correction term. Note, the Gaussian dense distribution prior term, oe 2 2 in Eq.
(3), relaxes the 1 -norm with strong convexity. Therefore, the adaptive group elastic net regularisation ensures the estimate is unaffected by noisy channels. Only the relevant information is used for advanced discrimination. Note that α is the trade-off parameter for the proposed adaptive group elastic net.

Explicit Attribute Selection and Implicit Spatial Unit Selection
The circular convolution operator in Eq.
(1) can be considered as being constituted by multiple inner-product operators. The channel selection assisted by the adaptive structured regularisation will identify non-informative channels, and suppress them by setting the associated DCF weights to zero. By virtue of the optimisation process, the informative, or in other words, the relevant channels, will be assigned non zero weights, that is larger values of balancing reliability. As each informative channel is tuned to respond to different image properties, it can be considered as selecting and representing specific attributes of the tracked target. The target appearance representation is then constituted by a set of the attributes identified by the selected channels. These attributes play an instrumental role in data fitting with parsimony and discrimination. It should be noted that within a channel, the target attribute will relate to specific spatial units, which exhibit the image characteristic picked up by the channel (e.g. a high frequency content). The spatial locations which do not contain the channel specific content will fail to respond, and the associated DCF weights will be close to zero. It follows, that even within informative channels with a relative large value of σ , the contributions of individual spatial features to the channel reliability value will be diverse and spatially dependent. This shows that the explicit attribute selection achieved by channel weight regularisation will implicitly perform simultaneous spatial unit selection. This notion of explicit attribute selection and implicit spatial unit regularisation is depicted in Fig. 2. The figure shows the result of implicit spatial regularisation effected by our adaptive channel selection for both hand-crafted and deep features. By feeding back the selected channels to the original input image patch we can identify the activation pattern of pixels producing the channel output. Clearly, channel selection can be considered as a method of selecting specific spatial pixel configurations. Fig. 2 also shows that deep features are more expert in configuring the relevant spatial information than hand-crafted features, resulting in more robust tracking performance, as confirmed in recent benchmarking competitions and challenges. The proposed adaptive group elastic net configures different channels for each frame in video to compose a joint spatial appearance representation of image content to maximise discrimination.

Formulation
To combine the proposed adaptive elastic net into the DCF formulation (Eq. (1)), we embed the balancing function, (2) with reverse triangle inequality and obtain the following regularisation term: (4) Therefore, we aim to solve the following objective: where λ 1 = αλ, λ 2 = (1 − α) λ, It should be noted that the internal balancing function σ maps each group in W to a non-negative value, avoiding the problem of 1 -norm discontinuity at the origin.

Optimisation
We employ the augmented Lagrangian method (Lin et al. 2010) to optimise Eq. (1). Note that R (W) is channelwise separable. We introduce the slack variable, W = W, and reformulate the objective to minimise the following Lagrangian function: where Γ is the Lagrangian multiplier sharing the same size as W, and μ is the corresponding penalty.

Updating W
To optimise W, we solve the following sub-problem in the frequency domain by employing the circulant structure (Henriques et al . 2015): Note that the symbolˆstands for Fourier representations in the frequency domain (Henriques et al. 2015). The closedform solution of the above sub-problem can be obtained as (Petersen and Pedersen 2008): where vectorŵ [m, n] = ŵ 1 m,n ,ŵ 2 m,n , . . . ,ŵ C m,n ∈ C C denotes the m-th row n-th column units ofŴ through all the C channels, and g =x [m, n]

Updating W
To optimise W , we need to minimise the following subproblem similar to group lasso (Yuan and Lin 2006): A closed-form solution with shrinkage operator can be derived as: where H j = W j + Γ j /μ.

Updating other variables
In each iteration, the penalty μ and the multiplier Γ are updated as: where ρ controls the strictness of the penalty in each iteration and μ max is the maximal penalty value. A parameter K is used to control the maximum number of iterations. As each sub-problem is convex, the convergence of our optimisation is guaranteed (Boyd et al. 2010).

Learning and Tracking Details
We summarise the proposed ACS-DCF tracking method in Algorithm 1.

Tracking
In the tracking stage, we obtain the position and scale of a target simultaneously as proposed in fDSST (Danelljan et al. 2017b). To be more specific, given a new image in frame t and the predicted target state of frame t − 1 (target centre p t−1 , the target width, w t−1 , and height h t−1 ), we extract a search window set {I s } centred around p t−1 with multiple scales, s = 1, 2, . . . , S, where S is the number of search windows. For each scale s, the search window patch is centred around p t−1 with a size of a N n × a N n pixels, where a is the scale factor and N = 2s−S−1 2 . We resize each patch to the n × n basic search window size. n is determined by the target size w t−1 × h t−1 and the padding parameter, as: n = (1 + ) √ w t−1 × h t−1 . Then we extract multichannel features of each search window with the scale of s as X s ∈ R N ×N ×C . Given the filter model obtained from the previous frame, W t-1 , the response map R s can efficiently be calculated in the frequency domain as: Suppose the maximal value in the multi-scale response maps {R s } corresponds to position p * t and scale s * . Then the final target centre p t and scale w t × h t of the target in the t th frame is obtained as:

Learning
In the learning stage, we first extract the feature representation X of the tracked target in frame t. Then the filter W is optimised based on the detailed steps in Sect. 5.2.

Update
After the learning stage, the same updating strategy as in (Henriques et al. 2015) is adopted: where β is the updating rate.

Implementation
available at Github 1 . The detailed settings for the parameters used in Sect. 5.3 are as follows. We set the basic window size n × n = 240 × 240 pixels, the padding parameter = 4, the scale factor and scale number as a = 1.01 and S = 7. To verify the generalisation capability of the proposed adaptive channel selection method, we equip ACS-DCF with three different feature configurations, i.e. hand-crafted features (ACS-DCF_HC), deep CNN features (ACS-DCF_Deep), and compound features using both feature types (ACS-DCF * ). The hand-crafted set includes HOG and Colour Names (CN) features, with 4 pixel cell size, λ 1 = 5, λ 2 = 30 and learning rate β = 0.6. Specifically, the HOG (31 channels) and CN (10 channels) features are concatenated along the channel dimension to obtain the final hand-crafted feature representation X _HC ∈ R 60×60×41 . We use ResNet-50 (the output of layer 3 with 16 pixels stride) to extract deep feature representations using the MatCon-vNet toolbox 2 (Vedaldi and Lenc 2015). The learning rate for deep feature is set to β = 0.06, with λ 1 = 5, λ 2 = 5. The dimensionality of the ResNet-50 feature representation tensor is X _Deep ∈ R 15×15×1024 . For ACS-DCF * , the filters for the hand-crafted and deep features are independently trained based on X _HC and X _Deep, according to Algorithm 1. In the tracking stage, for each search scale s, the final response map R s is constructed by adding the response maps obtained by the deep features R s _Deep and the hand-crafted features R s _HC. Note that the response map produced by the deep features, R s _Deep, is resized to the same spatial resolution as R s _HC for the additive operation.

Evaluation Metrics
We perform evaluation on three challenging benchmarks: OTB2013 (Wu et al. 2013), OTB100  and VOT2017/VOT2018 (Kristan et al. , 2018. For the first two datasets, we employ precision plot and success plot to evaluate the tracking performance (Wu et al. 2013). The precision plot measures the proportion of frames with the distance between the tracking results and the ground truth less than a certain number of pixels. The distance precision (DP) is defined by the corresponding value when the precision threshold is 20 pixels. Centre location error (CLE) measures the mean distance between the centres of the tracking results and the ground truth values. The success plot describes the percentage of successful frames with the threshold ranging from 0 to 1. The target in a frame is considered successfully tracked if the overlap of the two bounding boxes exceeds a given threshold. The overlap precision (OP) is defined by the corresponding value when the overlap threshold is 0.5. The area under curve (AUC) of the 1 https://github.com/XU-TIANYANG/ACSDCF. 2 http://www.vlfeat.org/matconvnet/. success plot quantifies the result in terms of overlap evaluation. For VOT2017/VOT2018, we use the expected average overlap (EAO), accuracy and robustness metrics for performance evaluation (Kristan et al. 2016).
We compare our method against recent state-of-the-art approaches, including VITAL (Song et al. 2018), STRCF (Li et al. 2018b

Ablation Studies
The purpose of the proposed adaptive channel selection method is to improve discrimination by enhancing the relevance of filters, as well as reducing information redundancy. As illustrated in Fig. 2, hand-crafted and deep features present different selection patterns. For each feature category, we first analyse the effect of each component in the proposed adaptive elastic net for ACS-DCF_HC and ACS-DCF_Deep. The baseline method is the standard spatially regularised DCF tracker (Danelljan et al. 2015). Generally, the proposed adaptive elastic net as well as its corresponding components, i.e., temporal smoothness and channel sparsity, produce improvement for the baseline tracker. The results are reported in Table 1. Compared with the baseline method (Hand-crafted/Deep), the temporal smoothness significantly improves the performance in terms of AUC score by 1.4% and 2.5%, respectively. Intuitively, connecting successive frames in the learning stage enables the learned filters to become more invariant to appearance variations. Channel sparsity also leads to improvement in the tracking performance, from 62.3%/52.1% to 64.1%/59.7%, compared with the baseline method. In addition, the combination of both above components, using the proposed adaptive elastic net, achieves the best performance (65.5%/59.7%) as compared with all the other configurations.
Besides, the impact of using different deep network feature layers is also reported in Table 2. Compared with the shallower layers (layer1 and layer2), layer 3 achieves a better AUC score, whereas the deepest layer, layer 4, exhibits a drastic drop in terms of tracking performance. In general, deep CNN features are more powerful than hand-crafted features as a target detector, but because of their low spatial resolution, their ability to localise the target is limited. Note that the resolution of the salient feature maps extracted from deep CNN layers is only 1/16 of the original input for layer-3 of VGG or ResNet, resulting in intrinsic centre location error (e.g., 8 pixels). Therefore, all the state-of-the-art DCF trackers use both shallow features and deep features jointly to achieve better performance in terms of accuracy.
To verify the sensitivity of the proposed regularisation model, we perform corresponding experiments for ACS-DCF * via varying λ 1 from 0.1 to 100, and λ 2 from 0.5 to 500, respectively. The results are shown in Fig. 4. Deep features benefit from the proposed group variable selection scheme, with the channel group sparsity parameter λ 1 ranging from 1 to 50. However, a similar AUC score is achieved by the hand-crafted features with different λ 1 . These results demonstrate that deep features are highly redundant, and exhibit undesirable interference. As such, they offer a scope for dimensionality reduction by the proposed adaptive channel selection, leading to improvement in performance. While the hand-crafted features are extracted in a fixed manner, different attributes are considered for different channels. Therefore, only redundancy is alleviated by the proposed channel sparsity without increasing discrimination for handcrafted features. In addition, hand-crafted features present stable and smooth performance in terms of different λ 2 ranging from 1 to 100. Improvement in the AUC score is achieved by deep features with λ 2 ranging from 1 to 10. The above results demonstrate the effectiveness of the proposed adaptive elastic net in formulating a spatio-temporal appearance model, robust to varying regularisation parameters.

Quantitative Performance
Precision plots and success plots on OTB100 and OTB2013 are presented in Fig. 5, with the DP and AUC scores reported  The best three results are highlighted in red, blue and brown in the figure legends respectively. The performance achieved by ACS-DCF * is superior to the state-of-the-art trackers in both criteria. On OTB100, the advantage of our ACS-DCF * is obvious, with a 2.0% in DP and 0.8% in AUC improvement compared with the second best, VITAL * and ECO * , respectively. On OTB2013, ACS-DCF * achieves accurate tracking with 96.1% in DP. Compared to ECO * , which can be considered the best of a class of DCF-based trackers, our performance is better. OP, CLE and AUC are presented in Tables 3 and 4, respectively. Compared with the other trackers with hand-crafted features, our ACS-DCF_HC achieves the best OP score and the second best in terms of CLE. In addition, for hybrid features, ACS-DCF * obtains accurate and robust tracking results on OTB2013 and OTB100, with The best three results are highlighted in red, blue and brown the best OP/CLE, 93.6%/6.6 pi xels and 88.4%/7.8 pi xels. We attribute our performance improvement to the adaptive integration of temporal smoothness and channel selection. The above performance is achieved using only about 7% of the deep channels available. By focusing on relevance and reducing redundancy in multi-channel deep feature representation, ACS-DCF * exhibits adaptive context awareness with an outstanding generalisation.
In Table 5, we report the results obtained on VOT2017/VOT2018. As VOT consists of diverse challenging factors, all the top-performing trackers use deep CNN features. The proposed ACS-DCF method performs best under the EAO metric, achieving a relative gain of 1.4% compared to the second best, LADCF. For accuracy and robustness, ACS-DCF achieves comparable performance to the top-performing trackers.
Besides, in Table 6, we report the results of top DCF trackers, i.e., ECO, C-COT, LADCF, and the proposed ACS-DCF, using the same deep features extracted by the VGG network. The proposed ACS-DCF achieves favourable tracking performance compared with other spatial regularisation approaches, demonstrating the merit of the proposed adaptive channel selection strategy in the filter learning stage.

Qualitative Performance
Qualitative comparisons are presented in Fig. 6, which shows the intuitive tracking results of the state-of-the-art methods, i.e., BACF, STAPLE_CA, CFNet * , C-COT * , ECO * , CREST * , MCPF * , VITAL * , MetaTracker * and ACS-DCF * , on some challenging video sequences. The difficulties are posed by rapid changes in the appearance of targets as well as the surroundings. Our ACS-DCF * performs well on these challenges as it successfully identifies the pertinent spatial salience configurations. Sequences with deformations (MotorRolling, Dragonbaby) and out of view (Biker, Bird1) can be successfully tracked by our method without any failures. Videos with rapid motions (Biker, Matrix, Skiing, Ironman) also benefit from our strategy of exploring relevant deep channels to enhance discrimination. Specifically, ACS-DCF * is expert in solving in-plane and out-of-plane rotations (Biker, MotorRolling, Skiing), because the proposed adaptive channel selection approach provides a novel solution to the appearance information fusion from the central region and surroundings by implicit spatial regularisation.

The Performance on Challenging Attributes
The tracking performance evaluated on OTB100 in 7 challenging attributes (Wu et al. 2013), i.e., scale variation, occlusion, motion blur, in-plane rotation, out of view, background clutter and low resolution, is summarised in Fig. 7. For presentation clarity, only the trackers within the top ten in terms of overall performance on OTB100 are included. The results demonstrate that our ACS-DCF * outperforms the state-of-the-art trackers in out of view, in-plane rotation, motion blur and scale variations. Due to the implicit spatial regularisation performed by our adaptive channel selection for deep features, the spatio-temporal salience of the target incorporates the surroundings. Compared to C-COT * and ECO * , the learning scheme of our ACS-DCF * only depends on the filter model W t-1 and current appearance representation X , without gathering a historical appearance pool. Overall, the proposed ACS-DCF * deals with challenging video sequences in a superior manner.

Stability Analysis of ACS-DCF
To evaluate the robustness of the proposed ACS-DCF method, we investigate its stability using OTB100. Unlike using contaminated input, in our design, random Gaussian noise is added to the filter model W   Table 5 The tracking results on VOT2017/VOT2018 The best three results are highlighted by red, blue and brown  Fig. 7 The experimental performance based on attributes on OTB100. The plots are ranked based on AUC (left) and DP (right) respectively. The scales of challenging attribute axes are displayed below the attribute labels We run the experiment 10 times for each noise level. Table 7 gives the tracking performance on AUC using OTB100 in terms of the mean value and standard deviation. We can see that the introduction of the spatiotemporal appearance regularisations can achieve adaptive channel selection with high stability. More specifically, ACS-DCF_Deep only drops 1.4% in mean value of AUC in the presence of level 100 noise, while ACS-DCF_HC loses 15% of performance. Intuitively, deep features are more robust to noise than the hand-crafted features after channel selection. This can be explained by the fact that deep features achieve more decisive discrimination compared to hand-crafted features. The relevant discriminatory information is enhanced by eliminating redundancy. In addition, ACS-DCF_HC, ACS-DCF_Deep and the hybrid ACS-DCF * all perform well under the first four noise levels, with the performance loss of less than 0.6%. In summary, the proposed ACS-DCF method can provide robust appearance model, thereby leading to superior and stable performance in visual object tracking.

Channel Selection or Spatial Regularisation?
As shown in Fig. 4 and Table 7, the proposed channel selection strategy significantly improves the tracking performance with deep CNN features in terms of both accuracy and robustness. In contrast, hand-crafted features do not benefit much from such strategy besides reducing limited redundancy. Interestingly, the dimensionality reduction methods with spatial regularisation investigated in recent tracking methods, i.e., SRDCF (Danelljan et al. 2015), BACF (Kiani Galoogahi et al. 2017), ECO (Danelljan et al. 2017a), LADCF (Xu et al. 2019), achieve notable improvements for hand-crafted features, but do not seem to work so well with deep features, as they are very compact and convey information accumulated over an extensive set of pixels. Therefore, here we explore the possibility of improving the performance further by adopting an appropriate formulation for hand-crafted features, and combine the result with channel-based selection for deep CNN features.
We employ the formulation in LADCF for hand-crafted features and construct a fused tracker, coined as ACSDCF Deep +LADCF HC . The experimental results on OTB100 are shown in Table 8. Compared with ACSDCF * , the tracking performance is improved from 69.9%/93.8% to 70.8%/94.5% in terms of AUC/DP, with a significant additional gain of 0.9%/0.7%, respectively. The results demonstrate that we should treat different feature categories with different strategies, i.e., spatial regularisation for handcrafted features while channel selection for deep features.

Conclusion
In this paper, we developed a novel tracking method featuring adaptive channel selection. The proposed ACS-DCF effectively handles target variations by adaptively selecting relevant discriminative deep channels. This approach is achieved by employing grouped elastic net regularisation to simultaneously identify spatial relevance and impose temporal smoothness on the DCF solution. Furthermore, the proposed ACS-DCF method realises implicit spatial regularisations, which confirms earlier findings about its importance reported in the tracking community. Qualitative and quantitative evaluations on several well-known benchmarking datasets demonstrate the effectiveness and robustness of our adaptive channel selection method with the comparison to the state-of-the-art trackers.