1 Introduction

Visual tracking plays an important role in computer vision and has received fast-growth attention in recent years due to its wide practical application such as pedestrian detection, vehicle navigation, security surveillance, and wireless communication [1,2,3,4]. In general, visual tracking is to track an interested target which usually has been indicated in the first frame by a bounding box in video streams. The main challenge of visual tracking is the numerous appearance changes, such as occlusion, abrupt motion, illumination variation, in-plane rotation, out-of-plane rotation, deformation, and scale variation.

To overcome the above challenges, many effective trackers have been proposed in recent years [5, 6]. Generally, tracking algorithms can be classified into three different categories: discriminative, generative, and hybrid generative-discriminative. Discriminative trackers formulate tracking as a binary classification problem, which search the target location and extract target from the background. The main problem of discriminative trackers is that they cannot estimate the target-specific location due to the limited number of candidates. Generative trackers adopt an appearance model to represent the target appearance, which estimate the target state by finding the highest likelihood; the model is often updated online to deal with the appearance changes. The main problem of generative trackers is that the appearance model often exhibits some limitations thus cannot represent the target effectively [7,8,9]. Hybrid generative-discriminative trackers fuse the advantages of discriminative and generative trackers; researchers have proposed many effective hybrid generative-discriminative trackers recently [10,11,12,13]. The hybrid model can take advantage of the global characteristics of object and also exploit the useful information from the background. However, the complexity of hybrid model is relatively high, which will lead to a high computational cost, thus cannot meet the requirement in practice.

Recently, as a generative tracking algorithm, sparse representation model has achieved outstanding performance [14]. Mei et al. [15] first propose the L1 tracker by casting the tracking problem as finding a sparse combination of a target template set and a trivial template set to approximate the target object. Then the sparsity is achieved by solving an l 1-regularized least squares problem. Ji et al. [16] improve the tracking accuracy by adding an l 2 norm regularization on the trivial coefficients and use an accelerated proximal gradient approach for solving the minimization problem which has achieved both tracking accuracy and computational efficiency. Zhang et al. [17] exploit the intrinsic relationship among different candidates which utilize the joint-sparsity property of candidates by casting tracking as a multi-task problem. Jia et al. [18] exploit both partial information and spatial information of the target based on a novel alignment-pooling method and employ a template update strategy, which combines incremental subspace learning and sparse representation. Wang et al. [19] introduce l 1 regularization into the principal component analysis (PCA) reconstruction and propose an online tracking algorithm, which approximates the target by linearly combining the PCA basis and a sparse set of trivial templates. Liu et al. [20] use a local sparse representation for representing the target and exploit the sparse coding histogram to represent the dynamic dictionary basis distribution of the target model. Guo et al. [21] propose a novel multi-view structural local subspace method which jointly exploits the advantages of three sub-models and uses an alignment-weighting average method to obtain the optimal state of the target. Wang et al. [22] adopt squared templates to replace trivial templates to handle partial occlusion and propose a probabilistic collaborative representation framework, which reduces the complexity in traditional sparse model-based methods. Kim et al. [23] propose a novel structure-preserving sparse learning method, which preserves both local geometrical and discriminative structures within a multi-task feature selection framework.

However, most of these methods mainly aim at improving the tracking accuracy or efficiency, they usually use the image intensity to construct the template set, which is less effective in expressing the structural information of the target, thus cannot cover severe appearance changes of the target object.

To solve this problem, many hand-crafted features have been used for visual tracking, such as Haar-like features, histogram of oriented gradient (HOG) features, local binary pattern (LBP), and scale-invariant feature transform (SIFT). However, these hand-crafted features are not robust for generic object tracking. Convolutional neural network (CNN) models which learn hierarchical features from raw images on large-scale dataset have been widely used to represent the appearance of the target. Ma et al. [24] exploit the features from hierarchical layers of CNN within a correlation filter-based framework for visual tracking, learn linear correlation filters on each CNN layer, and adopt a coarse-to-fine method to estimate the target location. Wang et al. [25] analyze CNN features from different layers and use a novel tracking method which jointly exploits two convolutional layers to mitigate the drift problem. Danelljan et al. [26] indicate that activations from the first convolutional layers achieve favorable tracking performance compared with the deeper layers within a discriminative correlation filter-based framework. In contrast to the traditional feature descriptors, CNN features contain more structural information, which is crucial to localize the target in an unknown frame.

Motivated by the above observations, we present a novel L1 tracker with CNN features. The proposed approach use a novel sparse representation model with convolutional features for visual tracking, which not only exploits CNN features to improve the robustness for describing the object appearance but also uses the trivial templates to model both reconstruction errors caused by sparse representation and the eigen-subspace representation. Besides, to alleviate redundancy of high-dimensional convolutional features, a feature selection method is adopted, which can reduce computation complexity and improve tracking accuracy. This strategy makes the model jointly exploit the advantages of the CNN features with more structural information to effectively represent the target, and of both sparse representation and the incremental subspace learning simultaneously. In addition, a customized APG method is developed to effectively solve the optimization problem. Furthermore, a robust observation likelihood metric is proposed.

The rest of this paper is organized as follows. In Section 2, we introduce the CNN features and the proposed sparse model in detail. In Section 3, we demonstrate the optimization of the objective function and the overall tracking algorithm. In Section 4, we present the details of the quantitative and qualitative experiments of our method compared with the state-of-the-art methods. In Section 5, we reach the conclusions of this paper.

2 Proposed model

2.1 CNN features

Most of the traditional L1 trackers usually use the image intensity to construct the template set. However, the image intensity-based trackers can hardly handle the complicated situation in practical visual tracking due to the lack of target structural information. To this end, our algorithm introduces CNN features in describing the target template set.

Convolutional neural network (CNN) has been successfully applied in many computer vision fields, especially in complicated tasks such as object detection, image classification, and object recognition [27]. Traditional CNN, which only the information from the last layer are used to represent the target, are effective in dealing with classification problems. However, adopting CNN for generic visual tracking directly is inadequate due to the lack of training samples and the computational complexity.

To overcome this problem, pre-trained CNN feature extraction method is proposed in recent years. CNN features, extracted from different CNN layers, have different characteristics in describing the object [24]. The CNN features from deeper layer contains more high-level semantic information, which can be seen as structural information, have more distinguishing capabilities and thus is effective facing the situation when intra-class appearance variation occurs. However, the features from deep layer have very low spatial resolution so that it cannot fit the task in generic visual tracking, which aim to indicate the location of target. On the other hand, CNN features from earlier layer contain more fine-grained information, which means the more the discriminative capabilities, the more effective in locating the target. But with the less semantic information, features from earlier layer are more sensitive to intra-class appearance variations.

From the observation above, different from the common strategy which use CNN feature extracted from the last layer, we exploit CNN features from hierarchical layers in order to make full use of the high-level structural information as well as preserving the spatial information of target.

2.2 Feature selection

In this paper, we employ CNN features extracted from VGG Net [28], which is trained on the large-scale ImageNet dataset; note that other CNN models may also be used alternatively, such as AlexNet [29] and R-CNN [30].

VGG-19 Net (with 16 convolutional layers and 3 full connect layers) has more deep structure than other CNN models, which can provide more semantic information. Given an input image frame, due to the CNN pooling propagation, the spatial resolution of each layer is more and more smaller, for instance, pool1 with the size of 224 × 224 and pool5 with only the size of only 7 × 7. The target in small size layers is hard to tell, so there is a need to resize each layer as a fixed size in order to locate the target accurately.

In this paper, we resize different layers to a constant size of 224 × 224 by using bilinear interpolation [31],

$$ {f}_k=\sum \limits_i{\omega}_{ki}{F}_i $$
(1)

where the weight ω ki depends on the position of k and i neighboring feature vectors, and F denotes the feature space.

As discussed above, in order to utilize CNN features from multi-layers, we choose conv2-2, conv3-4, and conv5-4 layers as feature representations specifically.

However, CNN features are pre-trained mainly aimed at dealing with classification tasks, so there are plenty of neurons used in describing generic object, which results in a very large number of wasted features. Here, by wasted features, we mean features which are redundant in discriminating target from background, especially when target deformation occurs. Furthermore, deeper CNN features are high-dimensional features (e.g., 512 dimension for conv5-4), leading to an extremely high computational complexity.

In order to alleviate the influences of these wasted features, it is of great importance to adopt an appropriate selection mechanism. From the experimental observation, we found that most redundant features have zero values for representing the target, so we adopt a sparse method to remove the redundancy similar to [25] and choose the feature with the largest coefficient as the template set.

2.3 Sparse representation model with CNN features and incremental subspace constraint

Motivated by the above dissussions, we propose a novel sparse model with CNN features (Fig. 1). Similar to [32], we assume that the target observation z ∈  D can be sparsely representated by target template set M = [m 1, m 2, …, m N ] ∈  D × N and the trivial template set I ∈  D × D, where D is the dimension of the observation vector, N is the number of target templates, and I is an identity matrix; traditional sparse representation-based trackers approximate target object by linearly combining M and I with sparse constraints,

$$ {\mathrm{argmin}}_{\mathbf{a}}\frac{1}{2}{\left\Vert \boldsymbol{z}-\boldsymbol{Aa}\right\Vert}_2^2+\lambda {\left\Vert \boldsymbol{a}\right\Vert}_1,\kern0.36em s.t.{\mathbf{a}}_M\ge 0, $$
(2)

where A = [M, I], a = [a M , a I ] ∈  D + N indicates the corresponding sparse coefficients, and λ controls the amount of regularization. The optimal state is the state with smallest reconstruction error.

Fig. 1
figure 1

Overall sparse model with CNN features

However, traditional sparse representation-based trackers have some drawbacks. First, the computational complexity is relatively high which limit the real-time application. Secondly, they only use image intensity to construct the target template set, which can hardly handle the drastic appearance changes of target in practical visual tracking due to the lack of feature description. Thirdly, the target templates are only obtained from a previous couple of time instants, which cannot effectively obtain the underlying properties for modeling target appearance.

To solve the second problem, we use CNN features in describing the object. However, CNN features from different convolutional layers have different characteristics, high-level features have more distinguishing capabilities while low spatial resolution, and low-level features have more discriminative capabilities while sensitive to appearance changes. So, we construct a target template set by using hierarchical CNN features as a more complete feature descriptor. Furthermore, CNN features have a very high dimension in contrast to image intensity-based trackers, which results in an extreme computation complexity. In addition, most CNN features have barely contributed to effectively determine the exact location of the target, so we adopt a feature selection method to alleviate the redundancy.

To solve the third problem, an eigen template model is introduced for its ability to learn the temporal correlation of target appearances effectively from the past observation data by incremental update procedure, which compactly capture both rich and redundant image properties [33]. The incremental visual tracking (IVT) [34] algorithm can efficiently learn and update a low-dimensional PCA subspace representation of the target object and update the sample mean, which makes full use of the past observed target appearances. Experimental results have demonstrated that incremental learning of PCA subspace representation can deal with appearance changes caused by rotation, illumination variation, deformation and scale change efficiently. However, it has also been demonstrated that the performance of IVT tracker declines when partial occlusion occurs. Since the underlying assumption of PCA is that the error of each pixel is Gaussian distributed with small variances, this assumption does not hold anymore case of partial occlusion occurs. Furthermore, the IVT tracker may also fail when the target overlaps with a similar object.

In [19], each patch can be linearly represented by the eigenvectors corresponding to itself and the coefficients of almost all other eigenvectors will be zero; hence, by introducing l 1 regularization into the PCA reconstruction and modeling the error term e with arbitrary but sparse noise,

$$ \underset{\mathbf{z},\mathbf{e}}{\mathrm{argmin}}\frac{1}{2}{\left\Vert \boldsymbol{y}-\boldsymbol{Uq}-\boldsymbol{e}\right\Vert}_2^2+\tau {\left\Vert \boldsymbol{e}\right\Vert}_1 $$
(3)

where U ∈  D × P is the PCA eigen basis matrix and P is the number of eigen basis vectors. q ∈  P are the coefficients of U and τ controls the amount of regularization.

Motivated by [19], we model both reconstruction errors caused by sparse representation and the eigen subspace representation by solving

$$ \underset{\mathbf{c}}{\mathrm{argmin}}\frac{1}{2}{\left\Vert \boldsymbol{y}-\boldsymbol{Bc}\right\Vert}_2^2+\frac{\sigma }{2}{\left\Vert \left(\boldsymbol{T}{\boldsymbol{c}}_{\boldsymbol{T}}-\overline{\boldsymbol{t}}\right)-\boldsymbol{U}{\boldsymbol{U}}^{\boldsymbol{T}}\left(\boldsymbol{T}{\boldsymbol{c}}_{\boldsymbol{T}}-\overline{\boldsymbol{t}}\right)\right\Vert}_2^2+\rho {\left\Vert \boldsymbol{c}\right\Vert}_1,\kern0.36em s.t.{\mathbf{c}}_T\ge 0 $$
(4)

where B = [T, I], T is the target template set with CNN features, c = [c T , c I ] ∈  3D + N indicates the corresponding sparse coefficients, \( \overline{\mathbf{t}} \) is the sample mean of target object, and σ balances the contribution of the two terms.

This strategy constrains the reconstruction of the sparse representation to have a minimal reconstruction error in the PCA eigen basis representation. Meaning that our model with incremental subspace constrains can model both reconstruction errors caused by sparse representation and the eigen-subspace representation. This method constructs the reliable part of the target using a few number of PCA basis.

By integrating the subspace constrained sparse representation model with CNN features extracted and selected from hierarchical CNN layers, we get

$$ \underset{\mathbf{a},\kern0.37em \mathbf{c}}{\mathrm{argmin}}\frac{1}{2}{\left\Vert \boldsymbol{z}-\boldsymbol{Aa}\right\Vert}_2^2+\lambda {\left\Vert \boldsymbol{a}\right\Vert}_1+\frac{1}{2}{\left\Vert \boldsymbol{y}-\boldsymbol{Bc}\right\Vert}_2^2+\frac{\sigma }{2}{\left\Vert \left(\boldsymbol{T}{\boldsymbol{c}}_{\boldsymbol{T}}-\overline{\boldsymbol{t}}\right)-\boldsymbol{U}{\boldsymbol{U}}^{\boldsymbol{T}}\left(\boldsymbol{T}{\boldsymbol{c}}_{\boldsymbol{T}}-\overline{\boldsymbol{t}}\right)\right\Vert}_2^2+\rho {\left\Vert \boldsymbol{c}\right\Vert}_1,s.t.{\mathbf{a}}_M,{\mathbf{c}}_T\ge 0 $$
(5)

The above overall model takes advantages of both the capability of hierarchical CNN features in describing the target and the subspace constrained sparse representation.

3 Optimization and the tracking algorithm

3.1 Optimization

Problem (5) can be decomposed into two sub-problems:

$$ \underset{\mathbf{a}}{\mathrm{argmin}}\frac{1}{2}{\left\Vert \boldsymbol{z}-\boldsymbol{Aa}\right\Vert}_2^2+\lambda {\left\Vert \boldsymbol{a}\right\Vert}_1,s.t.{\mathbf{a}}_M\ge 0 $$
(6-1)
$$ \underset{\mathbf{c}}{\mathrm{argmin}}\frac{1}{2}{\left\Vert \boldsymbol{y}-\boldsymbol{Bc}\right\Vert}_2^2+\frac{\sigma }{2}{\left\Vert \left(\boldsymbol{T}{\boldsymbol{c}}_{\boldsymbol{T}}-\overline{\boldsymbol{t}}\right)-\boldsymbol{U}{\boldsymbol{U}}^{\boldsymbol{T}}\left(\boldsymbol{T}{\boldsymbol{c}}_{\boldsymbol{T}}-\overline{\boldsymbol{t}}\right)\right\Vert}_2^2+\rho {\left\Vert \boldsymbol{c}\right\Vert}_1,s.t.{\mathbf{c}}_T\ge 0 $$
(6-2)

Problem (6-1) can be solved by the LASSO method [35] and problem (6-2) can be solved by the accelerated proximal gradient (APG) method [16]. APG method is an effective approach to solve the following unconstrained minimization problem,

$$ \min F\left(\boldsymbol{c}\right)+G\left(\boldsymbol{c}\right) $$
(7)

where F(c) is a differentiable convex function with Lipschitz continuous gradient and G(c) is a non-smooth convex function. We describe the details of solution as follows.

Let R = T − UU T T and \( \mathbf{S}=\overline{\mathbf{t}}-\mathbf{U}{\mathbf{U}}^{\mathrm{T}}\overline{\mathbf{t}} \). Then the problem (6-2) can be reformed to the following formulation:

$$ \underset{\mathbf{c}}{\mathrm{argmin}}\frac{1}{2}{\left\Vert \boldsymbol{y}-\boldsymbol{Bc}\right\Vert}_2^2+\frac{\sigma }{2}{\left\Vert \boldsymbol{S}-\boldsymbol{R}{\boldsymbol{c}}_T\right\Vert}_2^2+\rho {\left\Vert \boldsymbol{c}\right\Vert}_1,s.t.{\mathbf{c}}_T\ge 0 $$
(8)

However, APG method cannot be used directly in our model since the original APG method is proposed for solving unconstrained minimization problem, so there is a need to covert our model into an unconstrained problem.

Let 1 T  ∈  N denotes the column vector with entries are all 1. Let ψ (c) denotes the indicator function defined by

$$ \psi \left(\mathbf{c}\right)=\left\{\begin{array}{c}0\kern1.68em \boldsymbol{c}\ge 0\\ {}+\infty \kern1.08em \mathrm{otherwise}\end{array}\right., $$
(9)

The problem (8) can be alternately reformed as the following unconstrained problem:

$$ \underset{\mathbf{c}}{\mathrm{argmin}}\frac{1}{2}{\left\Vert \boldsymbol{y}-\boldsymbol{Bc}\right\Vert}_2^2+\frac{\sigma }{2}{\left\Vert \boldsymbol{S}-\boldsymbol{R}{\boldsymbol{c}}_T\right\Vert}_2^2+\rho {1}_T^T{\boldsymbol{c}}_T+\rho {\left\Vert {\boldsymbol{c}}_{\boldsymbol{I}}\right\Vert}_1+\psi \left({\mathbf{c}}_{\boldsymbol{T}}\right) $$
(10)

Then, we can use the APG approach to solve this minimization problem with

$$ F\left(\boldsymbol{c}\right)=\frac{1}{2}{\left\Vert \boldsymbol{y}-\boldsymbol{Bc}\right\Vert}_2^2+\frac{\sigma }{2}{\left\Vert \boldsymbol{S}-\boldsymbol{R}{\boldsymbol{c}}_T\right\Vert}_2^2+\rho {1}_T^T{\boldsymbol{c}}_T, $$
$$ G\left(\boldsymbol{c}\right)=\rho {\left\Vert {\boldsymbol{c}}_{\boldsymbol{I}}\right\Vert}_1+\psi \left({\mathbf{c}}_{\boldsymbol{T}}\right), $$
(11)

In the above formulation, we need to solve an optimization problem:

$$ {c}_{k+1}={\mathrm{argmin}}_{\boldsymbol{c}}\frac{L}{2}{\left\Vert \mathbf{c}-{\beta}_{k+1}+\nabla F\left({\beta}_{k+1}\right)/L\right\Vert}_2^2+G\left(\mathbf{c}\right), $$
(12)

where k denotes the current iteration time, L is the Lipschitz constant and β k + 1 is defined in Algorithm 1. We define g k + 1 = β k + 1 − ∇F(β k + 1)/L and the soft-thresholding operator \( {\mathfrak{T}}_{\rho }(x)=\operatorname{sign}(x)\max \left(\left|x\right|-\rho, 0\right) \). Then, the fast numerical algorithm for solving problem (6-2) is given in Algorithm 1.

figure a

3.2 Particle filter tracking framework

Similar to [19], our method is based on Bayesian filtering framework in a Markov model. In a particle filter framework, given a set of observed image vectors Z 1 : t − 1 = [z 1, z 2, …, z t − 1 ], the posterior probability can be recursively computed as

$$ p\left({\mathbf{x}}_t\left|{\mathbf{z}}_{1:t-1}\right.\right)=\int p\left({\mathbf{x}}_t\left|{\mathbf{x}}_{t-1}\right.\right)p\left({\mathbf{x}}_{t-1}\left|{z}_{t-1}\right.\right)d{\mathbf{x}}_{t-1}, $$
(13)

where p(x t |x t − 1) is the dynamic model and x t indicates the state vector.

At time t, by using Bayes rule, we get

$$ p\left({\mathbf{x}}_t\left|{\mathbf{Z}}_t\right.\right)=\frac{p\left({\mathbf{z}}_t\left|{\mathbf{x}}_t\right.\right)\int p\left({\mathbf{x}}_t\left|{\mathbf{x}}_{t-1}\right.\right)p\left({\mathbf{x}}_{t-1}\left|{z}_{t-1}\right.\right)d{\mathbf{x}}_{t-1}}{p\left({\mathbf{z}}_t\left|{z}_{1:t-1}\right.\right)}, $$
(14)

where p(z t |x t ) denotes the observation likelihood of observing z t at state x t . The state variable x t is composed of six parameters x t  = [t x , t y , θ t , s t , δ t , φ t ]T, where t x , t y , θ t , s t , δ t , φ t denote x, y translations, rotation angle, scale, aspect ratio, and skew respectively.

The dynamic model is modeled by the Gaussian distribution,

$$ p\left({\mathbf{x}}_t\left|{\mathbf{x}}_{t-1}\right.\right)=N\left({\mathbf{x}}_t;{\mathbf{x}}_{t-1},\sum \right), $$
(15)

where ∑ is a diagonal covariance matrix.

Through the above method, we generate the candidates state set \( {\mathbf{X}}_t=\left\{{\mathbf{x}}_t^1,{\mathbf{x}}_t^2,\dots, {\mathbf{x}}_t^n\right\} \), where n is the number of candidates sampled at each frame. For each particle \( {\mathbf{x}}_t^i \), we crop out the related image region to get \( {\mathbf{z}}_t^i \). Then, we get a candidate set \( {\mathbf{Z}}_t=\left[{\mathbf{z}}_t^1,{\mathbf{z}}_t^2,\dots, {\mathbf{z}}_t^n\right]\in {R}^{D\times n} \).

For each candidate, we solve the optimization problem using Algorithm 1. Then, the observation likelihood of state \( {\mathbf{x}}_t^i \) is given as

$$ p\left({\mathbf{z}}_t^i,{\boldsymbol{y}}_t^i\left|{\mathbf{x}}_t^i\right.\right)=\frac{1}{\varphi}\exp \left(-{\delta}_1{\left\Vert {\mathbf{z}}_t^i-\mathbf{M}{\mathbf{a}}_M^i\right\Vert}_2^2\right)\times \exp \left(-{\delta}_2{\left\Vert {\mathbf{y}}_t^i-\mathbf{T}{\mathbf{c}}_T^i\right\Vert}_2^2\right)\times \exp \left(-{\delta}_3{\left\Vert \mathbf{U}{\mathbf{U}}^{\mathrm{T}}\left(\mathbf{T}{\mathbf{c}}_T^i-\overline{\mathbf{t}}\right)-\left(\mathbf{T}{\mathbf{c}}_T^i-\overline{\mathbf{t}}\right)\right\Vert}_2^2\right), $$
(16)

where φ is a normal factor, δ 1, δ 2, and δ 3 balance the contributions between the three terms. The first term is the reconstruction error of the original image patch. The second term is the reconstruction error of sparse representation by target templates with CNN features. The third term reflects the relevancy between the reconstructed target and target PCA basis with CNN features. The optimal state \( {\mathbf{x}}_t^{\ast } \) of frame t is achieved by

$$ {\mathbf{x}}_t^{\ast }=\underset{{\boldsymbol{x}}_t^i\in {\mathbf{X}}_{\boldsymbol{t}}}{\mathrm{argmax}}p\left({\boldsymbol{z}}_t^i,{\boldsymbol{y}}_t^i\left|{\boldsymbol{x}}_t^i\right.\right) $$
(17)

3.3 Template update

To alleviate the influences of object appearance changes, there is a need to update the target template and PCA basis dictionary dynamically.

First, we use the method proposed in [27] to update target template. This updating strategy can effectively alleviate the influences caused by noise and occlusion. However, the target template is achieved from a previous couple of time, so they are not capable of dealing with numerous appearance variations due to the lack of long-term adjustment.

Then, we update the PCA basis dictionary using the method proposed in [34] and replace the oldest target template with the PCA reconstruction of the optimal candidate when the estimated optimal state is achieved. The PCA eigen template model could learn the temporal correlation of object appearances by incremental SVD update procedure effectively, so it has the ability to cover a long period of appearance changes relatively.

By adopting the strategy proposed above, we co-update the target template and PCA basis for current target appearance and long-term adjustment to improve the performance of our tracker.

4 Experiments

In order to illustrate the performance of our tracker, we test the robustness of our algorithm on 12 challenging video sequences with other 9 state-of-the-art trackers. The trackers are L1APG [16], ASLA [18], MTT [36], LSK [37], SST [38], IVT [34], FRAG [39], KMS [40], and SRUCK [41].

The proposed algorithm in this paper is implemented in MATLAB 2014a on a PC with Intel i7-4790 CPU (3.6 GHz) and 16 GB RAM memory. Before the experiment, we adopt some parameters and modify them according to other cited published works. For example, the iteration number is set to be 5 in the optimization part, but the tracking performance improves little when it is set to 10 or other lager value and the computational cost also increases with the iteration number. We did many experiments to choose the best parameter values as follows. Each sample is resized to 24 × 24 pixels. The number of target templates is set to be 11 with one fixed template extracted from the first frame. The candidate number n in each frame is 600. These values mainly affect the speed of the tracker; we choose them to achieve a balance between the speed and the tracking performance. The number of PCA basis is set to be 10. The regularization factor λ is set to be 0.01, σ is set to be 0.1, and ρ is set to be 0.01. The balance factors δ 1, δ 2, and δ 3 are set to be 10, 10, and 1. The Lipschitz constant L is set to be 8.

4.1 Qualitative evaluation

To evaluate our tracker with other state-of-the-art methods qualitatively, we choose 12 video sequences for testing. The 12 video sequences pose many challenging problems; Table 1 lists the characteristics of the sequences used in this paper.

Table 1 Tracking sequences used in this paper

Compared with traditional sparse representation-based trackers (e.g., L1APG, MTT, ASLA, LSK, and SST), our tracker outperforms in a wide range of challenging scenarios, especially when occlusion, rotation, and deformation occurs. This mainly attributes to our tracker that exploits both the advantage of sparse representation and incremental subspace learning, as well as using CNN features for representing the target. The incremental learning of PCA subspace representation method mainly aims at dealing with appearance changes caused by rotation, deformation, and scale variation, but it is sensitive to occlusion. In our algorithm, the occluded pixels of target object can be represented by the trivial templates, so when partial occlusion occurs, our tracker is more robust than the IVT tracker. Traditional sparse representation-based trackers are sensitive to rotation and deformation because they only use the image intensity for representing the target, but our method adopts CNN features from hierarchical layers in order to make full use of the high-level structural information as well as preserving the spatial information of target, so our tracker is more robust than these traditional L1 trackers.

For example, in Fig. 2ck, the target suffers from partial or total occlusion. In these scenarios, the IVT tracker presents bad performance, while our tracker can cope with these situations effectively. In Fig. 2ai, the targets suffer from drastic appearance changes. Our tracker can still handle these situations effectively because the adopting of CNN features can utilize more structural information, while other L1 trackers failed in most of these situations.

Fig. 2
figure 2

al Qualitative comparisons. Tracking results of our algorithm and other 9 state-of-the-art tracking method on some representative frames of 12 sequences (deer, david2, crossing, boy, david3, faceOcc1, football, walking, sylvester, football1, subway, and mhyang, from left to right and top to bottom). Result of our method is marked with red rectangle

In conclusion, our tracker performs well in all the 12 sequences while other 9 state-of-the-art trackers fail in some sequences.

4.2 Quantitative evaluation

We provide quantitative comparisons of our tracker with other state-of-the-art methods in terms of center location error (CLE) and overlap rate (VOR). The CLE is measured by the Euclidean distance between the estimated target center location and the ground truth center location. The VOR is defined by \( \frac{\mathrm{area}\left({B}_T\cap {B}_{GT}\right)}{\mathrm{area}\left({B}_T\cup {B}_{GT}\right)} \), where B T is the estimated target bounding box and B GT is the ground truth bounding box.

Figures 3 and 4 show the center error plot and the overlap rate plot of different trackers for each video sequence.

Fig. 3
figure 3

Center error comparisons on 12 sequences with 9 state-of-the-art trackers

Fig. 4
figure 4

Overlap rate comparisons on 12 sequences with 9 state-of-the-art trackers

In addition, we adopt the precision and success rate for evaluating the tracking performance. The precision criteria is the percentage of frames which estimated location is within a given threshold distance of the ground truth and the success criteria is the ratios of successful frames at a given threshold ranged from 0 to 1.

Figure 5 shows the precision and success plots. The threshold of distance precision is 20 pixels and the threshold of overlap success rate is 0.5. Both precision plots and success plots show that our tracker is more robust than other state-of-the-art trackers over the 12 video sequences.

Fig. 5
figure 5

Precision and success plots on 12 sequences with 9 state-of-the-art trackers

Tables 2 and 3 demonstrate the average center error and overlap rate of different tracking methods on each sequence. The best three results are marked in red, blue, and green fonts.

Table 2 Average center error for each sequence with 9 state-of-the-art trackers
Table 3 Average overlap rate for each sequence with 9 state-of-the-art trackers. The last row shows comparison results regarding the computational loads in terms of fps

Note that in 6 of the 12 sequences (e.g., deer, david2, football1, mhyang, sylvester, and walking), the proposed tracker achieves the best average center error. In 7 of the 12 sequences (e.g., deer, crossing, david3, football1, mhyang, sylvester, and walking), the proposed tracker achieves the best overlap rate. In other sequences, our tracker achieves either the second or the third best scores. The proposed tracker also achieves both the best scores in average center error and average overlap rate for all the 12 sequences, meaning that our tracker outperforms other state-of-the-art trackers in many challenging situations significantly.

5 Conclusions

In this paper, we propose a robust L1 tracker with CNN features. Different from traditional sparse representation-based tracking algorithms, our model not only exploits convolutional features to improve the robustness for describing the object appearance but also uses the trivial templates to model both reconstruction errors caused by sparse representation and the eigen-subspace representation. A customized APG method is developed to solve the optimization problem effectively. Both qualitative and quantitative evaluations demonstrate that our tracker outperforms other state-of-the-art trackers in many challenging situations.