4.1 Problem Statement
We describe the target region in each frame by using a motion state variable defined as
$$\begin{aligned} {\mathbf {z}}=\left\{ x,y,s\right\} , \end{aligned}$$
(4)
where x and y denote the 2D position of the target, and s denotes the scale coefficient. According to the motion state variable, we can crop out the corresponding region from the frame image. The cropped region is resized to a predefined value and stacked into a column vector, which is named as the appearance observation.
Our goal is to construct an estimator that can reliably predict an expected target appearance in each frame. Specifically, given the appearance observations \({\mathbf {y}}_1,{\mathbf {y}}_2,\dots ,{\mathbf {y}}_{k-1}\) of previously obtained targets in the k-th frame, we can estimate the target appearance in the current frame as
$$\begin{aligned} \hat{{\mathbf {y}}}_k=\varphi \left( {\mathbf {y}}_1,{\mathbf {y}}_2,\dots ,{\mathbf {y}}_{k-1}\right) \end{aligned}$$
(5)
by using the estimator \(\varphi \left( \cdot \right) \). To make the estimator as accurate as possible, some prior knowledge about the target appearance, is encouraged. Thus, the estimated appearance of the current target is reformulated as
$$\begin{aligned} \hat{{\mathbf {y}}}_k=\varphi \left( {\mathbf {y}}_1,{\mathbf {y}}_2,\dots ,{\mathbf {y}}_{k-1}|\varPhi \right) \end{aligned}$$
(6)
by incorporating the prior information \(\varPhi \). Then, we find a region, of which the corresponding appearance is most similar to \(\hat{{\mathbf {y}}}_k\), as the target region in the current frame. Mathematically, given a set of the appearance observations \(\mathcal {C}\) of all target candidates, the current target is located by
$$\begin{aligned} {\mathbf {y}}_k=\arg \min _{{\mathbf {c}}\in \mathcal {C}}\left\| \hat{{\mathbf {y}}}_k-{\mathbf {c}}\right\| . \end{aligned}$$
(7)
From a global perspective, the previously obtained targets are considered to reside in a low-dimensional subspace due to their high similarity in appearances. From a local point of view, local observations (partial target information) can be obtained to help the estimator make a more accurate prediction. As a result, the problem is solved by integrating both the global and the local information. We exploit the global correlation to handle the previously obtained targets, and leverage the local information to deal with the target priors.
4.2 Estimator Design
As presented above, the estimator is built on the two kinds of information: the appearance observations of previously obtained targets and the prior knowledge of the target. The designed estimator will be discussed below.
Target summarization. In order to increase the computational efficiency, the tracking model employs a compact form, instead of using all appearance observations, to represent the previously obtained targets. Meanwhile, such a compact representation is also explored to maintain the subspace assumption. To this end, only a limited number of previously obtained targets, which can best describe the appearance changes of all the obtained targets, are employed as the estimation evidence of the estimator. We refer to the target template method [2] to implement the target summarization, which is called target templates hereafter.
Target priors. In the proposed model, the prior knowledge is extracted directly from a number of pixels in the target region, because such direct partial observations are the best and the strongest prior information for the target. There is, however, an obvious paradox, i.e., we intend to estimate the target appearance, while the estimator needs to partially observe the target appearance first. For this reason, we observe a number of pixels from each target candidate, and the target candidate is employed to eliminate the paradox. Under the low-dimensional subspace assumption, the target is expected to be estimated accurately by the estimator among all target candidates. The underlying reason is straightforward: since the previously obtained targets span a low-dimensional subspace, while the current target can be well represented by this subspace.
Based on the preceding analysis, the matrix completion approach is a desirable estimator for our tracking model. On one hand, matrix completion is a reliable estimator to predict unobserved entries. On the other hand, it can implicitly maintain the subspace constraint through the rank-minimization.
Given an appearance observation, denoted by \({\mathbf {c}}\), of a target candidate in each frameFootnote 2, we use a set \(\varOmega \) to index the observed pixels, and consider the rest as missing values. We first generate an observed candidate \({\mathbf {c}}'\) by setting the pixels of \({\mathbf {c}}\) outside \(\varOmega \) to zeros and leaving the rest unchanged. Let a matrix \({\mathbf {T}}=\left[ {\mathbf {t}}_1,{\mathbf {t}}_2,\dots ,{\mathbf {t}}_n\right] \) denote the n target templates, which are summarized from \(\left\{ {\mathbf {y}}_1,\dots ,{\mathbf {y}}_{k-1}\right\} \). We construct a new matrix \({\mathbf {Y}}=\left[ {\mathbf {T}},{\mathbf {c}}'\right] \) and estimate the pixels outside \(\varOmega \) using matrix completion over \({\mathbf {Y}}\). For convenience, we use an equivalent form of Eq. (3) to address the matrix completion by introducing a slack variable \({\mathbf {E}}\).
$$\begin{aligned} \min _{\mathbf {X}}\left\| {\mathbf {X}}\right\| _*, s.t. {\mathbf {Y}}={\mathbf {X}}+{\mathbf {E}}, \mathcal {P}_\varOmega \left( {\mathbf {E}}\right) =0. \end{aligned}$$
(8)
The above minimization problem (8) can be solved by the IALM approach [27]. Let \({\mathbf {X}}^*=\left[ {\mathbf {T}}^*,{\mathbf {x}}\right] \) denote the solution of Eq. (8), where \({\mathbf {x}}\) is the estimated candidate over the observed candidate \({\mathbf {c}}'\).
4.3 Target Localization
Within the Bayesian sequential inference framework [30, 31], given all the obtained targets \({\mathbf {y}}_{1:k-1}\) in the k-th frame, the motion state of the k-th target, denoted by \({\mathbf {z}}_k\), is predicted by maximizing the posterior
$$\begin{aligned} p\left( {\mathbf {z}}_k|{\mathbf {y}}_{1:k-1}\right) =\int p\left( {\mathbf {z}}_k|{\mathbf {z}}_{k-1}\right) p\left( {\mathbf {z}}_{k-1}|{\mathbf {y}}_{1:k-1}\right) d{\mathbf {z}}_{k-1}, \end{aligned}$$
(9)
where \(p\left( {\mathbf {z}}_k|{\mathbf {z}}_{k-1}\right) \) denotes the motion model. Then, a target candidate is generated according to its motion state \({\mathbf {z}}_k\). Thus, the corresponding appearance observation, denoted by \({\mathbf {c}}\), is obtained and the posterior is updated by
$$\begin{aligned} p\left( {\mathbf {z}}_k|{\mathbf {c}},{\mathbf {y}}_{1:k-1}\right) \propto p\left( {\mathbf {c}}|{\mathbf {z}}_k\right) p\left( {\mathbf {z}}_k|{\mathbf {y}}_{1:k-1}\right) , \end{aligned}$$
(10)
where \(p\left( {\mathbf {c}}|{\mathbf {z}}_k\right) \) denotes the observation model. The target on the k-th frame, denoted by \({\mathbf {y}}_k\), is found by
$$\begin{aligned} {\mathbf {y}}_k=\arg \max _{{\mathbf {c}}\in \mathcal {C}}p\left( {\mathbf {z}}_k|{\mathbf {c}},{\mathbf {y}}_{1:k-1}\right) , \end{aligned}$$
(11)
where \(\mathcal {C}\) denotes the set of all the candidates that correspond to a series of regions sampled randomly in the frame according to the possibility \(p\left( {\mathbf {c}}|{\mathbf {z}}_k\right) \).
The motion model in our work is defined as a Gaussian distribution \(p\left( {\mathbf {z}}_k|{\mathbf {z}}_{k-1}\right) \sim \mathcal {N}\left( {\mathbf {z}}_k|{\mathbf {z}}_{k-1},\mathbf {\Sigma }\right) \), where the covariance \(\mathbf {\Sigma }\) is a diagonal matrix, denoting the variances of 2D translation and scaling, respectively. We set \(\mathbf {\Sigma }=diag\left\{ 3,3,0.005\right\} \) in our experiments. The observation model \(p\left( {\mathbf {c}}|{\mathbf {z}}_k\right) \) reflects the likelihood of the candidate \({\mathbf {c}}\) to be the target. As discussed above, a good candidate can be estimated accurately by the matrix completion under our subspace assumption. The accuracy is measure by means of the estimation errors. Let us define the observation model for a candidate \({\mathbf {c}}\) with the motion state \({\mathbf {z}}_k\) as
$$\begin{aligned} p\left( {\mathbf {c}}|{\mathbf {z}}_k\right) \propto \exp \left( -\left\| {\mathbf {c}}-{\mathbf {x}}\right\| \right) . \end{aligned}$$
(12)
For all the candidates and their corresponding motion states, the target in the k-th frame can be located using Eq. (11). Note that under the definition of the observation model, Eq. (11) is equivalent to Eq. (7), and yields the same result in the target location. The implementation details of the tracking algorithm is outlined in Algorithm 1.
Below is a demonstration of the proposed approach. As shown in Fig. 1(a), two candidates are marked in red and blue, respectively. The representative target templates are shown in Fig. 1(b). We crop out the two target candidates from the image and resize them to the same size as the target templates, as shown in Fig. 2(a) and (e). Then, we sample a number of pixels of the two candidates at the same locations and use these pixels as the observed values, while the rest are treated as missing values, as shown in Fig. 2(b) and (f), where the missing values are set to zeros. Next, we estimate the missing values of each candidate from Eq. (8). Figure 2(c) and (g) show the two estimated candidates, respectively. Their estimation errors are shown in Fig. 2(d) and (h), respectively.
From the above results, it is evident that the good candidate (in red) is estimated much more accurately than the bad one (in blue). As shown in Fig. 2(c), the estimated good candidate is rarely influenced by the distractive object (the magazine), however, the estimated bad candidate, as shown in Fig. 2(g), is quite different from its original version shown in Fig. 2(e). Similar results can also be observed from their estimation errors. Most errors of the good candidate are small, and large errors only appear at the location of the distractive object, as shown in Fig. 2(d). In contrast, most errors of the bad candidate are large, and scatter all over the entire image, as shown in Fig. 2(h). Quantitatively, we also plot the distributions of the absolute estimation errors at the missing entries of the two candidates, as shown in Fig. 3. It can be seen that for most missing entries, the errors of the good candidates are much smaller than those of the bad ones. In addition, the residual errors of the good candidates normally converge faster than those of the bad ones because the good candidates better match the implicitly learned subspace via the rank-minimization. Typically, the matrix completion runs less than 30 iterations for good candidates, while about 40 iterations are required for bad candidates.
The good performance of the matrix completion in this case is attributed to two aspects: the low-dimensional subspace assumption on the previously obtained targets, and the local observations from the candidates. From a global point of view, the previously obtained targets span a low-dimensional subspace, which makes better representations of the good candidates, such that they can be estimated more accurately than the bad ones. From a local perspective, the local observations work as strong priors and promote the accuracy of the estimation. Since the index set \(\varOmega \) is determined according to the previously obtained targets, some pixels observed from the bad candidate may be located on the distractive object, leading to a more inaccurate estimation.
4.4 Online Update
During tracking, the appearance of the target varies on successive frames. Thus, we need to update the tracker automatically to accommodate these appearance changes. In each frame, a number of pixels of the candidates are sampled so as to alleviate the influence of the distractive objects. Therefore, the set \(\varOmega \) is updated for every frame to exclude those unexpected pixels. Meanwhile, the target templates \({\mathbf {T}}\) are updated accordingly, in order to accurately reflect these appearance changes and satisfy the constraint of low-dimensional subspace.
In our work, each pixel of an obtained target is associated to a weight that reflects the possibility of this pixel to be observed in the next frame. Initially, we set all these weights equally. As analyzed in the above demonstration shown in Figs. 1, 2 and 3, the estimation errors are normally large in the regions of the distractive objects (see Fig. 2(d)). Thus, we adjust the weights in each frame to be inversely proportional to the corresponding estimation errors. To avoid that the observed pixels (they always have zero estimation errors) dominate the update, their weights are constrained during the computation. Finally, we draw the same number of entries randomly according to their weights and use these entries as the new index set \(\varOmega \).
Specifically, in the k-th frame, the weight of the the j-th pixel, denoted by \(w_j^k\), is updated by
$$\begin{aligned} w_j^k\propto {\left\{ \begin{array}{ll} \begin{array}{ll} \frac{1}{e_j^k}, &{} j\notin \varOmega \\ \frac{1}{e_a+e_j^{k-1}\left( e_b-e_a\right) }, &{} j\in \varOmega , \end{array} \end{array}\right. } \end{aligned}$$
(13)
where \(e_j^k\) denotes the estimation error of the k-th target in the j-th pixel, and \(e_a\) and \(e_b\) are determined by
$$\begin{aligned} e_{i_1}<e_{i_2}<\dots<e_a<e_m<e_b<\dots <{\mathbf {e}}_{i_N}, \end{aligned}$$
(14)
where \(e_m\) denotes the median value of the N estimation errors, and \(i_k\in \left\{ 1,2,\dots ,N\right\} \). In the above equations, we divide the pixels into two categories and update their associated weights respectively. One category contains the pixels outside the index set \(\varOmega \), i.e., in the case of \(j\notin \varOmega \) for the j-th pixel of the k-th target. Among these pixels, the pixels with large estimation errors are unexpected to be observed in the next frame, since they have high possibilities to be located on the distractive objects. Thus, we directly set their associated weights inversely proportional to their estimation error \(e_j^k\). The other category contains the pixels indexed by \(\varOmega \). Because these pixels are the observed ones in the current frame, i.e., they have zero estimation errors, they are expected to be observed in the next frame. In addition, in order to avoid that these pixels dominate the update, we deliberately decrease their possibilities to be observed to some extent. For this reason, we constraint the possibilities of these pixels within an appropriate range, or equivalently assign them certain errors within an range \(\left[ e_a,e_b\right] \). In practice, the median of the target estimation errors in last frame, i.e., the \(\left( k-1\right) \)-th target, is a reasonable reference in setting \(e_a\) and \(e_b\), such that their values are not being set too low or too high. In our experiments, \(e_a\) and \(e_b\) are set to the errors just below and above the median error, respectively.
Figure 4 illustrates the online update strategy of \(\varOmega \) between two consecutive frames. It can be seen from Fig. 4(b) that the pixels from the distractive object (the magazine) have higher possibilities to be excluded (i.e., not indexed in \(\varOmega \)) in the next frame. From Fig. 4(c), it is evident that the pixels belonging to the distractive object are reduced in the local observations of the target in the next frame. In our experiments, similar to the work [32], we use ten target templates.