In this section, we give a detailed description of our algorithm to compute a background separation based on the adaptive SVD.
Essential Functionalities
The algorithm in [23] was initially designed with the goal to determine the kernels of a sequence of columnwise augmented matrices using the V matrix of the SVDs. In background subtraction, on the other hand, we are interested in finding a low-rank approximation of the column space of A and therefore concentrate on the U matrix of the SVD which will allow us to avoid the computation and storage of V.
The adaptive SVD algorithm starts with an initialization step called SVDComp, calculating left singular vectors and singular values on an initial set of data. Afterward, data are added iteratively by blocks of arbitrary size. For every frame in such a block, the foreground is determined and then the SVDAppend step performs a thresholding described in Sect. 3.2 to check whether the frame is considered in the update of the singular vectors and values that corresponding to the background.
SVDComp
SVDComp performs the initialization of the iterative algorithm. It is given the matrix \(A \in \mathbb {R}^{d\times n}\) and a column number \(\ell \) and computes the best rank-\(\ell \) approximation \(A = U\varSigma V^T\),
$$\begin{aligned} U =:[U_{0}, U_{0}'], \quad \varSigma =: \begin{bmatrix}\varSigma _0 &{} \quad 0 \\ 0 &{} \quad \varSigma _0'\end{bmatrix},\quad V =:[V_0, V_0'], \end{aligned}$$
by means of an SVD with \(\varSigma _0 \in \mathbb {R}^{\ell \times \ell }, U_0 \in \mathbb {R}^{d\times \ell }\), and \(V_0 \in \mathbb {R}^{n\times \ell }\). Also, this SVD is conveniently computed by means of the algorithm from [23], as the thresholding of the augmented SVD will only compute and store an at most rank-\(\ell \) approximation, truncating the R matrix in the augmentation step to at most \(\ell \) columns. This holds both for initialization and update in the iterative SVD.
As mentioned already in Sect. 3.1, \(U_0\) is not stored explicitly but in the form of Householder vectors \(h_j, j = 1,\dots ,\ell \), stored in a matrix \(H_0\). Together with a small matrix \(\widetilde{U}_0 \in \mathbb {R}^{\ell \times \ell }\), we then have
$$\begin{aligned} U_0 = \widetilde{U}_0 \prod _{j=1}^{\ell } (I - h_j h_j^T), \end{aligned}$$
and multiplication with \(U_0\) is easily performed by doing \(\ell \) Householder reflection and then multiplying with an \(\ell \times \ell \) matrix. Since \(V_0\) is not needed in the algorithm, it is neither computed nor stored.
SVDAppend
This core functionality augments a matrix \(A_k\), given by \(\widetilde{U}_k, \varSigma _k, H_k\), determined either by SVDComp or previous applications of SVDAppend, by m new frames contained in the matrix \(B \in \mathbb {R}^{d\times m}\) as described in Sect. 3.1. The details of this algorithm based on Householder representation can be found in [23]. By the thresholding procedure from Sect. 3.2, one can determine, even before the calculation of the SVD if an added column is significant relative to the threshold level \(\tau \). This saves computational capacities by avoiding the expensive computation of the SVD for images that do not significantly change the singular vectors representing the background.
The choice of \(\tau \) is significant for the performance of the algorithm. The basic assumption for the adaptive SVD is that the foreground consists of small changes between frames. Calculating SVDComp on an initial set of frames and considering the singular vectors, i.e., the columns of \(U_0\), and the respective singular values give an estimate for the size of the singular values that correspond to singular vectors describing the background. With a priori knowledge of the maximal size of foreground effects, \(\tau \) can even be set absolutely to the size of singular values that should be accepted. Of course, this approach requires domain knowledge and is not entirely data driven.
Another heuristic choice of \(\tau \) can be made by considering the difference between two neighboring singular values \(\sigma _i - \sigma _{i+1}\), i.e., the discrete slope of the singular values. The last and smallest singular values describe the least dominant effects. These model foreground effects or small, negligible effects in the background. With increasing singular values, the importance of the singular vectors is growing. Based on that intuition, one can set a threshold for the difference of two consecutive singular values and take the first singular value exceeding the difference threshold as \(\tau \). Figure 1d illustrates a typical distribution of singular values. Since we want the method to be entirely data driven, we choose this approach. The threshold \(\tau \) is determined by \(\hat{i} := \min \left\{ i:\sigma _i - \sigma _{i+1} < \tau ^*\right\} \) and \(\tau =\sigma _{\hat{i}}\) with the threshold \(\tau ^*\) of the slope being determined in the following.
Re-Initialization
The memory footprint at the kth step in the algorithm described in Sect. 3.1 is \(O(n_k^2 + r_k\, d)\) and grows with every frame added in the SVDAppend step. Therefore, a re-initialization of the decomposition is necessary.
One possibility is to compute an approximation of \(A_k \approx U_k \varSigma _k V_k^T \in \mathbb {R}^{d\times n_k}\) or the exact matrix \(A_k\) by applying SVDComp to \(A_k\) with a rank limit of \(\ell \) that determines the number of singular vectors after re-initialization. This strategy has two disadvantages. The first one is that it needs \(V_k\), which is otherwise not needed for modeling the background, and hence would require unnecessary computations. Even worse, though \(\widetilde{U}_0 \in \mathbb {R}^{\ell \times \ell }\), \(\varSigma _0 \in \mathbb {R}^{\ell \times \ell }\), and \(H_0 \in \mathbb {R}^{d\times \ell }\) are reduced properly, the memory consumption of \(V_0 \in \mathbb {R}^{n_k\times \ell }\) still depends on the number of frames added so far.
The second re-initialization strategy, referred to as (II), builds on the idea of a rank-\(\ell \) approximation of a set of frames representing mostly the background. For every frame \(B_i\) added in step k of the SVDAppend, the orthogonal projection
$$\begin{aligned} {U_k(:,1:\hat{i})} ({U_k(:,1:\hat{i})}^T B_i), \end{aligned}$$
i.e., the “background part” of \(B_i\), gets stored successively. The value \(\sigma _{\hat{i}}\) is determined in Sect. 4.1.2 as threshold for the SVDAppend step. If the number of stored background images exceeds a fixed size \(\mu \), the re-initialization gets performed via SVDComp on the background images. No matrix V is necessary for this strategy, and the re-initialization is based on the background projection of the most recently appended frames.
In the final algorithm, we use a third strategy, referred to as (III) which is inspired by the sequential Karhunen–Loeve basis extraction [16]. The setting is very similar, and the V matrix gets dropped after the initialization as well. The update step with a data matrix \(B_k\) is performed just like the update step of the iterative SVD calculation in Sect. 3.1 based on the matrix \([U_k\varSigma _k, B_k]\). The matrices \(\varSigma _{k+1}\) and \(U_{k+1}\) get truncated by a thresholding of the singular values at every update step. Due to this thresholding, the number of singular values and accordingly the number of columns of \(U_k\) have an upper bound. Therefore, the maximum size of the system is fixed and no re-initialization is necessary. Calculating the SVD of \([U_k\varSigma _k, B_k]\) is sufficient since due to
$$\begin{aligned}{}[U_k\varSigma _k, B_k] [U_k\varSigma _k, B_k]^T&= U_k\varSigma _k \varSigma _k^T U_k^T + B_k B_k^T\\&= U_k\varSigma _k V_k^T V_k \varSigma _k^T U_k^T + B_k B_k^T\\&= [U_k\varSigma _k V_k^T, B_k] [U_k\varSigma _k V_k^T, B_k]^T \end{aligned}$$
the eigenvectors and eigenvalues of the correlation matrices with respect to \([U_k\varSigma _k, B_k]\) and \([U_k\varSigma _k V_k^T, B_k]\) are the same. Therefore, the singular values of \([U_k\varSigma _k, B_k]\) and \([U_k\varSigma _k V_k^T, B_k]\) are the same, being roots of the eigenvalues of the correlation matrix. In our approach, we combine the adaptive SVD with the re-initialization based on \(U_k \varSigma _k\), i.e., we perform SVDComp on \(U_k \varSigma _k\), because we want to keep the thresholding of the adaptive SVD. This is essentially the same as an update step in Karhunen–Loeve setting with \(B_k = 0\) and a more rigorous thresholding or a simple truncation of \(U_k\) and \(\varSigma _k\). The thresholding strategy of the adaptive SVD Sect. 3.2 is still valid, as the QR decomposition with column pivoting sorts the columns of the matrix according to the \(\ell _2\) norm and the columns of \(U_k\varSigma _k\) are ordered by the singular values due to \(||U \varSigma _{:,i}||_2 = \sigma _i\). \(U_k \varSigma _k\) already is in SVD form, and therefore, SVDComp at re-initialization is reduced to a QR decomposition to regain Householder vectors and a truncation of \(U_k\) and \(\varSigma _k\) which is less costly than performing a full SVD.
Since it requires the V matrix, the first re-initialization strategy will not be considered in the following, where we will compare only strategies (II) and (III).
Normalization
The concept of re-initialization via a truncation of \(U_k\) and \(\varSigma _k\) either directly through SVDComp of \(U_k \varSigma _k\) or in the Karhunen–Loeve setting with thresholding of the singular values still has a flaw: The absolute value of the singular values grows with each frame appended to \(U_k \varSigma _k\) as
$$\begin{aligned} \sum _{i=1}^{n}\sigma _i^2 = \Vert A \Vert _F^2. \end{aligned}$$
This also accounts for
$$\begin{aligned} \sum _{i=1}^{n_{k+1}}\sigma _{n_{k+1},i}^2&= \Vert U_{k+1} \varSigma _{k+1} \Vert _F^2\\&\approx \left\| [U_k \varSigma _k, B_k] \right\| _F^2 \\&= \Vert U_k \varSigma _k \Vert _F^2 + \Vert B_k \Vert _F^2. \end{aligned}$$
The approximation results from the thresholding performed at the update step. As only small singular values get truncated, the sum of the squared singular values grows essentially with the Frobenius norm of the appended frames. Growing singular values do not only introduce numerical problems, they also deteriorate thresholding strategies, and the influence of newly added single frames decreases in later steps of the method. Therefore, some upper bound or normalization of the singular values is necessary.
Karhunen–Loeve [16] introduce a forgetting factor \(\varphi \in [0,1]\) and update as \([\varphi \, U_k\varSigma _k, B_k]\). They motivate this factor semantically: More recent frames get a higher weight. Ross et al. [26] show that this value limits the observation history. With an appending block size of m, the effective number of observations is \(m/(1-\varphi )\). By the Frobenius norm argument, the singular values then have an upper bound. By the same motivation, the forgetting factor could also be integrated into strategy (III). Moreover, due to
$$\begin{aligned} \Vert (\varphi \, U_k \varSigma _k)_{:,i}\Vert _2 = \left\| \varphi \, \sigma _i \, U_{:,i} \right\| _2 = \varphi \sigma _i, \end{aligned}$$
the multiplication with the forgetting factor keeps the order of the columns of \(U_k\varSigma _k\) and linearly affects the 2-norm and is thus compliant with the thresholding. However, the concrete choice of the forgetting factor is unclear.
Another idea for normalization is to set an explicit upper bound for the Frobenius norm of observations contributing to the iterative SVD, or, equivalently, to \(\sum \sigma _i^2 = \Vert A\Vert _F^2\). At initialization, i.e., at the first SVDComp, the upper bound is determined by \(\frac{\Vert A\Vert _F^2}{n}\eta \) with n being the number of columns of A and \(\eta \) being the predefined maximum size of the system. This upper bound is a multiple of the mean squared Frobenius norm of an input frame, and we define a threshold \(\rho := \frac{\Vert A\Vert _F}{\sqrt{n}}\sqrt{\eta }\). If the Frobenius norm \(\Vert \varSigma _0\Vert _F\) of the singular values exceeds \(\rho \) after a re-initialization step, \(\varSigma _0\) gets normalized to \(\varSigma _0 \frac{\rho }{\Vert \varSigma _0\Vert _F}\). One advantage of this approach is that the effective system size can be transparently determined by the parameter \(\eta \).
In data science, normalization usually aims for zero mean and standard deviation one. Zero mean over the pixels in the frames, however, leads to subtracting the rowwise mean of A, replacing A by \((I - 1 1^T) A\). This approach is discussed in incremental PCA, cf. [26], but since the mean image usually contributes substantially to the background, it is not suitable in our application.
A framewise unit standard deviation makes sense since the standard deviation approximates the contrast in image processing and we are interested in the image content regardless of the often varying contrast of the individual frames. Different contrasts on a zero mean image can be seen as a scalar multiplication which also applies for the singular values. Singular values differing with respect to the contrast are not a desirable effect which is compensated by subtracting the mean and dividing by the standard deviation of incoming frames B, yielding \(\frac{B-\mu }{\sigma }\). Due to the normalization of single images, the upper bound for the Frobenius norm \(\rho \) is more a multiple of the Frobenius norm of an average image.
Adaptive SVD Algorithm
The essential components being described, we can now sketch our method based on the adaptive SVD in Algorithm 1.
The algorithm uses the following parameters:
-
\(\ell \): Parameter used in SVDComp for rank-\(\ell \) approximation.
-
\(\eta \): Parameter for setting up the maximal Frobenius norm as a multiple of the Frobenius norm of an average image.
-
\(\tau ^*\): Threshold value for the slope of the singular values used in SVDAppend.
-
\(\theta \): Threshold value depending on the pixel intensity range to discard noise in the foreground image.
-
\(\beta \): Number of frames put together to one block \(B_k\) for SVDAppend.
-
\(n^*\): Maximum number of columns of \(U_k\). If \(n^*\) is reached a re-initialization is triggered.
For the exposition in Algorithm 1, we use pseudo-code with a MATLAB like syntax. Two further explanations are necessary, however. First, we remark that SVDAppend and SVDComp return the updated matrices U and \(\varSigma \) and the index of the thresholding singular value determined by \(\tau ^*\) as described in Sect. 4.1.2. Using the threshold value \(\theta \), the foreground resulting from the subtraction of the background from the input image gets binarized. This binarization is used as mask on the input image to gain the parts that are considered as foreground. \(|B - J| > \theta \) checks elementwise whether \(| B_{jk} - J_{jk} | > \theta \) and returns a matrix consisting of the Boolean values of this operation.
Relaxation of the Small Foreground Assumption
A basic assumption of our background subtracting algorithm is that the changes due to the foreground are small relative to the image size. Nevertheless, this assumption is easily violated, e.g., by a truck in traffic surveillance or generally by objects close to the camera which can appear in singular vectors that should represent background. This has two consequences. The first is that the foreground object is not recognized as such and the second one leads to ghosting effects because of the inner product as shown in Fig. 1.
The following modifications increase the robustness of our method against these unwanted effects.
Similarity Check
Big foreground objects can exceed the threshold level \(\tau \) in SVDAppend and therefore are falsely included in the background space. With the additional assumption that background effects have to be stable over time, frames with large moving objects can be filtered out by utilizing the block appending property of the adaptive SVD. There, a large moving object causes significant differences in a block of images which can be detected by calculating the structural similarity of a block of new images. Wang et al. propose in [33] the normalized covariance of two images to capture the structural similarity. This again can be written as the inner product of normalized images, i.e.,
$$\begin{aligned} s(B_i, B_j) = \frac{1}{d-1}\sum _{l=1}^{d} \frac{B_{i,l} - \mu _{i}}{\sigma _{i}} \frac{B_{j,l} - \mu _{j}}{\sigma _{j}}, \end{aligned}$$
with \(B_i\) and \(B_j\) being two vectorized images with d pixels, mean values \(\mu _i\) and \(\mu _j\), and standard deviations \(\sigma _i\) and \(\sigma _j\). Considering that the input images already become normalized in our algorithm, see Sect. 4.1.4, this boils down to an inner product.
Given is a temporally equally spaced and ordered block of images \(B := \{B_1, B_2, \ldots , B_m\}\) and one frame \(B_i\) with \(i \in \{1, 2, \ldots , m\} =: M\). The structural similarity of frame \(B_i\) regarding the block B is the measure we search for. This can be calculated by
$$\begin{aligned} \frac{1}{m-1}\sum _{j \in M{\setminus }\{i\}} s(B_i,B_j), \end{aligned}$$
i.e., the mean structural similarity of \(B_i\) regarding B. For the relatively short time span of one block, it generally holds that \(s(B_i, B_j) \ge s(B_i, B_k)\) with \(i, j, k \in M\) and \(i< j < k\), i.e., the structural similarity drops going further into the future as motions in the images imply growing differences. This effect causes the mean structural similarity of the first or last frames of B generally being lower than of the middle ones due to the higher mean time difference to the other frames in the block.
This bias can be avoided by calculating the mean similarity regarding subsets of B. Let \(\nu > 0\) be a fixed number of pairs to be considered for the calculation of the mean similarity and \(\Delta T \in \mathbb {N}^+\) be the fixed cumulative time difference. Calculate the mean similarity \(\overline{s_i}\) of \(B_i\) regarding B by selecting pairwise distinct \(\{j_1, j_2,\ldots , j_{\nu }\}\) from \(M{\setminus }\{i\}\) with
$$\begin{aligned}\sum _{l=1}^{\nu } |j_l - i| = \Delta T\quad \text {and}\quad \overline{s_i} = \frac{1}{\nu } \left( \sum _{l=1}^{\nu } s(B_i, B_{j_l})\right) .\end{aligned}$$
If \(\overline{s_i}\) is smaller than the predefined similarity threshold \(\overline{s}\), frame i is not considered for the SVDAppend.
Periodic Updates
Using the threshold \(\tau \) speeds up the iterative process, but also has a drawback: If the incoming images stay constant over a longer period of time, the background should mostly represent the input images and there should be high singular values associated with the singular vectors describing it. Since input images that can be explained well do not get appended anymore, this is, however, not the case. Another drawback is that outdated effects, like objects that stayed in the focus for quite some time and then left again, have a higher singular vectors than they should, as they are not relevant anymore. Therefore, it makes sense to periodically append images although they are seen as irrelevant and do not surpass \(\tau \). This also helps to remove falsely added foreground objects much faster.
Effects of the Re-Initialization Strategy
The re-initialization strategy (II) based on the background images \({U_k(:,1:\hat{i})} ({U_k(:,1:\hat{i})} ^T B_i)\) as described in Sect. 4.1.3 supports the removal of incorrectly added foreground objects. When such an object, say X, is gone from the scene, i.e., \(B_i\) does not contain X and \({U_k(:,1:\hat{i})} ({U_k(:,1:\hat{i})} ^T B_i)\) does not contain it either because a singular vector not containing X approximates \(B_i\) much better. As X was added to the background, there must be at least one column \(j^*\) of \(U_k\) containing X, i.e., \({U_k(:,1:j^*)}^T X \gg 0\). As \({U_k(:,1:\hat{i})} ({U_k(:,1:\hat{i})} ^T B_i)\) does not contain X, \(({U_k(:,1:\hat{i})}^T B_i)_{j^*}\) must be close to zero as otherwise the weighted addition of singular vectors \({U_k(:,1:\hat{i})} ({U_k(:,1:\hat{i})}^T B_i)\) cancels X out. The re-initialization is thus based on images not containing X, and the new singular vectors also do not contain leftovers of X anymore.
Finally, the parameter \(\eta \) modifies the size of the maximum Frobenius norm used for normalization in re-initialization strategy (III) from Sect. 4.1.4. A smaller \(\eta \) reduces the importance of the already determined singular vectors spanning the background space and increases the impact of newly appended images. If an object X was falsely added, it gets removed more quickly if current frames not containing X have a higher impact. A similar behavior like with re-initialization strategy (II) can be achieved. The disadvantage is that the background model changes quickly and does not capture long time effects that well. In the end, it depends on the application which strategy performs better.