Background Subtraction using Adaptive Singular Value Decomposition

An important task when processing sensor data is to distinguish relevant from irrelevant data. This paper describes a method for an iterative singular value decomposition that maintains a model of the background via singular vectors spanning a subspace of the image space, thus providing a way to determine the amount of new information contained in an incoming frame. We update the singular vectors spanning the background space in a computationally efficient manner and provide the ability to perform block-wise updates, leading to a fast and robust adaptive SVD computation. The effects of those two properties and the success of the overall method to perform a state of the art background subtraction are shown in both qualitative and quantitative evaluations.


Motivation
Data driven approaches are a major topic in image processing and computer vision, leading to state of the art performances, for example in classification or regression tasks. One example is video surveillance used for security reasons, traffic regulation, or as information source in autonomous driving. The main problems with data driven approaches are that the training data has to be well balanced and to cover all scenarios that appear later in the execution phase and has to be well annotated. In contrast to cameras mounted at moving objects such as vehicles, static cameras mounted at some infrastructure observe a scenery, e.g. houses, trees, parked cars, that is widely fixed or at least remains static over large amount of frames. If one is interested in moving objects, as it is the case in the aforementioned applications, the relevant data is exactly the one different from the static data. The reduction of the input data, i.e., the frames taken from the static cameras, to the relevant data, i.e., the moving objects, is important for several applications like the generation of training data for machine learning approaches or as input for classification tasks reducing false positive detections due to the removal of the irrelevant static part.
Calling the static part background and the moving objects foreground, the task of dynamic and static part distinction is known as foreground background separation or simply background subtraction.

Background Subtraction as Optimization Problem
Throughout the paper, we make the assumptions that the camera is static, the background is mostly constant up to rare changes and illumination, and the moving objects, considered as foreground, are small relative to the image size. Then background subtraction can be formulated as an optimization problem. Given an image sequence stacked in vectorized form into the matrix A ∈ R d×n , with d being the number of pixels of an image and arXiv:1906.12064v1 [cs.CV] 28 Jun 2019 n being the number of images, foreground-background separation can be modeled as decomposing A into a low-rank matrix L, the background and a sparse matrix S, the foreground, cf. [7]. This leads to the optimization problem min L,S rank(L) + λ S 0 s.t. A = L + S. (1) Unfortunately, solving this problem is not feasible. Therefore, adaptations have to be made. Recall that a singular value decomposition (SVD) decomposes a matrix A ∈ R d×n into with orthogonal matrices U ∈ R d×d and V ∈ R n×n and the diagonal matrix where Σ has strictly positive diagonal values. The SVD makes no relaxation of the rank, but, given ≤ r, the best (in an 2 sense) rank-, ∈ N, estimate L of A can be obtained by using the first singular values and vectors, see [17,18]. This solves the optimization problems We use the following notation throughout our paper: The aim of a surveillance application is to subtract the background from every incoming image. Modeling the background via (3) results in a batch algorithm, where the low rank approximations are calculated based on some (recent) sample frames stacked together to the matrix A. Note that this allows the background to change slowly over time, for example due changing illumination or to parked cars leaving the scene. It is well-known that the computational effort to determine the SVD of A with dimensions d n is O(dn 2 ) using R-SVD and computing only U n = U :,1:n instead of the complete d × d matrix U , and the memory consumption is O(dn), cf. [8]. Especially in the case of higher definition images, only rather few samples n can be used in this way. This results in a dependency of the background model from the sample image size and an inability of adaption to a change in the background that is not covered in the few sample frames. Hence, a naive batch algorithm is not a suitable solution.

Main Contributions and Outline
The layout of this paper is as follows. In Sec. 2 we briefly revise related work in background subtraction and SVD methods. Sec. 3 introduces our algorithm of iteratively calculating a SVD. The main contribution here consists in the application and adaption of the iterative SVD to background subtraction. In Sec. 4 we propose a concrete algorithm that adapts the model of the background in a way that is dependent on the incoming data because of which we call it adaptive SVD. A straightforward version of the algorithm still has limitations, because of which we present extensions of the basic algorithm that overcome these deficits. In Sec. 5 evaluations of the method give an impression on execution time, generality, and performance capabilities of the adaptive SVD. Finally, in Sec. 6 our main conclusions are outlined.

Related Work
The "philosophical" goal of background modeling is to acquire a background image that does not include any moving objects. In realistical environments, the background may also change, due to influences like illumination or objects being introduced to or removed from the scene. Taking into account these problems as well as robustness and adaptation, background modeling methods can, according to the survey papers [3,4,2], be classified into the following categories: Statistical Background Modeling, Background Modeling via Clustering, Background Estimation and Neural Networks.
The most recent approach is, of course, to model the background via neural networks. Especially convolutional neural networks (CNNs) [10] have performed very well in may tasks of image processing. These techniques, however, usually involve a labeling of the data, i.e., the background has to be annotated, mostly manually, for a set of training images. The network then learns the background based on the labels. Background modeling is often combined with classification or segmentation tasks where every pixel of an image is assigned to one class. Based on the classes, the pixel can then be classified as background or foreground, respectively. Such techniques strongly depend on the trained data and besides new approaches like transfer learn-ing [9, p. 526] or reinforcement learning [12] can only be improved by adding new data.
Statistical background modeling includes Gaussian models, support vector machines and subspace learning models. Subspace learning originates from the modeling of the background subtraction task as shown in (1). Our approach therefore also belongs to this domain. Principal Component Pursuit (PCP) [7] is based on the convex relaxation of (1) by with L * being the nuclear norm of matrix L, the sum of the singular values of L. The relaxation (4) can be solved by efficient algorithms such as alternating optimization. As PCP considers the 1 error, it is more robust against outliers or salt and pepper noise than SVD based methods and thus more suited to situations that suffer of that type of noise. Since outliers are not a substantial problem in traffic surveillance which is our main application in mind, we do not have to dwell on this type of robustness. In addition, the pure PCP method also has its limitations such as being a batch algorithm, being computationally expensive compared to SVD, and maintaining the exact rank of the low rank approximation, cf. [5]. This is a problem when it comes to data that is affected by noise in most components, which is usually the case in camera based image processing. We remark that to overcome the drawbacks of plain PCP, many extensions of the PCP have been introduced, see [5,14]. There is naturally a close relationship between our SVD based approach and incremental principal component analysis (PCA) due to the close relationship between SVD and PCA. Given a matrix A ∈ R n×d with n being the number of samples and d the number of features, the PCA searches for the first k eigenvectors of the correlation matrix A T A which span the same subspace as the first k columns of the U matrix of the SVD of A T , i.e., the left singular vectors of A T . Thus, usually the PCA is actually calculated by a SVD, since the PCA produces the same subspace as our iterative SVD approach. One difference is that PCA originates from the statistics domain and the applications search for the main directions in which the data differs from the mean data sample. That is why the matrix A usually gets normalized by subtraction of the columnwise mean and divided by the columnwise standard deviation before calculating the PCA which, however, makes no sense in our application. This is also expressed in the work by Ross et al. [15], based on the sequential Karhunen-Loeve basis extraction from [11]. They use the PCA as a feature extractor for a tracking application. In our approach, we model the mean data, the background, by singular vectors only and dig deeper into the application to background subtraction, which we have not seen in works has not been considered in the PCA context. Nevertheless, we will make further comparisons to the PCA approach, pointing out further similarities and differences to our approach.

Update methods for rank revealing decompositions and applications
Our background subtraction method is based on an iterative calculation of an SVD for matrices augmented by columns, cf. [13]. In this section we revise the essential statements and the advantages of using this method for calculating the SVD.

Iterative SVD
The method from [13] is outlined, in its basic form, as follows: where Q results from a QR-decomposition, Σ k+1 ,Ũ andṼ result from the SVD of a (r k +m k )×(r k +m k ) matrix. P k and P k are permutation matrices.
For details, see [13]. In the original version of the iterative SVD, the matrix U k is (formally) of dimension d × d. Since in image processing d captures the amount of pixels of one image, an explicit representation of U k consumes too much memory to be efficient which suggests to represent U k in terms of Householder reflections. This ensures that the memory consumption of the SVD of A k is bounded by O(n 2 k + r k d), and the step k + 1 requires O(n 3 k+1 + d m k (r k + m k )) floating point operations.

Thresholding -Adaptive SVD
There already exist iterative methods to calculate an SVD, but for our purpose the approach from [13] has two favorable aspects. The first one is the possibility to perform blockwise updates with m k > 1, that is, with several frames. The second one is the ability to estimate the effect of appending B k on the singular values of A k+1 . In order to compute the SVD of A k+1 , Z := U T k B k is first calculated and a QR decomposition with column pivoting of Z r k +1:d,: = QRP is determined. The R matrix contains the information in the added data B k that is not already described by the singular vectors in U k . Then, the matrix R can be truncated by a significance level τ such that the singular values less than τ are set to zero in the SVD calculation of Σ k Z 1:r k ,: P T R .
Therefore, one can determine only from the (cheap) calculation of a QR decomposition, whether the new data contains significant new information and the threshold level τ can control how big the gain has to be for a data vector to be added to the current SVD decomposition in an iterative step.

Description of the Algorithm
In this section, we give a detailed description of our algorithm to compute a background separation based on the adaptive SVD.

Essential Functionalities
The algorithm in [13] was initially designed with the goal to determine the kernels of a sequence of columnwise augmented matrices using the V matrix of the SVDs. In background subtraction, on the other hand, we are interested in finding a low rank approximation of the column space of A and therefore concentrate on the U matrix of the SVD which will us allow to avoid computation and storage of V . The adaptive SVD algorithm starts with an initialization step called SVDComp, calculating left singular vectors and singular values on an initial set of data. Afterwards, data is added iteratively by blocks of arbitrary size. For every frame in such a block, the foreground is determined and then the SVDAppend step performs a thresholding described in Sec. 3.2 to check whether the frame is considered in the update of the singular vectors and values that correspond to the background.

SVDComp
SVDComp performs the initialization of the iterative algorithm. It is given the matrix A ∈ R d×n and a column number and computes the best rank-approximation A = U ΣV T , by means of an SVD with Σ 0 ∈ R × , U 0 ∈ R d× , and V 0 ∈ R n× . Also this SVD is conveniently computed by means of the algorithm from [13], as the thresholding of the augmented SVD will only compute and store an at most rank-approximation, truncating the R matrix in the augmentation step to at most columns. This holds both for initialization and update in the iterative SVD. As mentioned already in Sec. 3.1, U 0 is not stored explicitly but in the form of Householder vectors h j , j = 1, . . . , , stored in a matrix H 0 . Together with a small matrix U 0 ∈ R × we then have and multiplication with U 0 is easily performed by doing Householder reflection and then multiplication with an × matrix. Since V 0 is not needed in the algorithm it is neither computed nor stored.

SVDAppend
This core functionality augments a matrix A k , given by U k , Σ k , H k , determined either by SVDComp or previous applications of SVDAppend, by m new frames contained in the matrix B ∈ R d×m as described in Sec. 3.1. The details of this algorithm based on Householder representation can be found in [13]. By the thresholding procedure from Sec. 3.2 one can determine, even before the calculation of the SVD, if an added column is significant relative to the threshold level τ . This saves computational capacities by avoiding the expensive computation of the SVD for images that do not significantly change the singular vectors representing the background.
The choice of τ is significant for the performance of the algorithm. The basic assumption for the adaptive SVD is that the foreground consists of small changes between frames. Calculating SVDComp on an initial set of frames and considering the singular vectors, i.e., the columns of U 0 , and the respective singular values gives an estimate for the size of the singular values that correspond to singular vectors describing the background. With a priori knowledge of the maximal size of foreground effects, τ can even be set absolutely to the size of singular values that should be accepted. Of course, this approach requires domain knowledge and is not entirely data driven.
Another heuristic choice of τ can be made by considering the difference between two neighboring singular values σ i − σ i+1 , i.e., the discrete slope of the singular values. The last and smallest singular values describe the least dominant effects. These model foreground effects or small effects, negligible effects in the background. With increasing singular values, the importance of the singular vectors is growing. Based on that intuition, one can set a threshold for the difference of two consecutive singular values and take the first singular value exceeding the difference threshold as τ . Fig. 1d illustrates a typical distribution of singular values. Since we want the method to be entirely data driven, we choose this approach. The threshold τ is determined byî := min {i : σ i − σ i+1 < τ * } and τ = σî with the threshold τ * of the slope being determined in the following.

Re-initialization
The memory footprint at the k-th step in the algorithm described in Sec. 3.1 is O(n 2 k + r k d) and grows with every frame added in the SVDAppend step. Therefore, a re-initialization of the decomposition is necessary.
One possibility is to compute an approximation of A k ≈ U k Σ k V T k ∈ R d×n k or the exact matrix A k by applying SVDComp to A k with a rank limit of that determines the number of singular vectors after reinitialization. This strategy has two disadvantages. The first one is that this needs V k , which is otherwise not needed for modeling the background, hence would require unnecessary computations. Even worse, though U 0 ∈ R × , Σ 0 ∈ R × , and H 0 ∈ R d× are reduced properly, the memory consumption of V 0 ∈ R n k × still depends on the number of frames added so far.
The second re-initialization strategy, referred to as (II), builds on the idea of a rank-approximation of a set of frames representing mostly the background. For every frame B i added in step k of the SVDAppend the orthogonal projection i.e. the "background part" of B i , gets stored successively. The value σî is determined in Sec. 4.1.2 as threshold for the SVDAppend step. If the number of stored background images exceeds a fixed size µ, the re-initialization gets performed via SVDComp on the background images. No matrix V is necessary for this strategy and the re-initialization is based on the background projection of the most recently appended frames.
In the final algorithm we use a third strategy, referred to as (III) which is inspired by the sequential Karhunen-Loeve basis extraction [11]. The setting is very similar and the V matrix gets dropped after the initialization as well. The update step with a data matrix B k is performed just like the update step of the iterative SVD calculation in Sec. 3.1 based on the matrix [U k Σ k , B k ]. The matrices Σ k+1 and U k+1 get truncated by a thresholding of the singular values at every update step. Due to this thresholding, the number of singular values and accordingly the number of columns of U k has an upper bound. Therefore, the maximum size of the system is fixed and no re-initialization is necessary.
the eigenvectors and eigenvalues of the correlation ma- are the same, being roots of the eigenvalues of the correlation matrix. In our approach we combine the adaptive SVD with the reinitialization based on U k Σ k , i.e. we perform SVDComp on U k Σ k , because we want to keep the thresholding of the adaptive SVD. This is essentially the same as an update step in Karhunen-Loeve setting with B k = 0 and a more rigorous thresholding or a simple truncation of U k and Σ k . The thresholding strategy of the adaptive SVD Sec. 3.2 is still valid, as the QR-decomposition with column pivoting sorts the columns of the matrix according to the 2 norm and the columns of U k Σ k are ordered by the singular values due to ||U Σ :,i || 2 = σ i . U k Σ k already is in SVD form and therefore SVDComp at re-initialization is reduced to a QR decomposition to regain Householder vectors and a truncation of U k and Σ k which is less costly than performing a full SVD.
Since it requires the V matrix, the first reinitialization strategy will not be considered in the following, where we will compare only the strategies (II) and (III).

Normalization
The concept of re-initialization via a truncation of U k and Σ k either directly through SVDComp of U k Σ k or in the Karhunen-Loeve setting with thresholding of the singular values still has a flaw: the absolute value of the singular values grows with each frame appended to This also accounts for The approximation results from the thresholding performed at the update step. As only small singular values get truncated, the sum of the squared singular values grows essentially with the Frobenius norm of the appended frames. Growing singular values do not only introduce numerical problems, they also deteriorate thresholding strategies and the influence of newly added single frames decreases in later steps of the method. Therefore, some upper bound or normalization of the singular values is necessary. Karhunen-Loeve [11] introduce a forgetting factor ϕ ∈ [0, 1] and update as [ϕ U k Σ k , B k ]. They motivate this factor semantically: more recent frames get a higher weight. Ross et al. [15] show that this value limits the observation history. With an appending block size of m the effective number of observations is m/(1 − ϕ). By the Frobenius norm argument, the singular values then have an upper bound. By the same motivation, the forgetting factor could also be integrated into strategy (III). Moreover, due to (ϕ U k Σ k ) :,i 2 = ϕ σ i U :,i 2 = ϕσ i , the multiplication with the forgetting factor keeps the order of the columns of U k Σ k and linearly affects the 2-Norm and is thus compliant with the thresholding. However, the concrete choice of the forgetting factor in unclear.
Another idea for normalization is to set an explicit upper bound for the Frobenius norm of observations contributing to the iterative SVD, or, equivalently, to σ 2 i = A 2 F . At initialization, i.e. at the first SVD-Comp, the upper bound is determined by A 2 F n η with n being the number of columns of A and η being the predefined maximum size of the system. This upper bound is a multiple of the mean squared Frobenius norm of an input frame and we define a threshold ρ : If the Frobenius norm Σ 0 F of the singular values exceeds ρ after a re-initialization step, Σ 0 gets normalized to Σ 0 ρ Σ0 F . One advantage of this approach is that the effective system size can be transparently determined by the parameter η.
In data science, normalization usually aims for zero mean and standard deviation one. Zero mean over the pixels in the frames, however leads to subtracting the row wise mean of A, replacing A by (I − 11 T )A. This approach is discussed in incremental PCA, cf. [15], but since the mean image usually contributes substantially to the background, it is not suitable in our application.
A framewise unit standard deviation makes sense since the standard deviation approximates the contrast in image processing and we are interested in the image content regardless of the often varying contrast of the individual frames. Different contrasts on a zero mean image can be seen as a scalar multiplication which also applies for the singular values. Singular values differing with respect to the contrast are not a desirable effect which is compensated by subtracting the mean and dividing by the standard deviation of incoming frames B, yielding B−µ σ . Due to the normalization of single images, the upper bound for the Frobenius norm ρ is more a multiple of the Frobenius norm of an average image.

Adaptive SVD Algorithm
The essential components being described, we can now sketch our method based on the adaptive SVD in Alg. 1. For the exposition in Alg. 1 we use pseudo-code with a MATLAB like syntax. Two further explanations are necessary, however. First, we remark that SVDAppend and SVDComp return the updated matrices U and Σ and the index of the thresholding singular value determined by τ * as described in Sec. 4.1.2. Using the threshold value θ, the foreground resulting from the subtraction of the background from the input image gets binarized. This binarization is used as mask on the input image to gain the parts that are considered as foreground. |B − J| > θ checks elementwise whether |B jk − J jk | > θ and returns a matrix consisting of the Boolean values of this operation.

Relaxation of the small foreground assumption
A basic assumption of our background subtracting algorithm is that the changes due to the foreground are small relative to the image size. Nevertheless, this assumption is easily violated, e.g. by a truck in traffic surveillance or generally by objects close to the camera which can appear in singular vectors that should represent background. This has two consequences. The first is that the foreground object is not recognized as such, the second one leads to ghosting effects because of the inner product as shown in Fig. 1. The following modifications increase the robustness of our method against these unwanted effect.

Similarity Check
Big foreground objects can exceed the threshold level τ in SVDAppend and therefore are falsely included in the background space. With the additional assumption that background effects have to be stable over time, frames with large moving objects can be filtered out by utilizing the block appending property of the adaptive SVD. There, a large moving object causes significant differences in a block of images which can be detected by calculating the structural similarity of a block of new images. Wang et al. propose in [20] the normalized covariance of two images to capture the structural similarity. This again can be written as the inner product of normalized images, i.e., with B i and B j being two vectorized images with d pixels, means µ i , µ j and standard deviations σ i and σ j . Taking into account that the input images already become normalized in our algorithm, see Sec i.e., the mean structural similarity of B i regarding B. For the relatively short time span of one block it generally holds that s(B i , B j ) ≥ s(B i , B k ) with i, j, k ∈ M and i < j < k, i.e., the structural similarity drops going further into the future as motions in the images imply growing differences. This effect causes the mean structural similarity of the first or last frames of B generally being lower than of the middle ones due to the higher mean time difference to the other frames in the block.
This bias can be avoided by calculating the mean similarity regarding subsets of B. Let ν > 0 be a fixed number of pairs to be considered for the calculation of the mean similarity and ∆T ∈ N + be the fixed cumulative time difference. Calculate the mean similarity s i of B i regarding to B by selecting pairwise distinct If s i is smaller than the predefined similarity threshold s, frame i is not considered for the SVDAppend.

Periodic Updates
Using the threshold τ speeds up the iterative process, but also has a drawback: if the incoming images stay constant over a longer period of time, the background should mostly represent the input images and there should be high singular values associated to the singular vectors describing it. Since input images that can  be explained well do not get appended anymore, this is, however, not the case. Another drawback is that outdated effects, like objects that stayed in the focus for quite some time and then left again, have a higher singular vectors than they should, as they are not relevant any more. Therefore, it makes sense to periodically append images although they are seen as irrelevant and do not surpass τ . This also helps to remove falsely added foreground objects much faster.

Effects of the re-initialization strategy
The re-initialization strategy (II) based on the background images U k (:, 1 :î)(U k (:, 1 :î) T B i ) as described in Sec. 4.1.3 supports the removal of incorrectly added foreground objects. When such an object, say X, is gone from the scene, i.e., B i does not contain X and U k (:, 1 :î)(U k (:, 1 :î) T B i ) does not contain it either because a singular vector not containing X approximates B i much better. As X was added to the background, there must be at least one column j * of U k containing X, i.e., U k (:, 1 : j * ) T X 0. As U k (:, 1 :î)(U k (:, 1 :î) T B i ) does not contain X, (U k (:, 1 :î) T B i ) j * must be close to zero as otherwise the weighted addition of singular vectors U k (:, 1 :î)(U k (:, 1 :î) T B i ) cancels X out. The reinitialization is thus based on images not containing X and the new singular vectors also do not contain leftovers of X anymore.
Finally, the parameter η modifies the size of the maximum Frobenius norm used for normalization in reinitialization strategy (III) from Sec. 4.1.4. A smaller η reduces the importance of the already determined singular vectors spanning the background space and increases the impact of newly appended images. If an object X was falsely added, it gets removed more quickly if current frames not containing X have a higher impact. A similar behavior like with re-initialization strategy (II) can be achieved. The disadvantage is that the background model changes quickly and does not capture long time effects that well. In the end, it depends on the application which strategy performs better.

Computational Results
The evaluation of our algorithm is done based on an implementation in the C ++ programming language using Armadillo [16] for linear Algebra computations.

Default Parameter Setting
Alg. 1 depends on parameters that are still to be specified. In the following, we will introduce a default parameter setting that works well in many different applications. The parameters could even be improved or optimized for a specific application using ground truth data. Our aim here, however, is to show that the adaptive SVD algorithm is a very generic one and applicable almost "out of the box" for various situations. The chosen default parameters are as follows: The parameter determines how many singular values and corresponding singular vectors are kept after re-initialization. Setting too low can cause a loss of background information. In our examples, 15 turned out to be sufficient not to lose information. The reinitialization is triggered when n * relevant singular values have been accumulated. Choosing that parameter too big reduces the performance, as the floating point operations per SVDAppend step depend cubically on the number of singular vectors and linearly on the number of singular vectors times the number of pixels, see Sec. 3.1. The system size η controls the impact of newly appended frames, and a large value of η favors a stable background. The threshold value τ * for the discrete slope of singular values in the SVDAppend step depends on the data. The heuristic factor 0.05 proved to be effective to indicate that the curve of the singular values flattens out. The block size β and the corresponding ν and ∆T depend on the frame rate of the input. The choice is such that it does not delay the update of the background space too much, which would be the effect of a large block size. Keeping it relatively small, we are able to evaluate the input regarding similarity and stable effects. Due to the normalization of the input images to zero mean and standard deviation one, the similarity threshold s and the binarization threshold θ are stable against different input types.

Small Foreground Objects
The first example video for a qualitative evaluation is from a webcam monitoring the city of Passau, Germany, from above. The foreground objects, e.g. cars, pedestrians, boats, are small or even very small. The frame rate of 2 frames per minute is relatively low and the image size is 640 × 480 px. This situation allows for a straightforward application of the basic adaptive SVD algorithm without similarity check and regular updates. The remaining parameters are as in the default setting of Sec. 5.1.
In Fig. 2 an example frame 1 and the according foreground image from the webcam video is shown. The moving boat in the foreground, the cars in the lower left and right corners, and even the cars on the bridge in the background are detected well. Small illumination changes and reparking vehicles lead to incorrect detections on the square in the front. Fig. 3 depicts these regions. Fig. 3: Plot marking the true detections in the foreground image of Fig. 2b by green circles and incorrect detections by red circles with white stripes. Fig. 1 is a frame from an example video 1 including the projection onto the background space, the computed foreground image, and the distribution of the singular values. To illustrate the improvements due to similarity checks and periodic updates, the same frame is depicted in Fig. 4 where the extended version of our algorithm is applied. The artifacts due to big foreground objects that were added to the background in previous frames, are not visible anymore. The person in the image still gets added to the background, but only after being stationary for some frames.

Execution Time
The performance of our implementation is evaluated based on an Intel R Core TM i7-4790 CPU @ 3.60 Hz × 8. The example video from Sec. 5.3 has a resolution of 1920 × 1080 px with 25 fps. For the application of our 1 The complete sample videos can be downloaded following https://www.forwiss.uni-passau.de/en/media_and_data/. algorithm on the example video, the parameters are set as shown in Sec. 5.1.
As the video data was recorded with 25 fps, there is no need to consider every frame for a background update, because the background is assumed to be constant over a series of frames and can only be detected considering a series of frames. Therefore, only every second frame is considered for a background update, while a background subtraction using the current singular vectors is performed on every frame. Our implementation with the settings from Sec. 5.1 handles this example video with 8 fps.
For surveillance applications it is important that the background subtraction is applicable in real time for which 8 fps are too slow. One approach would be to reduce the resolution. The effects of that will be discussed in the following section. Leaving the resolution unchanged, the parameters have to be adapted. Setting = 10 and n * = 25 significantly reduces the number of background effects that can be captured, but turns out to be still sufficient for this particular scene. The number of images considered for background updates can be reduced as well. Downsampling the frames by averaging over a window size of 8 and setting = 10 and n * = 25 leads to a processing rate of 25 fps which is real time.
In Sec. 3.1 we pointed out that the number of floating point operations for an update step depends linearly on the number of pixels d when using Householder reflections. A re-initialization step is computationally even cheaper, because only Householder vectors have to be updated. The following execution time measurements underline the theoretical considerations. Our example video is resized several times, 900 images are appended, and re-initialization is performed when n * singular vectors are reached. Tab. 1 shows the summed up time for the append and re-initialization steps during iteration for the given image sizes. The number d of pixels equals 2,073,600 = 1920 · 1080.    shown in Tab. 1 for image sizes d/i. These factors should be constant for increasing image sizes due to the linear dependency. Still, the factors keep increasing, but even less than a logarithmic order. This additional increase in execution time can be explained due to the growing amount of memory that has to be managed and caching becomes less efficient as with small images.

Evaluation on Benchmark Datasets
The quantitative evaluation is performed on example videos from the background subtraction benchmark data set CDnet 2014 [19]. The first one is the pedestrians video belonging to the baseline category. It contains 1099 frames (360 × 240 px) of people walking and cycling in the public. An example frame can be seen in Fig. 5. For the frames 300 trough 1099 binary ground truth annotations exist that distinguish between foreground and background. From the first 299 frames, 15 frames are equidistantly sub-sampled and taken for the initial matrix M . Thereafter, Alg. 1 is executed on all frames from 300 through 1099. Instead of applying the binary mask in line 18 of algorithm 1 onto the input image, the mask itself is the output to achieve binary images.
With the default parameter setting of Sec. 5.1 a pixelwise precision of 0.958 and an F-measure of 0.919 are achieved with a performance of 843 fps. The thresholding leading to the binary mask is sensitive to the contrast of the foreground relative to the background. If it is low, foreground pixels are not detected properly. To avoid missing pixels within foreground objects,  the morphological close operation is performed with a circular kernel. Moreover, a fixed minimal size of foreground objects can be assumed reducing the number of false positives. These two optimizations lead to a precision of 0.968 and an F-measure of 0.958 at 684 fps. The complete evaluation measures can be seen in Tab. 2.
Default represents the default parameter setting and morph the version with the additional optimizations. In the following, the morphological postprocessing is always included.
Our method delivers a state of the art performance for unsupervised methods. The best unsupervised method, IUTIS-5 [1] on the benchmark site could achieve a precision of 0.955 and an F-measure of 0.969. It is based on genetic programming combining other state of the art algorithms. The execution time is not given, but naturally higher than the execution time of the slowest algorithm used, assuming perfectly parallel execution. We introduced domain knowledge only in the morphological optimizations. Otherwise, there is no specific change towards the test scene. Even more domain knowledge is used in supervised learning techniques as object shapes are trained and irrelevant movements in the background are excluded due to labeling. They are able to outperform our approach regarding the evaluation measures. An overall average precision and F-measure of more than 0.98 is achieved. The benchmark site disclaims, nevertheless, that the supervised methods may have been trained on evaluation data as ground truth annotations are only available for evaluation data.
The positive effect of a block-wise appending of the data with a similarity check and regular updates as shown above also applies here: our adaptive SVD algorithm on the given pedestrians video from the benchmark site without using the similarity checks and regular updates only leads to a precision of 0.933 and an F-measure of 0.931.
The performance of our algorithm on more example videos from the CDnet data set is listed in Tab. 3. The park video is recorded with a thermal camera, the tram and turnpike videos with a low frame rate, and the blizzard video while snow is falling. For highway and park the best unsupervised method is IUTIS-5 and for tram, turnpike, blizzard, and streetLight that is Se-manticBGS [6]. SemanticBGS combines IUTIS-5 and a semantic segmentation deep neural network and the execution time is given with 7 fps for 473 × 473 px images based on a NVIDIA GeForce GTX Titan X GPU.  Table 3: Evaluation of the adaptive SVD algorithm on example videos from the CDnet data set using precision and F-measure. Prec * and F-Meas * give the precision and F-measure of the best unsupervised method of the benchmark regarding to the test video.
Besides the park video, the content is mostly vehicles driving by. The performance of our algorithm clearly drops whenever the initialization image set contains a lot of foreground objects like in the highway video, where the street is never empty. Moreover, a foreground object turns into background when it stops moving which is even a feature of our algorithm. This, however, causes problems in a lot of the benchmark videos of the CDnet benchmark with vehicles stopping at traffic lights, like in the tram video, or people stopping and starting to move again. There is a category of videos with intermittent object motion in the CDnet data set. Our algorithm performs with an average precision of 0.752 and an F-measure of 0.385 whereas SemanticBGS reaches an average precision of 0.915 and an F-measure of 0.788. The precision of our algorithm tends to be higher than the F-measure, as it detects motion very well and therefore is certain that if there is movement, it is foreground, but often foreground is not detected due to a lack of motion. To delay the addition of a static object to the background, it is possible to reduce the regular updates, for example. But as this feature regulates the adaption of the background model to a change in the background, this only enhances the performance for very stable scenes. In the streetLight video no regular update was performed in contrast to the other videos. Including regular updates, the precision is 0.959 and the F-measure 0.622 due to cars stopping at traffic lights. The only domain knowledge we introduce is the postprocessing via morphological operations. Otherwise, the algorithm has no knowledge about the kind of background it models. Therefore, not only vehicles or people are detected as foreground, but also movement of trees or the reflection of the light of the vehicles on the ground, which is negative for the performance regarding the CDnet benchmark.

Conclusions
We utilized the iterative calculation of a Singular Value Decomposition to model a common subspace of a series of frames which is assumed to represent the background of the frames. An algorithm, the adaptive SVD was developed and applied for background subtraction in image processing. The assumption that the foreground has to be small objects was considered in more detail and relaxed by extensions of the algorithm. In an extensive evaluation, the capabilities of our algorithm were shown qualitatively and quantitatively using example videos and benchmark results. Compared to state of the art unsupervised methods we obtain competitive performance with even superior execution time. Even high definition videos can be processed in real time.
The evaluation also showed that, if an application to a domain such as video surveillance is intended, our algorithm would need to be extended to also consider semantic information. Therefore, it can only be seen as a preprocessing step, e.g. reducing the search space for classification algorithms. In future work we aim to evaluate the benefit of using our algorithm in preprocessing of an object classifier. Moreover, we will address the issue of foreground objects turning into background after being static for some time which is desirable in some cases and erroneous in others. A first approach is to use tracking, because objects do not disappear without any movement. In the end, there is also some parallelization ability in our algorithm separating the projection onto the background of incoming images from the update of the background model. Further performance improvements will be investigated.