Robust structure from motion with affine camera via low-rank matrix recovery

We present a novel approach to structure from motion that can deal with missing data and outliers with an affine camera. We model the corruptions as sparse error. Therefore the structure from motion problem is reduced to the problem of recovering a low-rank matrix from corrupted observations. We first decompose the matrix of trajectories of features into low-rank and sparse components by nuclear-norm and ℓ1-norm minimization, and then obtain the motion and structure from the low-rank components by the classical factorization method. Unlike pervious methods, which have some drawbacks such as depending on the initial value selection and being sensitive to the large magnitude errors, our method uses a convex optimization technique that is guaranteed to recover the low-rank matrix from highly corrupted and incomplete observations. Experimental results demonstrate that the proposed approach is more efficient and robust to large-scale outliers.


Introduction
Structure from motion is a classical computer vision problem of finding the three-dimensional (3D) structure of scene and the camera motion. One family of batch structure from motion algorithms is called factorization methods which were originally proposed by Tomasi and Kanade [1]. Unlike sequential methods, factorization methods can simultaneously calculate camera motion and scene geometry using all image measurements. Gross errors associated with sequence closure can be avoided because reconstruction errors can be distributed meaningfully across all measurements.
The matrix of trajectories of features lies in a low-dimensional subspace. Therefore, the structure from motion problem can be reduced to that of finding the closest low-rank approximation to the matrix of trajectories. Conventional Principal Component Analysis (PCA) [2] is commonly used to obtain the best approximation of a low-rank structure in the least-squares sense, when the observation matrix is full, that is, when all the 2D track features can be observed in different frames. As we all know, PCA is guaranteed to be optimal solution only if the noise in measurement is independent and identically distributed (i.i.d.) Gaussian noise. Unfortunately, such a linear noise model rarely exists in reality: outliers and missing data are common factors that cause trouble in the structure from motion problem.
In order to remove outliers and find correct correspondence, a remarkable amount of information in the images is exploited, including analyzing neighboring structures in the image and some prior information. Kanade-Lucas-Tomasi feature tracker (KLT) [3] is commonly used in feature-tracking and feature-matching in very efficient and fast implementations when the two images are taken from close viewpoints. However, when features are tracked over an extended time period, this algorithm will fail because the estimation errors accumulate in time. Scale invariant feature transform (SIFT) [4] is invariant to changes in rotation, scale and illumination, and is more stable than KLT. Nevertheless, it still has problems when the two images are taken by cameras with wide baseline. Because of the presence of similarity between features, and other concurrent nuisance factors such as illumination changes or partial occlusions and corruptions, false matches cannot be totally avoided even with some robust statistics methods, such as RANdom SAmple Consensus (RANSAC) [5].
On the other hand, there are various techniques for the factorization method in the presence of missing data. Tomasi et al. [1] proposed an initialization method, which obtains the initial estimation by decomposing the largest full sub-matrix, and then grows it by adding one row or column at a time. However, finding the largest full sub-matrix is an non-deterministic polynomial-time hard (NP-hard) problem. Moreover, the solution largely relies on the order in which the columns or rows are completed. There is also a category of methods extended from Wiberg algorithms [6], such as those described in [7,8], which are all based on the Gauss-Newton method, and need to solve a large linear system whose cost function is interrelated to the column of the measurement matrix. The complexity will rapidly increase with the increase of the size of column. Jacobs proposed a linear fitting method [9], in which each column with missing data is treated as an affine subspace and unknown entries can be recovered by least-squares regression. But this algorithm is often inefficient due to the large noise in the data. Buchanan et al. [10] proposed the damped Newton (DN) algorithm based on the Levenberg-Marquardt algorithm. Although this method achieves a better performance than Wiberg's method, it needs a large number of iterations, which is not suitable for large matrices. Chen et al. [11] provided an iterative algorithm to recover the missing components in a low-rank matrix without large-scale noise. Starting with a complete sub-matrix, the algorithm fills the missing entries as matrix grows by one row or column. Later, Chen et al. [12] revitalized the usage of the Levenberg-Marquardt algorithm and redefined the objective function. The algorithm greatly reduces the number of variables and is much more efficient than the previous methods, but there are still some drawbacks such as requiring a good initial value and using the assumption that the data are corrupted by i.i.d. Gaussian noise.
All the above methods have some problems such as depending on the initial value selection and being sensitive to the large magnitude errors. These issues will cause the algorithm not converge or only converge to a local minimum. Motivated by the recent breakthroughs in low-rank matrix recovery [13][14][15], we present a novel approach to structure from motion that can simultaneously deal with missing data and outliers with an affine camera. We model the corruptions as sparse errors, thus the structure from motion problem is reduced to a problem of recovering a low-rank matrix from corrupted observations. Unlike previous methods, our method uses a convex optimization technique and enables robust recovery of a highdimensional low-rank matrix from highly corrupted and incomplete observations. Experimental results demonstrate that the proposed approach is more efficient and robust to outliers. The remainder of this paper is organized as follows: In Section 2, we provide formulation associated with affine reconstruction. Section 3 gives the algorithm for solving the problem of recovering low-rank matrix from corrupted observation. Section 4 gives experimental results. Section 5 gives a summary of this paper.
2 Modeling structure from motion as matrix rank minimization

Affine projection
We first review the structure of trajectories matrix in a more general setting. In full generality, an affine camera has the form After ignoring the third row of projection matrix, a homogeneous coordinates X h j of 3D point X j and its projection on the camera imaging plane x ij are related by a 2 × 4 affine projection matrix P i : and then the full measurement matrix of rank 4 can be written as

Factorization method
We can rewrite Eq. (2) as where M i represents the first third columns of projection matrix, t i represents translation vector, X j represents 3D point. Without loss of generality, we assume that the origin is at the centroid of the 3D points. If matrix O in Eq. (3) is subtracted by its column-wise mean, we havê We can write the full measurement matrix of size 2m × n as Eq. (6), where M and S represent camera motion and object shape, respectively.
Once we haveÔ, pseudo motionM and pseudo shapeS can be given by The decomposition of Eq. (7) is not completely unique, there exists an unknown ambiguity A ∈ R 3×3 as follows: this ambiguity can be reduced or removed with normalization constraints [16]. Then the true motion M and shape S can be obtained by However, when there exist corruptions, such as outliers and missing data, this solution normally breaks down. In the following sections, we will show that our method can efficiently remove these corruptions.

Formulation
As we discussed before, the low-rank structure of the observation matrix is seldom observed because of the corruptions in data. If we denote outliers by a matrix E, the observation in Eq. (3) can be rewritten as In most cases, the number of outliers is relatively small, i.e., only a few of entries in the E are non-zero. Therefore, the structure from motion problem can be reduced to the optimization problem as Because of the self-occlusion, missing data always occur in most of structure from motion applications. Taking this factor into account, the problem in Eq. (11) can be rewritten as where Ω c denotes the locations of missing entries in the observed matrix O. π Ω c denotes the orthogonal projection operator corresponding to the Ω c , and Ω c denotes the linear subspace complementary to Ω.

Efficient solution via convex optimization
The optimization problem in Eq. (11) is an NP-hard problem and not directly solvable. With the breakthroughs in low-rank matrix recovery, it was shown [13][14][15] that Eq. (11) can be solved by the strategy of convex surrogate, which replaces rank(·) with the nuclear norm and 0 -norm with matrix 1 -norm. Then, the problem is reduced to the optimization problem where · * and · 1 represent the nuclear norm and 1 -norm, respectively, and λ > 0 is a weighting parameter. Then, our problem Eq. (12) can be rewritten as Recently, researchers have developed a lot of algorithms for solve low-rank matrix recovery [15,17,18]. Among them the augmented Lagrange multiplier(ALM) method [15,19] is faster and more stable than the others. Thus, we use ALM method [15] to solve our problem in Eq. (14), and the augmented Lagrangian is given by where Y ∈ R m×n is a Lagrange multiplier matrix, ·, · represents the matrix inner product 1) , μ is a positive constant, and · F is the Frobenius norm. The basic ALM iteration is given by where {μ k } is a monotonically increasing positive sequence (ρ > 1). A and E can be solved by an alternating minimization strategy as For the first step in Eq. (17), there is a closed-form solution given by where shrinkage is soft-thresholding operator and defined as follows: where α > 0 2) . Unfortunately, there is no close-form solution for the second step in Eq. (17). To overcome this problem, we use an iterative method so that the solution can converge reasonably. This method is based on the accelerated proximal gradient (APG) algorithm [17,18,20], which has better convergence and steadiness. The iterative solution is given as where {t i } is a positive sequence and updated by t 1 = 1 and t i+1 = 0.5 1 + 1 + 4t 2 i in each iteration. The entire algorithm is illustrated in Algorithm 1.

Experiments
In this section, we present the experimental results of our method on synthetic and real data. We compare our results with those obtained by the Levenberg-Marquardt algorithm on Grassmann manifold (LMM) [12], which is formulated under the assumption of Gaussian noise. There are two steps for LMM method, namely initializing the system by alternated least squares (ALS) method and optimizing the system by Levenberg-Marquardt algorithm. The LMM is much more efficient than the other exist methods, therefore we use this method for comparison.

Experimental results with synthetic data
In this subsection, we use synthetic data, for which the ground truth of trajectory matrix is known, to verify our method. We first create a set of points, and then create the camera motion around those points in a semi circle. A sparse noise with varying magnitudes is added to the trajectories matrix, which simulates outliers. We set λ = 0.4/ √ m in Eq. (14).

a. With outliers.
We generate a set of 200 3D points with a structure of box (see Figure 1). We then generate synthetic images with 60 cameras around those points, where the missing entries are chosen on the geometry relationship, and the locations of outliers are randomly chosen from the synthetic images. The two type of corruptions (missing entries and outliers) in the observation are with the percentages of 10% and 6%, respectively. The magnitude for outliers is randomly sampled in the range of 0-20. The estimated trajectories are illustrated in Figure 2. We use the root-mean-square(RMS) error and the maximum projection error to evaluate both methods. The maximum projection error is defined as where x g ij and x e ij are the ground truth and the estimated feature coordinate, respectively. From Table 1 we can see that our method performs much better than the LMM approach. The errors almost occur in all features with LMM approach as shown in Figure 2. Since the data of box has a distinct box structure, we present its recovered structure in Figure 3, Figure 3 (a) shows the whole structure, Figure 3 (b) shows the structure in x-y plane. Note some lines are not parallel with LMM approach, while they are kept parallel with our method.    The results have been averaged over 10 different sets of locations for outliers.

b. Effect of magnitude of outliers.
From the above experiments, we observe that the outliers have a powerful influence on the trajectory estimation. In this subsection, the effect of the magnitude of outliers is studied.
We vary the upper bound of magnitude of outlier from 10 to 60. The locations of outliers are randomly chosen from the synthetic data. We test our method on 10 different sets of locations for outliers. The average results are illustrated in Figure 4. We observe both the RMS error and the maximum projection error in the LMM estimate increase rapidly with the magnitude of outliers, while our method is significantly stable.
The experimental results demonstrate that our method is robust to the large magnitude of errors, whereas the LMM method is extremely sensitive to even small magnitude outliers in the input images. This is because our objective function uses l 1 -norm for errors, which is independent of the magnitude, while LMM approach uses l 2 -norm under the assumption of Gaussian noise, which is relevant to the magnitude of error.
c. Effect of amount of outliers. From the above experiments, we can see that our method is much more robust to outliers than the LMM method. In this experiment, we will test the effect of amount of outliers.
We generate data without missing entries, and then increase ratio of outliers. The locations of outliers are randomly chosen from synthetic images.
Again, we test our method on 10 different sets of locations for outliers. The average experimental results are shown in Figure 5. Our method can handle the data in which even 35% entries are corrupted  by outliers, while the errors of the LMM methods rapidly increases when the percentage of noise increases. Again, our method performs much better than the LMM method.
d. Computation. We use SVD for each iteration of our algorithm, which has a complexity of O(mn 2 ), where m and n represent the size of row and column, respectively. However, to speed up the algorithm, only partial SVDs are is computed in each iteration since we expect the optimal solution to have a rank of 3 at most. Thus, the complexity of each iteration reduces to O(lmn), where l is the number of singular vectors computed in each iteration. Typically, we set l to be much smaller than m and n. On the contrary, the LMM method has a complexity of O(m 3 r 3 ) for its main process, and its O(nm 2 r 2 ) for initialization step , where r is the rank of matrix. Our rank minimization approach can be used for structure from motion with much less computational cost than using the LMM approach.
For the data with Gaussian distribution in the first experiments, which consists of 200 3D points and 60 cameras with 10% missing entries and 6% outliers in observation matrix, on a Macintosh computer with 4 GB memory and a 2.66 GHz Core 2 i7 processor, our method takes about 5.1139 s, and the LMM method takes about 43.7833 s.

Qualitative evaluation with real images
In this experiments, we evaluate our method on real data. We use a set of 100 images of a house [21] taken under different views (see Figure 6). We use the KLT tracker [3] for feature tracking and intentionally relax the error threshold to test robustness of algorithms. We test on four sets of data, which contain 5.1%, 14.65%, 18.28%, 20.59% missing entries and correspond to the feature track 1-100, 100-199, 200-299 and 300-500, respectively. Since more features with similar descriptor are selected, the ratio of outliers is also increasing 3) .
The experimental results are illustrated in Figures 7 (a)  7 (c) and (d), we can observe the distinct incorrect trajectories recovered by the LMM method, while the trajectories recovered by our method are more reasonable. From the top-down view, we can also observe that the angle between the two walls recovered by our method keeps perpendicular, while the structure recovered by the LMM method is degenerated to a plane. Our method shows higher robustness to errors.

Discussion and future work
In this paper, we present a novel approach to structure from motion that can deal with missing data and outliers simultaneously with an affine camera. The corruptions in data are modeled as sparse errors, and thus the structure from motion problem is reduced to the problem of recovering the low-rank matrix from corrupted observations. Since our formulation is based on advanced convex optimization for nuclear norm and 1 -norm minimization, it does not rely on the initial value. Unlike pervious methods, which have some drawbacks such as requiring a good initialization and being sensitive to the large magnitude outliers, our method uses an convex optimization technique that is guaranteed to correctly recover low-rank matrix from highly corrupted and incomplete observations. However, this framework is still limited to the affine camera model. We will extend our method to the perspective camera model in the future.