Keywords

1 Introduction

Standard structure from motion (SfM) approaches are typically multi-stage pipelines comprising of feature matching or tracking, initial structure and camera estimation, and final nonlinear refinement stages. While feature matching and tracking and the nonlinear refinement stage have well-established gold standard implementations (most notably matching using SIFT features, tracking via Lucas-Kanade and nonlinear refinement via Levenberg-Marquardt), no elegant and generally accepted framework for estimating the initial poses and 3D structure from feature tracks is known. Even when one has a sensible starting point (initial 3D reconstruction) available, accumulated drift, undetected loop closures, etc., require a large basin of convergence for bundle adjustment to succeed. The essence of this work is widening the convergence basin of bundle adjustment thereby improving SfM systems.

Fig. 1.
figure 1

Visualization of the Di2 (see Table 7) tracks recovered using our two-stage meta-algorithms. In each run, each meta-algorithm is initialized from random camera poses and points (Fig. 1b). In the first stage, it performs affine bundle adjustment using either a Linear or Nonlinear VarPro-based algorithms, reaching the best affine optimum (Fig. 1c) in 91 % of all runs. The outputs are then used to initialize projective bundle adjustment. Although Di2 has strong perspective effects, our recommended meta-algorithms (TSMA1 and TSMA2) both reach the best projective optimum (Fig. 1d) in 90–98 % of all runs.

If an affine or weak perspective camera model is given (or assumed), determining pose and 3D structure amounts to solving a matrix factorization problem, which is an easy task if all points are visible in every image. If the visibility pattern is sparse and structured as induced by feature tracking, matrix factorization algorithms employing the Variable Projection (VarPro) method are highly successful (i.e. return a global optimum in a large fraction of runs) even when the poses and the 3D structure are initialized arbitrarily [10, 17]. Thus, the initial SfM computation can be entirely bypassed in the affine case. One obvious question is whether this is also true when using a pinhole camera model. This the main motivation of this work.

Formally, we are interested in finding global minimizers of the following nonlinear least squares projective bundle adjustment problem

$$\begin{aligned} \min _{\{\mathtt {P}_i\} \{\tilde{{\mathbf {x}}}_j\}} \sum _{\{i,j\} \in \varOmega } \Vert \varvec{\pi }\left( \mathtt {P}_i \tilde{{\mathbf {x}}}_j\right) - {\tilde{\mathbf {m}}}_{ij}\Vert _2^2 \end{aligned}$$
(1)

without requiring good initial values for the unknowns. In (1) the unknowns are as follows: \(\mathtt {P}_i \in {\mathbb R}^{{3}\times {4}}\) is the projective camera matrix for frame i and \({\tilde{{\mathbf {x}}}}_j \in \mathbb {R}^4\) is the homogeneous vector of coordinates of point j. \({\tilde{\mathbf {m}}}_{ij} \in \mathbb {R}^2\) is the observed projection of point j in frame i. \(\varOmega \) denotes a set of visible observations and \(\varvec{\pi }\left( \cdot \right) \) is the perspective division such that \( \varvec{\pi }([x, y, z ]^\top ) := [x/z, y/z]^\top \). This division introduces nonlinearity to the objective, and thus we can interpret (1) as a nonlinear matrix factorization problem.

Our quest to solve (1) directly without the help of an initial structure and motion estimation step leads to the following contributions:

  • Extension of Ruhe and Wedin algorithms: we extend the separable nonlinear least squares algorithms in [19] to apply to nonseparable problems such as (1).

  • Unification of affine and projective cases: we unify affine and projective bundle adjustment as special cases of a more general problem class. As a byproduct we obtain numerically better conditioned formulations for each of the special cases.

  • Simple two-stage meta-algorithms: we provide numerical experiments to identify the method yielding the highest success rate overall on real and synthetic datasets. We conclude that each of two winning methods is a Variable Projection (VarPro) method-based two-stage approach, which uses either a traditional or proposed numerically stable affine bundle adjustment algorithm followed by the proposed projective bundle adjustment algorithm.

Conversely, there are limitations of this work: the scope is confined to the \(L^2\)-norm projective formulation, and consequently we may encounter new challenges when incoporating robust kernel techniques and/or extending this work to the calibrated case, which is more frequently used in practice. We discuss the iteration complexity of our proposed algorithms but do not include timing measurements as we believe meaningful run-time figures require comparable implementations (and our code in [11] is inefficient as it has not incorporated the speed-up tricks mentioned in [2, 4, 10]).

1.1 Related Work

In this section we briefly summarize relevant literature. The first seminal work dates back to Wiberg [24], who investigated the task of matrix factorization under missing data and whose name is today associated with the most successful method to solve this problem. The method is based on the principle of Variable Projection, which rewrites the objective in terms of a reduced set of unknowns by “minimizing out” the remaining ones. This approach is in particular promising, when the dependencies between unknowns forms a bipartite graph (which is the case in the structure from motion setting). In several works it has been experimentally verified that the Wiberg/VarPro method is far superior to naive joint optimization for matrix factorization problems (e.g. [5, 10, 17]), which explains the interest in the more difficult to implement VarPro methods.

In computer vision the connection between matrix factorization and affine structure from motion—but without missing data—was explored in [22]. Solving projective structure from motion via matrix factorization is more difficult, and it requires iterative methods even in the fully observed case (e.g. [21]). One step towards the application of VarPro methods in projective problems is the Nonlinear VarPro extension explored by Strelow [20], which we take as a starting point for our implementation.

All methods mentioned so far are ideally designed not to require a careful initialization for the unknown cameras and 3D structure (or matrix factors in the general case), and VarPro-derived methods work well for matrix factorization tasks even with random values as initializer. In SfM applications strong geometric constraints (we refer to [9] for a comprehensive treatment) can be used to determine sensible initial cameras and 3D structure. This initialization is subsequently used as starting point for nonlinear least-squares optimization (termed bundle adjustment) over all unknowns (see [23] for a review). Determining a good starting point for bundle adjustment is a non-trivial problem and expensive to solve in the general case. Compared to two-view and three-view geometry and to full-scale bundle adjustment this step of finding a good initializer also lacks in theoretical understanding. Hence, it is beneficial to bypass this stage altogether and investigate initialization-free methods for bundle adjustment.

2 Known Methods for Bivariate Least-Squares Optimization

Bivariate least-squares solves

$$\begin{aligned} \min _{\mathbf {u}, \mathbf {v}}\Vert {\varvec{\varepsilon }}(\mathbf {u}, \mathbf {v})\Vert _2^2 \end{aligned}$$
(2)

where \(\mathbf {u}\) and \(\mathbf {v}\) are sets of model parameters and \({\varvec{\varepsilon }}\) is the residual vector. We can solve this by using various methods, namely Joint optimization, Variable Projection (Linear [6] and Nonlinear VarPro [20]) and Alternating least-squares (ALS), which is equivalent to RW3 (see Sects. 2.2 and 3).

The key to implementing all these methods is the use of the Levenberg-Marquardt (LM) algorithm [13, 15], which is a widely used trust-region strategy.

The Levenberg-Marquardt algorithm (LM)

LM [13, 15] is an extension of the Gauss-Newton algorithm, which minimizes \(\Vert {\varvec{\varepsilon }}(\mathbf {x})\Vert _2^2\) by iteratively solving its linearization and updating \({\mathbf {x}}\) accordingly. At each iteration, the Gauss-Newton update \(\mathbf {\Delta x}\) is the solution of the linearized problem

$$\begin{aligned} \mathop {{{\mathrm{arg\,min}}}}\limits _{\mathbf {\Delta x}}\left\| {\varvec{\varepsilon }}(\mathbf {x}) + \mathtt {J}(\mathbf {x}) \mathbf {\Delta x}\right\| ^2_2 \end{aligned}$$
(3)

where \(\mathbf {x}\) denotes the parameter values from the previous iteration and \(\mathtt {J}(\mathbf {x}) := \partial {\varvec{\varepsilon }}(\mathbf {x}) / \partial \mathbf {x}\). This step is likely to lead to a lower objective if the local cost surface about \({\mathbf {x}}\) resembles a quadratic model, but otherwise may lead to a higher objective. To overcome this, LM incorporates a regularizer to control the step size and find the augmented solution

$$\begin{aligned} \mathop {{{\mathrm{arg\,min}}}}\limits _{\mathbf {\Delta x}}\left\| {\varvec{\varepsilon }}(\mathbf {x}) + \mathtt {J}(\mathbf {x}) \mathbf {\Delta x}\right\| ^2_2 + \lambda \Vert \mathbf {\Delta x}\Vert _2^2 \end{aligned}$$
(4)

where \(\lambda \) is known as the damping factor which we tune to decrease the cost. This parameter indicates the size of the trust region — the smaller the value of \(\lambda \), the larger the region that can be “trusted” as quadratic.

2.1 Joint Optimization

Joint optimization solves for all parameters simultaneously. This is achieved by stacking into the vector \({\mathbf {x}} = [\mathbf {u}; \mathbf {v}]\) and using a Newton-like solver such as LM. In general, the update at iteration k is

$$\begin{aligned}{}[{\mathbf {u}_{k+1}}; \mathbf {v}_{k+1}] = {\mathbf {x}}_{k+1} = {\mathbf {x}}_k - (\mathtt {H}({{\mathbf {x}}}_k) + \lambda \mathtt {I})^{-1} \mathbf {g}({{\mathbf {x}}}_k) \end{aligned}$$
(5)

where \(\mathtt {H}({\mathbf {x}}_k)\) is the Hessian (or its approximation) of \(\Vert {\varvec{\varepsilon }}({\mathbf {x}})\Vert _2^2\) at \({\mathbf {x}}_k\), \(\mathbf {g}({\mathbf {x}}_k)\) is the gradient and \(\lambda \) is the damping factor. A widely used Hessian approximation, which is also used by LM (and throughout this paper), is the Gauss-Newton matrix \(2 \mathtt {J}({\mathbf {x}}_k)^\top \mathtt {J}({\mathbf {x}}_k)\) where \(\mathtt {J}({\mathbf {x}}_k):=\partial {\varvec{\varepsilon }}({\mathbf {x}}_k) / \partial {\mathbf {x}}\).

2.2 Linear Variable Projection (Linear VarPro)

Linear VarPro [6] is an approach for solving separable nonlinear least-squares [19], which is a subset of bivariate optimization problems and has a property that the residual vector is linear in at least one of two variables, e.g.

$$\begin{aligned} \min _{\mathbf {u}, \mathbf {v}} \Vert {\varvec{\varepsilon }}(\mathbf {u}, \mathbf {v})\Vert _2^2&= \min _{\mathbf {u}, \mathbf {v}} \Vert \mathtt {A}(\mathbf {u}) \mathbf {v}- \mathbf {b}\Vert _2^2 \end{aligned}$$
(6)

where \(\mathbf {u}\) and \(\mathbf {v}\) are sets of model parameters, \({\varvec{\varepsilon }}\) is the residual vector, \(\mathtt {A}(\mathbf {u})\) is a linear operator which depends on \(\mathbf {u}\) and \(\mathbf {b}\) is a constant vector. Since \({\varvec{\varepsilon }}\) is linear in \(\mathbf {v}\), we have a direct solution for \(\mathbf {v}\) that minimizes (6) given \(\mathbf {u}\) which we call

$$\begin{aligned} \mathbf {v}^*(\mathbf {u})&:= \mathop {{{\mathrm{arg\,min}}}}\limits _{\mathbf {v}} \Vert \mathtt {A}(\mathbf {u}) \mathbf {v}- \mathbf {b}\Vert _2^2=\mathtt {A}^\dagger (\mathbf {u}) \mathbf {b}. \end{aligned}$$
(7)

Substituting \(\mathbf {v}^*(\mathbf {u})\) for \(\mathbf {v}\) in (6) yields

$$\begin{aligned} \min _{\mathbf {u}} \Vert {\varvec{\varepsilon }}^*(\mathbf {u})\Vert _2^2&:= \min _{\mathbf {u}} \Vert \mathtt {A}(\mathbf {u}) \mathbf {v}^*(\mathbf {u}) - \mathbf {b}\Vert _2^2 = \min _{\mathbf {u}, \mathbf {v}} \Vert \left( \mathtt {A}(\mathbf {u}) \mathtt {A}^\dagger (\mathbf {u}) - \mathtt {I} \right) \mathbf {b}\Vert _2^2 \end{aligned}$$
(8)

which is a nonlinear reduced problem in \(\mathbf {u}\) that can be solved using LM.

In [7], the authors claim that this reduced objective is almost always better conditioned than the original one. Although no formal proof is provided, we can find empirical evidence in matrix factorization [10].

Deriving the Jacobian of the Reduced Problem. First, we write the Jacobian of the original problem (6) as

(9)

We then express (using the chain rule) the Jacobian of the reduced problem (8) as

$$\begin{aligned} \mathtt {J}^*(\mathbf {u})&:= \frac{d{\varvec{\varepsilon }}^*(\mathbf {u})}{d\mathbf {u}} = \frac{d{\varvec{\varepsilon }}(\mathbf {u}, \mathbf {v}^*(\mathbf {u}))}{d\mathbf {u}} = \frac{\partial {\varvec{\varepsilon }}(\mathbf {u}, \mathbf {v}^*(\mathbf {u}))}{\partial \mathbf {u}} + \frac{\partial {\varvec{\varepsilon }}(\mathbf {u}, \mathbf {v}^*(\mathbf {u}))}{\partial \mathbf {v}} \frac{d\mathbf {v}^*(\mathbf {u})}{d\mathbf {u}} \end{aligned}$$
(10)
$$\begin{aligned}&= \mathtt {J}_\mathbf {u}(\mathbf {u}, \mathbf {v}^*(\mathbf {u})) + \mathtt {J}_\mathbf {v}(\mathbf {u}) \frac{d}{d\mathbf {u}}\left[ \mathtt {A}(\mathbf {u})^\dagger \mathbf {b} \right] \end{aligned}$$
(11)
$$\begin{aligned}&= \mathtt {J}_\mathbf {u}(\mathbf {u}, \mathbf {v}^*(\mathbf {u})) + \mathtt {J}_\mathbf {v}(\mathbf {u}) \frac{d}{d\mathbf {u}}\left[ \mathtt {J}_\mathbf {v}(\mathbf {u}) ^\dagger \right] \mathbf {b}. \end{aligned}$$
(12)

If \(\mathbf {v}^*(\mathbf {u})\) is differentiable, (12) is analytically tractable. (see [11].)

Ruhe and Wedin Algorithms for Linear VarPro. In [19], Ruhe and Wedin proposed three Newton-like algorithms each of which uses an approximation to the Hessian. The first algorithm, RW1, simply uses Gauss-Newton (\(2\mathtt {J}^\top \mathtt {J}\)). The second algorithm, RW2, approximates \(d\mathbf {v}^*(\mathbf {u}) / {d\mathbf {u}}\) in the Jacobian such that the approximated Gauss-Newton matrix is orthogonal to the column space of \(\mathtt {J}_\mathbf {v}(\mathbf {u})\). Finally, RW3 assumes independence of the two variables by setting \(d\mathbf {v}^*(\mathbf {u}) / {d\mathbf {u}}=0\), leading to alternation.

Although Ruhe and Wedin did not associate any trust region strategy with the above algorithms, we can easily incorporate this by using LM.

2.3 Nonlinear Variable Projection (Nonlinear VarPro)

The approach in Sect. 2.2 can be applied only to separable nonlinear least-squares, where the objective is bivariate and linear in at least one of two variables. Strelow [20] extended this to apply to nonseparable problems, which can be expressed as

$$\begin{aligned} \min _{\mathbf {u}, \mathbf {v}} \Vert {\varvec{\varepsilon }}(\mathbf {u}, \mathbf {v})\Vert _2^2&= \min _{\mathbf {u}, \mathbf {v}} \Vert \mathbf {f}(\mathbf {u}, \mathbf {v}) - \mathbf {b}\Vert _2^2. \end{aligned}$$
(13)

Similar to Sect. 2.2, we wish to find \(\mathbf {v}^*(\mathbf {u}) := {{{\mathrm{arg\,min}}}}_{\mathbf {v}} \Vert {\varvec{\varepsilon }}(\mathbf {u}, \mathbf {v})\Vert _2^2\) and solve

$$\begin{aligned} \min _{\mathbf {u}} \Vert {\varvec{\varepsilon }}^*(\mathbf {u})\Vert _2^2&:= \min _\mathbf {u}\Vert {\varvec{\varepsilon }}(\mathbf {u}, \mathbf {v}^*(\mathbf {u}))\Vert _2^2. \end{aligned}$$
(14)

In this case, \(\mathbf {v}^*(\mathbf {u})\) may not have a closed form solution as the residual vector is nonlinear in both \(\mathbf {u}\) and \(\mathbf {v}\). Instead, we apply a second-order iterative solver (e.g. LM) to approximately solve \(\mathop {{{\mathrm{arg\,min}}}}\limits _{\mathbf {v}} \Vert {\varvec{\varepsilon }}(\mathbf {u}, \mathbf {v})\Vert _2^2\) and store the final solution in \({\hat{\mathbf {v}}}^*_0\).

Now, assuming that \({\hat{\mathbf {v}}}^*_0\) has converged, we define \({\hat{\mathbf {v}}}^*(\mathbf {u})\) as the quantity obtained by performing one additional Gauss-Newton iteration over \(\mathbf {v}\) from \((\mathbf {u}, {\hat{\mathbf {v}}}^*_0)\). i.e.

$$\begin{aligned} {\hat{\mathbf {v}}}^*(\mathbf {u}) := {\hat{\mathbf {v}}}^*_0 + \underbrace{\mathop {{{\mathrm{arg\,min}}}}\limits _{\mathbf {\Delta }\mathbf {v}} \Vert {\varvec{\varepsilon }}(\mathbf {u}, {\hat{\mathbf {v}}}^*_0) + \mathtt {J}_\mathbf {v}(\mathbf {u}, {\hat{\mathbf {v}}}^*_0) \mathbf {\Delta }\mathbf {v}\Vert _2^2}_\text {Additional Gauss-Newton step} = {\hat{\mathbf {v}}}^*_0 - \mathtt {J}_\mathbf {v}(\mathbf {u}, {\hat{\mathbf {v}}}^*_0)^\dagger {\varvec{\varepsilon }}(\mathbf {u}, {\hat{\mathbf {v}}}^*_0). \end{aligned}$$
(15)

Above expression implicitly assumes that \({\varvec{\varepsilon }}(\mathbf {u},\mathbf {v})\) is locally linear in \(\mathbf {v}\) about \({\varvec{\varepsilon }}(\mathbf {u},{\hat{\mathbf {v}}}^*_0)\). This approximation allows us to estimate \(d\mathbf {v}^*(\mathbf {u}) / d\mathbf {u}\) by computing \(d{\hat{\mathbf {v}}}^*(\mathbf {u}) / d\mathbf {u}\):

$$\begin{aligned} \frac{d\mathbf {v}^*(\mathbf {u})}{d\mathbf {u}} \approx \frac{d{\hat{\mathbf {v}}}^*(\mathbf {u})}{d\mathbf {u}} = - \frac{\partial }{\partial \mathbf {u}}[\mathtt {J}_\mathbf {v}(\mathbf {u}, {\hat{\mathbf {v}}}^*_0)^\dagger {\varvec{\varepsilon }}(\mathbf {u}, {\hat{\mathbf {v}}}^*_0)]. \end{aligned}$$
(16)

Combining the results of (10) and (16) yields

$$\begin{aligned} \tilde{\mathtt {J}}^*(\mathbf {u})&:= \mathtt {J}_\mathbf {u}(\mathbf {u}, {\hat{\mathbf {v}}}^*_0) - \mathtt {J}_\mathbf {v}(\mathbf {u}, {\hat{\mathbf {v}}}^*_0) \frac{\partial }{\partial \mathbf {u}}\left[ \mathtt {J}_\mathbf {v}(\mathbf {u}, {\hat{\mathbf {v}}}^*_0)^\dagger {\varvec{\varepsilon }}(\mathbf {u}, {\hat{\mathbf {v}}}^*_0) \right] \\&~= \left( \mathtt {I}- \mathtt {J}_\mathbf {v}(\mathbf {u}, {\hat{\mathbf {v}}}^*_0) \mathtt {J}_\mathbf {v}(\mathbf {u}, {\hat{\mathbf {v}}}^*_0)^\dagger \right) \mathtt {J}_\mathbf {u}(\mathbf {u}, {\hat{\mathbf {v}}}^*_0) \nonumber \end{aligned}$$
(17)
$$\begin{aligned}&\qquad \quad - \mathtt {J}_\mathbf {v}(\mathbf {u}, {\hat{\mathbf {v}}}^*_0) \frac{\partial }{\partial \mathbf {u}} \left[ \mathtt {J}_\mathbf {v}(\mathbf {u}, {\hat{\mathbf {v}}}^*_0)^\dagger \right] {\varvec{\varepsilon }}(\mathbf {u}, {\hat{\mathbf {v}}}^*_0) \end{aligned}$$
(18)

where \(\tilde{\mathtt {J}}^*(\mathbf {u})\) is the approximate Jacobian used by Nonlinear VarPro. This expression can be further simplified using the differentiation rule for matrix pseudo-inverses [6].

In summary, one iteration of Nonlinear VarPro amounts to solving one inner minimization over \(\mathbf {v}\) given \(\mathbf {u}\), which outputs \({\hat{\mathbf {v}}}^*_0 \approx \mathbf {v}^*(\mathbf {u})\), followed by one outer minimization over \(\mathbf {u}\), which is achieved by linearizing the residual vector in \(\mathbf {v}\) about \({\varvec{\varepsilon }}(\mathbf {u}, {\hat{\mathbf {v}}}^*_0)\).

For separable problems, the residual vector is always linear in \(\mathbf {v}\), and therefore (18) becomes the exact Jacobian.

3 Ruhe and Wedin Algorithms for Nonlinear VarPro

We acquire the nonlinear extensions of the original Ruhe and Wedin algorithms [19] as follows: since original RW1 applies the Gauss-Newton algorithm on the reduced problem (8), we propose that Nonlinear RW1 employs the Gauss-Newton algorithm using the Jacobian derived in (18) (This is essentially the same as Strelow’s General Wiberg [20]). Original RW2 projects the exact Jacobian of original RW1 to the left nullspace of \(\mathtt {J}_\mathbf {v}(\mathbf {u})\), resulting in an approximated Gauss-Newton matrix. For Nonlinear RW2, we project the Jacobian of Nonlinear RW1 to the left nullspace of \(\mathtt {J}_\mathbf {v}(\mathbf {u}, {\hat{\mathbf {v}}}^*_0)\), which is equivalent to discarding the latter term of (18). Lastly, original RW3 assumes \(\mathbf {u}\) and \(\mathbf {v}\) to be independent. For the nonlinear case, this translates to \(d{\hat{\mathbf {v}}}^*(\mathbf {u}) / d\mathbf {u}= 0\), yielding \(\mathtt {J}_\mathbf {v}(\mathbf {u}, {\hat{\mathbf {v}}}^*_0)\) as the approximate Jacobian of Nonlinear RW3 (Table 2.3).

Table 1. A list of approximate Jacobians used by our nonlinear extension of the Ruhe and Wedin algorithms. Nonlinear RW1 applies the Gauss-Newton (GN) algorithm on the reduced problem (14), Nonlinear RW2 makes an approximation to the Jacobian as described in Sect. 3 and Nonlinear RW3 makes a further approximation which turns it into alternation. \({\hat{\mathbf {v}}}^*_0\) is obtained using a second-order iterative solver (see Sect. 2.3).

3.1 The Sparsity of the Hessian Approximations

Okatani et al. [17] and Strelow [20] pointed out structural similarity between the Schur complement reduced system for Joint optimization and Linear VarPro. Analysing the exact differences between the two methods is a research question on its own, and therefore in this paper we just confirm numerically that the Hessian approximations of both RW1 and RW2 (linear and nonlinear) have the same sparsity pattern but are not equal to the Schur complement reduced system for Gauss-Newton based Joint optimization. Furthermore, we found (through code implementation) that the iteration complexity of our proposed nonlinear extensions is similar (same for RW2, higher for RW1) to standard bundle adjustment with embedded point iterations [12]. Hence, one LM iteration of both standard bundle adjustment and our method will take roughly similar amount of time.

4 A Unified Notation for Uncalibrated Camera Models

In this section, we present a unified notation for uncalibrated (affine and projective) cameras which allows a modularized compilation of bundle adjustment algorithms.

Affine and projective cameras are widely-used uncalibrated camera models which can be expressed in the homogeneous or the inhomogeneous form. We can incorporate both models and forms into a single camera matrix by defining

$$\begin{aligned} \mathtt {P}_i := \mathtt {P}(\mathbf {p}_i, \mathbf {q}_i, s_i, \mu _i) := \begin{bmatrix} p_{i1}&p_{i2}&p_{i3}&p_{i4} \\ p_{i5}&p_{i6}&p_{i7}&p_{i8} \\ \mu _i q_{i1}&\mu _i q_{i2}&\mu _i q_{i3}&s_i \\ \end{bmatrix} =: \begin{bmatrix} \mathbf {p}_{i1:}^\top \\ \mathbf {p}_{i2:}^\top \\ \begin{bmatrix} \mu _i \mathbf {q}_i^\top&s_i \end{bmatrix} \end{bmatrix} \end{aligned}$$
(19)

where \({\mathbf {p}}_i = [ {\mathbf {p}}_{i1:}^\top , {\mathbf {p}}_{i2:}^\top ]^\top = [ p_{i1}, \cdots , p_{i8} ]^\top \) and \(\mathbf {q}_i = [ q_{i1}, q_{i2}, q_{i3} ]^\top \) are the projective camera parameters for frame i, \(\mu _i \in [0, 1]\) indicates the degree of “projectiveness” of frame i, and \(s_i\) is the scaling factor of the i-th camera.

Table 2. A summary of uncalibrated camera models using the unified notation. \(\mathtt {H}\in \mathbb {R}^{4\times 4}\) is an arbitrary invertible matrix, and \(\mathtt {A}\in \mathbb {R}^{4\times 4}\) is an arbitrary invertible matrix with the last row set to [0, 0, 0, 1]. \(\alpha _i, \beta _j\in \mathbb {R}\) is an arbitrary scale factor.

Now each point is typically parametrized as

$$\begin{aligned} {\tilde{{\mathbf {x}}}}_j&:= {\tilde{{\mathbf {x}}}}({\mathbf {x}}_j, t_j) := \begin{bmatrix} {\mathbf {x}}_j^\top&t_j \end{bmatrix} := \begin{bmatrix} x_{j1}&x_{j2}&x_{j3}&t_j \end{bmatrix}^\top \end{aligned}$$
(20)

where \({\mathbf {x}}_j = \begin{bmatrix} x_{j1}, x_{j2}, x_{j3} \end{bmatrix}^\top \) is the vector of unscaled inhomogeneous coordinates of point j and and \({\tilde{{\mathbf {x}}}}_j\) is the vector of homogeneous coordinates of point j. (19) and (20) lead to a unified projection function

(21)

We show in Table 2 that the affine and the projective models (in both homogeneous and inhomogeneous forms) are specific instances of the unified model described above.

5 Compilation of Affine/projective Bundle Adjustment Algorithms

In this section, we present the building blocks of our bundle adjustment algorithms for uncalibrated cameras which stem from Sects. 23 and 4. To simplify notations, we stack the variables introduced in Sect. 4 across all cameras or points by omitting the corresponding subscript, e.g. \(\mathbf {p} = [{\mathbf {p}}_1^\top , \cdots , {\mathbf {p}}_f^\top ]^\top \) and \({\mathbf {x}} = [{\mathbf {x}}_1^\top , \cdots , {\mathbf {x}}_n^\top ]^\top \), where f is the number of frames and n is the number of points in the dataset used. We also define \({\tilde{\mathbf {p}}}\) to be the collection of the camera parameters \({\mathbf {p}}\), \(\mathbf {q}\) and \(\mathbf {s}\). (Note that \(\varvec{\mu }\) is not included.) We can now rewrite (1) as

$$\begin{aligned} \min _{{\mathbf {p, q, s, x, t}}}\Vert {{\varvec{\varepsilon }}(\mathbf {p,\, q,\, s,\, x,\, t,}\, {\varvec{\mu }})}\Vert _2^2. \end{aligned}$$
(22)

In this paper, we assume that \({\varvec{\mu }}\) (the projectiveness vector) is fixed during optimization. (Finding an optimal way to adjust \(\varvec{\mu }\) at each iteration is future work.) Our algorithms first eliminate points (\({\tilde{{\mathbf {x}}}}\)), generating a reduced problem over camera poses (\({\tilde{\mathbf {p}}}\)), but we could reverse the order to eliminate poses first as described in Sect. 6.1 of [23].

5.1 Required Derivatives

We only need three types of derivatives to implement all the algorithms mentioned in Sect. 2 irrespective of the camera model used.

The first two derivatives are the Jacobian with respect to camera poses (\(\mathtt {J}_{{\tilde{\mathbf {p}}}}\)) and the Jacobian with respect to feature points (\(\mathtt {J}_{{\tilde{{\mathbf {x}}}}}\)) which are the first order derivatives of the original objective (1). These Jacobians are used by both Joint optimization and VarPro but are evaluated at different points in the parameter space — at each iteration, Joint optimization evaluates the Jacobians at \(({\tilde{\mathbf {p}}}, {\tilde{{\mathbf {x}}}})\) whereas Linear and Nonlinear VarPro evaluate them at \(({\tilde{\mathbf {p}}}, {\tilde{{\mathbf {x}}}}^*({\tilde{\mathbf {p}}}))\), where \({\tilde{{\mathbf {x}}}}^*({\tilde{\mathbf {p}}})\) denotes a set of feature points which locally minimizes (1) given the camera parameters \({\tilde{\mathbf {p}}}\) (Table 3).

The third derivative, which involves a second-order derivative of the objective, is only required by Linear and Nonlinear RW1.

Table 3. A list of derivatives required for implementing affine and projective bundle adjustment algorithms based on the methods illustrated in Sect. 2. The camera parameters (\({\tilde{\mathbf {p}}}\)) consist of \(\mathbf {p}\), \(\mathbf {q}\) and \(\mathbf {s}\), and the point parameters (\({\tilde{{\mathbf {x}}}}\)) consist of \(\mathbf {x}\) and \(\mathbf {t}\). Note that the effective column size of these quantities will vary depending on the parameterization of the camera model used. The Jacobians are the first-order derivatives of \({\varvec{\varepsilon }}({\mathbf {p}}, {\mathbf {q}}, {\mathbf {s}}, {\mathbf {x}}, {\mathbf {t}}, \varvec{\mu })\) in (22).

5.2 Constraining Local Scale Freedoms in Homogeneous Camera Models

Homogeneous camera models have local scale freedoms for each camera and point (see Table 2). We need to constrain these scales appropriately for the second-order update to be numerically stable — manually fixing an entry in each camera and point (as in the inhomogeneous coordinate system) may lead to numerical instability if some points or cameras are located in radical positions.

To do this, we apply a Riemannian manifold optimization framework [1, 14]. The intuition behind this is that scaling each point and each camera arbitrarily does not change the objective, and therefore, each point and each camera can be viewed as lying on the Grassmann manifold (which is a subset of the Riemannian manifold).

In essence, optimization on the Grassmann manifold can be achieved [14] by projecting each Jacobian to its tangent space, computing the second-order update of parameters on the tangent space then retracting back to the manifold by normalizing each camera and/or point. This is numerically stable since the parameters are always updated orthogonal to the current solution. Details of our implementation can be found in [11].

5.3 Constraining Gauge Freedoms for the VarPro-Based Algorithms

Unlike scale freedoms, gauge freedoms are present in all camera models listed in Sect. 4.

Our VarPro-based algorithms eliminate points such that the matrix of whole camera parameters lie on the Grassmann manifold. This means that any set of cameras which share the same column space as the current set does not change the objective. With the inhomogeneous affine camera model, the matrix of whole cameras lie on a more structured variant of the Grassmann manifold as the scales are fixed to 1.

Since homogeneous camera models require both scale and gauge freedoms to be removed simultaneously (and the Jacobians are already projected to get rid of the scale freedoms), we incorporate a technique introduced in [17] to penalize the matrix of whole cameras updating along the column space of the current matrix, and this constrains all 16 gauge freedoms. (This approach can be viewed [10] as a relaxed form of the manifold optimization framework described in Sect. 5.2.) With the inhomogeneous affine model, we manipulate this technique to prevent from overconstraining the problem, and this removes 9 out of 12 gauge freedoms. More details are included in [11].

We have not implemented a gauge-constraining technique for Joint optimization but [16] could be applied.

5.4 Remarks

Combining all the aforementioned techniques yield 16 algorithms which are listed in [11]. We use 4 of them (see Table 4) to synthesize two-stage meta algorithms in Sect. 6.

As mentioned in Sect. 2.3, Nonlinear VarPro requires iterative inner minimization over points given cameras at each iteration. Our algorithms initialize points from the closest algebraic solution obtained using the Direct Linear Transformation (DLT) method [9].

Table 4. A list of affine and projective bundle adjustment algorithms used for our two-stage meta-algorithms in Sect. 6. We compile these algorithms using the building blocks from Sect. 5.

6 Two-Stage Meta-Algorithms for Projective Bundle Adjustment

Initially, we attempted to use the projective bundle adjustment algorithms compiled in Sect. 5 directly on the datasets listed in Tables 6 and 7. However, our preliminary investigation showed that none of these work out of the box as the Linear VarPro-based algorithms do for the inhomogeneous affine case.

To resolve this, we propose the following strategy: perform affine bundle adjustment first and then use the outputs to initialize projective bundle adjustment. This is inspired by the fact that some projective algorithms such as projective matrix factorization [9, 21] and trilinear projective bundle adjustment [20] initialize all camera depths to 1, which is equivalent to employing the affine camera model. The key difference between our strategy and the aforementioned methods is that our approach enforces the affine model throughout the first stage whereas other methods can switch to the projective model straight after initialization. Since our strategy essentially places a prior on the affine model, it is important to check how this performs on strong perspective scenes.

Table 5. A list of two-stage meta-algorithms used in our experiments.
Table 6. Synthetic tracks of 319 points randomly generated on a sphere of radius 10.0 viewed from 36 cameras. The cameras are equidistantly positioned and form a ring of radius d, looking down the sphere at \(60\,^\circ \) from the vertical axis. We employ structured visibility patterns with high missing rates to depict real tracks with occlusions and tracking failures. \(\mathcal {N}(\mathbf {0}, \mathtt {I})\) noise is added.

For the first (affine) stage, we choose AIRW2P (Affine Inhomogeneous RW2 with manifold Projection) and AHRW2P (Affine Homogeneous RW2 with manifold Projection). We opt for the VarPro-based algorithms, which have large convergence basins for the affine case. We drop the RW1 series as they perform substantially slower than the RW2 series with comparable success rates. (Similar phenomenon is reported in [8, 10].)

For the second (projective) stage, we choose PHRW1P (Projective Homogeneous RW1 with manifold Projection) and PHJP (Projective Homogeneous Joint optimization with manifold Projection). We drop PHRW2P after observing its poor performance on some of the datasets used. None of the inhomogeneous projective algorithms are selected due to numerical stability issues (see Sect. 5.2).

Table 7. Real datasets used for the experiments. f denotes the number of frames and n denotes the number of feature points. \(^*\)Di2 was generated by projecting real points from synthetic camera poses made deliberately close to the 3D structure thereby inducing strong perspective effects.

7 Experiments

All experiments were carried out on a workstation with 2.2 GHz Intel Xeon E5-2660 processor and 32 GB 1600 MHz DDR3 memory. We used MATLAB R2015b in single-threaded mode to run all the experiments.

We tested on various small synthetic (Table 6) and real SfM datasets (Table 7) derived from circular motion (Din, Dio, Di2, Hou), non-circular motion (Btb), forward movement (Cor, R47, Sth, Wil) and small number of frames (Lib, Me1, Me2, Wad).

Fig. 2.
figure 2

The figures show the success rates of each meta-algorithm on each dataset. (A run is counted as successful if and only if it reaches the best known optimum of the dataset used.) We conclude that TSMA1 and TSMA2 are winners by narrow margins.

On each dataset, we ran all four two-stage meta-algorithms listed in Table 5 for 100 runs. On each run, initial camera poses and points were drawn from \(\mathcal {N}(\mathbf {0}, \mathtt {I})\). The first stage of each meta-algorithm minimized the affine version of (1), and the second stage minimized the projective version of the same problem. We set the maximum number of iterations in each stage to 1000 and the function value tolerance to \(10^{-9}\).

We then compared how many fractions of runs each meta-algorithm converged to the best observed minimum on each dataset, defining this quantity as the success rate. The success rates of different meta-algorithms are compared in Fig. 2a and b.

Throughout this paper, we report the normalized cost values which can be computed as follows:

\(\sqrt{\text {Equation}~(1)~/~(2 \times \text {Total number of visible frames over all cameras})}\).

8 Discussions

Figure 2a and b show that TSMA1 and TSMA2 return global optimum in a large fraction of runs on most datasets. Considering that each run is initialized from arbitrary cameras and points, we believe that these are novel and valuable results.

On the synthetic datasets (Fig. 2a), all our meta-algorithms yield high success rates (74–100 %) until the ground truth cameras are moved radically close to the sphere (e.g. S10.5/L). As discussed in Sect. 6, this is somewhat expected since our strategy is inevitably biased towards affine reconstruction. However, one should bear in mind that these are extreme cases where the cameras are located only 0.5 unit away from the surface of the sphere of radius 10.0, and our strategy still succeeds with high probability on strong perspective datasets such as S11/L S12/L and S13/L. The presence of loop closure does not seem to influence success rates massively.

On the real sequences (Fig. 2b), each of TSMA1 and TSMA2 achieves 88–100 % on all datasets but one (Lib for TSMA1 and Cor for TSMA2). This demonstrates that these methods work well in practice as they provide consistent performances across different kinds of camera motions.

Regarding the first (affine) stage algorithms, we do not observe a clear boost in success rates from employing AHRW2P instead of AIRW2P. (We only observe this on the Cor dataset (see Fig. 2b), which comprises forward camera movements.) This is against our hypothesis that AHRW2P, which is a numerically-stable reformulation of AIRW2P, should perform better on strong perspective sequences. The results imply that the potential numerical instability caused by the use of inhomogeneous coordinates is not a major issue in the affine case. (It is still an issue for the projective model.)

In addition to the main experiments, we investigated to see if our meta-algorithms could serve as an initializer for the full bundle adjustment process. We ran a projective bundle adjustment algorithm, namely PHRW1P, on the full dinosaur dataset (Dio) with the initial camera values set to those of the global optimum of the trimmed dataset (Din). This allowed PHRW1P to reach the global optimum of the full sequence within 10 iterations. Based on this observation, we believe that our meta-algorithms could be applied to a segment of large datasets to trigger incremental or full bundle adjustment.

We also tried incrementing \(\mu \) (the projectiveness parameter) gradually to make the affine-projective transition smoother, but this strategy performed comparable to projective bundle adjustment without affine initialization. Implementing a fully unified algorithm still remains a challenge.

9 Conclusions

In this work we analysed if the Variable Projection (VarPro) method, which is highly successful in finding global minima in affine factorization problems without careful initialization, is equally effective in the projective scenario. Unfortunately, the answer is that the success rate of VarPro-based algorithms cannot be directly replicated in the projective setting. Thus, we proposed and evaluated several meta-algorithms to overcome this shortcoming, and each of the winning methods (TSMA1 and TSMA2) obtained success rates between 88 and 100 % on all real datasets but one. Experimentally it turns out that using an affine factorization based on VarPro to warm-start projective bundle adjustment is essential to boost the success rate.

We demonstrated that the convergence basin can be greatly enhanced using the right combination of methods. By unifying affine and projective factorization problems we also derived numerically better conditioned formulations to solve these instances.

Future work includes the followings: addressing outliers in the measurements and therefore robustness in the cost function (e.g. by incorporating robust kernel reformulation [25]) and to operate in metric instead of projective space by restricting the unknowns to the respective Lie group. A highly ambitious goal is to solve large datasets as introduced in [3] via an initialization-free approach.