1 Introduction

The geometry underlying two views of a rigid scene has been studied for a very long time. Using different projection models, the structure from motion problem, i.e., the problem of estimating both the scene geometry and the camera geometry from only image data, has been solved for many cases [8]. In his seminal paper from 1979, Ullman used the orthographic projection model to formulate and solve a number of structure from motion problems, for a small number of views and points [27]. During the following years, these theories were developed and refined. The main concern was to develop methods and algorithms to accurately estimate the geometry, in the presence of image noise [7, 9, 10, 14]. During recent years, more focus has been given to robustly estimating geometry in the presence of outliers in the data. To enable this, a number of approaches have been followed. A classic way to handle outliers is through robust estimation schemes based on RANSAC. In these frameworks, one needs methods for estimating model parameters given a small or minimal number of data points. For calibrated two-view projective geometry, the five-point algorithm estimates the essential matrix minimally, using five image point correspondences. In another direction, algorithms for global inlier maximization have been developed. These are often based on relaxing an initial non-convex problem into a more tractable problem, using different error norms or branch-and-bound methods [3, 12, 13, 16]. Arguably, in many cases it is more beneficial to instead relax the underlying camera models (e.g., by using an orthographic camera model), to get tractable problems. Using approximate models to estimate epipolar geometry has been investigated previously; in [6], Goshen and Shimshoni used two matching SIFT descriptors, with dominant directions and scale, to construct eight correspondences. Perdoch et al. [20] used a fixed partial calibration to estimate the fundamental matrix from two local affine frames. In [21], it was shown that the affine fundamental matrix is the first-order Taylor approximation of the full perspective fundamental matrix. Using two matching MSER feature points, an approximate fundamental matrix could be estimated. Lately, methods that optimally find models that maximize the number of inliers, in polynomial time, have been developed, [5, 25]. We will in this paper investigate a number of specific geometric problems that have received little or no attention earlier, related to orthographic projections of rigid scenes in two views. We believe that this work gives some additional theoretic insights and at the same time also gives new powerful tools to robustly establish image point correspondences in the presence of outliers.

Our main contribution is three algorithms for estimating orthographic epipolar geometry for two views:

  • A minimal three-point solver that uses only three points and yields only two solutions, making it very suitable for RANSAC-based estimation.

  • An optimal solver, maximizing the number of inliers using recent methods for optimal estimation and based on three specialized minimal solvers.

  • A least squares solver that minimizes a relevant reprojection error, by finding all stationary points of the corresponding Lagrangian.

2 Essential Matrix Estimation

In [18] the general form of the fundamental matrix for two affine cameras was given,

$$\begin{aligned} F = \begin{bmatrix} 0&\quad 0&\quad a \\ 0&\quad 0&\quad b \\ c&\quad d&\quad e \end{bmatrix}. \end{aligned}$$
(1)

This matrix has five degrees of freedom, but it is only defined up to scale. It can also be described using the directions and offsets of the corresponding epipolar lines [17]. It can be linearly estimated from at least four point correspondences [22] or two ellipses [1].

The corresponding calibrated entity—the essential matrix corresponding to two orthographic projections—will fulfill an additional nonlinear constraint. This fact has been mentioned in passing earlier [9, 15], but the implications have not been pursued in any depth. The extra constraint means that the essential matrix in this case only has three degrees of freedom. This low dimension will give us powerful tools for estimating the epipolar geometry. We will state the constraint in terms of the essential matrix.

Theorem 1

The essential matrix,

$$\begin{aligned} E = \begin{bmatrix} 0&\quad 0&\quad a \\ 0&\quad 0&\quad b \\ c&\quad d&\quad e \end{bmatrix}, \end{aligned}$$
(2)

corresponding to two orthographic views, will fulfill

$$\begin{aligned} a^2 + b^2 = c^2 + d^2 . \end{aligned}$$
(3)

Proof

Two orthographic views can be represented by the camera matrices

$$\begin{aligned} P_1 = \begin{bmatrix} 1&\quad 0&\quad 0&\quad 0 \\ 0&\quad 1&\quad 0&\quad 0 \\ 0&\quad 0&\quad 0&\quad 1 \end{bmatrix} \quad P_2 = \begin{bmatrix} r_{11}&\quad r_{12}&\quad r_{13}&\quad t_{1} \\ r_{21}&\quad r_{22}&\quad r_{23}&\quad t_{2} \\ 0&\quad 0&\quad 0&\quad 1 \end{bmatrix}, \end{aligned}$$
(4)

where \(\mathbf{r_1}=[ r_{11} \,, r_{12} \,, r_{13} ]\) and \(\mathbf{r_2}=[ r_{21} \,, r_{22} \,, r_{23} ]\) are orthonormal vectors. The corresponding essential matrix is then given by

$$\begin{aligned} E = \begin{bmatrix} 0&\quad 0&\quad a \\ 0&\quad 0&\quad b \\ c&\quad d&\quad e \end{bmatrix} = \begin{bmatrix} 0&\quad 0&\quad r_{13}r_{21} - r_{11}r_{23} \\ 0&\quad 0&\quad r_{13}r_{22} - r_{12}r_{23} \\ r_{23}&\quad -r_{13}&\quad r_{13}t_2 - r_{23}t_1 \end{bmatrix}. \end{aligned}$$
(5)

Let

$$\begin{aligned} \mathbf{r_3}&= [ r_{31} \,, r_{32} \,, r_{33} ] = \mathbf{r_1} \times \mathbf{r_2} \end{aligned}$$
(6)
$$\begin{aligned}&= [ r_{12}r_{23} - r_{13}r_{22} \,, r_{13}r_{21} - r_{11}r_{23} \,, r_{11}r_{22} - r_{12}r_{21} ]. \end{aligned}$$
(7)

From (5) and the orthonormality constraints, it is clear that

$$\begin{aligned} c^2+d^2 = r_{23}^2 +r_{13}^2 = 1-r_{33}^2. \end{aligned}$$
(8)

From (5) and (7), we have that \(a^2 + b^2 = r_{32}^2 + r_{31}^2.\) The orthonormality constraint now gives that

$$\begin{aligned} a^2 + b^2 = r_{32}^2 + r_{31}^2 = 1- r_{33}^2, \end{aligned}$$
(9)

and hence \(a^2 + b^2 = c^2 + d^2.\) \(\square \)

For two corresponding points \(\mathbf{u} =[x, y, 1]\) and \(\mathbf{u^{\prime }} = [x^{\prime }, y^{\prime }, 1]\), we have that \(\mathbf{u}^TE \mathbf{u^{\prime }} = 0 \Leftrightarrow e + ax + by + cx^{\prime } + dy^{\prime } = 0 \). The essential matrix is only determined up to scale. To get appropriate and symmetric epipolar line constraints, we will fix the scale by setting \(a^2 + b^2 = 1 (= c^2+d^2)\). This will lead to that (ab) and (cd) represent the unit norm normal directions for the two epipolar lines. The epipolar line distances to the origin are then given by \(e + cx^{\prime } + dy^{\prime }\), respectively, \(e + ax + by\). This gives us a natural way to define a symmetric and geometrically valid error. The mean perpendicular distance to the epipolar lines in the two images is given by

$$\begin{aligned} D&= \frac{1}{2} ( |ax + by + e + cx^{\prime } + dy^{\prime }| \end{aligned}$$
(10)
$$\begin{aligned}&\quad +|cx^{\prime } + dy^{\prime } +e + ax + by|) \end{aligned}$$
(11)
$$\begin{aligned}&= | e + ax + by + cx^{\prime } + dy^{\prime }|. \end{aligned}$$
(12)

In [22], three different cost functions, for affine geometry estimation, were discussed. Our choice of normalization makes these three different cost functions collapse into the one described above. For a thorough discussion on model selection functions for two-view geometry see [26].

It is known that given two orthographic views of a rigid scene, the full camera geometry cannot be established [9, 10]. But the image data must fulfill the epipolar constraint in order for the geometry to be valid. We will use this rigidity constraint to enable estimates of the geometry, based on a low number of tentative point correspondences. These estimates in turn can be used to find image feature point correspondences in robust manner.

We will in the next sections show how the essential matrix can be estimated. In Sect. 3, we solve the minimal case of three point correspondences. This gives a very fast solver that can be used in a standard RANSAC framework. In Sect. 4, we derive solvers that can be used to find the globally optimal essential matrix that maximizes the number of inliers, given a set of point correspondences corrupted by outliers. Finally in Sect. 5, we show how to find the nonlinear least squares solution, given four or more point correspondences.

3 A Minimal Three-Point Solver

The essential matrix has three degrees of freedom, and each point correspondence gives one constraint on the parameters. We should hence be able to minimally estimate the essential matrix using three point correspondences. We will now show how this can be done. We have according to (12) the three constraints \(D_i = 0, i=1,2,3.\) These are linear in the five parameters (abcde). In addition to this, we have the two quadratic constraints \(a^2+b^2=1\) and \(c^2+d^2=1\). We can start by eliminating e by taking \(D_2-D_1\) and \(D_3-D_1\). This gives two linear constraints in (abcd). We can use these to express (ab) in terms of (cd). Substituting these expressions into \(a^2+b^2=1\) gives a polynomial in c and d. We can use the second quadratic constraint \(c^2= 1- d^2\) to eliminate all factors of c of higher degree than one. This gives a polynomial of the form \(p_1 = k_1cd+ k_2d^2+ k_3\), where \(k_i\) only depends on image point measurements. Multiplying this polynomial by c and again eliminating terms of higher degree in c gives a polynomial of the form \(p_2 = q_1cd^2+q_2c+ q_3d^3+q_4d\). We can write these two polynomial constraints as

$$\begin{aligned} B \begin{bmatrix} c \\ 1 \end{bmatrix} = \begin{bmatrix} 0 \\0 \end{bmatrix}, \end{aligned}$$
(13)

with

$$\begin{aligned} B = \begin{bmatrix} k_1&k_2d^2+k_3 \\ q_1d^2+q_2&q_3d^3 + q_4d \end{bmatrix}. \end{aligned}$$
(14)

In order for (13) to have a solution, the determinant of B must vanish. Taking the determinant of B gives a fourth degree polynomial in d but with only even degrees. It is clear that if we have a solution (abcde) to our three-point problem then \((-a,-b,-c,-d,-e)\) will also be a solution, but this corresponds to the same essential matrix. This means that we only need to consider two of the solutions for d. The solver will be extremely fast since we only need to solve a second-degree polynomial in the end.

4 Maximizing the Number of Inliers

In the presence of outliers, finding accurate correspondences is difficult, and robust methods are highly desirable. A common approach is to estimate a model that optimizes the number of inliers. Measuring reprojection errors is normally the preferred choice, as this accurately models the limited precision of feature detection techniques. Although such formulations lead to challenging optimization problems, using recent advances in robust estimation it is sometimes possible to develop tractable methods. In [5], it was shown how the number of inliers can be maximized in polynomial time, for a fixed-dimensional model, where the computational complexity follows directly as a consequence of the theory of optimization. One requirement is that the parameter space is a differentiable manifold embedded in \(\mathbb {R}^m\) with a set of equality constraints. The authors used this to produce algorithms for optimal image stitching and 2D-registration. In [24, 25], the authors used similar methods to perform large-scale image-based localization. We will here describe how these ideas can be applied to orthographic essential matrix estimation, resulting in an optimal method.

The main theorem from [5] shows that one can find the optimal solution with respect to the number of inliers by enumerating a finite set of so called critical points, essentially being the Karush–Kuhn–Tucker (KKT) points. These critical points divide the solution space into regions that contain different combinations of inliers and outliers, and the optimal solution with respect to the number of inliers will be found in one of the critical points. Our parameter space (abcde) is embedded in \(\mathbb {R}^5\) with the constraints \(h_1 = a^2+b^2=1\) and \(h_2 = c^2+d^2=1\). The critical points satisfy the KKT-conditions for local optimality to the optimization problems

$$\begin{aligned} \min _{(a,b,c,d,e)} \quad&f(a,b,c,d,e) \end{aligned}$$
(15)
$$\begin{aligned}&h_j(a,b,c,d) = 1,\quad j = 1,2, \end{aligned}$$
(16)
$$\begin{aligned}&D_i(a,b,c,d,e) \le \epsilon \text{ for } \text{ all } i \in C, \end{aligned}$$
(17)

where C runs over all subsets of correspondences of size \(|C| \le 3\) (the number of degrees of freedom of our problem) and \(D_i\) are the epipolar distances (12). The function f is an auxiliary goal function, which can be chosen arbitrarily, as long as the KKT points constitute a finite set. Most often a linear function will yield simple equations, and we will show that choosing the goal function

$$\begin{aligned} f(a,b,c,d,e) = a +d, \end{aligned}$$
(18)

will give us a finite set of KKT points. For an inlier bound of \(\epsilon \), the KKT points are then found by looking at points where a number of constraints using (17) are active, i.e.,

  1. 1.

    \(D_i^2=\epsilon ^2, i = i_1,i_2,i_3.\)

  2. 2.

    \(D_i^2=\epsilon ^2, i = i_1,i_2\) and \((\nabla f,\,\nabla h_j,\,\nabla D_i^2)\) are linearly dependent.

  3. 3.

    \(D_i^2=\epsilon ^2, i = i_1\) and \((\nabla f,\, \nabla h_j,\,\nabla D_i^2)\) are linearly dependent.

  4. 4.

    \(\nabla f=0\).

In order to find the solution, we need solvers to the first three different cases above (the last case gives only a trivial solution), and we can then find the optimal solutions by evaluating all KKT points. The time complexity of the algorithm is determined by step 1, where three constraints are active. Going through all triplets of points and then for each solution checking how many inliers we have gives a total complexity of \(\mathcal {O}(n^4)\) for n tentative correspondences. In this way, it is possible to maximize the number of inliers in \(\mathcal {O}(n^4)\) time. The steps of the method are summarized in Algorithm 1. In Sect. 4.1, we show how we construct the main optimal three-point solver. The optimal two- and one-point solvers are discussed in Sect. 4.2.

figure b

4.1 The Three-Point Solver for Inlier Maximization

Given an inlier bound \(\epsilon \) and three point correspondences, we want to find an essential matrix E such that

$$\begin{aligned} D_i^2=\epsilon ^2, \quad i = 1,2,3. \end{aligned}$$
(19)

These equations are quadratic in the unknowns (abcde),  but we can simplify them by considering that the solution must fulfill

$$\begin{aligned} D_i=\pm \epsilon , \quad i = 1,2,3. \end{aligned}$$
(20)

This gives in total eight combinations of solutions. We linearly eliminate e, giving for each combination a system on the form

$$\begin{aligned} D_i-D_1=w_i, \quad i = 2,3, \end{aligned}$$
(21)

where \(w_i\) can take the values of \(-2\epsilon \), 0 or \(2\epsilon \). This gives a very similar system to the one we solved in Sect. 3, but slightly more complicated due to the constant factors. We will solve it by explicitly constructing a Gröbner basis [4]. We start in the same way and write (ab) in (cd) using the linear constraints. Resubstitution into \(a^2+b^2=1\) gives an equation on the form

$$\begin{aligned} k'_1c^2+ k'_2cd + k'_3c + k'_4d^2 + k'_5d + k'_6= 0, \end{aligned}$$
(22)

where \(k'_i\) only depends on image measurements. We can eliminate the \(c^2-\)term in favor for \(d^2\) using the constraint \(c^2+d^2=1\). We then get

$$\begin{aligned} p_1 = cd + k_1 + k_2c + k_3d + k_4d^2, \end{aligned}$$
(23)

where each \(k_i\) depends on \(k'\). Multiplying this equation with c and d gives two new equations. Again we eliminate all \(c^2-\)terms. Taking a linear combination of these two polynomials gives us a polynomial with monomials \(\{cd, c, d^3, d^2, d, 1\}\). We can eliminate the \(cd-\)term by solving for cd in \(p_1=0\) and resubstituting. This gives us

$$\begin{aligned} p_2 = d^3 + q_1 + q_2c + q_3d +q_4d^2, \end{aligned}$$
(24)

where again \(q_i\) only depends on the image coordinates. The polynomials \(p_1\) and \(p_2\) together with \(p_0=c^2+d^2-1\) constitute a Gröbner basis for our problem. We can now construct our solver using the action matrix method [4]. We use the linear basis \(\{1\,, c\,, d\,, d^2\}\) and d as action variable to construct the action matrix A,

$$\begin{aligned} d\begin{bmatrix}1 \\c \\d \\d^2\end{bmatrix} = \begin{bmatrix}d \\cd \\d^2 \\d^3\end{bmatrix} = A \begin{bmatrix}1 \\c \\d \\d^2\end{bmatrix} , \end{aligned}$$
(25)

with

$$\begin{aligned} A = \begin{bmatrix} 0&\quad 0&\quad 1&\quad 0\\ -k_1&\quad -k_2&\quad -k_3&\quad -k_4\\ 0&\quad 0&\quad 0&\quad 1\\ -q_1&\quad -q_2&\quad -q_3&\quad -q_4 \end{bmatrix}. \end{aligned}$$
(26)

We can now solve for c and d by finding the eigenvectors of A, giving four solutions. In total, we get \(8\cdot 4= 32\) solutions to the three-point problem.

4.2 The Two- and One-Point Solvers for Inlier Maximization

We will here give an outline to our two and one point solvers, i.e., the solutions to \(D_i^2=\epsilon ^2, i = i_1,i_2\), and \((\nabla f,\,\nabla h_j,\,\nabla D_i^2)\) are linearly dependent, respectively, \(D_i^2=\epsilon ^2, i = i_1\) and \((\nabla f,\,\nabla h_j,\,\nabla D_i^2)\) are linearly dependent. To get as simple expressions as possible, we have chosen the goal function \(f = a + c\). That the gradients should be linearly dependent can be expressed using the determinant of the corresponding stacked gradients, yielding for the two-point case an expression on the form

$$\begin{aligned} s_1ad +s_2bc + s_3bd = 0, \end{aligned}$$
(27)

where each \(s_i\) only depends on image measurements. Taking the difference of the two epipolar line distance equations eliminates e and gives a linear constraint on (abcd). This gives together with the two embedding criteria \(a^2+b^2=c^2+d^2=1\) a system of four polynomial equations in the four variables (abcd). Using similar techniques as previously, this system can be solved, yielding at most eight real solutions. Since we, for each of the two points, have a distance of \(\pm \epsilon \) we get in total \(4 \cdot 8 = 32\) solutions.

For the one point case, this point will be used to express e in terms of the unknowns. The linearly dependence of the gradients will in this case give the constraints \(bd=0, ad=0, bc=0\) and \(s_1^{\prime }ad + s_2^{\prime }bc + s_3bd=0\). This together with the embedding constraints only gives the solutions \((a,b,c,d) = (1,0,1,0)\) and \((a,b,c,d) = (-1,0,1,0)\). In total, we will get four solutions since we get two solutions for e corresponding to \(\pm \epsilon \).

5 A Least Squares Solver

Having more than three point correspondences will lead to an overdetermined system of equations if we want to solve \(D_i = 0\). We will solve this in a least squares sense, i.e., given n point correspondences,

$$\begin{aligned} \min _{E} \sum _{i=1}^n D_i^2, \quad s.t.\,\, a^2+b^2=1, \, c^2+d^2=1. \end{aligned}$$
(28)

Actually in the very first ECCV, Harris described the same problem [7], and it was noted that the solution can be found as one of the roots of an 8-degree polynomial. However, no details as to how to construct the final polynomial were given. Here we go through the solution in detail and construct a solver that is both robust to noise and efficient. We will solve the least squares problem by finding all the stationary points of (28) by differentiating the corresponding Lagrangian function,

$$\begin{aligned}&L(a,b,c,d,e,\lambda ,\gamma ) \nonumber \\&\quad = \sum _{i=1}^n D_i^2 + \lambda (1-a^2-b^2) + \gamma ( 1-c^2-d^2), \end{aligned}$$
(29)

with

$$\begin{aligned} \nabla L = 2\begin{bmatrix} \sum _{i=1}^n x_iD_i -a\lambda \\ \sum _{i=1}^n y_iD_i -b\lambda \\ \sum _{i=1}^n x'_iD_i -c\gamma \\ \sum _{i=1}^n y'_iD_i -d\gamma \\ \sum _{i=1}^n D_i \\ 1-a^2-b^2 \\ 1-c^2-d^2 \end{bmatrix}. \end{aligned}$$
(30)

From \(\nabla L = 0\) it is clear that

$$\begin{aligned} e = \frac{1}{n}\sum _{i=1}^n [x_i \,\,y_i\,\, x'_i\,\, y'_i]\begin{bmatrix} a \\b\\c\\d\end{bmatrix}. \end{aligned}$$
(31)

Thus e can be eliminated by removing the centroid of each of the image point sets, i.e., \(\tilde{\mathbf{u}}_i = \mathbf{u}_i - \bar{\mathbf{u}}\) and \(\tilde{\mathbf{u}}_i' = \mathbf{u}_i' - \bar{\mathbf{u}}'\). The vanishing of the gradient of L (without the normalization constraints) can then be written

$$\begin{aligned} Mv = Sv, \end{aligned}$$
(32)

with \(v = [a, b, c, d]^T\) and

$$\begin{aligned} M_{4 \times 4} = \begin{bmatrix} \tilde{\mathbf{u}} \\\tilde{\mathbf{u}}' \end{bmatrix} \begin{bmatrix} \tilde{\mathbf{u}}^T&\tilde{\mathbf{u}}'^T\end{bmatrix}, \quad S = \begin{bmatrix} \lambda&\quad 0&\quad 0&\quad 0\\ 0&\quad \lambda&\quad 0&\quad 0 \\ 0&\quad 0&\quad \gamma&\quad 0\\ 0&\quad 0&\quad 0&\quad \gamma \end{bmatrix}. \end{aligned}$$
(33)

This is almost an eigenvalue problem, but the different \(\lambda \) and \(\gamma \) make it slightly more complicated to solve. One can see that  (32) is homogeneous in (abcd) so we can scale the equations with 1 / d giving equations in \((a/d,b/d,c/d,1) = (\hat{a},\hat{b},\hat{c},1)\). We can then use three of the four equations to linearly express \((\hat{a},\hat{b},\hat{c})\) in terms of \(\lambda \) and \(\gamma \). We reinsert these expressions into the fourth equation of (32) and the normalization constraints \(\hat{a}^2+\hat{b}^2 = \hat{c}^2+1\). This gives two equations of degree four in \(\lambda \) and degree two in \(\gamma \). Multiplying these two equations with \(\gamma \) gives two new equations. We can now write these equations as

$$\begin{aligned} B_{4\times 4}(\lambda ) \begin{bmatrix} \gamma ^3 \\ \gamma ^2 \\ \gamma \\ 1\end{bmatrix} = \begin{bmatrix} 0 \\ 0 \\ 0 \\ 0 \end{bmatrix}. \end{aligned}$$
(34)

Since this should have a solution, the determinant of B should be zero. Expressing this determinant in \(\lambda \) gives a 12 degree polynomial. This polynomial can be factorized into an 8-degree polynomial and a 4-degree polynomial. The 4-degree polynomial [containing spurious solutions due to the denominator from the linear solution of  (32)] can be factored out, leaving an 8-degree polynomial to solve. The rest of the unknowns can be found by back substitution. The least squares solution should be one of the eight solutions, and we simply choose the one with smallest error.

6 Experimental Validation

We have conducted a number of experiments, on both synthetic and real data, to show the performance of our solvers. We have implemented all solvers in MATLAB, and they are publicly released at https://github.com/hamburgerlady/ortho-gem. In Table 1, the running times for our MATLAB implementations are shown. All tests were conducted on a desktop computer running Ubuntu, with an Intel Core i7 3.6 GHz processor.

Table 1 Running times of our minimal and least squares solvers. The solvers were implemented in MATLAB and run on a desktop computer with an Intel Core i7 3.6 GHz processor

6.1 Synthetic Data

In order to test the numerical stability of our minimal solvers, we did some simple tests.

Fig. 1
figure 1

Evaluation of the numerical performance of our minimal solvers. The graph shows the histograms of the errors for the minimal three-point solver and the three-point solver for the optimal inlier solver, respectively

We randomly generated true orthographic epipolar geometry, with corresponding essential matrices and image data (without added noise, 10,000 instances). We then ran our solvers: the minimal three-point solver and the three-point solver used in the optimal inlier solver. For the optimal inlier solver we also used a random bound. We then calculated the equation residuals for all solutions ((3) and (12) for the minimal three-point solver and (3) and (19) for the optimal three-point solver). The resulting histogram of the logarithm of the errors is shown in Fig. 1, where one can see that we get errors close to machine-precision.

Fig. 2
figure 2

Graph shows histograms of the logarithm of error norms between the estimated essential matrix and the ground truth, using our least square solver. We get small errors in general, with slightly larger errors for the almost minimal case of four points, which is to be expected

Fig. 3
figure 3

Graphs show the least square estimates of the essential matrix using four and ten points. Top The angle deviations from the ground truth epipolar line directions, measured in degrees (\(\theta \) in the first image, and \(\phi \) in the second). Bottom The deviation from the ground truth epipolar line offset. Note that since we are working with normalized coordinates, noise with standard deviation equal to 0.05 corresponds to a large pixel error

Fig. 4
figure 4

Graphs show the result of the semisynthetic RANSAC experiment, comparing our minimal three-point solver to the calibrated five-point solver. Also included are the results of running our optimal method, that maximizes the number of inliers. Left The average running times for the two methods as functions of the outlier ratio. Middle and right The recall and fallout, respectively, for the three methods, as functions of the outlier ratio

Fig. 5
figure 5

One example of the output of the semisynthetic RANSAC experiment, for a high outlier ratio (70%). Top input correspondences. Middle the five-point algorithm. Bottom the minimal three-point solver. We have in this case few inliers, and the five-point solver finds an erroneous solution with a larger number of inliers due to the relatively large number of degrees of freedom in the fitted model. The more restricted orthographic model gives in this case a better estimate of the true epipolar geometry

In order to test our least squares solver, we again randomly generated true orthographic epipolar geometry, without noise. We then ran our least squares solver on a large number of examples (10,000), and for each solution we recorded the Frobenius norm of the difference of the estimated essential matrix and the ground truth essential matrix. The distributions of errors are shown in Fig. 2. We get slightly different behavior for different number of points, but overall we get results with very small errors. To test the dependence on image point errors, for the least square solver, we ran a similar experiment as the previous one. We added Gaussian noise to the image measurements and ran our least square solver. In Fig. 3, the resulting errors in the final estimates are shown. Top shows the mean absolute angular error between the estimated and ground truth epipolar line directions for the two images. Bottom shows the corresponding absolute error in line offset (the last entry in the essential matrix). We show the results using 4 and 10 correspondences.

6.2 Semisynthetic Data

To test our algorithms’ behavior in the presence of outliers and real errors, we have conducted a number of tests based on real data.

Alcatraz In a first experiment, we used the Alcatraz dataset, as described in [19]. We used two images (image number 1 and 39) with SIFT correspondences, including outliers. Since we had access to the ground truth, we knew which correspondences were inliers and outliers, respectively. We have compared our minimal three-point solver to the calibrated five-point solver, using the MATLAB-mex implementation from [23], that runs in around 0.15 ms. We used 41 outliers in the correspondences and varied the number of inliers between 10 and 180. This gave us a number of sets with outlier ratio varying between 19 and 80%. The input for a large outlier ratio is shown in Fig. 5. Using a standard RANSAC loop—with the number of iterations set so that we would find an inlier set with probability 99.9%—we compared the five-point solver to our minimal three-point solver. The results are shown in Fig. 4. The left graph shows the running times as function of outlier ratio, on a logarithmic scale. Our method is faster due to three factors. Firstly, the minimal solver is faster. Secondly, since our solver maximally gives two solutions, we get in general fewer real solutions that need evaluating on the whole point set. And thirdly, since we only need to sample three points, the likelihood of finding an outlier free hypothesis set is higher. We get these benefits at the cost of having a more restrictive camera model, which might not capture the complete geometry. This will of course be highly scenario dependent, but as shown in Fig. 4, where the recall and fallout of the estimated inlier set are shown, we match the five-point solver well in terms of finding the true inlier set. In Sect. 6.4, we give more results on how well the orthographic model approximates the true epipolar geometry. We have also compared with our optimal method that maximizes the number of inliers. The running time of this method doesn’t depend on the inlier ratio, but only on the number of initial correspondences, but it is in general much slower than the RANSAC methods.

In the middle of Fig. 5, the resulting inlier set for the five-point solver is shown. Below this, the result of the minimal three-point solver is shown. In this case, there are very few inliers, and the more restricted orthographic camera model serves as a regularization. On a systems level, this would of course be easy to handle using the five-point solver, since the full 3D geometry probably wouldn’t fit, but we still believe that this shows the benefit of using a simpler model for establishing correspondences.

Dinosaur To test our optimal inlier maximization algorithm, as described in Sect. 4, we used the well-known dinosaur sequence. This sequence contains very little outliers, and the camera geometry has been shown to be well approximated by an affine model, see, e.g., [11].

We used two calibrated images (image 20 and 22) with a subset of 51 true correspondences. Using the full inlier set, we estimated the essential matrix using our least square solver. This was used to define the ground truth of our experiment. We then randomly corrupted a subset of the correspondences to simulate gross outliers and ran our optimal inlier maximization. As a comparison, we also ran our minimal three-point solver using RANSAC exhaustively.

Fig. 6
figure 6

Top absolute error between the ground truth and estimated epipolar line angles as functions of outlier ratio, for the semisynthetic optimal estimation experiment. Bottom absolute error between the ground truth and estimated offset element of the essential matrix, as function of the outlier ratio

Fig. 7
figure 7

To the left, the four input images from the Heinz experiment are shown, with the initial feature point matches for all pairs of images—indicated by lines—overlaid. To help visibility, only 50 random correspondences for each pair are shown. As can be seen, there are multiple mismatches due to the repetitive scene. To the right, the initial points (red) and reprojected points (yellow) are shown, after running a RANSAC loop with our minimal three-point, respectively, optimal inlier solver and subsequent bundle adjustment (Color figure online)

The result of the error between the estimated essential matrices and the ground truth, for different rates of outliers, is shown in Fig. 6. We show the error in angle and offset in a similar way as in Sect. 6.1. As can be seen in the figure, the errors degrade gracefully as functions of the outlier ratio.

6.3 Real Data and Repetitive Structures

One scenario where the matching of features is difficult is when the depicted scene contains repetitive structures. This is quite common and can occur due to similar textures of objects. To test the use of our minimal solver in such a setting, we conducted a small experiment. The setup of our Heinz experiment is shown in Fig. 7. We took four images of a number of cans, with the same texture on them. We extracted SURF features and matched these based on their feature vectors, for all pairs of the four images. A subset of the initial matches can be seen to the left in Fig. 7. For the six pairs, we got between 330 and 530 tentative correspondences. We then ran 5,000 iterations of our minimal three-point solver to estimate the inlier set. We then used this inlier set to estimate the full projective epipolar geometry, with subsequent bundle adjustment. The original points and reprojected points are shown in the middle of the figure. The corresponding rms-value for all views was 1.14 in un-normalized coordinates. We have also run our optimal inlier method, giving a corresponding rms-value for all views of 1.58 after bundle adjustment. The reprojection points can be seen to the right of Fig. 7. The number of inliers for all pairs is given in Table 2.

Table 2 Number of initial correspondences and the number of estimated inliers for the two repetitive structures experiments

In a second experiment, facade, we used four images of a building facade. These images contain a number of repeating structures, such as windows, roof texture, dormer windows, and chimneys. We again extracted SURF features and matched these, for all pairs of the four images. The estimated inliers set using 10,000 iterations of RANSAC with our minimal three-point solver is shown in Fig. 8. We have also run the optimal inlier maximization on this dataset—the details are given in Table 2. In both these experiments, running the optimal solver is orders of magnitude slower than RANSAC, so it is not directly practical to use it for these amounts of initial correspondences. In order for it to be tractable, some form of initial pruning would be needed, but this is left for future work.

Fig. 8
figure 8

Inlier image points (red) and reprojected points (yellow), after running a RANSAC loop with our minimal three-point solver and subsequent bundle adjustment on the facade dataset (Color figure online)

6.4 Orthographic Model Fit

In order to investigate how well the orthographic model approximates the true epipolar geometry, we made the following test on the Alcatraz dataset. We used the true inlier set for all pairs of images and used our least squares solver to find the orthographic essential matrix. We then calculated the mean reprojection error (12) for all pairs. We looked at how this error varies with different parameters such as the medium depth of all scene points, the variance of the depth of the scene points, and the baseline between the two cameras. Our conclusion for this test set was that the important parameter was the baseline between the two cameras. In Fig. 9, we show the mean error as a function of the distance between the two corresponding camera centers. The metric scale for the cameras was manually estimated from the images. The average depth of the scene points was 8.2 m for this dataset (with a standard deviation of 1.4 m).

Fig. 9
figure 9

Mean reprojection error as a function of the baseline between the two cameras. The essential matrix was estimated using our least squares solver using a validated inlier correspondence set. One can see that for this image sequence we get reasonable errors for a camera baseline below 2.5 m

7 Conclusion

We have in this paper given methods and algorithms for estimating two-view geometry for orthographic cameras. We have shown how to estimate the corresponding essential matrix minimally (using three point correspondences), in a least squares sense, or optimally with respect to the number of inliers. These methods can be used to robustly find inlier correspondences in the presence of high degrees of outliers. They depend on an orthographic camera model, but we indicate in the experimental section that in many cases this model is a very good approximation. Our low-dimensional solvers give many benefits over the full projective estimates. Due to the simplicity of our minimal solver, we get a faster solver that also gives fewer solutions than, e.g., the five-point calibrated solver, which leads to faster validation in a RANSAC loop. Of course, these benefits come at the cost of assuming a more restrictive camera model. This model might not capture the complete geometry and may be biased toward affine geometry. This caveat aside, we believe that we get a very fast framework for robust two-view correspondence estimation. Even though our optimal inlier estimation is based on only three point correspondences, it is in many cases not tractable to check all combinations of three points. Future work will include investigating recent methods that can reduce the number of initial correspondences without sacrificing optimality [2, 25].