1 Introduction

Geometric computation has long been one of the major issues in computer vision. In particular, two-view geometry computation is a central building block for three-dimensional (3D) modeling and camera motion estimation. For example, self-driving is implemented through the technology of simultaneous localization and mapping (SLAM) and the structure from motion (SfM) algorithm. Among many important core algorithms, the eight-point algorithm [1] computes the fundamental matrix from a set of eight or more point correspondences between two views, which has the advantage of the simplicity of implementation. However, it was extremely susceptible to image noise and hence was of very limited practical use until Hartley devised a normalized eight-point algorithm in his seminal work [2], which shows that by preceding the algorithm with a data normalization (translation and scaling) of the coordinates of the correspondences, the results obtained are comparable to those of the best iterative algorithms. As a consequence, with its simple strategy of translation and scaling, the isotropic normalization, now termed as Hartley’s normalization, has gradually become an indispensable component of many geometric computations not only for fundamental matrix estimation [3] but also for homography [4], ellipse fitting [5], bundle adjustment [6], etc.

One particular aspect of Hartley’s normalization in regard to the direct linear transformation (DLT) formulation of the fundamental matrix computation is that it allows the DLT solution to possess a better condition number. Therefore, when the solution matrix is enforced to have rank 2, a much more stable estimate of the fundamental matrix is obtained; this is important because it is the starting point of all the structure and motion computations such as guided correspondence search, camera and structure optimization, and 3D reconstruction for more than two views. Consequently, enforcing the rank-2 constraint as much as possible at the DLT stage becomes an interesting topic of study. For example, Mühlich and Mester [7] performed a statistical analysis to obtain an optimal data normalization for DLT fundamental matrix computation and showed that Hartley’s normalization can be expected to work well, although it is not identical to the optimal transformation. Mair et al. [8] performed further error analysis to obtain a better performance than Hartley’s eight-point algorithm. The work of da Silveira and Jung [9] presented a perturbation analysis of the eight-point algorithm for a wide field of view cameras. In contrast to these works based on statistical analysis, this paper tries to determine the mechanism of data normalization through deep learning without specific statistical modeling. Considering that the fundamental matrix estimation is strongly affected by the error distribution of the feature matching algorithm, we argue that a data normalization scheme can be exploited to achieve DLT solutions of improved rank-2 condition by learning the error distribution from the data themselves; this approach coincides with the views of Refs. [7, 9, 10]. In particular, as displayed in Fig. 1, we propose to learn a data-driven normalization scheme under the standard configuration of eight correspondences.

Figure 1
figure 1

Distributions of normalized image coordinates by using Hartley’s normalization algorithm (upper right) and our learning-based normalization approach (bottom right), respectively. Eight pairs of point correspondences are obtained from the two street images on the left. Note that in the right figure, the coordinate axes represent the normalized image coordinates in the horizontal and vertical directions, and the “error” refers to the symmetry epipolar distance, which can better characterize the estimation accuracy of the two-view geometry. Our approach learns a robust normalization scheme adapted to the input data, obtains a better distribution spread of the normalized point coordinates, and eventually leads to improved performance in the computation of the fundamental matrix

Currently, the success of deep learning in high-level vision tasks has been gradually extended to multi-view geometry problems such as homography [11], fundamental matrix [12], bundle adjustment [13], plane sweeping [14, 15], and rolling-shutter modeling [16, 17]. However, this success has not been extended to the normalized eight-point algorithm and a different or better normalization scheme has so far not been presented nor replaced by the deep learning pipeline. This is mainly due to the following obstacles: (1) gradient descent cannot be trivially applied as mentioned in Ref. [12]; (2) the network must be invariant to the permutation of the correspondences, i.e., different orderings of the input data should produce the same normalization; and (3) a large amount of labeled input and output data should be used for supervised learning (in this case, the input is eight-point correspondences and the output is the optimal data normalization). In this paper, we overcome these problems by back-propagating through a singular value decomposition (SVD) layer and using a self-supervised learning mechanism in the permutation-invariant network architecture; this also solves the problem of having to train large amounts of data. Our approach not only produces an interpretable pipeline of fundamental matrix estimation but can also be easily embedded in other robust frameworks such as the differentiable random sample consensus (RANSAC) [18]. Through experiments, our learning-based normalization demonstrated superior performance to Hartley’s normalization and a good generalization ability across different datasets. Our main contributions can be summarized as follows.

  1. (1)

    We propose a self-supervised learning-based deep solution for normalizing DLT fundamental matrix estimation under the standard configuration of eight point correspondences.

  2. (2)

    We make a theoretical contribution by demonstrating the existence of different and better normalization algorithms beyond Hartley’s normalization.

  3. (3)

    Extensive experiments on both synthetic and real images demonstrate the effectiveness and good generalizability of our proposed approach.

2 Related work

In this section, we briefly review related work in traditional two-view geometry computation and deep learning-based multi-view geometry learning.

2.1 Two-view geometry estimation

The normalized eight-point algorithm [2] significantly improves the numerical accuracy of the fundamental matrix and extends the scope of applications due to the improved condition number of the hand-designed normalization scheme. Since this seminal work, there have been various follow-up studies on the uncertainty in fundamental matrix estimation and the relationships between the epipolar constraint and corresponding errors. Csurka et al. [19] proposed a method to simultaneously estimate the fundamental matrix and its uncertainty. Mühlich and Mester [10] concluded that the normalization strategy can ensure that the two-view non-iterative motion estimation algorithm maintains unbiasedness and consistency. They further introduced a normalization transformation scheme based on the bound of epipolar constraint errors obtained by assuming known feature matching covariance, which was also used to extend the existing first-order error propagation analysis of the eight-point algorithm in Ref. [8]. However, this approach was still not optimal because the error distribution of the input data was not considered [7]. The closed-form computation of the uncertainty of the fundamental matrix was presented in Ref. [20] to recover correspondences via the uncertain equilibrium of motion estimation. Chojnacki and Brooks [21] revisited the normalized eight-point algorithm and presented a statistical model of data distribution by merging the statistical approach of Ref. [10], which was further extended in Ref. [22] by introducing a structured model for the data distribution. In addition, da Silveira and Jung [9] performed perturbation analysis for the fundamental matrix estimation without considering any kind of matching error distribution.

2.2 Deep learning-based geometry estimation

Recently, the success of deep learning in high-level vision tasks has been gradually extended to various multi-view geometry estimation problems. DeTone et al. [11] employed a deep convolutional neural network (CNN) to regress a homography from a pair of input images in an end-to-end manner. A follow-up study [23] developed the unsupervised variant by replacing direct supervision with image-based loss. This pipeline has been extended to fundamental matrix estimation, where a fundamental matrix is directly regressed from a pair of stereo images without correspondences [24]. Ranftl and Koltun [12] treated the fundamental matrix estimation problem as a weighted homogeneous least-squares problem, where the matching weights and fundamental matrix are simultaneously estimated by using supervised deep networks. With the availability of camera intrinsics, Yi et al. [25] recovered the essential matrix from putative correspondences with little training data and limited supervision, thus finding good correspondences for wide-baseline stereo. Furthermore, Probst et al. [26] proposed an unsupervised learning framework for consensus maximization in the context of solving 3D vision problems such as 3D-3D matching [27, 28] and image-to-image matching (homography and fundamental matrix). DSAC [18] is a differentiable counterpart of RANSAC and can also be leveraged as a robust optimization component for other deep learning pipelines.

Different from existing work in deep learning-based multi-view geometry computation, our self-supervised learning strategy removes the need for supervisory signals and thus generalizes well across different datasets. Furthermore, our learning-based normalization module can be integrated with both traditional and deep learning frameworks.

3 A revisit of the normalized eight-point algorithm

We use capital letters, A, B, etc., to denote matrices. The operation of reshaping a matrix into a vector is denoted by \(\mathrm{vec}(\cdot )\), defined as \(\mathrm{vec}(\boldsymbol{A}) = [\boldsymbol{a}_{1}^{T},\ldots, \boldsymbol{a}_{N}^{T}]^{T}\), where \(\boldsymbol{a}_{i}\) is the i-th column vector of A and N is the number of columns. Its inverse operation is denoted as \(\mathrm{mat}(\cdot )\).

Given a pair of correspondences \(\boldsymbol{u}'_{i}\) and \(\boldsymbol{u}_{i}\) between two views, the epipolar constraint is expressed as

$$\begin{aligned} {\boldsymbol{u}'_{i}{^{T}}} \boldsymbol{F} \boldsymbol{u}_{i} = 0, \end{aligned}$$
(1)

where \(\boldsymbol{F}= [f_{ij}]\) is a \(3 \times 3\) matrix of rank 2, termed as the fundamental matrix. Collecting \(N=8\) point correspondences \(\{ (\boldsymbol{u}'_{i}, \boldsymbol{u}_{i}) | i=1,\ldots,8\}\), i.e., the standard configuration, we may rewrite Eq. (1) as a linear equation of f:

$$\begin{aligned} \centering {\boldsymbol{A}} \boldsymbol{f} = 0, \end{aligned}$$
(2)

where \(\boldsymbol{f}= \mathrm{vec}(\boldsymbol{F}^{T})\) is a nine-dimensional vector composed of stacked columns of \(\boldsymbol{F}^{T}\), and \(\boldsymbol{A}= [\boldsymbol{a}_{1}, \ldots, \boldsymbol{a}_{8}]^{T}\) is the \(8\times 9\) coefficient matrix with \(\boldsymbol{a}_{i} = \mathrm{vec}(\boldsymbol{u}_{i} { \boldsymbol{u}'}_{i}^{T})\) for \(i=1,\ldots,8\). This approach provides the DLT formulation for computing F, and a solution may be obtained through SVD of A.

Despite its simplicity, the computation of the DLT for the eight-point algorithm [1] is extremely susceptible to noise in the image coordinate measurements. In the seminal work [2], Hartley showed that the precision of the eight-point algorithm can be greatly improved by proper normalization of the image coordinates; this approach is the classic normalized eight-point algorithm. Hartley’s normalization is designed to compute image translation and scaling such that the average distance of the transformed coordinates from the origin is \(\sqrt {2}\):

$$\begin{aligned} \boldsymbol{T}_{\mathrm{H}} = \begin{bmatrix} s&{}&{ - so_{1}} \\ {}&s&{ - so_{2}} \\ {}&{}&1 \end{bmatrix}, \end{aligned}$$
(3)

with s, \(o_{1}\) and \(o_{2}\) given by

$$\begin{aligned} o_{j} = \frac{1}{N}\sum_{i = 1}^{N} {\boldsymbol{u}_{i}^{{{(j)}}}} \quad{\mathrm{{and}}}\quad s = \frac{{\sqrt {2} }}{{\frac{1}{N}\sum_{i = 1}^{N} { \Vert {\boldsymbol{u}_{i} - \boldsymbol{o}} \Vert _{2}} }}, \end{aligned}$$
(4)

where the superscript j denotes the j-th entry of vector \(\boldsymbol{u}_{i} \). Given two normalization matrices \(\boldsymbol{T}'\) and T, Eq. (2) is transformed to

$$\begin{aligned} \hat{{\boldsymbol{A}}} \hat{\boldsymbol{f}} = \mathbf{0}, \end{aligned}$$
(5)

where \(\hat{{\boldsymbol{A}}} = [\hat{{\boldsymbol{a}}}_{1},\ldots, \hat{{\boldsymbol{a}}}_{8} ]^{T} \) is the transformed coefficient matrix with \(\hat{{\boldsymbol{a}}}_{i} = {\mathrm{{vec}}}( \hat{\boldsymbol{u}}_{i}\hat{\boldsymbol{u}}'_{i}{^{T}}) = { \mathrm{{vec}}}( \boldsymbol{T}\boldsymbol{u}_{i} \boldsymbol{u}'_{i}{^{T}} \boldsymbol{T}^{\prime T} )\). In summary, the normalized eight-point algorithm mainly includes the following three steps.

  1. (1)

    Normalization: Transform the input image coordinates according to \(\hat{\boldsymbol{u}}'_{i} = \boldsymbol{T}' \boldsymbol{u}'_{i}\) and \(\hat{\boldsymbol{u}}_{i} = \boldsymbol{T} \boldsymbol{u}_{i}\).

  2. (2)

    Compute the corresponding fundamental matrix \(\hat{{\boldsymbol{F}}}'\) to normalize data by

    1. (a)

      Direct linear transform: Determine \(\hat{{\boldsymbol{F}}} = {\mathrm{{mat}}}(\hat{\boldsymbol{f}}) \) from the right singular vector \(\hat{\boldsymbol{f}}\) corresponding to the smallest singular value of \(\hat{\boldsymbol{A}} \) defined in Eq. (5).

    2. (b)

      Singularity constraint enforcement: Replace \(\hat{{\boldsymbol{F}}} \) by \(\hat{{\boldsymbol{F}}}'=\hat{{\boldsymbol{U}}}{\mathrm{{diag}}}(r_{1},r_{2},0) \hat{{\boldsymbol{V}}}^{T}\), where \(\hat{{\boldsymbol{F}}} = \hat{{\boldsymbol{U}}} \hat{{\boldsymbol{D}}} \hat{{\boldsymbol{V}}}^{T}\) with \(\hat{\boldsymbol{D}} \) is a diagonal matrix \(\hat{\boldsymbol{D}} = {\mathrm{{diag}}}(r_{1},r_{2},r_{3})\) satisfying \(r_{1} \ge r_{2} \ge r_{3} \).

  3. (3)

    Denormalization: Set \({{\boldsymbol{F}}} = {{\boldsymbol{T}}}^{\prime T} \hat{{\boldsymbol{F}}}' {{\boldsymbol{T}}} \).

The condition number of A is defined as \({\kappa }(\boldsymbol{A}) = \Vert \boldsymbol{A} \Vert _{2} \Vert \boldsymbol{A}^{+} \Vert _{2}\), where \(\boldsymbol{A}^{+}\) is the pseudo-inverse of A. Its equivalent condition number may be defined as the ratio of the greatest to the second smallest singular values, \({\kappa }(\boldsymbol{A}) = \sqrt {d_{1}/d_{8}}\), for \(\boldsymbol{A}^{T}\boldsymbol{A} = \boldsymbol{U} \mathrm{diag}(d_{1},d_{2},\ldots,d_{8},d_{9})\boldsymbol{U}^{T}\). It has been reported in the literature [2, 9, 21, 22] that the unsatisfactory performance of the eight-point algorithm is mainly due to the worse numerical conditioning of the coefficient matrix A. In fact, the condition number \(k(\boldsymbol{A})\) is extremely large, leading to two least eigenvalues relatively close to one another, and causing their corresponding eigenvectors to be mixed up and indistinguishable. As a result, a negligible perturbation of the matrix entries tends to cause a significant change in the smallest eigenvector, since it may fall anywhere in the proximity to the eigensubspace spanned by the similar eigenvectors associated with those virtual degenerate eigenvalues [21]. It has been found that proper selection of normalization to the input image coordinates results in better numerical conditioning when carrying out linear DLT computation, and that the improved numerical conditioning provides with the smallest eigenvector of \(\hat{\boldsymbol{A}}\) far less susceptible to interference [2, 22]. From this point, a natural question arises: Can we achieve the ultimate optimal condition number \(k(\hat{\boldsymbol{A}})=1\)? Below we figure out that the condition number of the transformed coefficient matrix cannot reach the optimum of 1. A follow-up question must be: Can we have a better normalization transformation? This paper provides a positive answer in the next section. We develop a self-supervised CNN-based technique that learns the convolutional neural network weights based on a geometric loss function. It requires no ground truth labeling but has shown highly improved performance in various experiments.

Proposition 1

There is no pair of normalization matrices \(\boldsymbol{T}'\) and T that results in \(k(\hat{\boldsymbol{A}}) = 1\).

Proof

(proof by contradiction) For the full row rank matrix A, there must be an invertible matrix \(\boldsymbol{P} = [p_{ij}] \) such that \(\boldsymbol{A}=\boldsymbol{P}^{-1}\boldsymbol{Q}\) holds, where the matrix \(\boldsymbol{Q} \in \mathbb{R}^{8\times 9}\) also has full row rank [29]. Moreover, one can assume that each row of Q represents a standard orthonormal basis of the 9-dimensional subspace, which is easily achieved by matrix decomposition [29], such as Gram–Schmidt orthogonalization, QR decomposition, and SVD decomposition.

The condition number \({\kappa (\hat{\boldsymbol{A}}) =1}\) if and only if \(\hat{\boldsymbol{A}}\hat{\boldsymbol{A}}^{T} = c \boldsymbol{I}\), where c is a non-zero positive constant [29, 30]; this implies that the rows of \(\hat{\boldsymbol{A}}\) make up eight orthogonal bases of the 9-dimensional subspace up to a fixed-length scale \(\sqrt{c}\). Therefore, in order to achieve \(\kappa (\hat{\boldsymbol{A}}) = 1\), the two invertible transformations \(\boldsymbol{T}'\) and T should make \(\hat{\boldsymbol{A}}= \boldsymbol{Q} = \boldsymbol{P} \boldsymbol{A}\) hold, i.e.,

$$\begin{aligned} {\mathrm{{mat}}}\bigl(\hat{\boldsymbol{a}}_{i}^{T} \bigr) = \sum_{j = 1}^{8} {{p_{ij}} \cdot {\mathrm{{mat}}}\bigl({\boldsymbol{a}_{j}^{T}}}\bigr),\quad i=1,\ldots,8. \end{aligned}$$
(6)

Note that \({\mathrm{{rank}}}({\mathrm{{mat}}}(\hat{\boldsymbol{a}}_{i}^{T})) = {\mathrm{{rank}}}({ \mathrm{{mat}}}(\boldsymbol{a}_{i}^{T})) = 1 = {\mathrm{{rank}}}(\hat{\boldsymbol{u}_{i}} \hat{\boldsymbol{u}}'_{i}{^{T}}) = {\mathrm{{rank}}}(\boldsymbol{u}_{i} \boldsymbol{u}'_{i}{^{T}}) \) for any \(i\in [1,8] \). Except for the trivial configuration in which \({\boldsymbol{P}} = \boldsymbol{I}\), the rank of the sum on the right-hand-side must exist to be equal to 3 for any \(\boldsymbol{T}'\) and T (e.g. given in Eq. (3)); so Eq. (6) cannot be established. That is, there are no normalization matrices \(\boldsymbol{T}'\) and T to make \({\kappa (\hat{\boldsymbol{A}}) =1}\) tenable. □

4 Learning-based normalization with self-supervised CNNs

This section develops a machine learning model that produces T and \(\boldsymbol{T}'\), the two data normalization matrices, which result in a better estimation of F than Hartley’s normalization for eight input correspondences. As discussed in Sect. 3, the estimation of the fundamental matrix has two main steps. First, the input image coordinates are normalized by T and \(\boldsymbol{T}'\) to construct the data matrix \(\hat{\boldsymbol{A}}\), and the solution \(\hat{\boldsymbol{f}}\) is obtained. Second, \(\hat{\boldsymbol{F}}\) is reconstructed by enforcing the singularity constraint. The following are two observations regarding this estimation process:

  1. (1)

    The goal of Hartley’s normalization is to achieve a better computation of \(\hat{\boldsymbol{f}}\). However, this does not guarantee the singularity condition \(\mathrm{det}(\mathrm{mat}(\hat{\boldsymbol{f}})) = 0\), which is why the singularity constraint enforcement (SCE) is necessary.

  2. (2)

    There are cases where enforcing the singularity (\(\hat{\boldsymbol{F}} = \hat{\boldsymbol{U}}\mathrm{diag}(r_{1},r_{2},0) \hat{\boldsymbol{V}}^{T}\)) brings about large nonlinear projection errors and leads to an unsatisfactory estimation of \(\hat{\boldsymbol{F}}\). This happens especially when \(\rho =r_{2}/r_{3}\) is not large enough.

It is evident that the singularity constraint should be considered at the same time as well as the numerical conditioning when the normalization matrices T and \(\boldsymbol{T}'\) are prepared, which implies the existence of better normalization schemes.

Our approach adopts a CNN-based model and a self-supervised learning algorithm to train it. The model outputs the parameters of the normalization matrices when eight input correspondences are provided as input. Following the conjecture of the affine structure of the normalization matrix proposed in Ref. [10], the normalization matrix is designed here to have two more parameters than Hartley’s normalization:

$$\begin{aligned} \boldsymbol{T}_{L} ={}& \begin{bmatrix} \alpha _{1} & \\ & \alpha _{2} & \\ & & 1 \end{bmatrix} \begin{bmatrix} \cos \theta & -\sin \theta & \\ \sin \theta & \cos \theta & \\ & & 1 \end{bmatrix} \\ &{}\times \begin{bmatrix} 1 & & -o_{1} \\ & 1 & -o_{2} \\ & & 1 \end{bmatrix} , \end{aligned}$$
(7)

which can characterize the data distribution better and enable more general normalization schemes to be implemented by CNNs. Nevertheless, how to robustly determine three normalization parameters (especially \(\alpha _{1}\), \(\alpha _{2}\), and θ) has always been a difficult problem. Note that, after Hartley’s seminal solution [2], there has been no substantial progress in designing hand-crafted normalization strategies. In contrast, we try to extend Hartley’s normalization and develop a deep solution for normalization. The performance of the CNN model for this parametrization is evaluated and visualized through various experiments in Sect. 5. The overall computation pipeline of our framework is illustrated in Fig. 2.

Figure 2
figure 2

Overall framework comparisons of Harley’s eight-point algorithm and our learning-based eight-point algorithm, both of which support eight points as input. Our approach shows an interpretable pipeline to predict the parameters of each normalization matrix (\(\alpha _{1}\), \(\alpha _{2}\), and θ in particular), which is also beneficial for a more accurate estimation of the intrinsic epipolar geometry. DLT refers to the direct linear transformation and SCE refers to the singularity constraint enforcement

4.1 Self-supervised learning for normalization

Network architecture. The overall network architecture is illustrated in Fig. 3. We adopt the structure of 12 consecutive ResNet blocks as the first stage of the CNN network, which is consistent with the classic two-view geometry estimation networks [12, 25]. The eight input points \(\boldsymbol{u}'\) or u are first processed by multi-layer perceptrons of 128 neurons sharing weights [25] between correspondences. Then, the 128-dimensional features for each correspondence are transmitted as output through 12-layer ResNet blocks [25, 31]. The integration of global information is performed by weight-sharing operations between different correspondences, followed by instance normalization [32] after each layer. Max-pooling and instance normalization are applied to each layer of the 12-layer ResNet blocks, namely, the input of the first ResNet block and the output of each of the next 12 ResNet blocks, to extract 13 global features of \(1\times 128\) dimensions, respectively. This process enables the CNN layer to maintain the permutation invariance and fix the size of the global feature maps. Then, 13 feature maps are concatenated and delivered to the two-dimensional (2D) convolutional layer, which consists of eight channels, \(3\times 3\) square kernels, and unequal strides with four in the column and one in the row. The output of the 2D convolution is then passed through two fully-connected layers each with a dimension of 256, followed by ReLU. Finally, three-parameter estimation corresponding to \(\boldsymbol{u}'\) or u is regressed. Note that our network supports the input of more than eight correspondences and this flexibility is mainly due to the max-pooling and instance normalization design, which is valuable in practice.

Figure 3
figure 3

Overview of our network architecture, corresponding to the CNN layer in Fig. 2. Our approach estimates the parameters of the normalization matrix (\(\alpha _{1}\), \(\alpha _{2}\), and θ in particular). 2D convolutional layer refers to two-dimensional convolutional layer

Our network is inspired by 3DRegNet [33] but has significant differences in architecture design: we utilize weight sharing for point correspondences, instance normalization module for better performance, and fewer parameters in 2D convolution. Specifically, compared to the representative two-view geometry estimation methods [12, 25], our network is invariant to the permutation of the correspondences.

Self-supervised learning. In order to train our model through self-supervised learning, the outputs obtained from the CNN model are leveraged to construct the normalization matrices T and \(\boldsymbol{T}'\), and are fed into the next module performing (1) the data scaling, (2) DLT to compute \(\hat{\boldsymbol{f}}\), and (3) SVD to compute singularity constrained \(\hat{\boldsymbol{F}}\). Finally, the output F is evaluated using the loss function chosen to be the symmetry epipolar distance [34]:

$$\begin{aligned} & \mathcal{L} \bigl( {\boldsymbol{F}}; \boldsymbol{u}_{i}, \boldsymbol{u}_{i}'\bigr) \\ &\quad= { \bigl\vert { \boldsymbol{u}'_{i}{^{T}}} \boldsymbol{F} \boldsymbol{u}_{i} \bigr\vert } { \biggl( { \frac{1}{{ \Vert ({\boldsymbol{F}{^{T}} \boldsymbol{u}'_{i}})^{\mathrm{{(1:2)}}} \Vert _{2}}} + \frac{1}{{ \Vert ({\boldsymbol{F} \boldsymbol{u}_{i}})^{\mathrm{{(1:2)}}} \Vert _{2}}}} \biggr)}. \end{aligned}$$
(8)

We tested several variants of distance functions including the Sampson distance and algebraic distance, and decided to use the symmetry epipolar distance, because it showed superior results in the experiments. Interestingly, these findings contrast with the findings of Ref. [34].

By training through minimizing the loss function, we can train the network without any ground truth data at all, contrary to Ref. [33] or Ref. [12]; the network achieves self-supervisory in the geometric sense. It also enables us to exploit a very large number of frames from video sequence datasets under various kinds of camera motion.

Addressing the ordering invariance. Our network model is designed to be invariant to the order of the input image points similar to Ref. [35] or Ref. [33], thereby obtaining invariance in the subsequent fundamental matrix computation.

Proposition 2

As long as the computation of the normalization matrices \(\boldsymbol{T}'\) and T has permutation invariance, then so has the computation of the fundamental matrix.

Proof

Because \(\boldsymbol{T}'\) and T maintain invariant for any order of the input data \(\boldsymbol{u}'\) and u, the resulting \(\hat{\boldsymbol{u}}'\) and \(\hat{\boldsymbol{u}}\) hold the same order as \(\boldsymbol{u}'\) and u after normalization; this is equivalent to performing a row transformation on the transformed coefficient matrix \(\hat{{\boldsymbol{A}}}\) in Eq. (5) for different orders of \(\boldsymbol{u}'\) and u. However, when the row transformation is made to \(\hat{{\boldsymbol{A}}}\), the right singular vector corresponding to the smallest singular value of \(\hat{{\boldsymbol{A}}}\) does not change [29], i.e. the estimation of \(\hat{{\boldsymbol{F}}}\) is not affected. Furthermore, the final fundamental matrix F also has permutation invariance.

Training procedure. The network is implemented in PyTorch. We adopt the Adamax Optimizer [36] with an initial learning rate of 10−3 and a decreasing learning rate of 0.8 times per 10 epochs. The chosen batch size is 16 and the network is trained for 150 epochs. Each input set is pre-filtered by the residual based on the original eight-point algorithm with a threshold (60 pixels) sufficiently large to enhance the stability of the training process.

5 Experimental results

To prove that our approach can learn normalization matrices adapted to the input data and obtain more accurate fundamental matrix estimations, we benchmark the performance of our approach on three typical datasets with varying regularity. Furthermore, we perform cross-dataset validation to prove the generalizability of our approach.

5.1 Datasets

KITTI dataset. The KITTI odometry dataset [37] consists of 22 distinct sequences from a car driving around a residential area. This dataset exhibits dominant forward motion with high regularity but shows difficult data associations. We choose the first 11 sequences with ground truth from GPS and a Velodyne LiDAR. Specifically, we employ sequences “00” to “05” for training and sequences “06” to “10” for testing in our experiment, which enables a fair comparison with recent state-of-the-art methods [12].

TUM dataset. We use the indoor sequences from the TUM RGB-D dataset [38], which contains several hand-held sequences with ground truth obtained by an additional motion capture system. This dataset reflects rich camera motion and scene geometry, and shows the most general cases for fundamental matrix estimation. We exploit the cross-validation for the sequence “fr3_long_office” during training. To better test the generalizability of the proposed method, we resize the image size of the TUM RGB-D dataset to be consistent with that of the KITTI dataset.

Cambridge dataset. The Cambridge dataset [39] is a large-scale outdoor urban localization setting, containing six challenging scenes with changes in perspective and illumination; this setting is quite different from TUM and KITTI datasets. Here we adopt the “St Mary’s Church” scene to evaluate the generalizability of our proposed approach, and report only the qualitative results in the following section.

We generate two different correspondence datasets for each of the KITTI dataset and the TUM dataset, which are stored in a manner similar to that used in Ref. [40]. First, 1000 correspondences based on SIFT [41] are pre-filtered by employing a ratio test with a threshold of 0.8. The second one does not leverage the ratio test to pre-filter the correspondences, which generates a challenging dataset with high noise. The ratio test is a frequently used strategy for improving the robustness and accuracy of feature matching. Therefore, unless otherwise stated, we utilize pre-filtered datasets in our experiments. Moreover, each input sample is generated by shuffling all the correspondences between two views in the dataset.

5.2 Evaluation protocols

To evaluate the performance of our approach, we report the average better rate per input sample, i.e., the average percentage that our learning-based normalization outperforms Hartley’s normalization in terms of the symmetric epipolar distance (see Eq. (8)). Besides, in the experiments within the RANSAC framework, we evaluate the average percentage of inliers (correspondences with errors less than 1 pixel or 0.1 pixels), as well as the F1 (the average percentage of correspondences below 1 pixel error with respect to the ground truth epipolar line).

5.3 Experimental evaluations

In the first experiment, we evaluate the performance of our approach per input sample. We first optimize F based on Eq. (8) under singularity constraints for Hartley’s normalization and our learning-based normalization in the KITTI test set, and the results are summarized in Fig. 4(a). The equivalence between our approach and Hartley-based optimization result is reported, which indicates that our approach can provide better initial values for more sophisticated nonlinear optimization methods. Unlike the constant \(\sqrt{2}\) distance from the origin in Hartley’s normalization, Fig. 5 shows that our learning-based normalization predicts a distance tailored to each input data, which exploits the inherent regularity of the input data.

Figure 4
figure 4

(a) Average pixel errors of per sample with or without optimization of the first 20 frames of sequence “06”. Our direct results are almost the same as those based on Hartley with optimization. (b) Average pixel error of each sample for the different eight-point methods. We discard the input samples with an original eight-point error greater than 60 for better visualization

Figure 5
figure 5

Learning-based normalized distances from the origin on the left and right camera views, respectively. Hartley’s normalization makes them always \(\sqrt{2}\), while our approach learns a robust normalization scheme adapted to the input data

Then, we quantitatively evaluate the average rate of improvement for each input sample, which is our primary concern. Since Hartley’s normalization is the most widely-used normalization method [34], we only compare with it here. As presented in Table 1, our learning-based normalization outperforms Hartley’s normalization for each input sample. Interestingly, the model trained on the KITTI dataset is generalizable well to the TUM dataset, and vice versa, which shows the generalizability of our approach. To further analyze the impact of training sets on our approach, we provide experimental results by evaluating the average percentage of each input sample when using KITTI and TUM datasets jointly as training sets. The performance of our approach is further improved for each input sample, which shows that our approach can learn a better and more generalized normalization scheme from more training data that contains diverse regularities. Finally, in Fig. 4(b), we report the distribution of the symmetric epipolar distance for the original eight-point algorithm, with Hartley’s normalization, and with our learning-based normalization. While both have achieved great improvements with respect to the un-normalization version, our learning-based normalization consistently outperforms Hartley’s normalization in achieving lower errors for eight input correspondences.

Table 1 Results of the average improvement rate of per input sample in diverse training sets. Our approach not only takes into account the inherent regularity of the input data but also learns a better and more generalized normalization scheme

From the superior performance of our learning-based normalization algorithm over each input sample, we further heuristically verify that our approach can be effectively integrated into the traditional RANSAC framework [45]. In the experimental comparison, we follow the most related and classic work [12]. We compare our approach with the least median of squares (LMEDS) [43], MLESAC [42], USAC [44], Ranftl’s method [12] and RANSAC [45], where RANSAC is based on Hartley’s normalization while our approach is performed with the learning-based normalization. Note that USAC is a state-of-the-art robust estimation framework, and “RANSAC + normalized eight-point algorithm” represents the gold standard [34] for geometric tasks such as visual odometry and SLAM. Inside Ranftl’s method [12], the matching scores have been used as additional information to guide the estimation, which can result in an obvious improvement in average accuracy. By contrast, we leverage only the original RANSAC to conduct experiments for performance evaluation. It is also worth noting that as a supervised learning-based framework, Ranftl’s method requires ground truth correspondences in training, while our approach is fully self-supervised. Additionally, designing an ensemble network to improve overall performance such as DSAC [18] is outside the scope of this paper, as our focus is better normalization for each sample.

Table 2 summarizes the results on the KITTI dataset. Within the RANSAC framework, our learning-based normalization performs on par with Hartley’s normalization on the KITTI benchmark. Furthermore, we evaluate the performance based on the challenging testing set without the ratio test, and the results are presented in Table 3. Note that our approach achieves higher inliers on the TUM dataset. We remark here that, recent analyses in Refs. [46, 47] as well as related experiments in Ref. [48] indicate that the RANSAC paradigms with supporting heuristics can only increase the chance of finding the final good solution and are not completely governed by the internal solver, which is one possible reason for the slight improvement of our method when it is embedded into RANSAC. Overall, the effectiveness of our learning-based normalization method combined with RANSAC is demonstrated.

Table 2 Results on the KITTI test set using the ratio test at different inlier thresholds
Table 3 Performance of the proposed method in combination with RANSAC in the test set without the ratio test

Finally, we directly employ the network model trained on the KITTI dataset, which is very different from the Cambridge dataset. The qualitative generalization results for the Cambridge dataset are reported in Fig. 6. One can see that our approach can achieve an accurate two-view fundamental matrix estimation, which reflects the good generalizability of our approach. Moreover, since we always centralize the correspondences first, varying image sizes and distributions of features will not have a significant impact on the final results. Currently, our forward propagation time is approximately 5 times that of Hartley’s normalization due to the use of 12-layer ResNet architectures. Fortunately, these efficiency sacrifices can improve normalization to achieve more accurate epipolar geometry for each sample.

Figure 6
figure 6

Image pairs from the KITTI and Cambridge datasets. Odd row: First image with inliers (blue) and outliers (red). Even row: The estimated epipolar lines of a random subset of inliers in the second image. The images are scaled for visualization

Influence of the number of correspondences. We perform additional experiments to analyze the influence of the number of correspondences in the input. We take the median of 1000 trials based on a random testing image. The results are shown in Fig. 7, which indicate that better fundamental matrices can be obtained with an increasing number of correspondences.

Figure 7
figure 7

Influence of the number of correspondences on the median error of 1000 trials in a random testing frame

Condition numbers. We conduct another experiment to compare the condition numbers in solving the fundamental matrix and the results are reported in Fig. 8. We observe that better numerical conditioning of the transformed coefficient matrix can be obtained by our learning-based normalization, which is one of the keys to our upgraded performance.

Figure 8
figure 8

Effect of diverse normalization schemes on the average condition number

Nonlinear projection. The singularity of \(\hat{\boldsymbol{F}}\) is evaluated by calculating \(\rho =r_{2}/r_{3}\) for every 100 consecutive frames of the KITTI test set. The results are displayed in Fig. 9, which shows our learning-based approach is able to achieve smaller nonlinear projection errors. These findings also verify our argument that the condition number of the transformed coefficient matrix via a better normalization will be more conducive to imposing the singularity constraint on the resulting fundamental matrix. Note that these experimental results all highlight the superiority of our learning-based normalization approach.

Figure 9
figure 9

Average ρ of every 100 consecutive samples in the KITTI test set

6 Conclusion

In this paper, we revisit the classic two-view geometry computation with eight point correspondences and employ CNNs to provide a novel perspective for better normalization. First, we present that the ideal condition number can be obtained by our approach to be more consistent with the following singularity constraint enforcement step. Second, we propose a self-supervised deep neural network to learn a robust normalization scheme for more accurate fundamental matrix estimation. Our approach enables a data-driven estimation pipeline to perform interpretable and generalized fundamental matrix estimation. Our learning-based normalization solution is superior to Hartley’s normalization for each input sample, and is comparable to Hartley’s normalization when integrated with RANSAC. Its potential advantage is to provide better initial values for non-linear optimization and to afford better interpretability for an ensemble network. In the future, we plan to design a lightweight network to weigh time and quality, utilize ground truth correspondences or ground truth matching scores to explore supervised two-view geometry estimation, and further extend our deep solution to other multi-view geometry problems such as triangulation and trifocal tensor estimation.