Addressing the generalization of 3D registration methods with a featureless baseline and an unbiased benchmark

Bojanić, David; Bartol, Kristijan; Forest, Josep; Petković, Tomislav; Pribanić, Tomislav

doi:10.1007/s00138-024-01510-w

Addressing the generalization of 3D registration methods with a featureless baseline and an unbiased benchmark

Research
Open access
Published: 23 March 2024

Volume 35, article number 41, (2024)
Cite this article

Download PDF

You have full access to this open access article

Machine Vision and Applications Aims and scope Submit manuscript

Addressing the generalization of 3D registration methods with a featureless baseline and an unbiased benchmark

Download PDF

David Bojanić¹,
Kristijan Bartol²,
Josep Forest³,
Tomislav Petković¹ &
…
Tomislav Pribanić¹

1829 Accesses
2 Altmetric
Explore all metrics

Abstract

Recent 3D registration methods are mostly learning-based that either find correspondences in feature space and match them, or directly estimate the registration transformation from the given point cloud features. Therefore, these feature-based methods have difficulties with generalizing onto point clouds that differ substantially from their training data. This issue is not so apparent because of the problematic benchmark definitions that cannot provide any in-depth analysis and contain a bias toward similar data. Therefore, we propose a methodology to create a 3D registration benchmark, given a point cloud dataset, that provides a more informative evaluation of a method w.r.t. other benchmarks. Using this methodology, we create a novel FAUST-partial (FP) benchmark, based on the FAUST dataset, with several difficulty levels. The FP benchmark addresses the limitations of the current benchmarks: lack of data and parameter range variability, and allows to evaluate the strengths and weaknesses of a 3D registration method w.r.t. a single registration parameter. Using the new FP benchmark, we provide a thorough analysis of the current state-of-the-art methods and observe that the current method still struggle to generalize onto severely different out-of-sample data. Therefore, we propose a simple featureless traditional 3D registration baseline method based on the weighted cross-correlation between two given point clouds. Our method achieves strong results on current benchmarking datasets, outperforming most deep learning methods. Our source code is available on github.com/DavidBoja/exhaustive-grid-search.

Local feature guidance framework for robust 3D point cloud registration

Article 07 December 2022

MFINet: a multi-scale feature interaction network for point cloud registration

Article 27 September 2024

Robust point-cloud registration based on dense point matching and probabilistic modeling

Article 13 June 2022

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

3D point cloud registration is the task of finding the rotation and translation that aligns the source point cloud to the partially overlapping target point cloud. It arises as a subtask in many different computer vision applications such as: 3D reconstruction [1], object recognition and categorization [2], shape retrieval [3], robot navigation [4] and is still an active research area [5, 6].

The typical registration pipeline consists of several steps: detecting features by finding salient points or patches of the point clouds, extracting features by describing those detected points or patches, matching features by finding the correspondences between the features of the point clouds, removing outlier correspondences by satisfying a specific criteria, and estimating the transformation by using only confident correspondences to find the alignment with the highest inlier ratio. These steps can be learning-based or handcrafted as in the traditional approaches.

The most recent advances have been inspired by the successes of deep learning and the development of novel architectures convenient for point cloud processing, such as PointNet [7] and KPConv [8]. Most of the learning-based approaches follow the typical pipeline by first extracting point cloud features [9,10,11,12] and then either applying RANSAC for creating feature-based matches [13,14,15] and filtering out the bad matches [16,17,18] or learning the whole registration pipeline end-to-end [19,20,21]. These methods achieve remarkable performance on public benchmarks [13, 22,23,24], even on very difficult examples with an overlap smaller than 30% [14].

A big limitation, however, of the state-of-the-art methods, which is typical for deep-learning-based methods [25,26,27], is that the model performance drops on benchmark data that differ from their training data. Most recent methods answer the generalizability question by training on the 3DMatch dataset [13] and evaluating their generalization capabilities on the KITTI [22] or ETH [23] benchmarks. As we argue in Sect. 4.1, however, these benchmark datasets lack data variability, where the datasets are biased toward similar data onto which the feature extraction pipeline can focus on. Additionally, as we show in Sect. 4.1, the current benchmarks have a restricted range of the registration parameters (rotation, translation and overlap), therefore providing less information about the actual quality of a method. Moreover, none of the benchmarks provide the option to assess the quality and robustness of a 3D registration method w.r.t. a single registration parameter. Therefore, none of the current benchmarks can provide an adequate in-depth analysis on the performance and generalization of a 3D registration method.

To address these limitations, we propose a methodology for creating a 3D registration benchmark, starting from a point cloud dataset. The methodology improves on the current benchmarks by allowing an in-depth analysis toward concrete registration parameters and providing a bigger range of variability in the registration parameters. We provide the methodology steps by creating a new version of the FAUST-partial (FP) benchmark [28] based on the FAUST [29] dataset, but the process can be extended to any point cloud dataset, including the already mentioned 3DMatch, KITTI and ETH benchmark datasets. By using the human body point clouds from the FAUST dataset, however, we address the bias in the current benchmarks which are mostly comprised of similar objects, providing a substantially different point cloud distribution than the current datasets. We start by creating 3 different settings for the FP benchmark, where each setting changes the difficulty (easy, medium or hard) for one of the 3 following parameters: rotation range, translation range, or overlap range; whilst fixing the remaining two. By fixing two out of three registration parameters, we can isolate the analysis of the quality of a particular 3D registration method to a single parameter, which is not possible with the current benchmarks. The three difficulty levels of the newly created benchmark datasets provide a bigger variety of parameter ranges for all the three registration parameters, which allows for determining the robustness of a method toward that parameter. We compare in detail the newly created benchmark with the existing benchmarks and conclude that the FP benchmark provides a much more detailed analysis, allowing to answer questions related to the generalization onto different point cloud distributions and different registration parameters. This comparison additionally provides us with a general methodology for assessing the difficulty of a 3D registration benchmark based on the registration parameter range.

Using the newly created FP benchmark, we carry out a thorough analysis of the state-of-the-art methods in Sect. 4 and address the research gap for a more thorough and recent survey (evaluation) of 3D registration methods. Our analysis suggests that most methods are very sensitive to an overlap decrease, somewhat sensitive to a larger rotation range, and not sensitive at all to a larger translation range.

To address the generalization downside of the current feature-based methods, we further propose a straightforward featureless traditional 3D registration method to use as a baseline for comparing with the state-of-the-art methods. We extend the work from [28], that is based on a grid search of the quantized rotation $\text {SO}(3)$ and translation ${\mathbb {R}}^3$ spaces. The best transformation candidate is selected as the solution with the maximum cross-correlation between the voxelized source and target point clouds. Thus, we name the method exhaustive grid search (EGS). The EGS shows competitive performance, outperforming most traditional and deep learning methods, as well as achieving state-of-the-art results on the ETH benchmark and several FP benchmarks. The results suggest that the learning-based methods, although remarkable on many public benchmarks, are still not robust enough to be applied to any 3D data. On the other hand, our EGS method performs consistently, regardless of the data distribution and regardless of its parameter choices, providing a robust method with higher applicability (see Sect. 5.6).

In summary, we:

Propose a new 3D registration method which performs an exhaustive search of the rotation and translation spaces and selects the transformation candidate based on the maximum weighted cross-correlation between the voxelized point clouds;
Provide a methodology to create a 3D registration benchmark, starting from an existing 3D point cloud dataset, that provides a more informative evaluation w.r.t. existing benchmarks and allows to asses the difficulty of existing 3D registration benchmarks;
Using the newly proposed methodology, generate a novel FAUST-partial (FP) 3D registration benchmark, which addresses the bias toward similar data in the current benchmarks, provides greater parameter range variability than observed in the current benchmarks, and allows to evaluate the strengths and weaknesses of a 3D registration method w.r.t. a single registration parameter;
Thoroughly evaluate the generalization performance of a great number of state-of-the-art methods under a common set of 3D registration metrics and benchmarks and analyze in detail the influence of three registration parameters;

2 Related work

We divide the related works into traditional and deep learning methods. Along with the standard optimization-based and handcrafted feature-based traditional methods, we additionally overview the cross-correlation and Fourier-based methods, since the EGS uses the cross-correlation computed in the Fourier domain as a guiding signal for the registration. Between the deep learning methods, we distinguish the feature learning, robust estimation learning and end-to-end learning methods.

2.1 Traditional methods

Optimization-based The most popular traditional registration method is the iterative closest point (ICP) algorithm. The algorithm selects a subset of points as correspondences, calculates the optimal transformation between the clouds using SVD, and iterates until convergence. The original implementations used point-to-point [30] and point-to-plane [31] distances for finding the tentative correspondences, but many other strategies have been proposed [32,33,34,35]. GO-ICP [33] proposes a branch-and-bound scheme and claims the global optimality of the algorithm. The 4-point congruent sets (4PCS) algorithm [36] and its variants [37, 38] are based on the idea that there exist sets of four coplanar points whose alignment corresponds to the alignment of the point clouds. To select the correspondences, RANSAC is used, and ICP is applied for refinement.

Cross-correlation based We single out related methods that use the cross-correlation to find the 3D alignment. [39] determines the 3D rotation using a sensor (accelerometer or magnetometer) attached to the 3D scanner, followed by a 3D cross-correlation between the voxelized point clouds to determine the translation. A group of works [40,41,42] uses the 2D cross-correlation between the reprojected 3D data to a 2D space to either determine the correspondences or the final registration.

Differently, our approach uses the 3D cross-correlation to determine both the 3D rotation and translation in order to align the point clouds. Additionally, our method does not mix 2D and 3D information, but rather uses only the 3D information of the given point clouds to align them. To the best of our knowledge, there are no works that explicitly use the 3D cross-correlation to determine both the rotation and translation to register two point clouds.

Fourier-based We overview related works that use the Fourier domain to compute the 3D registration between two point clouds. [43] uses the Fourier domain to align slices (2D images) of 3D brain MRI volumes by searching for only a single rotation angle along the Z-axis that obtains the biggest cross-correlation. A set of works [44,45,46,47,48,49] find the 3D alignment completely in the frequency domain by leveraging the fact that the magnitude of the Fourier transform of the displaced voxelized point cloud decouples the rotational from the translational component of the 3D alignment. The translational component is then found using the phase correlation or phase matching.

Differently, our method is 3D-based (not 2D) and does not find the rotation in the Fourier domain. Instead, we find the rotation by sampling the $\text {SO}(3)$ space, which increases the rotation estimation robustness since the Fourier rotation theorems used in these works are only valid in the continuous case; introducing numerical issues if discretized. Moreover, these methods perform significantly worse when the point clouds do not (nearly) completely overlap.

Handcrafted feature-based These methods first extract potential correspondences between the point clouds using the computed features and then find the transformation using RANSAC. Similar to the image keypoint-based methods such as SIFT [50], 3D feature-based methods focus on keypoint detection [51,52,53] and their distinctive description [54,55,56,57,58,59,60,61,62]. Fast global registration [63] refines the initial correspondences computed using the FPFH [55] descriptor and optimizes the Black-Rangarajan duality between robust estimation and line processes to estimate the 3D alignment.

Differently, our method does not require any features, keypoints, or their description to estimate the transformation. Instead, it exhaustively searches the rotation and translation spaces, avoiding the common pitfalls of the feature-based methods, and increasing its generalization capabilities.

2.2 Deep learning methods

Feature learning Instead of handcrafting distinctive features, keypoint detection and description can be learned. 3DMatch [13] transforms patches into volumetric voxel grids of truncated distance function (TDF) values and processes them through a 3D convolutional network [64, 65] to output local descriptors. Followed by 3DMatch and the popularity of deep learning, many other works propose to learn keypoint detection [14, 15, 66, 67] and description [27, 68,69,70,71,72,73,74,75]. Most of these works are learned by optimizing some version of the contrastive loss [76, 77] between the descriptors of matching and non-matching points and then by applying RANSAC to select the final correspondences and find the transformation.

Robust estimation learning Instead of creating even better features for the process of registration, these methods focus on removing outliers from a given set of features or correspondences, prior to estimating the rotation with RANSAC, GC-RANSAC [78] or CG-SAC [79]. [80] classifies the given correspondences into inliers or outliers and computes the transformation from the given inliers. [17] uses triplets of correspondences to cast a vote in the 6D Hough space to vote for a particular transformation. [16] selects the transformation with the most inliers from a list of transformation estimations computed by using the confidence of each given correspondence. [81] removes outliers from given correspondences by using a two-stage branch-and-bound algorithm to find a simpler (1+2) and (2+1) degrees of freedom for the rotation and translation, respectively. [82] finds consistent correspondences between two sets of features by building the adjacency matrix of a graph whose nodes represent the potential correspondences and weights on the links the pairwise agreements between the potential correspondences. [18] uses non-local channel spatial attention layers to obtain more reliable contextual information and uses the work from [82] to find consistent correspondences. [83] proposes a decoupled approach to solve in cascade for the scale, rotation and translation of a truncated least squares registration formulation using given correspondences. [84] considers a second-order spatial compatibility measure to compute the similarity between correspondences. From these, they find reliable initial correspondences that form consensus sets, based on which a rigid transformation can be found. [85] jointly learns the FCGF [68] features along with the outlier removal. [86] learns a matching matrix to match DGCNN [87] rectified virtual point features, after which they use Procrustes to solve for the transformation matrix.

End-to-end registration learning There are many recent approaches that learn not only feature description, but also the subsequent matching step, thus learning end-to-end. The first group of these methods [9,10,11,12, 88,89,90,91], pioneered by the deep closest point [9], follow the ICP idea by (iteratively) establishing soft correspondences and then applying weighted SVD to obtain the transformations. The second group of methods [19,20,21, 92, 93], represented by PointNetLK [19], use the PointNet architecture [7] or similar global description strategy to iteratively regress the transformation based on the global feature vectors. The third group of methods [94,95,96,97] use mechanisms of self-attention and cross-attention to densely back-propagate the encoded superpoint features and choose the final transformation from candidates of superpoint matches.

2.3 Generalization to other datasets

Several recent methods [16, 27, 71, 72] attempt to generalize to datasets other than training. All of these methods demonstrate a significant performance retention on novel datasets when, for example, evaluating 3DMatch-trained models on KITTI [16, 27, 71]. However, most of the results [15, 27, 71, 72] only show that the computed descriptors have a high registration recall by presenting the feature-matching-recall metric, never actually evaluating the quality of the 3D registration. As will be seen, many methods still struggle to generalize when encountered with completely unseen data.

2.4 3D registration surveys

Recent survey papers on 3D point cloud registration [98,99,100,101,102] provide a grouping of the traditional and learning methods, a detailed overview of the key elements of each method, the current benchmarks used in the literature and the different evaluation metrics used. Additionally these papers also present the results for some of the methods. However, most of the results have been gathered from previous papers. Since there are multiple benchmarks with multiple metrics, the gathered results are mostly comprised of only a few methods. Therefore, the current literature is lacking of an in-depth analysis on the results of the current state-of-the-art 3D registration methods. In order to work toward the goal of a fully robust and generalizable method, a thorough comparison is necessary. We provide a detailed analysis of 33 of the current state-of-the-art methods on three established benchmarks (3DMatch, KITTI and ETH) and our newly created FP benchmark. For comparison, the survey papers mention only 10 (or less depending on the paper) out of the 33 methods we compare.

3 Method description

Let ${\mathcal {X}} \in \mathbbm {R}^{N \times 3}$ be the source point cloud and ${\mathcal {Y}} \in \mathbbm {R}^{M \times 3}$ the target point cloud. The goal of rigid 3D registration is to find the homogeneous transformation ${\textbf{T}} \in \text {SE}(3)$ that best aligns ${\mathcal {X}}$ to ${\mathcal {Y}}$. The rigid transformation ${\textbf{T}}$ is composed of a rotation component ${\textbf{R}} \in \text {SO}(3)$ and a translation component ${\textbf{t}} \in \mathbbm {R}^3$.

To find the correct rotation and translation, we perform an exhaustive search over the parametrization of the rotation and translation spaces (also called the search space). We divide our method into 3 consecutive steps: pre-processing, cross-correlation and estimation, as shown in Fig. 1. Optionally, an additional refinement step can be added to further refine the results. In this section, we first introduce the search space parametrization and the general pipeline, whereas in Sect. 5 we discuss the results of the different tested strategies for each of the 4 steps. The final estimation of the rigid transformation is provided in Eq. 12.

Search space parametrization To parametrize the $\text {SO}(3)$ space, we first create a geodesic polyhedron $\{3,5+\}_{4,0}$ comprised of 162 vertices [103], each lying on a unit 2-sphere equidistant with its neighbors. These vertices are used as a uniform sample of ${\mathbb {S}}^2$. Next, for each point on the 2-sphere, we uniformly sample ${\mathbb {S}}^1$ using an angle step S. Each combination of a point on ${\mathbb {S}}^2$ (denoted as axis) and point on ${\mathbb {S}}^1$ (denoted as angle) creates an angle-axis representation of a rotation. This results in $N = 162 \times (360 / S) $ non-unique rotations that can be converted to rotation matrices $R_i, i=1,\dots , N$. The non-uniqueness of the rotations follows from having opposite axes present in the sampling of ${\mathbb {S}}^2$. We remove these duplicate rotation matrices by iteratively rejecting the ones where the norm of their difference equals 0. Note that this step is only computed once, prior to any registration.

The translation space is inherently parameterized by the voxelization process of the given point clouds. The possible translations hence correspond to the centers of the source point cloud voxels and are therefore dependent on the voxelization resolution (VR). More details are provided in the next few sections.

Pre-processing First, we center and rotate the source point cloud ${\mathcal {X}}$ around the origin using the N precomputed rotation matrices $R_i$:

$$\begin{aligned} {\mathcal {X}}_i&= R_i ({\mathcal {X}} - t_{{\mathcal {X}}}^{\textsc {center}}) \end{aligned}$$

(1)

$$\begin{aligned} t_{{\mathcal {X}}}^{\textsc {center}}&= \frac{1}{N} \sum _{i=1}^{N} {\mathcal {X}}[i,:] \in {\mathbb {R}}^3 \end{aligned}$$

(2)

obtaining ${\mathcal {X}}_i,\, i=1,\dots ,N$, where [ : , : ] indicates the row and column-wise indexing.

Next, we make all the point clouds coordinates positive by translating their minimal bounding box point into the origin:

$$\begin{aligned} {\mathcal {X}}_i&= R_i ({\mathcal {X}} - t_{{\mathcal {X}}}^{\textsc {center}}) + t_{{\mathcal {X}}}^{\textsc {posit}} \end{aligned}$$

(3)

$$\begin{aligned} {\mathcal {Y}}&= {\mathcal {Y}} + t_{{\mathcal {Y}}}^{\textsc {posit}} \end{aligned}$$

(4)

where

$$\begin{aligned} t_{{\mathcal {X}}}^{\textsc {posit}} = - \begin{bmatrix} \min {\mathcal {X}}[:,1] \\ \min {\mathcal {X}}[:,2]\\ \min {\mathcal {X}}[:,3] \end{bmatrix} \quad t_{{\mathcal {Y}}}^{\textsc {posit}} = - \begin{bmatrix} \min {\mathcal {Y}}[:,1] \\ \min {\mathcal {Y}}[:,2]\\ \min {\mathcal {Y}}[:,3] \end{bmatrix} \in {\mathbb {R}}^3 \nonumber \\ \end{aligned}$$

(5)

and min indicates the minimal element of an array. This step is performed to facilitate the voxelization process.

We then voxelize each source ${\mathcal {X}}_i$ and target ${\mathcal {Y}}$ point clouds with a voxel resolution of VR cm. We experiment with different voxelization resolutions and strategies and discuss them in more depth in Sect. 5. Generally, voxelizing a point cloud results in a 3D grid volume where a value of 1 represents that a point from the point cloud is present in that specific grid box (voxel), whereas a value of 0 represents that there are no points from the point cloud present in that specific grid box (voxel). Instead of having a 3D grid with ones and zeros, we set a value of PV (positive voxel) for the filled voxels and a value of NV (negative voxel) for the empty ones. This results in N voxelized source volumes ${\textbf{X}}_i$ and one voxelized target volume ${\textbf{Y}}$:

$$\begin{aligned} {\textbf{X}}_i(x,y,z), \, {\textbf{Y}}(x,y,z) = {\left\{ \begin{array}{ll} PV, \text {if voxel } (x,y,z) \text {filled} \\ NV, \text { if voxel } (x,y,z) \text {empty} \end{array}\right. } \end{aligned}$$

(6)

Cross-correlation For each source volume ${\textbf{X}}_i$, we perform a 3D cross-correlation with the target volume ${\textbf{Y}}$. Essentially, the central voxel of the target volume is translated over each voxel of the source volume where the cross-correlation can be computed by multiplying the overlying voxel values of the two volumes and summing them together. This results in N cross-correlation volumes $CC_i(x,y,z)$ with the same 3 dimensions as the source volume. The volumes can be thought of as discrete heatmaps where higher values should represent higher degrees of matching between the voxelized point clouds. Prior to the cross-correlation, each source volume is padded in order for the target volume to slide all over the source volume. We mark with ${\textbf{P}} = [n_{\text {left}}, n_{\text {right}},n_{\text {bottom}},n_{\text {top}},n_{\text {front}},n_{\text {back}}] \in {\mathbb {R}}^6$ the padding applied to each source volume ${\textbf{X}}_i$, where the values represent the number of voxels padded to the left, right, bottom, top, front and back of the volume, respectively. We experiment with different padding sizes and choices in Sect. 5. We make use of the Fourier domain to accelerate the computation of the cross-correlation. Both volumes are first transformed into the Fourier space using the FFT algorithm [104], after which the cross-correlation simplifies to a matrix multiplication [105]. The output is then transformed back with an inverse FFT. More details are given in Sect. 5.6.

Estimation We estimate the rotation matrix ${\hat{R}}$ that aligns (rotation-wise) ${\mathcal {X}}$ to ${\mathcal {Y}}$ using one of the N precomputed rotation matrices $R_i$. We select the matrix $R_i$ that corresponds to ${\textbf{X}}_i$ with the maximal cross-correlation value from the $CC_i(x,y,z)$ volumes. More concretely, we use the index

$$\begin{aligned} i^* = \mathop {\mathrm {\arg \!\max }}\limits _{i} CC_i(x,y,z) \end{aligned}$$

(7)

to select the estimated rotation matrix ${\hat{R}} = R_{i^*}$. To estimate the translation, we find the voxel with the maximal cross-correlation value from $CC_{i^*}$. Then, we translate the central voxel of the target volume ${\textbf{Y}}$ to the just found voxel of the $CC_{i^*}$ volume. Since the $CC_{i^*}$ volume corresponds to the source ${\textbf{X}}_{i^*}$ volume, we essentially translate the central voxel of the target volume to the voxel of the source volume with the maximal cross-correlation. More concretely, we find the index of the voxel with the maximal cross-correlation value with

$$\begin{aligned} (x^*,y^*,z^*) = \mathop {\mathrm {\arg \!\max }}\limits _{x,y,z} CC_{i^*} (x,y,z). \end{aligned}$$

(8)

Then, to translate the central voxel of the target volume to it, we use the translation:

$$\begin{aligned} t_{\text {est}} = \Bigg ( \underbrace{- \underbrace{ {\textbf{Y}}_{\textsc {cv}}}_{\begin{array}{c} \text {target} \\ \text {volume} \\ \text {central} \\ \text {voxel} \\ \text {} \end{array}}}_{\begin{array}{c} \text {move to} \\ \text {origin} \end{array}} - \underbrace{\underbrace{ \begin{bmatrix} {\textbf{P}}[0] \\ {\textbf{P}}[2] \\ {\textbf{P}}[4] \end{bmatrix}}_{\begin{array}{c} \text {padding} \\ \text {displacement} \end{array}} + \underbrace{\begin{bmatrix} x^* \\ y^* \\ z^* \end{bmatrix}}_{\begin{array}{c} \text {max cc} \\ \text {voxel} \end{array}} + \underbrace{\begin{bmatrix} 0.5 \\ 0.5 \\ 0.5 \end{bmatrix}}_{\begin{array}{c} \text {move to} \\ \text {center of} \\ \text {voxel} \end{array}}}_{\text {move to } (x^*,y^*,z^*) } \Bigg ) \times \text {VR} \nonumber \\ \end{aligned}$$

(9)

where each value is multiplied by the voxel resolution $\text {VR}$ to transform from voxel indices to Euclidean coordinates. The central voxel of the target volume can be computed as:

$$\begin{aligned} {\textbf{Y}}_{\textsc {cv}} = \begin{bmatrix} \lceil V_x / 2 \rceil \\ \lceil V_y / 2 \rceil \\ \lceil V_z / 2 \rceil \end{bmatrix} \end{aligned}$$

(10)

where $V_x,V_y,V_z$ are the number of voxels of ${\textbf{Y}}$ along the 3 dimensions. Intuitively, the central voxel along a dimension is the middle voxel if the number of voxels is odd, and one on the left of the "middle point" if it’s even.

Following all of the steps above, the rigid registration can be summarized as:

$$\begin{aligned} \left( {\hat{R}} \left( {\mathcal {X}} - t_{{\mathcal {X}}}^{\textsc {center}} \right) \right) + t_{{\mathcal {X}}}^{\textsc {posit}} \sim \left( {\mathcal {Y}} + t_{{\mathcal {Y}}}^{\textsc {posit}} \right) - t_{\text {est}} \end{aligned}$$

(11)

where $\sim $ indicates that the left and right parts are aligned.

Since the final rigid transformation needs to align ${\mathcal {X}}$ to ${\mathcal {Y}}$, Equation (11) can be rewritten as:

$$\begin{aligned} \left( {\hat{R}} \left( {\mathcal {X}} - t_{{\mathcal {X}}}^{\textsc {center}} \right) \right) + t_{{\mathcal {X}}}^{\textsc {posit}} + t_{\text {est}} - t_{{\mathcal {Y}}}^{\textsc {posit}} \sim {\mathcal {Y}} \end{aligned}$$

(12)

Therefore, the final rotation and translation estimations are:

$$\begin{aligned} {\hat{R}} = R_{i^*}, \hspace{3em} {\hat{t}} = -{\hat{R}} \, t_{{\mathcal {X}}}^{\textsc {center}} + t_{{\mathcal {X}}}^{\textsc {posit}} + t_{\text {est}} - t_{{\mathcal {Y}}}^{\textsc {posit}} \end{aligned}$$

(13)

Refinement Since the rotation and translation spaces are discretized, the initial alignment can be further refined. We derive the numerical and analytical rotation and translation upper bounds RB and T B in Appendix A. It can be concluded from these bounds that the rough initial alignment provides a good initialization for a fine registration algorithm. We experiment with different refining strategies in Sect. 5.4. In the final case, we use generalized ICP [106] to refine the initial solution since it provided slightly better results. We run $i=500$ iterations with an adaptive distance threshold based on the q-th quantile of the nearest neighbor distances between the two point clouds. Using an adaptive threshold provides more robustness to the method, since the point clouds from different benchmarks have very different resolutions. As we show in Table 8, however, the final results vary only slightly for different threshold values of i and q, which make the EGS independent of the refinement strategy.

4 Experiments

We evaluate the traditional and deep learning state-of-the-art methods trained on 3DMatch [13] and compare them to the EGS method. We use three established benchmarks: 3DMatch [13], ETH [23] and KITTI [22]; and create a novel FAUST-partial benchmark based on the FAUST dataset [29]. These benchmarks test the generalization abilities in terms of different sensor modalities (RGB-D, laser scanner), different environments (indoor, outdoor), resolution (6mm to 5cm) and completely different structure (from indoor objects to human bodies). Implementation details for all the compared methods are listed in Sect. 6.

4.1 Benchmarks

3DMatch The 3DMatch [13] benchmark dataset contains 46 training, 8 validation and 8 test indoor scenes. Each scene is fragmented into multiple 3D scans that need to be aligned. The scenes represent indoor scans of various rooms such as offices, hotel rooms, kitchens, laboratories, etc. The benchmark has been created by joining several existing benchmarks into one. Hence, it shows the biggest variability in terms of the rotation, translation and overlap parameters. Following standard practice [12,13,14, 94, 95], we evaluate our EGS method on the 8 test scenes and align all fragments with a minimum overlap of $30\%$, including the neighboring benchmark pairs that the original benchmark excluded.

KITTI The KITTI [22] benchmark dataset is comprised of 11 sequences of outdoor driving scenarios obtained by a lidar scanner. Compared to 3DMatch, the fragments are much larger, have lower resolution and a different structure. Following common practice [14, 14, 15, 68, 71, 95], we evaluate our EGS method on scenes 8 to 10 using pairs which are at least 10 m away from each other. The ground-truth transformation matrices are refined using ICP [14, 15, 27, 95] since the ground-truth alignment parameters are obtained using the imperfect GPS coordinates of the moving vehicle.

Table 1 The 9 FAUST-partial benchmark dataset versions

Addressing the generalization of 3D registration methods with a featureless baseline and an unbiased benchmark

Abstract

Similar content being viewed by others

Local feature guidance framework for robust 3D point cloud registration

MFINet: a multi-scale feature interaction network for point cloud registration

Robust point-cloud registration based on dense point matching and probabilistic modeling

Explore related subjects

1 Introduction

2 Related work

2.1 Traditional methods

2.2 Deep learning methods

2.3 Generalization to other datasets

2.4 3D registration surveys

3 Method description

4 Experiments

4.1 Benchmarks

4.2 Metrics

4.3 Parameters

4.4 Results

5 Ablation study

5.1 Voxelization strategy

5.2 Filling strategy

5.3 Rotation strategy

5.4 Refinement strategy

5.5 Limitations

5.6 Computational complexity

6 Implementation details

7 Visualizations

8 Conclusion

Availability of data and materials

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethics approval

Consent to participate

Consent for publication

Code availability

Additional information

Publisher's Note

Appendices

Error bounds derivation

Implementation details

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation