1 Introduction

Depth sensors (e.g., ToF cameras and LiDAR sensors) are widely used in research and industrial applications thanks to the variety of available and affordable products. These sensors are often utilized in telepresence and robotic applications for tasks such as 3D reconstruction, SLAM, or object recognition. In order to cover large spaces or to avoid occlusion, multiple cameras are used. In these cases, the sensors must be calibrated intrinsically (individually) and extrinsically (to each other) to obtain a common point cloud.

For multi-camera calibration and registration, the classical approach is based on feature detection on flat checkerboards since those features can be detected reliably and accurately in color and infrared images. The first step is to detect the checkerboard itself; then, the inner corners can be extracted using known corner detectors (e.g., [1]), which serve as point correspondences. Using these correspondences, a rigid transformation between different sensors is computed. This approach has been improved continuously (e.g., [2, 3]) and generally leads to robust and accurate results.

However, this method is not always applicable, for example, when infrared or color images are not available. In addition, Reyes-Aviles et al. [4] reported that, depending on the camera model, the infrared images and the corresponding depth images may have different projections (which they observed, for example, with different Orbbec RGB-D sensors) which leads to errors when infrared images are used for registration. On the other hand, if registration is performed directly on depth images or point clouds, the problem of inherently noisy depth data arises, making accurate feature detection difficult [5].

Our registration procedure, which we present here, requires only 3D point clouds obtained from depth images (no color or infrared data needed), works independently of scene brightness, can register both sensors facing in the same direction and opposing depth sensors, and is very easy and quick to perform. At the same time, we achieve excellent results in our evaluations with the Microsoft Azure Kinect, which are:

  • low registration errors (avg. 1.6 mm at 2 m distance),

  • robust target detection (PPV > 0.99),

  • fast target detection (avg. 20 ms).

We accomplish this by using a custom-designed lattice-like 3D registration target that can be easily replicated and a pipeline designed to detect the target’s features accurately in noisy depth images. Our implementation is available both as a self-contained small C++ library and as an Unreal Engine 4 plugin with a sample project, which also allows registering the depth sensors into a virtual world.

This paper is an extension of our CW 2021 paper [6]. In addition to a more detailed explanation of the algorithm and evaluation, we present a new experimental setup and perform an evaluation in which we compare the accuracy of our registration method to commonly known registration methods (see Sect. 4.1). Furthermore, we supplement this paper with a new small C++ library, which allows the use of our registration procedure in other software projects.

2 Related work

As mentioned earlier, the calibration and registration of depth sensors is usually done via the accompanying IR or RGB sensor images: Macknojia et al. [7] synchronously captured a checkerboard in the color and IR images of a Kinect for extrinsic calibration between RGB and depth sensors. Registration (or extrinsic calibration) between multiple Kinect cameras was similarly performed using the respective IR images. Chen et al. [8] captured a checkerboard in the color and IR images for homography-based calibration, while Darwish et al. [9] tracked two orthogonal checkerboards and aimed to improve depth accuracy.

In other publications, e.g., [10, 11] and [12], the use of the checkerboard approach is also described, but external optical tracking systems were added for depth correction and registration of multiple cameras in a common coordinate system, respectively. Although in this case the viewing areas of the cameras do not need to overlap, the need for a tracking system is a major limitation.

Herrera et al. [13] proposed a calibration approach that works directly in the depth image by detecting the outer corners of the checkerboard through a time-consuming manual selection. Reyes-Aviles et al. [4] proposed using a 3D checkerboard as a calibration target and a method that includes normal estimation, edge detection, and thresholding to detect it in the depth image. The registration method proposed by Song et al. [5] is based on a special checkerboard with regularly spaced hollow squares. Depth variations are handled by a model-based approach that considers the centers of the holes. Some works such as [14, 15] have replaced checkerboard-like registration targets with static marker-free 3D objects with known or previously scanned geometry, e.g., a stack of boxes that can be detected in depth images or point clouds. Furthermore, spherical targets have also been presented for camera calibration (e.g., [16, 17]) and registration in multiple-camera setup (e.g., [18]), in which the target was detected in the RGB image using ellipse fitting and in the depth image using background subtraction or a spherical area detection and sphere fitting of the point cloud.

Another popular approach is to perform a target-less registration directly on the 3D point clouds of the surrounding 3D scene obtained from multiple depth cameras. The most well-known algorithm is Iterative Closest Points (ICP), which alternatively searches for the closest point-to-point correspondences and an optimal rigid transformation. However, its main drawback is the tendency to converge to a local minimum and, therefore, its high dependency on good initial guesses [19]. Numerous variants were proposed in order to improve the convergence [20], the computational speed [21] and the robustness to noise [22]. Typical global methods for 3D registration are based on feature matching (including detection and description) and transformation via RANSAC. Generally, these methods tend to be less precise and, depending on the number of outliers, more time-consuming [23]. A major problem with target-less registration occurs when the point clouds do not overlap completely or only slightly. Different recent approaches try to obtain a registration even in these cases, e.g., [24] and [25]. However, if there is no overlap of the scene at all, e.g., because depth sensors are facing each other, these target-less methods are not applicable.

3 Our approach

We designed a lattice-like 3D target and developed a lattice detection algorithm that allows for quick and easy registration (extrinsic calibration) of multiple depth sensors. Our lattice-like 3D target (see Fig. 1) consists of 12 bars of size 44 cm x 4 cm x 0.2 cm, available in a common DIY store. By leaving 4 cm of space between the bars in the vertical and horizontal directions, 25 holes of size 4 cm x 4 cm are created that can be detected in the point cloud generated by depth sensors.

To perform the registration, the lattice has to be moved in the field of view of the depth sensors for a few seconds, while our lattice detection algorithm detects up to 25 point correspondences per frame for all depth sensors. Based on these point correspondences of multiple frames, we determine a rigid transformation matrix that describes the transformation of the sensors with respect to each other.

Fig. 1
figure 1

Left: photograph of our lattice with 25 holes. Right: two point clouds registered using our method

The major challenge in classical camera registration is the correct and accurate recognition of feature points. Since we only use depth data and no RGB or IR data, we cannot easily reuse the image-based recognition algorithms.

In the following, we present a novel approach that is fast and easy to implement while still achieving robust results. The detection of the lattice consists of the following steps:

  1. 1.

    Identification of plausible lattice candidates.

  2. 2.

    Detection of hole centers.

  3. 3.

    Identification of the center and axes of the lattice and outlier removal among hole centers.

  4. 4.

    Correspondence mapping.

  5. 5.

    Correspondence rejection.

  6. 6.

    SVD-based transformation estimation.

In the following, we will explain the individual steps of the lattice detection. Note that our algorithm expects an array of 3D coordinates as input—i.e., a point cloud—that is available in scanline order of the original depth image.

3.1 Lattice candidates

Fig. 2
figure 2

The steps in which we identify lattice candidates

Initially, we are given a point cloud in scanline order from the depth image, without indication as to whether a lattice is visible and where the lattice is located. To obtain clues as to where lattices might be located in the point cloud, we exploit the property that the lattice has many regularly spaced holes that result in many gaps in scanline order.

So, in the first step, we search for gaps along the scanlines which correspond to the regular geometry of the lattice (see Fig. 2). We do this by segmenting individual scanlines based on the Euclidean distance between neighboring points. Scanline segments that are at most as long as the diagonal of a hole and that are surrounded by two scanline segments of plausible length are identified as gap segments.

Now we have many individual segments lying along individual scanlines, each of which could be within a hole of the lattice. However, since we want to find a section of the point cloud that fully encloses the lattice, we cluster all gap segments using their Virtual Gap CentersFootnote 1 based on their proximity to each other. Using PCA, we calculate eigenvectors and eigenvalues for each cluster. Based on the proportions of the eigenvalues, we can efficiently discard clusters that obviously cannot represent lattice candidates: due to the symmetric structure of the lattice, we expect the first two eigenvalues to be similar in size, while the third eigenvector is many times smaller (we use a factor of 10 as the threshold), since the lattice is flat. All remaining clusters of gap segments are considered as lattice candidates.

3.2 Hole center detection

Using the clusters leftover from the previous step, we now know areas where a lattice may be located. In this step, we try to identify the exact hole centers of the lattice for each area found and discard lattice candidates that turn out to be implausible in the following.

We first determine all 3D points that potentially belong to the physical lattice due to their proximity to the gap segments using their Virtual Gap Centers (see footnote 1). A point is considered a lattice point if it is within a certain radius to at least one Virtual Gap Center (we use \(r=0.16\,\text {m}\) to completely cover the lattice in case some holes were missed in the previous step).

Fig. 3
figure 3

Side view of the lattice in the point cloud (blue) before and after filtering noise (e.g., due to the flying pixel effect). This is done by fitting a plane (black) via RANSAC into these points, then culling points by thresholding

Fig. 4
figure 4

A hole of the lattice in image space. The red line visualizes the separation of both segments. The dark blue points are those 3D inlier points used to calculate the hole center by averaging them and projecting the average onto the estimated plane in which the lattice is located

Fig. 5
figure 5

Visualization of our heuristic that can filter incorrectly detected hole centers (gray) and detect directions of x- and y-axes. In this example, the red and gray colored points where detected as hole centers previously

To effectively filter out the noise that typically occurs with depth sensors (e.g., due to the flying pixel effect), we use RANSAC [26] to fit a plane to the point cloud section and define all points closer than a certain threshold to the plane as lattice points (see Fig. 3). We store the indices of these lattice points in an inlier set. All remaining points are defined as outliers.

To identify the holes of the lattice, we again iterate over all scanlines of the input point cloud, each from the first inlier to the last inlier. We create segments similar to Sect. 3.1 but this time we create segments of lattice points (which are contained in the inlier set defined above) and segments of outliers. All outlier segments, which are enclosed by inlier segments are assumed to be a part of a hole. Since we have only iterated over the horizontal scanlines so far and thus only have horizontal segments, we now vertically join adjacent outlier segments if they overlap horizontally to obtain one segment for each hole of the lattice. This vertical joining of the horizontally running outlier segments is done efficiently using the union-find structure.

The remaining and joined outlier segments represent the individual holes. However, since the points of these joined outlier segments do not lie in the plane of the lattice, we consider the directly adjacent inlier points in each scanline (see Fig. 4), project them onto the earlier fitted plane and use their average as hole centers.

3.3 Axes detection and outlier removal

For each lattice candidate, we have found a set H of hole center candidates. However, there may be still incorrectly identified hole centers and additionally, we need correspondence points between multiple sensors, i.e., we have to match the found hole centers of different sensors.

For this purpose, we have developed a heuristic that can recognize the axes of the lattice based on the potential hole centers and that can cope even with quite noisy data. This heuristic works as follows:

  1. (1)

    Given the set of all found hole centers H, we now define the following set V of vectors:

    $$\begin{aligned} V = \{ n-m \mid d - \delta< dist(m,n) < d + \delta , m, n \in H \}, \end{aligned}$$
    (1)

    where d = hole spacing and \(\delta \) is a tolerance (in our case \(d=8\,\text {cm}\) and \(\delta =2\,\text {cm}\) which represents the geometry of our lattice). These vectors can be interpreted as edges between the hole centers. Together, they form a proximity graph (see Fig. 5).

  2. (2)

    Sort the vectors \(v \in V\) by their angle \(\alpha \) they subtend with the x-axis:

    $$\begin{aligned} \alpha (v) = {\left\{ \begin{array}{ll} { atan}2(v_y, v_x) &{} \text {if atan2}(v_y, v_x) \ge 0 \\ { atan}2(v_y, v_x) + \pi &{} \text {otherwise} \end{array}\right. } \end{aligned}$$
    (2)
  3. (3)

    Cluster these vectors based on their angle \(\alpha \). As can be seen in Fig. 5, this results in two very large clusters (color-coded by the blue and orange edges) as well as several other very small clusters. Thinking of these vectors as edges of a proximity graph, we define the set of all “good” edges in one of those two largest clusters as G and the set of all “bad” edges in the remaining small clusters as B.

  4. (4)

    For each hole center \(h \in H\) we now consider its incident edges E(h) and remove hole centers based on the number of incident “good” edges and incident “bad” edges, leaving the following set:

    $$\begin{aligned} H_\text {filtered}=\{h\mid \#E_G(h) > \#E_B(h), h \in H\}, \end{aligned}$$
    (3)

    where \(E_G(h) = E(h) \cap G\) and \(E_B(h) = E(h) \cap B\). This way, we very reliably remove incorrect hole centers.

  5. (5)

    Using the remaining hole centers \(H_\text {filtered}\), we look for the hole center candidate \(h^*\in H\) that is closest to the average:

    $$\begin{aligned} h'=\frac{1}{|H|} \sum _{h_i \in H} h_i, \end{aligned}$$
    (4)

    The \(h^*\) closest to this \(h'\) will be considered the center of the lattice.

  6. (6)

    By selecting the median vector in both the largest clusters, we get two very stable vectors which points along the x-axis and the y-axis (see Fig. 5).

  7. (7)

    At this point, we do not know which of the two vectors represents the x-axis and which the y-axis as well as their signs. To resolve this ambiguity, we consider the points of the point cloud surrounding the lattice: these are usually the points of the hand and arms holding the lattice. So, we calculate a vector from \(h^*\) to the center of these hand points and flip both the previous found vectors and assign them so that the x-axis always points in the direction of the hands. Let us assume for the moment that always the front side of the lattice is visible in the depth image: Then we can set the plane normal found by RANSAC as the z-axis, which then determines the y-axis.

If we find that the two vectors we determined in step (6) are not roughly orthogonal, or \(\#H_\text {filtered}\) is too small, we discard this candidate, since it is more likely not to be the lattice in this case. Using this heuristic and making the preliminary assumption that the lattice is always visible from the front, we were able to determine the center and axes of the lattice as well as remove incorrect hole centers.

3.4 Point correspondences

Generally, it would be possible to use only the centers of the lattice across multiple sensors as point correspondences over multiple frames. However, the more point correspondences are used over a large space, the more accurate and stable the registration is expected to become. Therefore, it is desirable to use all the hole centers instead of just the center one. By using all found hole centers, we get up to 25 times more point correspondences over a larger space. This also considerably reduces the time needed to perform a registration.

However, since we still lack the information on whether the lattice in the original depth image is seen from the front or the back, we cannot yet establish a clear mapping of hole centers between multiple sensors. Therefore, we first perform a less precise registration with only two point correspondences per frame, namely (a) the lattice center as well as (b) the lattice center shifted in the direction of the x-axis, since the x-axis always points in the direction of the hands. After that rough registration, we can transform the z-axis vector of the lattice seen in sensor A into the coordinate system of sensor B and use \(z'_B = \pm z_B\) as z-axis and \(y'_B = \pm y_B\) as y-axis (since we created the y-axis using the z-axis), depending on which sign gives the dot product \(z_A \cdot z_B\). In this way, we have resolved the ambiguity of the lattice side and ensured that lattices visible from the same side in different cameras have the same sign.

3.5 Correspondence rejection and registration

Up to this point, we have found point correspondences for which the 3D coordinates in the camera space of multiple sensors are known. However, since our lattice detection is not completely immune to errors, very rare errors in the point correspondences are possible. To ensure that in these cases the accuracy of the registration is not affected, we filter the point correspondences using the RANSAC-based correspondence rejection of the Point Cloud Library (PCL) [27]. Finally, we perform registration with the remaining point correspondences using SVD-based transformation estimation implemented by the PCL [27].

4 Results

To investigate the accuracy, reliability, and runtime performanceFootnote 2 of our registration procedure, we designed and conducted several experiments, which we present in this section. In all experiments, the Microsoft Azure Kinect was used as depth sensor. The following is a list of experiments we conducted, whereby experiments A and B are related to the whole registration procedure, and experiments C, D and E refer to the lattice detection in particular:

  1. A

    Accuracy Measurement and Comparison: We present an experimental setup that allows for very accurate determination of the distance error of a point as well as an angular error between two registered depth sensors. We compare our lattice-based registration procedure with three variants of the well-known checkerboard registration procedure. Both depth sensors were synchronized in time, so that two matched frames were always recorded with a time offset of exactly 160 \(\mu \)s.

  2. B

    Common Coordinate System Registration: We register a depth sensor into a ground truth coordinate system given by Optitrack, a high precision optical tracking system. The pose of the lattice is tracked both by Optitrack and by the depth sensor. The purpose of this experiment is also to determine the accuracy of the registration procedure, but in a different application—namely, the registration of a depth sensor with a third system. Major differences to experiment A are that (a) only the lattice center is taken as the point correspondence instead of all hole centers, since Optitrack isn’t able to detect these, (b) there is no exact time synchronization of the frames between the two systems, and (c) the error is determined not only over one very accurate correspondence point, but over many correspondence points in a larger volume.

  3. C

    Rotational Robustness: In this experiment, we rotate the lattice while its center position is fixed. This allows us to determine the minimum angle at which the lattice is still detected by our method and to detect possible systematic errors that depend on the angle of rotation. In this way, we can estimate whether our registration procedure is reliable in all situations (e.g., when the angle between the direction vectors of both depth sensors is very large) and whether higher errors can possibly be expected there than those we obtain in Experiments A and B.

  4. D

    Runtime Performance: We investigate the runtime performance of the lattice detection depending on the lattice distance since the lattice detection takes by far the largest part of the runtime (see footnote 2).

  5. E

    Precision and Recall: The reliability of our registration method essentially depends on how reliably the lattice, including its hole centers, is found. Therefore, in this experiment, for three different scenarios, we look at how often the lattice was correctly detected when our algorithm detected something (Precision) and how often the lattice was correctly detected when a lattice should have been visible (Recall).

4.1 Accuracy measurement and comparison

With this experiment, we examine the accuracy of our lattice registration procedure and compare it to the accuracy of conventional checkerboard registration procedures.

The following registration procedures are evaluated:

  • Checkerboard (RGB): We capture a moving checkerboard (with 8x8 inner corners) in the RGB image, detect its corners with OpenCV’s checkerboard corner detection by Duda et al. [3] in image space, and then we use OpenCV’s stereoCalibrate function [28] to obtain a registration for the RGB sensors.

  • Checkerboard (IR): We capture a moving checkerboard (with 8x8 inner corners) in the infrared image, detect its corners with OpenCV’s checkerboard corner detection by Duda et al.  [3] in image space and then we use OpenCV’s stereoCalibrate function [28] to obtain a registration for the infrared/depth sensors.

  • Checkerboard (IR+D): We capture a moving checkerboard (with 8x8 inner corners) in the infrared image, detect its corners with OpenCV’s checkerboard corner detection by Duda et al. [3] in image space, use the corresponding 3D points by the depth image, and apply the correspondence rejection and SVD-based transformation estimation implemented in PCL [27], which is also used by our method.

  • Lattice (D): We capture a moving lattice in the point cloud given by the depth image, detect it using our lattice detection and perform registration which is based on the correspondence rejection and SVD-based transformation estimation implemented in PCL [27].

Note that the first three checkerboard registration methods (RGB, IR and IR+D) are well-known approaches in the community that we compare to our lattice-based registration (D).

Fig. 6
figure 6

Setup of the recordings made per run. In each of the 10 runs, we recorded a the moving lattice, b the moving checkerboard, and c the static whiteboard with a checkerboard in the center. The whiteboard in c was always located approximately in the middle of the registration volume (Reg. Volume) and on average about 2 meters away from the sensors

Our experiment consists of 10 runs, for each of which we make (a) a recording with a moving lattice target, (b) a recording with a moving checkerboard target, and (c) a recording of a stationary whiteboard with a checkerboard pattern at its center (see Fig. 6). In each run, recordings (a) and recording (b) are used to perform a registration while recording (c) is an evaluation recording used to determine a point correspondence between the cameras’ coordinate systems very precisely for distance error measurement. After each run, we slightly changed the position and orientation of both Azure Kinects to get a bit of variation while maintaining a distance of approximately 1.5–2.5 m to the center of the registration volume. Furthermore, we alternated the order whether the lattice recording (a) or the checkerboard recording (b) was done first. Both Azure Kinects were synchronized in time to avoid errors caused by a larger location offset of the registration target in the same frame of different sensors. However, to avoid interference between multiple Azure Kinects, the second Azure Kinect was delayed by 160 \(\upmu \)s as recommended by the manufacturer, which is negligible regarding the expected error.

Fig. 7
figure 7

Uneven-perceived surface and plane fitting. In (a), the uneven-perceived surface of the checkerboard is visible. In (b), the area of points of the whiteboard (white) which we considered for plane fitting is shown

To obtain a point correspondence for error measurement in recording (c), which is needed to determine the distance error, we proceed as follows:

  1. (1)

    We detect the checkerboard pattern glued to the whiteboard using OpenCV’s checkerboard corner detection in the infrared image. According to [3], this yields subpixel accuracy.

  2. (2)

    Since, in the case of the Azure Kinect, the projections and sensors that generate the depth and IR image are identical, for each corner in the IR image, we obtain the 3D points of all corners, and average over all 3D corner points to get a mid-point of the checkerboard.

  3. (3)

    While the 3D mid-point is assumed to be very precise along the axes in image space, there may be small deviations along the depth axis due to the alternating colors of the checkerboard fields (see Fig. 7a). To control for that, we fit a plane to the 3D points rectangle that is centered to the checkerboard (see Fig. 7b). The previously calculated mid-point is then projected onto this plane, which is finally used to estimate the error between both point clouds.

Note that the registration transformation obtained by the Checkerboard (RGB) procedure differs somewhat from those obtained by the other procedures. This should be taken into account when comparing the 3D accuracy of the different calibration procedures. The reason for that arises from the fact that the Azure Kinect (and potentially many other RGB-D cameras) has two sensors: one sensor for both the IR and depth image, and a separate sensor for the RGB image. In order to perform a transformation of 3D points of Kinect B into the reference frame of Kinect A, the following concatenation of transformations should be used in case of the RGB calibration procedure:

$$\begin{aligned} T_{D_A \leftarrow D_B} = T_{D_A \leftarrow C_A} \cdot T_{C_A \leftarrow C_B} \cdot T_{C_B \leftarrow D_B} \end{aligned}$$
(5)

where

  • \(T_{D_A \leftarrow D_B}\) denotes the transformation from the depth sensor of Kinect B to the depth sensor of Kinect A,

  • \(T_{C_B \leftarrow D_B}\) denotes the transformation from Kinect B’s depth to its color sensor (given by the factory calibration),

  • \(T_{C_A \leftarrow C_B}\) denotes the transformation from Kinect B’s color sensor to Kinect A’s color sensor (known from the checkerboard registration),

  • \(T_{D_A \leftarrow C_A}\) is the transformation from Kinect A’s color to its depth sensor.

Obviously, the error of the Checkerboard (RGB) registration procedure using the RGB sensors and the error of the factory calibration between depth and color sensors accumulate. Therefore, it is to be expected that the Checkerboard (RGB) registration has a higher error than the other registration procedures. This is verified by the results of our experiments (see Table 1).

Table 1 Results of single registration runs. Error (mm) is the distance error measured by the correspondence point while Error (deg) is measured by the angle between the fitted plane normals. The value of the method that performed best in the respective run is marked in bold

In all runs, the lattice was detected by both sensors in an average of 87.2 frames (SD: 32.0) of recording (a). In the RGB image of recording (b), the checkerboard was detected in 108.3 frames (SD: 39.7) in average and in the IR image in 87.4 frames (SD: 39.0) in average. Note that the recordings (a) and (b) are not identical— individual recordings of the checkerboard methods and the lattice method can therefore not be directly compared. Furthermore, the difference in the number of detected checkerboards in the RGB image and the IR image, both using recording (a), is due to the uneven brightness of the checkerboard at different distances in the IR image, so that the checkerboard was not detected in some cases in the IR image.

The results (see Table 1) show that the average distance error of our method Lattice (D) (which only requires the depth image) is 1.6 mm and the average angular error is 0.17 deg. The Checkerboard (IR+D) method performs considerably better with an average distance error of 0.7 mm and an average angular error of 0.08 deg, but requires both the infrared image and the depth image. Compared to the Checkerboard (IR) method with an average error of 6.4 mm and the Checkerboard (RGB) method with an average error of 15.4 mm, our registration method Lattice (D) as well as the Checkerboard (IR+D) method performs significantly better.

Fig. 8
figure 8

Distance error after registration by procedures and axes

Since both the depth image and the infrared image are combined as input in the Checkerboard (IR+D) method, this method is expected to be more accurate than the Lattice (D) method, which uses only the noisy depth image as input. However, even if the average error of the Checkerboard (IR+D) method is smaller than the average error of the Lattice (D) method, both errors are very small in absolute terms considering the accuracy of depth sensors, which are much noisier and suffer from distortions at edges or the flying pixel effect. Furthermore, one has to take into account that the lattice in the experiment has a thickness of 4 mm (two layers of 2 mm thick bars) and was built with a precision of about 1-2 mm. The motions of the lattice in the recordings, as well as the higher standard deviation in the X-direction for Lattice (D) in Fig. 8, indicates that the thickness of the lattice may have mattered. Thinner and more precise lattices could improve the result of the Lattice (D) method.

Figure 8 shows that the Checkerboard (RGB) and Checkerboard (IR) procedures, in particular, contain a systematic error that could be caused by the factory calibration of the Kinect whose parameters are used in OpenCV’s stereo calibration methods. In our experimental setup, the distance between two adjacent pixels in the registration volume is about 4 mm in physical space, so even small errors in the intrinsic calibration could have a significant impact on the distance error in both cases. On the other hand, these two methods do not use depth information, so a possible systematic offset in depth direction is not corrected by these methods. In the case of the Checkerboard (RGB) method, as described above, the errors in the transformations between the separate depth and RGB sensors also add up, leading to expected higher errors compared to the three other methods.

Finally, this experiment shows that our method is a valuable alternative registration method in cases where no IR or color image is available, and is also very accurate in terms of absolute error values, in regard to the accuracy of depth sensors.

4.2 Registration into a common coordinate system

In the previous section, we looked at the accuracy of a registration between two depth sensors using our registration method. In this section, we will examine the accuracy with which we can register a depth sensor into Optitrack’s coordinate system using our method. To do so, we tracked the lattice using both Optitrack and a Microsoft Azure Kinect combined with our detection algorithm. To track the lattice with Optitrack, we attached seven markers to the lattice to achieve sufficient accuracy. These Optitrack markers were detected in the Azure Kinect depth image by our method as hole centers. However, our heuristic generally detected these hole centers as incorrect hole centers, resulting in no noticeable effect on the precision of the lattice detection.

Fig. 9
figure 9

Two scenarios used for ground truth evaluation, from the perspective of the depth sensor

Since Optitrack and the Kinect use the same infrared light at a wavelength of 850 nm, there was occasionally a pulsating noise throughout the depth image (see Fig. 10). We ran Optitrack at 30 fps (almost the same frame rate with which the Azure Kinect recorded), as the pulsating noise was least likely to show up this way. The pulsating noise caused a greatly increased runtime of the algorithm in those frames, since many lattice candidates were detected in flat background objects. However, all these false lattice candidates were successfully discarded by our detection algorithm.

We performed the evaluation in front of two different backgrounds (see Fig. 9), hereafter also called scenarios, while slowly moving the lattice. In scenario B, the distance between the sensor and the lattice was between 1.0 m and 1.95 m while the center of the lattice stayed within a volume of about 0.6 m\(^3\).Footnote 3 In scenario A, the distance between the sensor and the lattice ranged from 0.85 m to 2.05 m, with the center of the lattice in a volume of about 1.2 m\(^3\) (the entire lattice was about 2.2 m\(^3\)).

Fig. 10
figure 10

Pulsating noise in the recorded point cloud due to inference between Optitrack and the Azure Kinect, which both use infrared light with a wavelength of 850 nm. In the image sequence shown, the noise varies from no noise (a) to very intense noise (d)

Table 2 Mean error between ground truth lattice center and detected lattice center after registration

Using the lattice centers as point correspondences, we registered the Microsoft Azure Kinect’s point cloud into Optitrack’s coordinate system. We then measured the deviation of the registered center point from the center point detected by Optitrack. In both scenes we observed a quite similar error averaging only 3.83 mm to 4.40 mm (see Table 2).

Note that, compared to experiment A (see Sect. 4.1), instead of just measuring the distance error of one point which was located in the center of the registration volume and was smoothed over time, we captured multiple correspondences in a specific volume which still were affected by the typical noise of the Azure Kinect. Additionally, Optitrack and the Kinect were not synchronized in time. Therefore, we always searched for the closest Optitrack frame in time to a Kinect frame to find point correspondences. Although we set both Optitrack and the Kinect to 30 fps, they did not run at exactly the same speed. The closest frames in time between Optitrack and the Kinect were always time-shifted by 0 to about 1/60 second, averaging 1/120 second. With an average movement speed of 18.8 cm per second in scenario A, this gives an expected error of \(0.0083\text {s} * 18.8\,\,\text {cm/s} = 1.56\,\,\text {mm}\). The average error of the lattice detection by Optitrack was specified by Optitrack’s Motive software as 0.7 mm.

4.3 Rotational robustness

With this experiment, we tried to determine the robustness of our method with respect to the angle between the line of sight of the camera and the normal of the lattice. It is to be expected that at grazing angles (i.e., angle between line of sight and the lattice normal approaches 90 degrees), our registration procedure will fail.

We used a thin thread to hang the lattice as symmetrically as possible between two tripods, leaving one degree of freedom (see Fig. 11). In the experiment, the lattice then slowly rotated around its y-axis in the range \([-90,90]\) degrees. Assuming the rotation axis accurately passes through the lattice’s real center, the position of the detected lattice center is not expected to change regardless of the orientation of the lattice.

Fig. 11
figure 11

Experiment setup to determine the deviation of the detected center point during the rotation of the lattice around its own axis

Fig. 12
figure 12

Deviation of the center point during rotation around the y-axis of a lattice, which was clamped between two tripods with sewing threads

After recording the lattice at a distance of about 1.5 m, we obtained an average deviation from the mean center along the x-axis of 0.9 mm (SD: 1.0 mm), along the y-axis of 0.4 mm (SD: 0.3 mm), and along the z-axis of 1.8 mm (SD: 1.3 mm) over a range of \(-57.0\text {\,deg}\) to \(59.7\text {\,deg}\) (see Fig. 12)—at larger angles the lattice was no longer detected. As expected, the deviation along the y-axis (upward axis) was very small. The slightly higher deviation along the z-axis compared to the x-axis could also be due, at least in part, to the expected error of the Azure Kinect camera depth values (which mainly affect the z-value). Also, our lattice has a thickness of 4 mm and was built with an accuracy of only about 1-2 mm.

4.4 Runtime performance

We expect the runtime of the lattice detection to depend on the distance between the lattice and the sensor since fewer points of the point cloud have to be processed if the lattice is further away. Therefore, we created a recording in which we moved the lattice back and forth at a distance of about 0.9 m to 3.9 m (see Fig. 13).

Fig. 13
figure 13

Scenario C for runtime measurement and precision and recall estimation

Fig. 14
figure 14

Dependence of the runtime on distance between lattice and sensor in scenario C only (considering frames where a lattice was detected)

On average, we observed an average runtime of 19.2 ms on a single core of an AMD Ryzen 9 3900X processor for the lattice detection per processed frame (SD 6.4 ms) in Scenario C. For only 198 of considered 4022 frames, the lattice detection took more than 33.33 ms (4.9%), while the maximum runtime in this scenario was 60.5 ms. As expected, the runtime clearly depends on the distance of the lattice to the sensor, see Fig. 14.

For completeness, we have also given the results of our runtime measurements for Scenario A and Scenario B in Table 3. There it can be seen that the average runtimes of 19.8 ms and 24.5 ms are quite similar for other scenarios as well. However, a few frames of both recordings were affected by pulsating noise due to interference between the Azure Kinect and Optitrack (see Fig. 10). As a result, many lattice candidates were detected in these frames, and although they were correctly rejected, the maximum runtime was abnormally high.

4.5 Precision and recall

Another quality metric for registration methods is the robustness of the detection of the target object in the images, which can be measured by the well-known classification scores precision (defined as \(\text {PPV} = \frac{\text {TP}}{\text {TP} + \text {FP}}\)) and recall (defined as \(\text {TPR} = \frac{\text {TP}}{\text {TP} + \text {FN}}\)). In our case, precision gives the percentage of lattices detections that were correct, while recall describes the percentage of correct lattices that our algorithm detected among all the visible, actual targets (i.e., in the camera’s field of view). In Scenario C, we ensured that the lattice was fully visible in the camera’s field of view at all times; hence, there are no true negatives in this case.

Our results can be found in Table 4.

Table 3 Runtimes of our lattice detection algorithm in different scenarios (without parallelization)
Table 4 Precision and recall of the lattice detection algorithm in Scenario C

4.6 Limitations

Fig. 15
figure 15

Since the lattice can be detected from both sides, our registration method can be used in most applications despite the limited detection at some angles (see Sect. 4.3). As a rough rule of thumb, if angle between the lines-of-sight of two cameras is greater than 90 deg, the lattice should be positioned such that they see different sides of the lattice a, while for camera angles less than 90 deg, the lattice should be held so that both sensors see the same side of the lattice b

We observed that with the Azure Kinect, some of the holes of the lattice are occasionally invisible in the original depth image. In our experiments, we found that this seems to depend on the background behind the lattice (e.g., surface normal and reflectivity) and primarily occurs when the distance to the background is greater than the Azure Kinect’s working range (we used the NFOV unbinned mode which has a working range of 0.5–3.86 m)Footnote 4. We suspect this may be related to a filter in the Azure Kinect or Azure Kinect SDK that appears to bridge areas between invalid pixels. In scenario B, this effect likely had a significant impact on the number of false-negative detections and, consequently, the recall, due to the distant background (partially \(> 6\,m\)), while the effect was nearly negligible in scenarios A and C. Although this effect did not affect the registration success nor accuracy in our scenarios, there might be special application areas where registration with our method could be difficult when the Azure Kinect is operated outside its working range.

As shown in the experiment regarding rotational stability, our method can detect the lattice stable only if the angle between the lattice normal and the camera viewing direction is smaller than about 55 deg. However, since our method is robust against viewing the lattice from the front or from the back, this is only a minor limitation for most applications, as Fig. 15 shows.

Finally, probably the most obvious and very minor limitation is the fact that our lattice needs to be held by one or two hands at only one side while registration is performed.

5 Virtual world registration

Some applications require not only registration between depth sensors, but also registration of these depth sensors with a virtual world. For example, in VR applications, it might be required to stream a point cloud into the virtual world for the use as a user’s avatar or for real objects to cast shadows by virtual lights (see Fig. 16).

To perform such a registration, we need to find correspondences between virtual world space and one of the depth sensors. To do so, we designed a physical bracket to which our lattice and two motion controllers can be attached. This allows to move the lattice and the motion controllers simultaneously while maintaining a fixed relative distance between them (see Fig. 17). We manually measured the transformations from the lattice to both motion controllers, \(T_\text {RightController}\) and \(T_\text {LeftController}\), resp. Given those, we can derive the position of the controllers in camera space. Since their position in virtual world space is also given (by the VR system’s tracking), their positions in both reference frames can be collected simultaneously over multiple frames, which can then be used as corresponding point pairs. Thus, the virtual world and the camera reference frames can be registered using the SVD-based transformation estimation of the PCL [27].

Fig. 16
figure 16

Rendering of the point clouds of two Kinects registered into a virtual scene seen from a first-person view in an HMD a and from a third perspective b

Fig. 17
figure 17

The lattice with the motion controllers of the HP Reverb G2 attached using a bracket allowing for precise registration of depth sensors with the virtual scene

Instead of manually measuring the transformation between the motion controllers and the lattice, it is also possible to estimate the transform using the rigid motions of both controller and lattice. This problem is known as hand-eye calibration and simply requires recording of multiple poses of the motion controllers and the lattice. For a feasibility check, we performed this calibration offline using the method of [29]. Instead of only calculating the transformation between controller and lattice, we could also directly compute the world to camera space transformation. As the norm of this transformation typically is larger, though, a small error in the lattice orientation leads to an overall higher calibration error.

6 Source code

We provide two implementations of our registration procedure. To register arbitrary depth sensors with each other, we provide a new small C++ library which can be found at https://gitlab.informatik.uni-bremen.de/cgvr_public/lattice_registration_library. Second, we provide a slightly larger Unreal Engine 4 project that allows for registration of multiple Microsoft Azure Kinects with each other as well as with a virtual world: https://gitlab.informatik.uni-bremen.de/cgvr_public/lattice_based_registration_ue4.

7 Conclusion

We presented a novel approach for the registration (extrinsic calibration) of depth sensors based exclusively on depth data. As a registration target, we designed a lattice-like board with regularly spaced holes which are visible in the depth image. More importantly, we developed an algorithm that can detect such boards reliably and accurately in depth images and is very easy to implement.

In our test scenarios, which we performed using a Microsoft Azure Kinect under real-world conditions, we achieved a precision of more than 0.99 with our lattice detection algorithm. At the same time, the lattice detection has an average running time of roughly 20 ms on a single core of an AMD Ryzen 9 3900X per frame. Using the features of the detected lattices for registration, we measured an average registration error of 1.6 mm between two point clouds in the middle of the registration volume at a capture distance of approximately 2 m from the sensors.

We provide an open, small C++ library that can be used to register any kind of depth sensors. Furthermore, we make an Unreal Engine 4 project available that is capable of registering depth sensors with a virtual world; it works exemplary with the Microsoft Azure Kinect, but can be adapted easily to other depth cameras as well.

In future work, our method could be made even more accurate by integrating the work of Deng et al. [30], which is able compensate slight distortions of the depth camera within a larger volume. Additionally, one could implement the optimization presented by Beck et al. [12] for better registration results with hardware which is not synchronized in time. Finally, it could be useful to extend the method to optionally detect the lattice in an IR or color image as well. This might improve the accuracy a bit more, e.g., if one wants to register depth-only sensors or LiDAR sensors, which do not provide an infrared or color image, with RGB-D sensors, which have such an additional image.