To investigate the accuracy, reliability, and runtime performanceFootnote 2 of our registration procedure, we designed and conducted several experiments, which we present in this section. In all experiments, the Microsoft Azure Kinect was used as depth sensor. The following is a list of experiments we conducted, whereby experiments A and B are related to the whole registration procedure, and experiments C, D and E refer to the lattice detection in particular:
-
A
Accuracy Measurement and Comparison: We present an experimental setup that allows for very accurate determination of the distance error of a point as well as an angular error between two registered depth sensors. We compare our lattice-based registration procedure with three variants of the well-known checkerboard registration procedure. Both depth sensors were synchronized in time, so that two matched frames were always recorded with a time offset of exactly 160 \(\mu \)s.
-
B
Common Coordinate System Registration: We register a depth sensor into a ground truth coordinate system given by Optitrack, a high precision optical tracking system. The pose of the lattice is tracked both by Optitrack and by the depth sensor. The purpose of this experiment is also to determine the accuracy of the registration procedure, but in a different application—namely, the registration of a depth sensor with a third system. Major differences to experiment A are that (a) only the lattice center is taken as the point correspondence instead of all hole centers, since Optitrack isn’t able to detect these, (b) there is no exact time synchronization of the frames between the two systems, and (c) the error is determined not only over one very accurate correspondence point, but over many correspondence points in a larger volume.
-
C
Rotational Robustness: In this experiment, we rotate the lattice while its center position is fixed. This allows us to determine the minimum angle at which the lattice is still detected by our method and to detect possible systematic errors that depend on the angle of rotation. In this way, we can estimate whether our registration procedure is reliable in all situations (e.g., when the angle between the direction vectors of both depth sensors is very large) and whether higher errors can possibly be expected there than those we obtain in Experiments A and B.
-
D
Runtime Performance: We investigate the runtime performance of the lattice detection depending on the lattice distance since the lattice detection takes by far the largest part of the runtime (see footnote 2).
-
E
Precision and Recall: The reliability of our registration method essentially depends on how reliably the lattice, including its hole centers, is found. Therefore, in this experiment, for three different scenarios, we look at how often the lattice was correctly detected when our algorithm detected something (Precision) and how often the lattice was correctly detected when a lattice should have been visible (Recall).
Accuracy measurement and comparison
With this experiment, we examine the accuracy of our lattice registration procedure and compare it to the accuracy of conventional checkerboard registration procedures.
The following registration procedures are evaluated:
-
Checkerboard (RGB): We capture a moving checkerboard (with 8x8 inner corners) in the RGB image, detect its corners with OpenCV’s checkerboard corner detection by Duda et al. [3] in image space, and then we use OpenCV’s stereoCalibrate function [28] to obtain a registration for the RGB sensors.
-
Checkerboard (IR): We capture a moving checkerboard (with 8x8 inner corners) in the infrared image, detect its corners with OpenCV’s checkerboard corner detection by Duda et al. [3] in image space and then we use OpenCV’s stereoCalibrate function [28] to obtain a registration for the infrared/depth sensors.
-
Checkerboard (IR+D): We capture a moving checkerboard (with 8x8 inner corners) in the infrared image, detect its corners with OpenCV’s checkerboard corner detection by Duda et al. [3] in image space, use the corresponding 3D points by the depth image, and apply the correspondence rejection and SVD-based transformation estimation implemented in PCL [27], which is also used by our method.
-
Lattice (D): We capture a moving lattice in the point cloud given by the depth image, detect it using our lattice detection and perform registration which is based on the correspondence rejection and SVD-based transformation estimation implemented in PCL [27].
Note that the first three checkerboard registration methods (RGB, IR and IR+D) are well-known approaches in the community that we compare to our lattice-based registration (D).
Our experiment consists of 10 runs, for each of which we make (a) a recording with a moving lattice target, (b) a recording with a moving checkerboard target, and (c) a recording of a stationary whiteboard with a checkerboard pattern at its center (see Fig. 6). In each run, recordings (a) and recording (b) are used to perform a registration while recording (c) is an evaluation recording used to determine a point correspondence between the cameras’ coordinate systems very precisely for distance error measurement. After each run, we slightly changed the position and orientation of both Azure Kinects to get a bit of variation while maintaining a distance of approximately 1.5–2.5 m to the center of the registration volume. Furthermore, we alternated the order whether the lattice recording (a) or the checkerboard recording (b) was done first. Both Azure Kinects were synchronized in time to avoid errors caused by a larger location offset of the registration target in the same frame of different sensors. However, to avoid interference between multiple Azure Kinects, the second Azure Kinect was delayed by 160 \(\upmu \)s as recommended by the manufacturer, which is negligible regarding the expected error.
To obtain a point correspondence for error measurement in recording (c), which is needed to determine the distance error, we proceed as follows:
-
(1)
We detect the checkerboard pattern glued to the whiteboard using OpenCV’s checkerboard corner detection in the infrared image. According to [3], this yields subpixel accuracy.
-
(2)
Since, in the case of the Azure Kinect, the projections and sensors that generate the depth and IR image are identical, for each corner in the IR image, we obtain the 3D points of all corners, and average over all 3D corner points to get a mid-point of the checkerboard.
-
(3)
While the 3D mid-point is assumed to be very precise along the axes in image space, there may be small deviations along the depth axis due to the alternating colors of the checkerboard fields (see Fig. 7a). To control for that, we fit a plane to the 3D points rectangle that is centered to the checkerboard (see Fig. 7b). The previously calculated mid-point is then projected onto this plane, which is finally used to estimate the error between both point clouds.
Note that the registration transformation obtained by the Checkerboard (RGB) procedure differs somewhat from those obtained by the other procedures. This should be taken into account when comparing the 3D accuracy of the different calibration procedures. The reason for that arises from the fact that the Azure Kinect (and potentially many other RGB-D cameras) has two sensors: one sensor for both the IR and depth image, and a separate sensor for the RGB image. In order to perform a transformation of 3D points of Kinect B into the reference frame of Kinect A, the following concatenation of transformations should be used in case of the RGB calibration procedure:
$$\begin{aligned} T_{D_A \leftarrow D_B} = T_{D_A \leftarrow C_A} \cdot T_{C_A \leftarrow C_B} \cdot T_{C_B \leftarrow D_B} \end{aligned}$$
(5)
where
-
\(T_{D_A \leftarrow D_B}\) denotes the transformation from the depth sensor of Kinect B to the depth sensor of Kinect A,
-
\(T_{C_B \leftarrow D_B}\) denotes the transformation from Kinect B’s depth to its color sensor (given by the factory calibration),
-
\(T_{C_A \leftarrow C_B}\) denotes the transformation from Kinect B’s color sensor to Kinect A’s color sensor (known from the checkerboard registration),
-
\(T_{D_A \leftarrow C_A}\) is the transformation from Kinect A’s color to its depth sensor.
Obviously, the error of the Checkerboard (RGB) registration procedure using the RGB sensors and the error of the factory calibration between depth and color sensors accumulate. Therefore, it is to be expected that the Checkerboard (RGB) registration has a higher error than the other registration procedures. This is verified by the results of our experiments (see Table 1).
Table 1 Results of single registration runs. Error (mm) is the distance error measured by the correspondence point while Error (deg) is measured by the angle between the fitted plane normals. The value of the method that performed best in the respective run is marked in bold In all runs, the lattice was detected by both sensors in an average of 87.2 frames (SD: 32.0) of recording (a). In the RGB image of recording (b), the checkerboard was detected in 108.3 frames (SD: 39.7) in average and in the IR image in 87.4 frames (SD: 39.0) in average. Note that the recordings (a) and (b) are not identical— individual recordings of the checkerboard methods and the lattice method can therefore not be directly compared. Furthermore, the difference in the number of detected checkerboards in the RGB image and the IR image, both using recording (a), is due to the uneven brightness of the checkerboard at different distances in the IR image, so that the checkerboard was not detected in some cases in the IR image.
The results (see Table 1) show that the average distance error of our method Lattice (D) (which only requires the depth image) is 1.6 mm and the average angular error is 0.17 deg. The Checkerboard (IR+D) method performs considerably better with an average distance error of 0.7 mm and an average angular error of 0.08 deg, but requires both the infrared image and the depth image. Compared to the Checkerboard (IR) method with an average error of 6.4 mm and the Checkerboard (RGB) method with an average error of 15.4 mm, our registration method Lattice (D) as well as the Checkerboard (IR+D) method performs significantly better.
Since both the depth image and the infrared image are combined as input in the Checkerboard (IR+D) method, this method is expected to be more accurate than the Lattice (D) method, which uses only the noisy depth image as input. However, even if the average error of the Checkerboard (IR+D) method is smaller than the average error of the Lattice (D) method, both errors are very small in absolute terms considering the accuracy of depth sensors, which are much noisier and suffer from distortions at edges or the flying pixel effect. Furthermore, one has to take into account that the lattice in the experiment has a thickness of 4 mm (two layers of 2 mm thick bars) and was built with a precision of about 1-2 mm. The motions of the lattice in the recordings, as well as the higher standard deviation in the X-direction for Lattice (D) in Fig. 8, indicates that the thickness of the lattice may have mattered. Thinner and more precise lattices could improve the result of the Lattice (D) method.
Figure 8 shows that the Checkerboard (RGB) and Checkerboard (IR) procedures, in particular, contain a systematic error that could be caused by the factory calibration of the Kinect whose parameters are used in OpenCV’s stereo calibration methods. In our experimental setup, the distance between two adjacent pixels in the registration volume is about 4 mm in physical space, so even small errors in the intrinsic calibration could have a significant impact on the distance error in both cases. On the other hand, these two methods do not use depth information, so a possible systematic offset in depth direction is not corrected by these methods. In the case of the Checkerboard (RGB) method, as described above, the errors in the transformations between the separate depth and RGB sensors also add up, leading to expected higher errors compared to the three other methods.
Finally, this experiment shows that our method is a valuable alternative registration method in cases where no IR or color image is available, and is also very accurate in terms of absolute error values, in regard to the accuracy of depth sensors.
Registration into a common coordinate system
In the previous section, we looked at the accuracy of a registration between two depth sensors using our registration method. In this section, we will examine the accuracy with which we can register a depth sensor into Optitrack’s coordinate system using our method. To do so, we tracked the lattice using both Optitrack and a Microsoft Azure Kinect combined with our detection algorithm. To track the lattice with Optitrack, we attached seven markers to the lattice to achieve sufficient accuracy. These Optitrack markers were detected in the Azure Kinect depth image by our method as hole centers. However, our heuristic generally detected these hole centers as incorrect hole centers, resulting in no noticeable effect on the precision of the lattice detection.
Since Optitrack and the Kinect use the same infrared light at a wavelength of 850 nm, there was occasionally a pulsating noise throughout the depth image (see Fig. 10). We ran Optitrack at 30 fps (almost the same frame rate with which the Azure Kinect recorded), as the pulsating noise was least likely to show up this way. The pulsating noise caused a greatly increased runtime of the algorithm in those frames, since many lattice candidates were detected in flat background objects. However, all these false lattice candidates were successfully discarded by our detection algorithm.
We performed the evaluation in front of two different backgrounds (see Fig. 9), hereafter also called scenarios, while slowly moving the lattice. In scenario B, the distance between the sensor and the lattice was between 1.0 m and 1.95 m while the center of the lattice stayed within a volume of about 0.6 m\(^3\).Footnote 3 In scenario A, the distance between the sensor and the lattice ranged from 0.85 m to 2.05 m, with the center of the lattice in a volume of about 1.2 m\(^3\) (the entire lattice was about 2.2 m\(^3\)).
Table 2 Mean error between ground truth lattice center and detected lattice center after registration Using the lattice centers as point correspondences, we registered the Microsoft Azure Kinect’s point cloud into Optitrack’s coordinate system. We then measured the deviation of the registered center point from the center point detected by Optitrack. In both scenes we observed a quite similar error averaging only 3.83 mm to 4.40 mm (see Table 2).
Note that, compared to experiment A (see Sect. 4.1), instead of just measuring the distance error of one point which was located in the center of the registration volume and was smoothed over time, we captured multiple correspondences in a specific volume which still were affected by the typical noise of the Azure Kinect. Additionally, Optitrack and the Kinect were not synchronized in time. Therefore, we always searched for the closest Optitrack frame in time to a Kinect frame to find point correspondences. Although we set both Optitrack and the Kinect to 30 fps, they did not run at exactly the same speed. The closest frames in time between Optitrack and the Kinect were always time-shifted by 0 to about 1/60 second, averaging 1/120 second. With an average movement speed of 18.8 cm per second in scenario A, this gives an expected error of \(0.0083\text {s} * 18.8\,\,\text {cm/s} = 1.56\,\,\text {mm}\). The average error of the lattice detection by Optitrack was specified by Optitrack’s Motive software as 0.7 mm.
Rotational robustness
With this experiment, we tried to determine the robustness of our method with respect to the angle between the line of sight of the camera and the normal of the lattice. It is to be expected that at grazing angles (i.e., angle between line of sight and the lattice normal approaches 90 degrees), our registration procedure will fail.
We used a thin thread to hang the lattice as symmetrically as possible between two tripods, leaving one degree of freedom (see Fig. 11). In the experiment, the lattice then slowly rotated around its y-axis in the range \([-90,90]\) degrees. Assuming the rotation axis accurately passes through the lattice’s real center, the position of the detected lattice center is not expected to change regardless of the orientation of the lattice.
After recording the lattice at a distance of about 1.5 m, we obtained an average deviation from the mean center along the x-axis of 0.9 mm (SD: 1.0 mm), along the y-axis of 0.4 mm (SD: 0.3 mm), and along the z-axis of 1.8 mm (SD: 1.3 mm) over a range of \(-57.0\text {\,deg}\) to \(59.7\text {\,deg}\) (see Fig. 12)—at larger angles the lattice was no longer detected. As expected, the deviation along the y-axis (upward axis) was very small. The slightly higher deviation along the z-axis compared to the x-axis could also be due, at least in part, to the expected error of the Azure Kinect camera depth values (which mainly affect the z-value). Also, our lattice has a thickness of 4 mm and was built with an accuracy of only about 1-2 mm.
Runtime performance
We expect the runtime of the lattice detection to depend on the distance between the lattice and the sensor since fewer points of the point cloud have to be processed if the lattice is further away. Therefore, we created a recording in which we moved the lattice back and forth at a distance of about 0.9 m to 3.9 m (see Fig. 13).
On average, we observed an average runtime of 19.2 ms on a single core of an AMD Ryzen 9 3900X processor for the lattice detection per processed frame (SD 6.4 ms) in Scenario C. For only 198 of considered 4022 frames, the lattice detection took more than 33.33 ms (4.9%), while the maximum runtime in this scenario was 60.5 ms. As expected, the runtime clearly depends on the distance of the lattice to the sensor, see Fig. 14.
For completeness, we have also given the results of our runtime measurements for Scenario A and Scenario B in Table 3. There it can be seen that the average runtimes of 19.8 ms and 24.5 ms are quite similar for other scenarios as well. However, a few frames of both recordings were affected by pulsating noise due to interference between the Azure Kinect and Optitrack (see Fig. 10). As a result, many lattice candidates were detected in these frames, and although they were correctly rejected, the maximum runtime was abnormally high.
Precision and recall
Another quality metric for registration methods is the robustness of the detection of the target object in the images, which can be measured by the well-known classification scores precision (defined as \(\text {PPV} = \frac{\text {TP}}{\text {TP} + \text {FP}}\)) and recall (defined as \(\text {TPR} = \frac{\text {TP}}{\text {TP} + \text {FN}}\)). In our case, precision gives the percentage of lattices detections that were correct, while recall describes the percentage of correct lattices that our algorithm detected among all the visible, actual targets (i.e., in the camera’s field of view). In Scenario C, we ensured that the lattice was fully visible in the camera’s field of view at all times; hence, there are no true negatives in this case.
Our results can be found in Table 4.
Table 3 Runtimes of our lattice detection algorithm in different scenarios (without parallelization) Table 4 Precision and recall of the lattice detection algorithm in Scenario C Limitations
We observed that with the Azure Kinect, some of the holes of the lattice are occasionally invisible in the original depth image. In our experiments, we found that this seems to depend on the background behind the lattice (e.g., surface normal and reflectivity) and primarily occurs when the distance to the background is greater than the Azure Kinect’s working range (we used the NFOV unbinned mode which has a working range of 0.5–3.86 m)Footnote 4. We suspect this may be related to a filter in the Azure Kinect or Azure Kinect SDK that appears to bridge areas between invalid pixels. In scenario B, this effect likely had a significant impact on the number of false-negative detections and, consequently, the recall, due to the distant background (partially \(> 6\,m\)), while the effect was nearly negligible in scenarios A and C. Although this effect did not affect the registration success nor accuracy in our scenarios, there might be special application areas where registration with our method could be difficult when the Azure Kinect is operated outside its working range.
As shown in the experiment regarding rotational stability, our method can detect the lattice stable only if the angle between the lattice normal and the camera viewing direction is smaller than about 55 deg. However, since our method is robust against viewing the lattice from the front or from the back, this is only a minor limitation for most applications, as Fig. 15 shows.
Finally, probably the most obvious and very minor limitation is the fact that our lattice needs to be held by one or two hands at only one side while registration is performed.