# Robust and Practical Depth Map Fusion for Time-of-Flight Cameras

- 1.8k Downloads

## Abstract

Fusion of overlapping depth maps is an important part in many 3D reconstruction pipelines. Ideally fusion produces an accurate and nonredundant point cloud robustly even from noisy and partially poorly registered depth maps. In this paper, we improve an existing fusion algorithm towards a more ideal solution. Our method builds a nonredundant point cloud from a sequence of depth maps so that the new measurements are either added to the existing point cloud if they are in an area which is not yet covered or used to refine the existing points. The method is robust to outliers and erroneous depth measurements as well as small depth map registration errors due to inaccurate camera poses. The results show that the method overcomes its predecessor both in accuracy and robustness.

## Keywords

Depth map merging RGB-D reconstruction## 1 Introduction

Merging partially overlapping depth maps into a single point cloud is an essential part of every depth map based 3-dimensional (3D) reconstruction software. A simple registration of depth maps may lead to a huge number of redundant points even with relatively small objects. That will make the further processing very slow.

The amount of points could be reduced afterwards by simplifying the cloud but it is more reasonable to aim directly at a nonredundant point cloud. This will save both time and needed memory capacity.

In this paper, we further develop a method which merges a sequence of depth maps into a single nonredundant point cloud [7]. The method takes the measurement accuracy of obtained depths into account and merges nearby depth measurements into a single point in 3D space by giving more weight to the more certain measurement. Thus, only those points that do not have other neighbouring points are added to the cloud. The proposed method significantly reduces the amount of outliers in the depth maps and rejects incorrectly measured or badly registered points.

## 2 Related Work and Our Contributions

Fusion of depth maps from the aspect of 3D reconstruction has been studied widely during recent years [4, 9, 11, 18, 21]. The most relevant work regarding to our work is the one presented in [11]. There, the authors proposed a depth map fusion method which is capable of building 3D reconstructions from live video in real time. The method is designed for passive stereo depth maps, and thus, does not use uncertainties for depth measurements.

Since the release of Microsoft Kinect, the interest in the real-time reconstruction has increased widely. These methods mostly represent the models as voxels [1, 14, 16, 17, 19] which means that their resolution is limited by the available memory. However, this restriction is successfully avoided especially in [14] but this method is designed for live video, and therefore, it may not work that well with wide baseline depth maps. Choi et al. [1] have also achieved impressive results recently. In their method, the loop closures play a significant role which have to be taken into account when capturing the data. The voxel based approach is also used in [3] in the merging of depth maps with multiple scales but the depth maps were acquired with a range scanner or with a multi-view stereo system.

Kyöstilä et al. proposed a method where the point cloud is created iteratively from a sequence of depth maps so that the added depth maps do not increase the redundancy of the cloud [7]. That is, starting with a point cloud, back projected from a single depth map, the method either creates new points to the cloud from other depth maps if they are in an area which has not yet been covered by other points or uses the new measurements to refine the existing points. The refinement merges nearby points by giving more weight to measurements that have lower empirical, depth dependent variances.

However, Kyöstilä’s method is mainly designed for merging redundant depth maps and it cannot handle outliers. In addition, the method was designed for the first generation Kinect device (Kinect V1), and thus, it does not take all the characteristics of the newer Kinect device (Kinect V2) into account. These differences and our solutions are discussed in more detail in Sect. 2.1.

### 2.1 Our Contributions

As described in Sect. 2, the method in [7] cannot handle outliers and does not work properly with Kinect V2. Regarding to our method, the most essential difference between the Kinect devices is the depth measuring technique. Kinect V1 calculates the depths using an infrared dot pattern projected into the space, whereas Kinect V2 is based on time-of-flight (ToF) technique and predicts the depths from the phase shift between an emitted and received infrared signals [15]. Generally, the measurements acquired with Kinect V2 are more accurate, but in certain cases the sensor might receive multiple reflected or scattered signals from the same direction which might cause significant measurement errors as presented in Fig. 1. This multipath interference problem [13] is not taken into account in [7].

- 1.
pre-filtering of depth maps to reduce the amount of outliers,

- 2.
improved uncertainty covariance to compensate for the measurement variances and make the method more accurate and

- 3.
filtering of the final point cloud based on a simple visibility violation rule to reduce the amount of erroneous and badly registered measurements due to the multipath interference [13] and incorrect camera poses, respectively.

The experiments show that the extensions significantly improve the results when compared with [7] which make the proposed method a potential post-processing step for methods like ORB-SLAM [12] or [2]. In addition, the nonredundant point clouds produced with the proposed method can be further transformed into a mesh, like e.q. in [1, 11], using [6] or [8] for example.

## 3 Method

As presented in Fig. 2, the proposed method takes a set of depth maps and calibrated RGB images with known camera poses as input and outputs a point cloud. The method improves the algorithm described in [7] with three extensions which are marked with darker boxes in Fig. 2. Similarly to [7], our method can be used as a pipeline to process one depth map at a time and therefore the only thing that limits the size of the reconstruction is the available memory for storing the created point cloud.

### 3.1 Pre-filtering of Depth Maps

Typically, backprojected Kinect depth maps (both V1 and V2) have outliers or inaccurate measurements near depth edges and near the corners of the depth image. Usually, their distances to the nearest neighbouring points are much above the average. To remove such measurements from the depth maps, we first calculate a reference curve which describes the average distance from a point to its *n*th nearest neighbour (NN) (\(n=4\) in all our experiments) in the 3D space at a certain depth. The left part of Fig. 3 presents the calculation of a reference distance at depth \(d_z\) for one pixel. The final reference distance at depth \(d_z\) is the average of such distances of all pixels. The average distances are calculated for depths from 0.5 m to 4.5 m with 0.1 m interval and the reference curve (blue solid line in the right sub figure) is then acquired by fitting a line to these values.

### 3.2 Improved Depth Map Fusion

The actual depth map fusion is based on [7] with two exceptions: (1) the device dependent parameter values were calibrated for Kinect V2 and (2) the orientations of uncertainty ellipsoids were improved to match with the ToF measuring technique. The details are described later in the section.

**C**which determines the location uncertainty of the measurement in x, y and z directions as depth dependent variances

*z*is the measured depth and \(\lambda _1\), \(\lambda _2\), \(\beta _x\), \(\beta _y\), \(\alpha _2\), \(\alpha _1\) and \(\alpha _0\) are parameters which were calibrated for Kinect V2 using the approach presented in [7].

**C**can be expressed in the world frame with

### 3.3 Post-filtering of the Final Point Cloud

If in the refinement part of the fusion, at least one of the distances \(d_1\) and \(d_2\) (Eqs. (6) and (7), respectively) is bigger than the threshold \(\tau \), the existing measurement is not updated but the measurements might violate the visibility of each other depending on their locations. To solve possible visibility violations, we need normals for every point. The normals are estimated by a plane fitted to the *k* nearest neighbours of the point (\(k=50\) in all our experiments) in the original back projected depth map.

In this paper, we consider three alternatives, illustrated in Fig. 4, how the measurements may locate with respect to each other. In the first case, point **A** occludes point **B** but they are far away from each other so that is not a visibility violation. Next, the point **C** is occluding point **D** nearby but this time the normal of measurement **D** is not pointing towards the half space where the camera under consideration is located, and therefore, this is not a visibility violation either. In the third case, the point **E** occludes the nearby point **F** whose normal is towards the camera. In this case, there is a visibility violation because it is very unlikely that both of these measurements really exist in the scene. In practice, the points are near enough when the distance between them is less than 10% of the depth of the new measurement. This kind of violation may occur due to the inaccuracy of the camera poses or calibration, noise or the multipath interference.

The post-filtering consists of two parts. The first part is built-in to the depth map fusion and it collects some point-wise statistics which are utilized in the second part that does the actual filtering after the fusion. The statistics are two values which record the number of merges and the number of visibility violations.

That is, if two points that project onto the same pixel are not close enough to be merged together but still violate the visibility of each other in the 3D space, either the existing measurement or the new one is probably an outlier or too inaccurate to be added to the final cloud of points. If the existing measurement has already been merged with another point more than once, it can be considered more reliable and the visibility violation value of the new measurement is incremented by one. Otherwise, the reliability is based on an unreliability weight \(w = (1/cos(\alpha ))^2\), where \(\alpha \) is the angle between the line of sight and the normal of the point, i.e. the bigger the angle the more unreliable the point is and the violation value of the more unreliable measurement is incremented.

## 4 Experiments

The experiments were carried out using three data sets captured with Kinect V2: CCorner, Office1 and Office2. The last two are complicated office environments whereas the first one is a simple concave corner bounded by floor and two walls. Figure 5 presents a sample image of each data set. The checker boards on CCorner data set were used to acquire the poses of the cameras as well as to create a ground truth for quantitative evaluation. The data sets consist of RGB images and depth maps and they were captured with Kinect by moving the device around the room and holding it still while capturing. The sets were captured so that the depth maps had redundant measurements and sequential RGB images had common areas with rich texture in order to gain as good camera pose estimations as possible as described below.

^{1}. The calibration parameters, the depth maps and the sparse point cloud, produced by SfM, were used to set the scale of the obtained poses to match with the metric system used by the depth sensor of Kinect V2. The poses, the depth maps and the RGB images were then fed to the algorithm pipeline.

The method in [7] was used as a baseline in the evaluations. The results presented in the following sections show step by step how each extension iteratively enhances the results made with the baseline algorithm. In Sect. 4.1, we present the enhancement achieved with pre-filtering. Then, Sect. 4.2 compares the results produced by our method and the baseline extended with the pre-filtering, and finally, in Sect. 4.3 the influence of every extension, including the pre-filtering (PRF), re-aligned covariances (RAC) and post-filtering (POF), is shown by three quantitative analyses.

### 4.1 Depth Map Pre-filtering

### 4.2 Re-Aligned Covariances and Post-filtering

Figure 8 shows the comparison of the results made with the baseline method extended with the pre-filtering and the proposed method. Our method is able to remove the outliers between the backrests of the chairs and the table as shown in the top part of the figure (green rectangles), but as the bottom part of the figure illustrates, the method is also able to remove the incorrect measurements under the table (red dashed ellipses) and the misplaced measurements above (green solid ellipses). The incorrect measurements below the table have suffered from the multipath interference via backrest of the chair and Kinect had obtained too long distances for those measurements (cf. Figure 1). The misplaced measurements above the table exists due to an inaccurate pose of the camera where the measurements originate from.

### 4.3 Quantitative Analyses

In the last experiment, the methods and extensions were tested against each other with three quantitative analyses. First, Table 1 illustrates an overview of the sizes of the used data sets and the sizes of the final results. The abbreviations PRF, RAC and POF refer to the proposed extensions to the baseline method, i.e. pre-filtering, re-aligned covariances and post-filtering, respectively. As the table shows, every extension increases the ratio of reduction of the point count.

An overview of the sizes of used data sets and achieved point reduction ratios.

As shown in the left sub figure, each extension enhances the accuracy of the fusion. Especially the re-aligned covariance extension significantly improves the result (the red square curve versus the green diamond curve). Pre-filtering and post-filtering bring only moderate improvement in this data set because, due to the simplicity of the data set, the amount of outliers is moderate and practically there are no badly misplaced measurements because the camera poses were acquired relatively accurately as described earlier.

## 5 Conclusion

In this paper, we proposed a method for merging a sequence of overlapping depth maps into a single non-redundant point cloud. Starting with a point cloud back projected from a single depth map, the method iteratively adds points from other depth maps so that the new measurements refine the existing points in overlapping areas. The refinement is based on an uncertainty covariance calculated for every measurement. The proposed method improves the algorithm [7] with three extensions: (1) depth map pre-filtering, (2) depth map fusion with directed uncertainty covariances and (3) post-filtering of the final point cloud. The performance of each extension was demonstrated with several experiments. The proposed method outperformed the baseline algorithm both in robustness and accuracy.

## Footnotes

## References

- 1.Choi, S., Zhou, Q.Y., Koltun, V.: Robust reconstruction of indoor scenes. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5556–5565 (2015)Google Scholar
- 2.Córdova-Esparza, D.M., Terven, J.R., Jiménez-Hernández, H., Herrera-Navarro, A.M.: A multiple camera calibration and point cloud fusion tool for kinect v2. In: Science of Computer Programming (2017, inpress)Google Scholar
- 3.Fuhrmann, S., Goesele, M.: Fusion of depth maps with multiple scales. In: Proceedings of the 2011 SIGGRAPH Asia Conference, pp. 148:1–148:8. ACM (2011)Google Scholar
- 4.Goesele, M., Curless, B., Seitz, S.M.: Multi-view stereo revisited. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2006)Google Scholar
- 5.Herrera, C.D., Kannala, J., Heikkilä, J.: Joint depth and color camera calibration with distortion correction. IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI)
**34**(10), 2058–2064 (2012)CrossRefGoogle Scholar - 6.Kazhdan, M., Bolitho, M., Hoppe, H.: Poisson surface reconstruction. In: Eurographics Symposium on Geometry Processing (2006)Google Scholar
- 7.Kyöstilä, T., Herrera C., D., Kannala, J., Heikkilä, J.: Merging overlapping depth maps into a nonredundant point cloud. In: Kämäräinen, J.-K., Koskela, M. (eds.) SCIA 2013. LNCS, vol. 7944, pp. 567–578. Springer, Heidelberg (2013). doi: 10.1007/978-3-642-38886-6_53 CrossRefGoogle Scholar
- 8.Labatut, P., Pons, J.P., Keriven, R.: Robust and efficient surface reconstruction from range data. Comput. Graph. Forum (CGF)
**28**(8), 2275–2290 (2009)CrossRefGoogle Scholar - 9.Li, J., Li, E., Chen, Y., Xu, L., Zhang, Y.: Bundled depth-map merging for multi-view stereo. In: IEEE Conference on Computer Vision and Pattern Recognition (2010)Google Scholar
- 10.Mendel, J.: Lessons in Estimation Theory for Signal Processing, Communications and Control. Prentice Hall, Englewood Cliffs (1995)zbMATHGoogle Scholar
- 11.Merrell, P., et al.: Real-time visibility-based fusion of depth maps. In: IEEE International Conference on Computer Vision (ICCV) (2007)Google Scholar
- 12.Mur-Artal, R., Montiel, J.M.M., Tardós, J.D.: ORB-SLAM: a versatile and accurate monocular SLAM system. IEEE Trans. Robot.
**31**(5), 1147–1163 (2015)CrossRefGoogle Scholar - 13.Naik, N., Kadambi, A., Rhemann, C., Izadi, S., Raskar, R., Kang, S.B.: A light transport model for mitigating multipath interference in time-of-flight sensors. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 73–81 (2015)Google Scholar
- 14.Nießner, M., Zollhöfer, M., Izadi, S., Stamminger, M.: Real-time 3D reconstruction at scale using voxel hashing. ACM Trans. Graph. (TOG)
**32**(6), 169 (2013)CrossRefGoogle Scholar - 15.Pagliari, D., Pinto, L.: Calibration of kinect for xbox one and comparison between the two generations of microsoft sensors. Sensors
**15**(11), 27569–27589 (2015)CrossRefGoogle Scholar - 16.Richard A., N., Shahram, I., Otmar, H., David, M., David, K., Andrew J., D., Pushmeet, K., Jamie, S., Steve, H., Andrew, F.: KinectFusion: real-time dense surface mapping and tracking. In: IEEE International Symposium on Mixed and Augmented Reality (ISMAR), pp. 127–136, October 2011Google Scholar
- 17.Roth, H., Vona, M.: Moving volume KinectFusion. In: British Machine Vision Conference (2012)Google Scholar
- 18.Tola, E., Strecha, C., Fua, P.: Efficient large-scale multi-view stereo for ultra high-resolution image sets. Mach. Vis. Appl.
**23**(5), 903–920 (2012)CrossRefGoogle Scholar - 19.Whelan, T., Kaess, M., Maurice, F., Johannsson, H., Leonard, J., McDonald, J.: Kintinuous: spatially extended KinectFusion. Technical report (2012)Google Scholar
- 20.Ylimäki, M., Kannala, J., Heikkilä, J.: Optimizing the Accuracy and Compactness of Multi-view Reconstructions, pp. 171–183, September 2015Google Scholar
- 21.Zach, C., Pock, T., Bischof, H.: A globally optimal algorithm for robust TV-\(L^1\) range image integration. In: IEEE International Conference on Computer Vision (ICCV) (2007)Google Scholar