High-quality indoor scene 3D reconstruction with RGB-D cameras: A brief review

High-quality 3D reconstruction is an important topic in computer graphics and computer vision with many applications, such as robotics and augmented reality. The advent of consumer RGB-D cameras has made a profound advance in indoor scene reconstruction. For the past few years, researchers have spent significant effort to develop algorithms to capture 3D models with RGB-D cameras. As depth images produced by consumer RGB-D cameras are noisy and incomplete when surfaces are shiny, bright, transparent, or far from the camera, obtaining high-quality 3D scene models is still a challenge for existing systems. We here review high-quality 3D indoor scene reconstruction methods using consumer RGB-D cameras. In this paper, we make comparisons and analyses from the following aspects: (i) depth processing methods in 3D reconstruction are reviewed in terms of enhancement and completion, (ii) ICP-based, feature-based, and hybrid methods of camera pose estimation methods are reviewed, and (iii) surface reconstruction methods are reviewed in terms of surface fusion, optimization, and completion. The performance of state-of-the-art methods is also compared and analyzed. This survey will be useful for researchers who want to follow best practices in designing new high-quality 3D reconstruction methods.


Introduction
Real-world 3D reconstruction is a longstanding goal in computer vision. Many tools have been applied to accurately perceive the 3D world, including stereo cameras, laser range finders, monocular cameras, and RGB-D cameras. Advances in consumer RGB-D cameras, such as the Microsoft Kinect, Asus Xtion Live, Intel RealSense, Google Tango, and Occiptial's Structure Sensor, facilitate numerous new and exciting applications, e.g., in augmented reality (AR) to fuse supplementary elements with the real-world environment (e.g., Holoportation [1]), in virtual reality (VR) to provide users with reliable environment perception [2], in digital cultural heritage protection for realistic modeling [3], and in simultaneous localization and mapping (SLAM) for automatic robot navigation. This led to various research into 3D reconstruction with consumer RGB-D cameras. A typical pipeline for RGB-D based 3D scene reconstruction is summarized in Fig. 1, and consists of three modules: image processing, camera pose estimation, and surface reconstruction. Camera pose estimation finds the transformation between two RGB-D images, while surface reconstruction takes RGB-D data as input and fuses the dense overlapping depth frames into one reconstructed model using some specific representation. A complete scene is reconstructed from views acquired along the camera trajectory, while each view covers only a small part of the environment. The pre-processed RGB-D images with estimated camera poses are integrated into a complete 3D scene model.
The advent of affordable consumer grade RGB-D cameras has brought about profound advances in visual scene reconstruction methods. Researchers in the field of computer graphics and computer vision have expended significant effort to develop entirely new algorithms to capture comprehensive shape models of real-world scenes with RGB-D cameras. Figure 2 gives a brief history of research into indoor scene 3D reconstruction with RGB-D cameras, and indicates some representative methods in the past decade. KinectFusion [4] is a seminal RGB-D based real-time indoor scene 3D reconstruction system. It uses a volumetric representation based on truncated signed distance function (TSDF) [5], in conjunction with fast iterative closest point (ICP) [6] pose estimation to provide a real-time fused dense model. A major limitation of KinectFusion is that camera pose estimation is performed by frame-to-model registration using an ICP algorithm, which is only reliable for RGB-D data with small shifts between consecutive frames acquired by high-frame-rate RGB-D cameras. Since then, improved variants of systems and methods have been proposed. We classify tasks as below and give representative methods: • Large-scale fusion, e.g., Kintinuous [7], LSD-RGBD SLAM [8], large-scale 3D reconstruction [9,10], and SG-NN [11]. • Semantic fusion, e.g., SLAM++ [12], automatic semantic modeling [13,14], SemanticFusion [15], 3D-SIS [16], and SISNet [17]. • Dynamic fusion, e.g., DynamicFusion [18], Fusion4D [19] , FusionMLS [20], and PIFu [21]. • Efficient fusion, e.g., VoxelHashing [22], FastFusion [23], and InfiniTAM [24,25]. • High-quality fusion, e.g., Redwood [3], Bundle-Fusion [26], Intrinsic3D [27], and Uncer-taintyAware [28]. Kintinuous [7] extends the work of KinectFusion and creates highly detailed maps of extended scale environments in real time. SLAM++ [12] is the first work on semantic scene reconstruction. It focuses on an implementation of joint 3D object recognition and RGB-D SLAM, and creates semantically meaningful maps by combining geometric and semantic information. Figure 3 shows an example of semantic reconstruction from SemanticFusion [15], which is a real-time visual SLAM system capable of semantically annotating a dense 3D scene using  CNNs. VoxelHashing [22] uses a simple spatial hashing scheme that compresses space, and allows for real-time access and updates of implicit surface data efficiently. For scene reconstruction, an inherent problem is dealing with the tracking drift due to accumulated pose estimation errors. Redwood [3] deals with the accumulated pose estimation errors by reconstructing locally smooth scene fragments and deforming these fragments to align them with each other, obtaining high-quality 3D scene models offline. DynamicFusion [18] presents the first system capable of reconstructing non-rigidly deforming scenes in real time. Figure 4 shows an example of dynamic reconstruction from Fusion4D [19], which is a realtime human volumetric capture system with consumer RGB-D cameras. ElasticFusion [29] proposes surfelbased fusion coupled with frequent model refinement through non-rigid surface deformations. Bundle-Fusion [26] uses additional color features for registration and global bundle adjustment to obtain precise scene geometry in real time. Intrinsic3D [27] obtains high-quality 3D reconstructions by simultaneously optimizing for reconstructed geometry, surface albedo, camera pose, and scene lighting. SG-NN [11] converts partial and noisy RGB-D scans into high-quality 3D scene reconstructions by inferring unobserved scene geometry through self-supervised learning.
In this paper, we focus on high-quality 3D reconstruction of indoor scenes with consumer RGB-D cameras, and review the methods in terms of depth image processing, camera pose estimation, and surface reconstruction. The cited methods focus on articles published in leading conferences and journals in recent years. This review will be useful for researchers who want to follow best practices in designing new high-quality 3D reconstruction methods. The main contributions of our paper are as follows: 1. depth image processing methods in 3D scene reconstruction are analyzed and discussed in terms of depth enhancement and depth completion, 2. camera pose estimation methods are analyzed and discussed in terms of ICP-based, feature-based, and hybrid methods, 3. surface reconstruction methods are analyzed and discussed in terms of surface fusion, surface optimization, and surface completion, and 4. evaluation methods are compared and performance of state-of-the-art systems is analyzed. The structure of this survey is organized as follows. Section 2 discusses related work on indoor scene 3D reconstruction and gives the motivation for our review. Sections 3-5 review the methods used in 3D reconstruction in terms of image processing, camera pose estimation, and surface reconstruction respectively. Performance of state-of-the-art methods is compared and analyzed in Section 6, while Sections 7 and 8 present a summary and concluding remarks, and consider future developments.

Related work
In this section, we provide a brief review of related work in high-quality 3D scene reconstruction methods, RGB-D datasets, benchmarks for 3D scene reconstruction, and related surveys on 3D reconstruction.

High-quality 3D scene reconstruction
High-quality 3D scene reconstruction aims to obtain complete 3D models with highly-detailed geometry or high-quality surface textures. Existing indoor scene reconstruction methods can be classified as online, i.e., dense SLAM or dynamic reconstruction, or offline, with higher accuracy. To ensure accuracy, low-level geometric and texture information, as well as high-level semantic information, can be used in the reconstruction algorithm. High-quality 3D reconstruction of the real-world is a key component in AR/VR and digital cultural heritage protection. With semantic information, indoor scene reconstruction systems have potential applications to intelligent systems like autonomous robot navigation and human-computer interaction. Figure 5 shows examples of high-quality indoor scene reconstructions by accurate geometric registration [3], joint appearance and geometry optimization [30], and semantic segmentation [31], respectively. Due to noisy depth data, inaccurate registration, camera tracking drift, and the lack of accurate surface details, 3D models reconstructed from consumer RGB-D cameras are not yet popularly used in applications. The purpose of this state-of-the-art report is to review current approaches that try to solve this problem.

Datasets and benchmarks
There are many RGB-D datasets for evaluating realworld and synthetic scene reconstruction methods. We collect and analyze state-of-the-art RGB-D datasets for 3D reconstruction in Table 1, which gives their magnitude, availability of ground truth of camera pose and surface, and semantic annotation. Real-world scenes are scanned by hand-held cameras or robots equipped with RGB-D cameras, while synthetic scenes are obtained by technologies such as rendering and ray tracing. The synthetic ICL-NUIM dataset [35] and real-world TUM RGB-D dataset [32] are two benchmarks widely used to compare and analyze 3D scene reconstruction systems in terms of camera pose estimation and surface reconstruction. Choi et al. [3] provided code and executables to evaluate global registration algorithms for 3D scene reconstruction system, and proposed the augmented (Aug) ICL-NUIM dataset. As can be seen from Table 1, with new applications in robotics and HCI, a trend in RGB-D datasets is towards large-scale scenes with dynamic objects. The newly proposed High-quality indoor scene reconstruction with consumer RGB-D cameras by accuracy geometric registration (left) [3], joint appearance and geometry optimization (middle) [30], and semantic segmentation (right) [31]. Reproduced with permission from Ref. [3], c IEEE 2015; Ref. [30], c Springer-Verlag Berlin Heidelberg 2016; Ref. [31], c IEEE 2017.

Fig. 6
High-quality 3D reconstruction by joint appearance and geometry optimization. Models have fine-detail geometry (left) and compelling visual appearance (right); close-up views below. Reproduced with permission from Ref. [27], c IEEE 2017. replica dataset [42] contains 18 highly photo-realistic 3D indoor scene reconstructions at room and building scale, in which each scene consists of a dense mesh, high-dynamic-range (HDR) textures, semantic class, instance information, and so on. The corresponding benchmarks need to be further standardized to push the development of high-quality 3D reconstruction.

Other surveys
There are several surveys related to our work. Berger et al. [45] surveyed the field of surface reconstruction, and provided a categorization with respect to priors, data imperfections, and reconstruction output. Chen et al. [46] provided an overview of recent advances in indoor scene modeling techniques, as well as public datasets and code libraries which can facilitate experiments and evaluation. Stotko [47] reviewed several registration algorithms developed in recent years and compared their performance. Xu et al. [48] gave an overview of the main concepts and components of data-driven shape analysis and processing techniques. Recently, Zollhöfer et al. [49] presented a survey of the state-of-the-art in 3D reconstruction with RGB-D cameras, and reviewed the recent developments in RGB-D scene reconstruction for static and dynamic scenes. Han et al. [50] reviewed the state-of-the-art and trends in 3D object reconstruction in the deep learning era. Roldão et al. [51] identified, compared, and analyzed techniques of semantic scene completion (SSC) for both methods and datasets. Recently, Liu et al. [52] covered SLAM related datasets, including an overview and comparison of existing datasets, review of evaluation criteria, and discussions of current limitations and future directions. The above surveys do not specifically analyze the influencing factors in high-quality 3D scene reconstruction methods. Thus, we give the first comprehensive and critical review of high-quality indoor scene 3D reconstruction with RGB-D cameras, focusing on image processing, camera pose estimation, surface reconstruction, and performance comparison, providing a summary and discussion, and looking ahead to future trends.

Depth image processing
As Fig. 1 shows, depth image processing is the first stage of 3D scene reconstruction. Consumer RGB-D cameras employ one of two main approaches to depth sensing, triangulation and time-of-flight (ToF). Triangulation is realized by structured light, an active system which projects an infrared light pattern onto the scene and estimates the disparity given by the perspective distortion of the pattern due to variations in the object's depth. ToF cameras measure the time that light emitted by an illumination unit requires to travel to an object and back to a detector. Consumer grade RGB-D cameras relying on these methods often suffer from significant noise and distortion, and cannot capture subtle details. The raw depth images have to be taken into account in algorithm development for high-quality 3D reconstruction.

Depth enhancement
To enhance the quality of depth images used in 3D reconstruction, many approaches focus on depth denoising and depth super-resolution. Researchers also have exploited techniques of shape from shading (SfS) and shape from polarization (SfP) for depth images.

Depth denoising
The noise of depth images captured by consumer RGB-D cameras depends on a variety of parameters, such as the distance to the acquired object, and pixel position in the depth image. Many researchers have evaluated and analyzed the accuracy of depth images [53][54][55]. A commonly used method is bilateral filtering [56] which can effectively smooth depth images and is widely used in RGB-D based 3D reconstruction systems. Li et al. [57] processed depth images with a depth adaptive bilateral filter to effectively improve the accuracy of 3D scene models.
In recent years, deep depth denoising techniques (e.g., Ref. [58]) which can better capture the global context of each scene have attracted more attention.

Depth super-resolution
Consumer RGB-D cameras can capture high resolution (HR) RGB images (e.g., 1280 × 1024), but only low resolution (LR) depth images (e.g., 640× 480). In order to facilitate reconstruction, both RGB images and depth images are used at low resolution in most 3D reconstruction. Super-resolution techniques improve the observed low resolution images to corresponding high resolution images: high-resolution depth maps can be inferred from low-resolution depth measurements and an additional high-resolution intensity image of the same scene. Although there are depth super-resolution techniques without color information (e.g., example-based methods [59]), most existing methods improve the resolution of depth images using high-resolution color images [60][61][62]. Some deep learning depth super-resolution techniques [63][64][65] also exist to improve the resolution of depth images. Hui et al. [63] used two CNNs to downsample an HR image concurrently with upsampling the LR depth image: after the generation of RGB features from the downsampling CNN, these features were used to fine-tune the upsampling of the depth images. Riegler et al. [65] used an energy minimization model to guide the model for generating HR depth images without the need for reference images.

Shading-based methods
Shape from shading [66] deals with the recovery of shape from the gradual variation of shading in the images. This method is capable of capturing high quality shape details of a dynamic object under natural illumination, and is widely used to enhance depth images from consumer grade RGB-D cameras [67][68][69][70]. For instance, Han et al. [67] estimated detailed shape of diffuse objects with uniform albedo from a single RGB-D image. Yu et al. [68] presented a shading-based shape refinement algorithm which uses a noisy, incomplete depth image from Kinect to help resolve ambiguities in SfS. Wu et al. [69] presented the first real-time method for refinement of depth images using SfS in general uncontrolled scenes with consumer RGB-D cameras. RGBD-fusion [70] uses a lighting model to handle natural scene illumination, and enhances the depth image by fusing intensity and depth information to create more detailed range profiles. Nevertheless, the robustness of SfS methods is limited due to use of an illumination model.

Polarization-based methods
Shape from polarization is an application of polarization imaging and aims to digitize the shape of the observed object. Polarization reveals surface normal information, and is thus helpful to propagate depth to featureless regions. Researchers have exploited polarization techniques [44,[71][72][73] to enhance depth images. Polarized 3D [44] enhances coarse depth images by using shape information from polarization cues; an experimental result of this method is shown in Fig. 7, which compares shading enhancement and polarization enhancement. Cui et al. [71] combined per-pixel photometric information from polarization and obtained good reconstruction performance especially on featureless 3D objects. Deep SfP [72] makes the first attempt to bring the SfP problem to the realm of deep learning and performs well. Since then, many methods (e.g., Ref. [73]) have tried to combine deep learning with polarization techniques for depth enhancement. The equipment used in polarization-based methods is expensive due to use of polarization technology, limiting its wider application to in 3D reconstruction. As can be seen from the above, depth enhancement approaches have been applied in RGB-D reconstruc- tion; most techniques belong to traditional image processing and have been widely applied in practice.

Depth completion
Raw depth images produced from consumer grade RGB-D cameras are often incomplete when surfaces are shiny, bright, transparent, or far from the camera. To addressing this problem, various approaches have emerged which try to complete the sparse depth measurements into a dense depth image. Figure 8 shows an example of depth completion [74] for an indoor scene. Techniques to complete depth data of RGB-D images can be divided into traditional and data-driven methods.

Traditional methods
A few early works addressed depth completion through image filtering or optimization. To fill in holes in a raw depth image, NYU v2 [75] uses cross-bilateral filtering to produce a visually pleasing depth map, but introduces artifacts. Xiao et al. [34] improved the depth image by using TSDF to voxelize the space, accumulating the depth map from nearby frames using camera poses, and then used ray casting to get a reliable depth image. Chen and Koltun [76] developed a global high-resolution MRF optimization approach to improve the accuracy of depth images. These methods use traditional image processing algorithms for depth completion, but their prediction ability is limited given large data loss.

Data-driven methods
With recent advances in deep learning and the availability of various RGB-D datasets, researchers have started to look at data-driven approaches to depth estimation. Most algorithms [74, [77][78][79][80][81][82] utilize the RGB image and additional information that can be inferred from the depth map, such as surface normal, to give geometrical guidance to the training process. Sparse-to-dense [77] first introduced a robust and accurate depth estimation method from RGB images with additional sparse depth samples acquired from a low-resolution depth sensor; it was used in a SLAM system. Later, Chen et al. [78] presented a deep model that can accurately produce dense depth images given an RGB image with known depths at a very sparse set of pixels. Zhang and Funkhouser [74] trained a deep network to predict dense surface normals and occlusion boundaries, and combined those predictions with raw depth observations to solve for depths for all pixels, including those missing in the original observation. To address depth smearing between objects, Imran et al. [83] proposed a depth coefficient representation which enables convolutions to more easily avoid inter-object depth mixing. In recent work, Zhu et al. [84] introduced a local implicit neural representation built on ray-voxel pairs that allows generalization to unseen transparent objects and provides fast inferencing.
Research into depth completion has developed with the advent of consumer RGB-D cameras, and most existing approaches focus on deep learning. To train deep networks, a large corpus of training data with accurate ground-truth is required. This is limits the application of data-driven depth completion methods to 3D reconstruction. Li et al. [85] attempted to obtain high-quality 3D reconstruction with depth super-resolution and completion, and evaluated its feasibility on the synthetic ICL-NUIM dataset, but application to real-world scenes remains to be studied.

Camera pose estimation
In 3D reconstruction, the goal of camera pose estimation is to find the transformation T between two images. To obtain accurate camera poses, a complete estimation pipeline often contains two phases: (i) front-end camera tracking (e.g., frame-toframe tracking [86] or frame-to-model tracking [4]), and (ii) back-end optimization (e.g., loop closure and global optimization [3,87,88]). According to the tracking characteristic, camera pose estimation methods can be divided into ICP-based and featurebased frameworks. In the following, we discuss both, and hybrid methods.

ICP-based methods
ICP-based methods estimate camera pose by maximizing the consistency of geometric information as well as color information between pairs of adjacent frames. The ICP algorithm introduced by Besl and Mckay [89] is a popular method for 3D reconstruction with RGB-D cameras.
It aligns two partially overlapping point clouds given an initial guess for the relative transform. Each point in one data set is paired with the closest point in the other data set to form correspondence pairs. Given two scene scans P and Q, the transformation T = [R | t] between them is estimated by minimising: where p i and q i are the points from P and Q respectively. This error metric is the sum of the squared distances between points in each correspondence pair. Figure 9 illustrates the ICP algorithm using point-toplane errors: arg min This process is iterated until the error becomes smaller than a threshold or it stops changing. In scenes containing textureless regions (e.g., walls and floors), depth information alone is insufficient to compute the camera pose. The direct method reported in dense visual odometry (DVO) SLAM [91] uses color information to overcome this issue. The goal of the direct method is to estimate the camera motion such that the warped second image matches the first image based on the photo-consistency assumption; Fig. 10 shows this process for two images. The photometric error E rgb is defined as where ξ is the camera motion, p i is a pixel point, and w is the warping function that matches the current Fig. 9 Point-to-plane error between two surfaces in an iterative closest point algorithm. Reproduced with permission from Ref. [90].

Fig. 10
Photo-consistency assumption in the direct method. Reproduced with permission from Ref. [91], c IEEE 2013.
image I 2 to the previous image I 1 . DVO SLAM estimates the camera pose combining geometric error and photometric error, in what it calls combined ICP [92]. The combined error E combined is given by Eq. (3): where λ is a weight. Both error functions use the same correspondences and their limitations do not affect each other. This idea is further used in Kintinuous and ElasticFusion. In addition to combined ICP, variants of ICP methods (e.g., Color-ICP [93], efficient ICP [6], nonrigid ICP [94], generalized-ICP [95], NICP [96]) have been proposed. For instance, non-rigid ICP is capable of modeling nonrigid objects. Generalized-ICP constructs point-to-point, point-to-plane, and plane-to-plane error metrics. In recent years, ICPbased methods combined with deep learning also have been proposed. Deep closest point (DCP) [97] replaces the Euclidean nearest point step of ICP by a learnable per-point embedding network, followed by a high-dimensional feature-matching. Following DCP, many iterative methods [98,99] extend the feature matching idea, where the general scheme is to learn the mapping, apply the inferred transformation to the source point cloud, and learn a new alignment map, until convergence. Recently, deep weighted consensus [100] presents a new paradigm for rigid alignment based on a learnable weighted consensus which is robust to noise.

Feature-based methods
Feature-based methods introduce RGB features into camera pose estimation by maximizing the 3D position consistency of corresponding feature points between frames, to improve the robustness of camera tracking. Existing 3D reconstruction systems within an SLAM framework often use sparse features to establish 2D-3D matches between features in a query image and points in a 3D map. In general, the transformation T is estimated using feature reprojection error: where x and x denote the position of a 3D feature and the matched feature respectively, and K is the camera intrinsic matrix. Point features are a popular choice for feature extraction and matching in 3D reconstruction, such as SIFT [34], FPFH [3], ORB [101], and some learned features (e.g., 3DMatch [87] and PixLoc [102]). Lines and planes are the most common structures used in indoor scenes and are less sensitive to lighting variation than points. Researchers are increasingly studying methods [103][104][105][106][107][108][109] to use them for high-quality 3D reconstruction. Such methods can generally achieve good performance under both constant and varying lighting conditions. Based on line features, Choi et al. [103] presented a 3D edge detection approach for RGB-D point clouds, which can exploit the organized structure of the RGB-D image to efficiently detect edges, and make use of both 3D shape information and photometric texture information. Lu and Song [104] fused point and line features to form a robust RGB-D visual odometry algorithm, which extracts 3D points and lines from RGB-D images, analyzes their measurement uncertainties, and computes camera motion using maximum likelihood estimation. Zhou and Koltun [105] proposed a depth camera tracking method with contour cues, which can be used to establish correspondence constraints that carry information about scene geometry and constrain pose estimation. The contour constraints reliably improve camera tracking accuracy.
Based on plane features, Taguchi et al. [106] presented a point-plane 3D reconstruction system, which uses the minimal set of features in an RANSAC framework to robustly compute correspondences and estimate camera pose. Dense planar SLAM [107] densely maps the environment using bounded planes and surfels extracted from depth images. It takes advantage directly of the planarity of many parts of the scenes via a data-driven process to directly regularize planar regions and represent their accurate extent efficiently using an occupancy approach with on-line compression. CPA-SLAM [108] consistently integrates frame-to-keyframe and frame-to-plane alignment, and models the environment with a global plane model. It makes use of the dense image information available in keyframes for accurate shortterm camera tracking and uses the global model to reduce drift. PlaneMatch [109] densely models the environment with plane information through a CNN that takes in RGB, depth, and normal information of a planar patch in an image, and outputs a descriptor to find coplanar patches in other images for scene reconstruction.

Hybrid methods
In practical applications, feature-based methods often combine multiple features (e.g., points, edges, lines, and planes) to improve camera tracking stability. For instance, Manhattan SLAM [110] makes use of point, line, and plane features for robust tracking in challenging scenes, allowing for accurate camera tracking and efficient dense mapping. Generally speaking, feature-based methods are better than ICP-based ones at handling RGB-D data with large shifts, since they simply run a quadratic minimization problem to directly compute the relative transformation between two consecutive frames.
To obtain high-quality 3D scene models robustly, some systems estimate the camera pose combining ICP-based methods and feature-based methods in hybrid methods. SDF-2-SDF [86] proposes an implicit-to-implicit surface registration scheme, and can be utilized both for frame-to-frame camera tracking and global optimization. BundleFusion employs correspondences based on sparse features and dense geometric and photometric matching, and obtains a highly accurate camera pose. Kehl et al. [111] formulated a joint contour and ICP tracking approach. 3D match [87] presents a datadriven model that learns a local volumetric patch descriptor for establishing correspondences between partial 3D data. Semantic information [112] and optical flow [113] also have been used in camera pose estimation. Schönberger et al. [112] proposed the first semantic visual localization method, which is robust to missing observations where previous approaches failed. GeoNet [113] estimates dense depths, optical flow, and camera pose using unsupervised learning.
Recently, Tang et al. [114] estimated camera pose using dense scene matching (DSM), where a cost volume is constructed between a query image and a scene. The cost volume and the corresponding coordinates are processed by a CNN to predict dense coordinates.

Surface reconstruction
Surface reconstruction fuses RGB-D images from different camera views into a complete 3D model. In this section, we consider surface reconstruction methods in terms of surface fusion, surface optimization, and surface completion.

Surface fusion
The basic surface fusion approaches for dense 3D reconstruction are volume-based or surfel-based. Existing high-quality 3D reconstruction systems [3,25,26,115] are mainly based on these or their improvements.

Volume-based fusion
Volume-based fusion provides efficient and simple ways of integrating multiple RGB-D images into a complete 3D model. In a volume-based framework, TSDF is discretized into a voxel grid to represent a physical volume of space: see Fig. 11. On the left is a two-dimensional example of signed distance values stored at voxels within the truncation distance of the observed surface, with rays cast from the observing sensor, and on the right is the voxel grid underlying the reconstruction volume. Each voxel contains a signed distance function (SDF) indicating the distance from the cell to a surface and a weight representing confidence in the accuracy of the distance. For a given voxel v in the fused scene model F , the update of signed distance value F (v) is defined by where addition is used for integration, and subtraction for de-integration. The signed distance function f i (v) is the projective distance between a voxel and the ith depth frame, and weighting function w i (v) represents the confidence in the accuracy of the distance.
The original idea of volumetric 3D reconstruction from depth images dates back to Ref. [5]. Later, the advent of consumer RGB-D cameras and massively parallel processors in GPUs led to the seminal KinectFusion system, which inspired a wide range of further work. Volume-based fusion is at the core of many state-of-the-art RGB-D reconstruction frameworks [3,4,22,26,116]. One disadvantage of volume-based fusion is its large memory footprint, as the required memory grows linearly with the overall volume that is represented rather than with its surface area. This issue has been addressed by sparse volumetric representations, such as multi-scale octrees [23,117] and hierarchical structures [24]. For non-rigid fusion, embedded deformation for the shape manipulation algorithm [118] has been introduced in some volume-based dynamic fusion systems (e.g., DynamicFusion [18] and DeepHuman [119]). 3D deep learning of volumetric methods, such as deep implicit representation (DIR) [120][121][122], have also been proposed and studied. For instance, DeepSDF [120] introduces a learned continuous SDF representation of a class of shapes that enables high-quality shape representation, interpolation, and completion from partial and noisy 3D input data. Scene representation networks (SRNs) [121] represent scenes as continuous functions that map world coordinates to a feature representation by encoding both geometry and appearance. Neural sparse voxel fields (NSVF) [122] defined a set of voxel-bounded implicit fields organized in a sparse voxel octree to model local properties, and can successfully represent complex 3D scenes.

Surfel-based fusion
Surfel-based fusion is a powerful paradigm to efficiently render complex geometric objects. Figure 12 shows a surfel (surface element) representation [123] in object space and texture space. The maximum distance between adjacent surfels in object space is the radius (r 0 pre ) of the tangent disk. A surfel is a point sample of an object's surface that includes geometric attributes such as position and normal as well as photometric attributes such as a diffuse color.
During surface fusion, for a given surfel M s with a position p ∈ R 3 , normal n ∈ R 3 , colour c ∈ R 3 , weight w ∈ R, and radius r ∈ R, the update rules for each component are where the prime superscript (e.g., p ) and hat operator (e.g.,p) denote the newly associated measurement and new updated value for a given surfel respectively.
Andersen et al. [124] proposed a surfel-based geometry reconstruction method for determining a piecewise smooth surface from noisy data. Surfels are well suited to modeling dynamic geometry, because there is no need to compute topological information such as adjacency lists. This surfel-based fusion strategy is used by several reconstruction systems, such as dynamic scenes reconstruction [125], dense planar SLAM [107], ElasticFusion, and InfiniTAM v3 [25]. Based on ElasticFusion, SemanticFusion allows semantic predictions from multiple view points to be probabilistically fused into a semantic map. GravityFusion [115] incorporates gravity measurements into the surfels to avoid the typical curving of 3D maps in long hallways. DeepSurfels [126] combines explicit and neural building blocks to jointly encode geometry and appearance information, and has better scalability to larger scenes than existing methods. Through combination with prior information and deep learning, these methods have improved reconstruction performance for indoor scenes. Further, point-based representation [125] is also simple but efficient in terms of memory requirements. A 3D shape can be represented using an unordered set S = (x i ; y i ; z i ) N i=1 with N points. It is well suited for objects with interacting parts and fine details. Therefore, many papers on point-based 3D object reconstruction, such as DensePCR [127], have appeared over recent years.

Surface optimization
An initial 3D model reconstructed using consumer RGB-D cameras often contains noisy geometry, and blurred surface textures. Surface optimization is a classical task in 3D reconstruction in computer vision. In the following, we describe, analyze, and classify methods that have been proposed for surface optimization during recent years.

Shape denoising
Shape denoising techniques can be applied to points (e.g., Refs. [128,129]), meshes (e.g., Refs. [130,131]), and surfaces (e.g., Refs. [132,133]) to improve the quality of 3D models. Wolff et al. [128] removed noise and geometrically or photometrically inconsistent outliers in a point cloud. Wang et al. [131] presented a data-driven approach for mesh denoising via cascaded normal regression. High-frequency details are added to the coarse base mesh using color and displacement maps. Schertler et al. [132] proposed a field-aligned online surface reconstruction algorithm that sidesteps the signed-distance computation of classical reconstruction techniques in favor of direct filtering, parametrization, and mesh and texture extraction. Tsai et al. [133] proposed a surface optimization framework for non-line-of-sight imaging. Shape denoising algorithms abound in computer vision and computer graphics, but most of them are suitable for object denoising. Scene surface denoising is still challenging and needs to be further explored.

Surface refinement
Methods used in 3D reconstruction mainly include shading-based geometry refinement [136,137], joint appearance and geometry optimization [27], and deep learning [138]. Representative shading-based work is VSBR [136], which obtains fine-scale detail through volumetric shading-based refinement of a distance field to solve the problem of over-smoothing in RGB-D reconstructions. To obtain high-quality 3D reconstructions, Intrinsic3D [27] introduces a simultaneous optimization method for geometry encoded in an SDF, and textures from automaticallyselected key-frames. It dramatically increases the level of detail in the reconstructed scene geometry and contributes highly to consistent surface texture recovery. DECOR-GAN [138] details 3D shapes by conditional refinement through a generative adversarial network (GAN), which can refine a coarse shape into a variety of detailed shapes with different styles.

Color textures
Image-based texture mapping is a common way of producing texture maps for 3D geometric models. Although a high-quality texture map can be easily computed for accurate geometry and calibrated cameras, texture map quality degrades significantly in the presence of inaccuracies. Researchers have explored several methods [30,135,139,140] for high-quality texture maps. The large-scale scene model with texture map shown in Fig. 5(middle) was acquired by optimizing the texture coordinates of the 3D model to maximize photometric consistency among multiple key frames [30]. 3DLite [135] extrapolates high-level scene geometry, and uses image inpainting to generate sharp surface textures. Liu et al. [139] realized high-quality textured 3D shape reconstruction with cascaded fully convolutional networks. Recently, Huang et al. [140] proposed an approach to produce photo-realistic textures for approximate surfaces even from misaligned images by learning an objective function. Reconstructed scene models with realistic color textures are very useful in AR/VR and digitization of cultural heritage; research using consumer RGB-D cameras is of great interest and remains challenging in practice.

Surface completion
3D models are quite often incomplete due to occlusion between objects. Surface completion is used to recover a complete object or scene model from one or more images. Inferring a dense 3D scene from 2D or sparse 3D inputs is in fact an ill-posed problem since the input data are insufficient to resolve all ambiguities. Most existing works rely on deep learning to learn semantics and geometric priors from large scale datasets. Figure 13 compares object completion and scene completion approaches. Initial reconstructions are shown on the left while the completed surface models are shown on the right. We next discuss object and scene completion methods in turn.
Davis et al. [141] addressed situations in which the holes are too geometrically and topologically complex to fill using triangulation algorithms, and applied a diffusion process to extend SDF throughout the volume until its zero set bridges whatever holes may be present. Harary et al. [143] introduced a contextbased completion algorithm to synthesize missing geometry for a given triangle mesh that has holes. Rock et al. [142] recovered a complete 3D model using an exemplar-based approach, which retrieves similar objects in a database of 3D models using viewbased matching and transfers the symmetries and surfaces from retrieved models. Firman et al. [134] hypothesized that objects of dissimilar semantic classes often share similar 3D shape components, and estimate the hidden geometry for a wide range of objects using a limited dataset. ShapeNet [144] was the first work to apply deep learning to learn a 3D representation on a large scale CAD model database and with capability for shape completion. Following the success of Shapenet, various works have emerged that complete 3D shape using data-driven methods [145][146][147][148][149][150][151]. VConv-DAE [145] proposes a fully convolutional volumetric auto encoder to learn a volumetric representation from noisy data by estimating voxel occupancy grids. OctNetFusion [147] presents a 3D CNN architecture to predict an implicit surface representation; it outperforms the traditional volumetric fusion approach in terms of noise reduction and outlier suppression. Dai et al. [148] completed partial 3D shapes through a combination of volumetric deep neural networks and 3D shape synthesis. X-Section [149] predicts the endpoint of an object along a ray which can be used with volumetric SDF fusion to obtain completed shapes. RevealNet [150] enables a semantically meaningful decomposition of a scanned scene into individual, complete, 3D objects. GAN style approaches are also widely used in object completion. For instance, 3D GAN [146] can generate high-quality 3D objects from a probabilistic space. The recently proposed ShapeInversion [151] introduces GAN inversion to shape completion, and gives robust results for realworld scans and partial inputs of various forms and incompleteness levels.

Scene completion
Scene completion often uses prior information, such as scene structural priors [152][153][154][155] and semantic priors [11,17,156,157]. Silberman et al. [152] proposed a method for scene completion that can infer the layout of a complete room and the full extent of partially occluded objects. Sung et al. [153] used a collection of example 3D shapes to build structural part-based priors for shape completion. Song et al. [154] output semantic labels for all voxels in the camera view frustum with a single depth image as input. Dzitsiuk et al. [155] used plane priors to complete 3D reconstructions. ScanComplete [156] applies 3D CNNs in a hierarchical fashion to take an incomplete 3D scene scan as input and predict a complete 3D model along with per-voxel semantic labels. SISNet [17] reconstructs a complete 3D scene with precise voxel-wise semantics and presents a novel scene-instance-scene network, which takes advantages of both instance and scene level semantic information. Recent work, PALNet [157] utilizes a two-stream network to extract both 2D and 3D Fig. 13 Comparison of object completion and scene completion. Left: initial reconstruction [134]. Right: completed geometry with sharp surface textures [135]. Reproduced with permission from Ref. [134], c IEEE 2016; Ref. [135], c ACM 2017.
features from multiple stages using fine-grained depth information to capture the context in the scene. Following the proliferation of large-scale 3D datasets, SSC has gained significant momentum in the research community because it holds unresolved challenges in recent years.
DVO SLAM and RGBD SLAM apply pose graph optimization to achieve a globally consistent trajectory and then the global scene model is constructed by integrating all depth images in a volumetric representation. Kintinuous and ElasticFusion achieve a globally consistent model in a map-centric manner by deforming the global model according to global or local constraints. VoxelHashing and InfiniTAM v3 use a spatial hashing scheme to compress space, and can quickly realize surface reconstruction. SUN3D SfM takes a data-driven brute-force approach to RGB-D structure from motion (SfM), and can reconstruct big scenes with object labels. Redwood, BundleFusion, and UncertaintyAware divide the global model into submaps and obtain a globally consistent model by optimizing between submaps. To align submaps globally, Redwood uses dense geometric correspondences, while BundleFusion uses sparse as well as dense correspondences. UncertaintyAware exploits sparse features to align submaps.

Camera tracking accuracy
The accuracy of camera tracking is evaluated by comparing the estimated trajectory with the groundtruth. Two prominent error measures are the absolute trajectory error (ATE) and the relative pose error (RPE). The ATE directly measures the difference between points of the true and the estimated trajectory, and is well-suited for measuring the performance of visual SLAM systems.
The metric most commonly used for quantitative evaluation is the root mean square error (RMSE). For evaluating the accuracy of camera tracking, there are two commonly used benchmarks: the ICL-NUIM synthetic benchmark [35] and the TUM benchmark [32]. Table 2 presents camera tracking accuracy (ATE RMSE) for the living rooms kr0-kr3 from the ICL-NUIM synthetic benchmark for the chosen state-ofthe-art reconstruction systems; figures are quoted from the corresponding papers.
We also compared these systems using four common sequences from the TUM RGB-D benchmark: fr1 desk, fr2 xyz, fr3 office, and fr3 nst. Real-world scenes were scanned by a robot using Microsoft Kinect for Windows. The data were recorded at full frame rate (30 Hz) with a sensor resolution of (640×480). Table 3 shows the accuracy (ATE RMSE) of camera tracking on the TUM RGB-D benchmark. Note that the ground-truth (GT) trajectories are provided in the corresponding benchmarks; the results are quoted from corresponding papers. Speeds of those methods are estimated using the data provided in the corresponding papers. The computer configurations are also taken from the corresponding papers. It can be seen that BundleFusion and UncertaintyAware outperform other systems with respect to camera tracking. InfiniTAM v3 has the highest speed on the GPU, while DVO SLAM has the highest speed on the CPU.

Surface reconstruction accuracy
The accuracy of surface reconstruction is measured by comparing the reconstructions produced by the state-of-the-art methods against the ground-truth 3D surface model. There are five standard statistics computed over the distances for all vertices in the  reconstruction: mean, median, standard deviation, min, and max. The commonly used quantitative metrics for evaluating the performance of surface reconstruction are the living room sequences (kr0-kr3) of the synthetic ICL-NUIM benchmark [35]. Figure 14 shows the interior of a synthetic living room scene without color information. Each sequence partly covers the room and the average trajectory length is 7 m. Later, Choi et al. [3] augmented the original ICL-NUIM dataset in a number of ways to adapt it for evaluation of complete scene reconstruction pipelines, giving the Aug ICL-NUIM benchmark. The average trajectory length of each sequence is 36 m and the average surface area coverage reaches 88%.
We have compared the state-of-the-art methods both on the ICL-NUIM and Aug ICL-NUIM benchmarks. Table 4 gives surface reconstruction error on the ICL-NUIM benchmark (median distance in mm), while Table 5 gives the surface reconstruction error on the Aug ICL-NUIM benchmark (median distance in mm). The ground-truth 3D surface models are provided in the corresponding benchmarks,   and the results are quoted from corresponding papers. It can be seen that UncertaintyAware has the best reconstruction performance on the ICL-NUIM benchmark. The reconstruction accuracies of Redwood are closest to the GT trajectory on the Aug ICL-NUIM benchmark, benefiting from offline optimization.

Evaluation of pre-processing and postprocessing
For high-quality 3D reconstruction, there are two important components in addition to the core 3D reconstruction pipeline: pre-processing and postprocessing. The former focuses on handling noise or missing data in RGB-D images, while the latter focuses on handling noise or missing data in 3D models. To evaluate the performance of depth enhancement and depth completion, experiments commonly use the NYU v2 dataset [75] by downsampling, adding noises, or making holes in the depth image. Furthermore, a quantitative comparison (e.g., RMSE) on the ToFMark dataset [159] can also be used to benchmark depth super-resolution methods. To validate the performance of surface optimization and surface completion, reconstructed models are qualitatively compared through visual observation or perceptual evaluation. Quantitative evaluation is suitable for comparisons on synthetic scenes (e.g., the ICL-NUIM dataset), but is challenging on real-world scenes as there is typically no ground-truth surface model. In particular, the geometrical intersection over union (IoU) and mean intersection over union (mIoU) may be evaluated on input occluded and observed surfaces on the SSC datasets (e.g., synthetic SUNCG-RGBD [160]). The above benchmarks have not been commonly used to evaluate state-of-the-art 3D reconstruction systems and need to be further standardized.

Summary and discussions
In this section, we discuss the key techniques and limitations in high-quality scene reconstruction with RGB-D cameras, and summarize application scenarios, challenges, and future directions.

Key techniques and limitations
Based on the pipeline of 3D scene reconstruction, the key issues are how to reduce errors in camera pose estimation and improve the accuracy of surface reconstruction. During the past decades, most successful RGB-D based reconstruction systems mainly focus on camera localization methods with various features and volume data fusion methods with elastic registration or local-global registration. Introducing deep learning into 3D reconstruction is a direction being explored, but it is hard to make substantial progress in a short time. Efficient methods for depth image processing and 3D model processing can improve the quality of 3D reconstruction with consumer RGB-D cameras. Currently datadriven approaches have obvious advantages in depth completion and surface completion. However they usually need a large amount of RGB-D scene data to support model training, and robust performance is often limited to specific scenarios.

Applications
As can be seen from the performance comparison in Section 6, the average error of state-of-the-art online and offline systems is just a few millimetres for the ICL-NUIM benchmark. Online scene reconstruction systems (e.g., InfiniTAM v3) with low requirements on computational performance can be applied in mobile devices. For instance, there are some apps (e.g., Polycam) available for mobile phones and tablets. Offline scene reconstruction systems (e.g., Redwood) are usually used for high-quality 3D map creation and digital cultural heritage protection. Real scene models built offline can be used in smart venues and virtual tours. Scene models with semantic information have potential applications in intelligent systems like autonomous robot navigation, HCI, and so on. High-quality dynamic 3D reconstruction can further be used in human action capture for human action analysis applications (e.g., sports performance analysis).

Challenges and future work
High-quality 3D scene reconstruction is computationally expensive, and the major challenge is how to quickly obtain realistic scene models with convenient devices. In addition to increasing accuracy and efficiency, future work can address the following: (i) task-oriented 3D scene understanding is a key research topic in 3D vision, and different 3D scenes should be reconstructed for different taskoriented purposes, and (ii) quality of reconstruction depends not only on reconstructing the geometry and appearance of the scene, but also exploring invisible information (e.g., purpose and utility) underpinning the scene.

Conclusions
The area of high-quality 3D reconstruction with RGB-D cameras has grown from various methods, which can be divided into three phases: image processing, camera pose estimation, and surface reconstruction.
Our survey provides insight into this wide array of methods, highlighting strengths and limitations of current approaches. We find the research trends of state-of-the-art methods mainly concentrate on: (i) combining multiple methods, e.g., BundleFusion and 3DLite, (ii) more use of CNNs and deep learning, e.g., for scan completion and semantic fusion, and (iii) using more information, e.g., object shape priors and scene structural priors.
To inspire researcher to propose new methods, we also suggest directions for future work in highquality 3D reconstruction. Future directions may move: (i) from static to dynamic, e.g., real-time dynamic fusion, (ii) from local to global, e.g., localto-global optimization, large-scale scene completion, (iii) from 2D to 3D processing, e.g., occlusion recovery, (iv) from single goal to multiple goals, e.g., scene reconstruction with semantics, geometric reconstruction with color texture, and (v) from lowlevel to high-level, e.g., 3D reconstruction with scene understanding. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.
The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
Other papers from this open access journal are available free of charge from http://www.springer.com/journal/41095. To submit a manuscript, please go to https://www. editorialmanager.com/cvmj.