Learning to Reconstruct High-Quality 3D Shapes with Cascaded Fully Convolutional Networks

Cao, Yan-Pei; Liu, Zheng-Ning; Kuang, Zheng-Fei; Kobbelt, Leif; Hu, Shi-Min

doi:10.1007/978-3-030-01240-3_38

Yan-Pei Cao^17,18,
Zheng-Ning Liu¹⁷,
Zheng-Fei Kuang¹⁷,
Leif Kobbelt¹⁹ &
…
Shi-Min Hu¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 11213))

Included in the following conference series:

European Conference on Computer Vision

2401 Accesses
17 Citations

Abstract

We present a data-driven approach to reconstructing high-resolution and detailed volumetric representations of 3D shapes. Although well studied, algorithms for volumetric fusion from multi-view depth scans are still prone to scanning noise and occlusions, making it hard to obtain high-fidelity 3D reconstructions. In this paper, inspired by recent advances in efficient 3D deep learning techniques, we introduce a novel cascaded 3D convolutional network architecture, which learns to reconstruct implicit surface representations from noisy and incomplete depth maps in a progressive, coarse-to-fine manner. To this end, we also develop an algorithm for end-to-end training of the proposed cascaded structure. Qualitative and quantitative experimental results on both simulated and real-world datasets demonstrate that the presented approach outperforms existing state-of-the-art work in terms of quality and fidelity of reconstructed models.

Y.-P. Cao and Z.-N. Liu—Equal contribution.

You have full access to this open access chapter, Download conference paper PDF

Image2Mesh: A Learning Framework for Single Image 3D Reconstruction

Single-View 3D Shape Reconstruction with Learned Gradient Descent

Deep learning-based 3D reconstruction: a survey

Article 28 January 2023

Taha Samavati & Mohsen Soryani

Keywords

1 Introduction

High-quality reconstruction of 3D objects and scenes is key to 3D environment understanding, mixed reality applications, as well as the next generation of robotics, and has been one of the major frontiers of computer vision and computer graphics research for years [13, 18, 30, 39, 42]. Meanwhile, the availability of consumer-grade RGB-D sensors, such as the Microsoft Kinect and the Intel RealSense, involves more novice users to the process of scanning surrounding 3D environments, opening up the need for robust reconstruction algorithms which are resilient to errors in the input data (e.g., noise, distortion, and missing areas).

In spite of recent advances in 3D environment reconstruction, acquiring high-fidelity 3D shapes with imperfect data from casual scanning procedures and consumer-level RGB-D sensors is still a particularly challenging problem. Since the pioneering KinectFusion work [39], many 3D reconstruction systems, both real-time [18, 29, 32, 52, 59] and offline [13], have been proposed, which often use a volumetric representation of the scene geometry, i.e., the truncated signed distance function (TSDF) [17]. However, depth measurement acquired by consumer depth cameras contains a significant amount of noise, plus limited scanning angles lead to missing areas, making vanilla depth fusion suffer from blurring surface details and incomplete geometry. Another line of research [30, 40, 46] focuses on reconstructing complete geometry from noisy and sparsely-sampled point clouds, but cannot process point clouds with a large percentage of missing data and may produce bulging artifacts.

The wider availability of large-scale 3D model repositories [6, 61] stimulates the development of data-driven approaches for shape reconstruction and completion. Assembly-based methods, such as [10, 49], require carefully segmented 3D databases as input, operate on a few specific classes of objects, and can only generate shapes with limited variety. On the other hand, recent deep learning-based approaches [14, 22, 28, 48, 50, 51, 54, 55, 60, 63, 64] mostly focus on inferring 3D geometry from single-view images [22, 50, 51, 54, 55, 63, 64] or high-level information [48, 60] and often get stuck at low resolutions (typically $32^3$ voxel resolution) due to high memory consumption, which is far too low for recovering geometric details.

In this work, we present a coarse-to-fine approach to high-fidelity volumetric reconstruction of 3D shapes from noisy and incomplete inputs using a 3D cascaded fully convolutional network (3D-CFCN) architecture, which outperforms state-of-the-art alternatives regarding the resolution and accuracy of reconstructed models. Our approach chooses recently introduced octree-based efficient 3D deep learning data structures [43, 53, 56] as the basic building block, however, instead of employing a standard single-stage convolutional neural network (CNN), we propose to use multi-stage network cascades for detailed shape information reconstruction, where the object geometry is predicted and refined progressively via a sequence of sub-networks. The rationale for choosing the cascaded structure is two-fold. First, to predict high-resolution (e.g., $512^3$, $1024^3$, or even higher) geometry information, one may have to deploy a deeper 3D neural network, which could significantly increase memory requirements even using memory-efficient data representations. Second, by splitting the geometry inference into multiple stages, we also simplify the learning tasks, since each sub-network now only needs to learn to reconstruct 3D shapes at a certain resolution.

Training a cascaded architecture is a nontrivial task, particularly when octree-based data representations are employed, where both the structure and the value of the output octree need to be predicted. We thus design the sub-networks to learn where to refine the 3D space partitioning of the input volume, and the same information is used to guide the data propagation between consecutive stages as well, which makes end-to-end training feasible by avoiding exhaustively propagating every volume block.

The primary contribution of our work is a novel learning-based, progressive approach for high-fidelity 3D shape reconstruction from imperfect data. To train and quantitatively evaluate our model on real-world 3D shapes, we also contribute a dataset containing both detailed full body reconstructions and raw depth scans of 10 subjects. We then conduct careful experiments on both simulated and real-world datasets, comparing the proposed framework to a variety of state-of-the-art alternatives. These experiments show that, when dealing with noisy and incomplete inputs, our approach produces 3D shapes with significantly higher accuracy and quality than other existing methods.^{Footnote 1}

2 Related Work

There has been a large body of work focused on 3D reconstruction over the past a few decades. We refer the reader to [2] and [9] for detailed surveys of methods for reconstructing 3D objects from point clouds and RGB-D streams, respectively. Here we only summarize the most relevant previous approaches and categorize them as geometric, assembly-based, and learning-based approaches.

Geometric Approaches. In the presence of sample noise and missing data, many choose to exploit the smoothness assumption, which constrains the reconstructed geometry to satisfy a certain level of smoothness. Gradient-domain methods [1, 4, 30] require that the input point clouds be equipped with (oriented) normals and utilize them to estimate an implicit soft indicator function which discriminates the interior region from the exterior of a 3D shape. Similarly, [5, 36] use globally supported radial basis functions (RBFs) to interpolate the surface. On the other hand, a series of moving least squares (MLS) -based methods [25, 41] attack 3D reconstruction by fitting the input point clouds to a spatially varying low-degree polynomial. By assuming local or global surface smoothness, these approaches, to a certain extent, are robust to noise, outliers, and missing data.

Sensor visibility is another widely used prior in scan integration for object and scene reconstruction [17, 23], which acts as an effective regularizer for structured noise [65] and can be used to infer empty spaces. For large-scale indoor scene reconstruction, since the prominent KinectFusion, plenty of systems [13, 18, 29] have been proposed. However, they are mostly focused on improving the accuracy and robustness of camera tracking in order to obtain a globally consistent model.

Compared to these methods, we propose to learn natural 3D shape priors from massive training samples for shape completion and reconstruction, which better explores the 3D shape space and avoids undesired reconstructed geometries resulted from hand-crafted priors.

Assembly-Based Approaches. Another line of work assumes that a target object can be described as a composition of primitive shapes (e.g., planes, cuboids, spheres, etc.) or known object parts. [8, 45] detect primitives in input point clouds of CAD models and optimize their placement as well as the spatial relationship between them via graph cuts. The method introduced in [47] first interactively segments the input point cloud and then retrieves a complete and similar 3D model to replace each segment, while [10] extends this idea by exploiting the contextual knowledge learned from a scene database to automate the segmentation as well as improve the accuracy of shape retrieval. To increase the granularity of the reconstruction to the object component level, [49] proposes to reassemble parts from different models, aiming to find the combination of candidates which conforms the input RGB-D scan best. Although these approaches can deal with partial input data and bring in semantic information, 3D models obtained by them still suffer from the lack of geometric diversity.

Learning-Based Approaches. 3D deep neural networks have achieved impressive results on various tasks [7, 15, 61], such as 3D shape classification, retrieval, and segmentation. As for generative tasks, previous research mostly focuses on inferring 3D shapes from (single-view) 2D images, either with only RGB channels [14, 28, 50, 54, 55, 60, 63], or with depth information [22, 51, 64]. While showing promising advances, these techniques are only capable of generating rough 3D shapes at low resolutions. Similarly, in [48, 57], shape completion is also performed on low-resolution voxel grids due to the high demand of computational resources.

Aiming to complete and reconstruct 3D shapes at higher resolutions, [19] proposes a 3D Encoder-Predictor Network (3D-EPN) to firstly predict a coarse but complete shape volume and then refine it via an iterative volumetric patch synthesis process, which copy-pastes voxels from k-nearest-neighbors to improve the resolution of each predicted patch. [26] extends 3D-EPN by introducing a local 3D CNN to perform patch-level surface refinement. However, these methods both need separate and time-consuming steps before local inference, either nearest neighbor queries [19], or 3D boundary detection [26]. By contrast, our approach only requires a single forward pass for 3D shape reconstruction and produces higher-resolution results (e.g., $512^3$ vs. $128^3$ or $256^3$). On the other hand, [27, 53] propose efficient 3D convolutional architectures by using octree representations, which are designed to decode high-resolution geometry information from dense intermediate features; nevertheless, no volumetric convolutional encoders and corresponding shape reconstruction architectures are provided. While [42] presents an OctNet-based [43] end-to-end deep learning framework for depth fusion, it refines the intermediate volumetric output globally, which makes it infeasible for producing reconstruction results at higher resolutions even with memory-efficient data structures. Instead, our 3D-CFCN learns to refine output volumes at the level of local patches, and thus significantly reduces the memory and computational cost.

3 Method

This section introduces our 3D-CFCN model. We first give a condensed review of relevant concepts and techniques in Sect. 3.1. Then we present the proposed architecture and its corresponding training pipeline in Sects. 3.2 and 3.3. Section 3.4 summaries the procedure of collecting and generating the data which we used for training our model.

3.1 Preliminaries

Volumetric Representation and Integration. The choice of underlying data representation for fusing depth measurements is key to high-quality 3D reconstruction. Approaches varies from point-based representations [31, 58], 2.5D fields [24, 38], to volumetric methods based on occupancy maps [62] or implicit surfaces [17, 18]. Among them, TSDF-based volumetric representations have become the preferred method due to their ability to model continuous surfaces, efficiency for incremental updates in parallel, and simplicity for extracting surface interfaces. In this work, we adopt the definition of TSDF from [39]:

$$\begin{aligned} \mathrm{V}(\mathbf {p})&= \Psi (\mathrm{S}(\mathbf {p})) , \end{aligned}$$

(1)

$$\begin{aligned} \mathrm{S}(\mathbf {p})&= \left\{ \begin{array}{cl} \Vert \mathbf {p} - \partial {\Omega } \Vert _2,\quad &{} if \ \mathbf {p} \in {\Omega } \\ -\Vert \mathbf {p} - \partial {\Omega } \Vert _2,\quad &{} if \ \mathbf {p} \in {\Omega }^c \end{array} \right. , \end{aligned}$$

(2)

$$\begin{aligned} \Psi (\eta )&= \left\{ \begin{array}{cl} \min (1, \frac{\eta }{\mu }) \, \mathrm{sgn}(\eta ), \quad &{} if \ \eta \ge -\mu \\ invalid, \quad &{} otherwise \end{array} \right. , \end{aligned}$$

(3)

where $\mathrm{S}$ is the standard signed distance function (SDF) with ${\Omega }$ being the object volume, and $\Psi $ denotes the truncation function with $\mu $ being the corresponding truncation threshold. The truncation is performed to avoid surface interference, since in practice during scan fusion, the depth measurement is only locally reliable due to surface occlusions. In essence, a TSDF obliviously encodes free space, uncertain measurements, and unknown areas.

Given a set of depth scans at hand, we follow the approach in [17] to integrate them into a TSDF volume:

$$\begin{aligned} \mathrm{V}(\mathbf {p}) = \frac{\sum w_i (\mathbf {p}) \, V_{ i }(\mathbf {p}) }{\sum w_i (\mathbf {p})}, \end{aligned}$$

(4)

where $\mathrm{V}_{ i }(\mathbf {p})$ and $w_i(\mathbf {p})$ are the TSDFs and weight functions from the i-th depth scan, respectively.

OctNet. 3D CNNs are a natural choice for operating TSDF volumes under the end-to-end learning framework. However, the cubic growth of computational and memory requirements becomes a fundamental obstacle for training and deploying 3D neural networks at high resolution. Recently, there emerges several work [43, 53, 56] that propose to exploit the sparsity in 3D data and employ octree-based data structures to reduce the memory consumption, among which we take OctNet [43] as our basic building block.

In OctNet, features and data are organized in the grid-octree data structure, which consists of a grid of shallow octrees with maximum depth 3. The structure of shallow octrees are encoded as bit strings so that the features and data of sparse octants can be packed into continuous arrays. Common operations in convolutional networks (e.g., convolution, pooling and unpooling) are defined on the grid-octree structure correspondingly. Therefore, the computational and memory cost are significantly reduced, while the OctNet itself, as a processing module, can be plugged into most existing 3D CNN architectures transparently. However, one major limitation of OctNet is that the structure of grid-octrees is determined by the input data and keeps fixed during training and inference, which is undesirable for reconstruction tasks where hole filling and detail refinement need to be performed. We thus propose an approach to eliminate this drawback in Sect. 3.2.

3.2 Architecture

Our 3D-CFCN is a cascade of volumetric reconstruction modules, which are OctNet-based fully convolutional sub-networks aiming to infer missing surface areas and refine geometric details. Each module $\mathcal {M}^i$ operates at a given voxel resolution and spatial extent. We find ${512^3}$ voxel resolution and a corresponding two-stage architecture suffice to common daily 3D scanning tasks in our experiments, and thus will concentrate on this architecture in the rest of the paper; nevertheless, the proposed 3D-CFCN framework can be easily extended to support arbitrary resolutions and number of stages.

In our implementation, for both sub-networks, we adopt the U-net architecture [44] while substituting convolution and pooling layers with the corresponding operations from OctNet. Skip connections are also employed between corresponding encoder and decoder layers to make sure the structures of input volumes are preserved in the inferred output predictions. To complete the partial input data and refine its grid-octree structure, we refrain from using OctNet’s unpooling operation and propose a structure refinement module, which learns to predict whether an octant needs to be split for recovering finer geometric details.

The first sub-network, $\mathcal {M}^0$, receives the encoded low-resolution (i.e., $128^3$) TSDF volume ${V}^l$ (see Sect. 3.4), which is fused from raw depth scans $\{\mathcal {D}_i\}$ of an 3D object $\mathcal {S}$, as input and produces a feature map $F^l$ as well as a reconstructed TSDF volume $R^l$ at the same resolution. Then for each $16^3$ patch $\tilde{F}^l_k$ of $F^l$, we use a modified structure refinement module to predict if its corresponding block in $R^l$ needs further improvement.

If a TSDF patch $\tilde{R}^l_k$ is predicted to be further refined, we then crop its corresponding $64^3$ patch $\tilde{V}^h_k$ from $V^h$, which is an encoded TSDF volume fused from the same depth scans $\{\mathcal {D}_i\}$, but at a higher voxel resolution, i.e., $512^3$. $\tilde{V}^h_k$ is next fed to the second stage $\mathcal {M}^1$ to produce a local feature map $\tilde{F}^h_k$ with increased spatial resolution and reconstruct a more detailed local 3D patch $\tilde{R}^h_k$ of $\mathcal {S}$. Meanwhile, since input local TSDF patches $\{\tilde{V}^h_k\}$ may suffer from a large portion of missing data, we also propagate $\{\tilde{F}^l_k\}$ to incorporate global guidance. More specifically, a propagated $\tilde{F}^l_k$ is concatenated with the high-level 3D feature map after the second pooling layer in $\mathcal {M}^1$ (see Fig. 2). Note this extra path, in return, also helps to refine ${F}^l$ during back propagation. Finally, the regressed local TSDF patch $\{\tilde{R}^h_k\}$ is substituted back into the global TSDF, which can be further used to extract surfaces.

To avoid inconsistency across TSDF patch boundaries, we add interval overlaps when cropping feature maps and TSDF volumes. When cropping $\{\tilde{F}^l_k\}$, we expand two more voxels on each side of the 3D patch, making the actual resolution of $\{\tilde{F}^l_k\}$ grow to $20^3$; similarly, for $\{\tilde{V}^h_k\}$ and $\{\tilde{F}^h_k\}$, we apply 8-voxel overlapping and increase their resolution to $80^3$. However, when substituting back $\{\tilde{R}^h_k\}$, overlapping regions are discarded. So in its essence, this cropping approach acts as a smart padding scheme. Note that all local patches are still organized in grid-octrees.

Structure Refinement Module. Since the unpooling operation of OctNet restrains the possibility of refining the octree structure on-the-fly, inspired by [42, 53], we propose to replace unpooling layers with a structure refinement module. Instead of inferring new octree structures implicitly from reconstructions as in [42], we use $3^3$ convolutional filters to directly predict from feature maps whether an octant should be further split. In contrast, OGN [53] predicts three-state masks using $1^3$ filters followed by three-way softmax. To determine if a 3D local patch needs to be fed to $\mathcal {M}^1$, we take the average “split score” of all the octants in this patch and compare it with a confidence threshold $\rho $ ($=0.5$). By employing this adaptive partitioning and propagation scheme, we achieve high-resolution volumetric reconstruction while keeping the computational and memory cost to a minimum level.

3.3 Training

The 3D-CFCN is trained in a supervised fashion on a TSDF dataset $\{ \mathcal {F}_n = \{V^l, V^h, G^l, G^h \} \}$ in two phases, where $V^l$ and $V^h$ denote the incomplete input TSDFs at low and high voxel resolution, while $G^l$ and $G^h$ are low- and high-resolution ground-truth TSDFs, respectively.

In the first phase, $\mathcal {M}^0$ is trained alone with a hybrid of $\ell _1$, binary cross entropy, and structure loss:

$$\begin{aligned} \mathcal {L}(\theta ;\, V^l, G^l) = \mathcal {L}_{\ell _1} + \lambda _1 \mathcal {L}_{bce} + \lambda _2 \mathcal {L}_{s}. \end{aligned}$$

(5)

The $\ell _1$ term is designed for TSDF denoising and reconstruction, and we employ the auxiliary binary cross entropy loss $\mathcal {L}_{bce}$ to provide the network more guidance for learning shape completion; while in our experiments, we find $\mathcal {L}_{bce}$ also leads to faster convergence. Our structure refinement module is learned with $\mathcal {L}_{s}$, where

$$\begin{aligned} \mathcal {L}_{s} = \frac{1}{| \mathcal {O} |} \sum _{o \in \mathcal {O}} {BCE} \left( 1 - f(o', T_{gt}), p(o) \right) . \end{aligned}$$

(6)

Here, $\mathcal {O}$ represents the set of octants in the current grid-octree, and BCE denotes the binary cross entropy. p(o) is the prediction of whether the octant o should to be split, while $o'$ is the corresponding octant of o in the ground-truth grid-octree structure $T_{gt}$ (in this case, the structure of $G_l$). We define $f(o', T_{gt})$ as an indicator function that identifies whether $o'$ exists in $T_{gt}$:

$$\begin{aligned} f(o', T_{gt}) = \left\{ \begin{array}{cl} 1, \quad &{} \exists ~\tilde{o}',\ such\ that\ h(\tilde{o}') \le h(o') \\ 0, \quad &{} otherwise \end{array} \right. , \end{aligned}$$

(7)

where h denotes the height of an octant in the octree.

Furthermore, we employ multi-scale supervision [15, 20] to alleviate potential gradient vanishing. Specifically, after each pooling operation, the feature map is concatenated with a downsampled input TSDF volume at the corresponding resolution, and we evaluate the downscaled hybrid loss at each structure refinement layer.

In the second phase, $\mathcal {M}^1$ is trained; at the same time, $\mathcal {M}^0$ is being fine-tuned. To alleviate over-fitting and speed up the training process, among all the local patches that are predicted to be fed to $\mathcal {M}^1$, we keep only K of them randomly and discard the rest (we set $K = 2$ across our experiments). At this stage, the inferred global structure $\tilde{F}^l_k$ flows into $\mathcal {M}^1$ to guide the shape completion, while the refined local features also provide feedbacks and improves $\mathcal {M}^0$. The same strategy, i.e., hybrid loss (see Eq. 5) and multi-scale supervision, is adopted here when training $\mathcal {M}^1$ together with $\mathcal {M}^0$.

3.4 Training Data Generation

Synthetic Dataset. Our first dataset is built upon the synthetic 3D shape repository ModelNet40 [61]. We choose a subset of 10 categories, with 4051 shape instances in total (3245 for training, 806 for testing). Similar to existing approaches, we set up virtual cameras around the objects^{Footnote 2} and render depth maps, then simulate the volumetric fusion process [17] to generate ground-truth TSDFs. To produce noisy and partial training samples, previous methods [18, 26, 42] add random noise and holes to the depth maps to mimic sensor noise. However, synthetic noise reproduced by this approach usually does not conform real noise distributions. Thus, we instead implement a synthetic stereo depth camera [21]. Specifically, we virtually illuminate 3D shapes with a structured light pattern, which is extracted from Asus XTion sensors using [12, 37], and apply the PatchMatch Stereo algorithm [3] to estimate disparities (and hence depth maps) across stereo speckle images. In this way, the distribution of noise and missing area in synthesized depth images behaves much closer to real ones, thus makes the trained network generalize better on real-world data. In our experiments, we pick 2 or 4 virtual viewpoints randomly when generating training samples.

In essence, apart from shape completion, learning volumetric depth fusion is to seek a function $g(\{\mathcal {D}_1, \ldots , \mathcal {D}_n\})$ that maps raw depth scans to a noise free TSDF. Therefore, to retain information from all input depth scans, we adopt the histogram-based TSDF representation (TSDF-Hist) proposed in [42] as the encoding of our input training samples. A 10D smoothed-histogram, which uses 5 bins for negative and 5 bins for positive distances, with the first and the last bin reserved for truncated distances, is allocated for each voxel. The contribution of a depth observation is distributed linearly between the two closest bins. For outputs, we simply choose plain 1-dimensional TSDFs as the representation.

Since we employ a cascaded architecture and use multi-scale supervision during network training, we need to generate training and ground-truth sample pairs at multiple resolutions. Specifically, TSDFs at $32^3$, $64^3$, $128^3$, $256^3$, and $512^3$ voxel resolutions are simultaneously generated in our experiments.

Real-World Dataset. We construct a high-quality dynamic 3D reconstruction (or, free-viewpoint video, FVV) system similar to [16] and collect 10 4D sequences of human actions, each capturing a different subject. Then a total of 9746 frames are randomly sampled from the sequences and split into training and test set by the ratio of 4 : 1. We name this dataset as Human10. For each frame, we fuse 2 or 4 randomly picked raw depth scans and obtain the TSDF-Hist encodings of the training sample; while the ground-truth TSDFs is produced by virtually scanning (see the previous section) the corresponding output triangle mesh of our FVV system. The sophisticated pipeline of our FVV system guarantees the quality and accuracy of the output mesh, however, the design and details of the FFV system is beyond the scope of this paper.

4 Experiments

We have evaluated our 3D-CFCN architecture on both ModelNet40 and Human10 and compared different aspects of our approach with other state-of-the-art alternatives.^{Footnote 3}

4.1 High-Resolution Shape Reconstruction

In our experiments, we train the 3D-CFCN separately on each dataset for 20 epochs (12 for stage 1, 8 for two stages jointly), using the ADAM optimizer [33] with 0.0001 learning rate, which takes ${\approx }80$ h to converge. Balancing weights in Eq. 5 are set to: $\lambda _1 = 0.5$ and $\lambda _2 = 0.1$. During inference, it takes ${\approx }3.5\,\text {s}$ on average to perform a forward pass through both stages on a NVIDIA GeForce GTX 1080 Ti. The Marching Cubes algorithm [35] is used to extract surfaces from output TSDFs. Figs. 1, 3, and 4 illustrate the high-quality reconstruction results achieved with our 3D-CFCN architecture.

In Fig. 3 we show a variety of test cases from both Human10 and ModelNet40 dataset. All the input TSDF-Hists were fused using depth maps from 2 viewpoints, and the same TSDF truncation threshold were applied. Despite the presence of substantial noise and missing data, our approach was able to reduce the noise and infer the missing structures, producing clean and detailed reconstructions. Comparing the second and the third column, for Human10 models, stage 2 of our 3D-CFCN significantly improved the quality by bringing more geometric details to output meshes; on the other hand, $128^3$ voxel resolution suffices to ModelNet40, thus stage 2 does not show significant improves in these cases.

Auxiliary Visual Hull Information. In practice, most depth sensors can also capture synchronized color images, which opens up the possibility of getting auxiliary segmentation masks [11]. Given the segmentation masks from each view, a corresponding visual hull [34], which is essentially an occupancy volume, can be extracted. Visual hulls provide additional information about the distribution of occupied and empty spaces, which is important for shape completion. We thus evaluated the performance of our 3D-CFCN when visual hull information is available. Towards this goal, we added corresponding visual hull input branches to both two stages, which are concatenated with intermediate features after two $3^3$ convolutional layers. Table 1 reports the average Hausdorff RMS distance between predicted and ground-truth 3D meshes, showing that using additional visual hull volumes as input brought a performance gain around $11\%$. Both TSDF-Hists and visual hull volumes in this experiment were generated using 2 viewpoints. Note that we also scaled the models in Human10 to fit into a $3^3$ bounding box.

Table 1. Quantitative comparisons of shape reconstruction techniques. Relative Hausdorff RMS distance with respect to the diagonals of bounding boxes are measured against the ground-truth triangle meshes. All baseline methods use input data fused from 2 views.

Full size table

Number of Viewpoints. Here we evaluated the impact of the completeness of input TSDF-Hists, i.e., the number of viewpoints used for fusing raw depth scans, on reconstruction quality. We trained and tested the 3D-CFCN architecture using TSDF-Hists fused from 2 and 4 viewpoints, listing the results in Table 1. As expected, using more depth scans led to increasing accuracy of output meshes, since input TSDF-Hists were less incomplete.

Robustness to Calibration and Tracking Error. Apart from sensor noise, calibration and tracking error is another major factor that can crack scanned models. To evaluate the robustness of the proposed approach to calibration and tracking error, we added random perturbations (from $2.5\%$ to $10\%$) to ground-truth camera poses, generated corresponding test samples, and predicted the reconstruction results using 3D-CFCN. As shown in Fig. 5(a), although the network has not been trained on samples with calibration error, it can still infers geometric structures reasonably.

4.2 Comparison with Existing Approaches

Figure 4 and Table 1 compare our 3D-CFCN architecture with three learning-based state-of-the-art alternatives for 3D shape reconstruction, i.e., OctNetFusion [42], 3D-EPN [19], and OGN [53], as well as the widely used geometric method Poisson surface reconstruction (PSR) [30].

OctNetFusion. Similar to our approach, OctNetFusion adopts OctNet as the building block and learns to denoise and complete input TSDFs in a multi-stage manner. However, each stage in OctNetFusion is designed to take an up-sampled TSDF and refine it globally (i.e., each stage needs to process all the octants in the grid-octree at the current resolution), making it infeasible to reconstruct 3D shape at higher resolutions, as learning at higher resolutions (e.g., $512^3$) not only increases the memory cost at input and output layers, but also requires deeper network structures, which further challenges the limited computational resource. Figure 4 and Table 1 summarize the comparison of our reconstruction results at $512^3$ voxel resolution with OctNetFusion’s results at $256^3$.

3D-EPN. Without using octree-based data structures, 3D-EPN employs a hybrid approach, which first completes the input model at a low resolution ($32^3$) via a 3D CNN and then uses voxels from similar high-resolution models in the database to produce output distance volumes at $128^3$ voxel resolution. However, as shown in Fig. 4, while being able to infer the overall shape of input models, this approach fails to recover fine geometric details due to the limited resolution.

OGN. As another relevant work to our 3D-CFCN architecture, OGN is a octree-based convolutional decoder. Although scales well to high resolution outputs, it remains challenging to recover accurate and detailed geometry information from encoded shape features via only deconvolution operations. To compare our approach with OGN, we trained the proposed 3D-CFCN on Human10 dataset to predict occupancy volumes, extracted $32^3$ intermediate feature from the stage-1 3D FCN of our architecture, and used these feature maps to train an OGN. Figure 5(b) compares the occupancy maps decoded by OGN with the corresponding occupancy volumes predicted by the proposed 3D-CFCN (both at $512^3$ resolution), showing that our method performs significantly better than OGN with respect to fidelity and accuracy.

5 Conclusions

We have presented a cascaded 3D convolutional network architecture for efficient and high-fidelity shape reconstruction at high resolutions. Our approach refines the volumetric representations of partial and noisy input models in a progressive and adaptive manner, which substantially simplifies the learning task and reduces computational cost. Experimental results demonstrate that the proposed method can produce high-quality reconstructions with accurate geometric details. We also believe that extending the proposed approach to reconstructing sequences is a promising direction.

Notes

1.
We will make our 3D-CFCN implementation publicly available.
2.
We place virtual cameras at the vertices of a icosahedron.
3.
Please find more experiment results in the supplementary material.

References

Alliez, P., Cohen-Steiner, D., Tong, Y., Desbrun, M.: Voronoi-based variational reconstruction of unoriented point sets. In: Symposium on Geometry Processing, vol. 7, pp. 39–48 (2007)
Google Scholar
Berger, M., et al.: A survey of surface reconstruction from point clouds. In: Computer Graphics Forum, vol. 36, pp. 301–329. Wiley Online Library (2017)
Google Scholar
Bleyer, M., Rhemann, C., Rother, C.: Patchmatch stereo - stereo matching with slanted support windows. In: BMVC, January 2011. https://www.microsoft.com/en-us/research/publication/patchmatch-stereo-stereo-matching-with-slanted-support-windows/
Calakli, F., Taubin, G.: SSD: smooth signed distance surface reconstruction. In: Computer Graphics Forum, vol. 30, pp. 1993–2002. Wiley Online Library (2011)
Google Scholar
Carr, J.C., et al.: Reconstruction and representation of 3D objects with radial basis functions. In: Proceedings of the 28th Annual Conference on Computer Graphics and Interactive Techniques, pp. 67–76. ACM (2001)
Google Scholar
Chang, A.X., et al.: Shapenet: an information-rich 3D model repository. arXiv preprint arXiv:1512.03012 (2015)
Charles, R.Q., Su, H., Kaichun, M., Guibas, L.J.: PointNet: deep learning on point sets for 3D classification and segmentation. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 77–85. IEEE (2017)
Google Scholar
Chauve, A.L., Labatut, P., Pons, J.P.: Robust piecewise-planar 3D reconstruction and completion from large-scale unstructured point data. In: 2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1261–1268. IEEE (2010)
Google Scholar
Chen, K., Lai, Y.K., Hu, S.M.: 3D indoor scene modeling from RGB-D data: a survey. Comput. Vis. Media 1(4), 267–278 (2015)
Article Google Scholar
Chen, K., Lai, Y., Wu, Y.X., Martin, R.R., Hu, S.M.: Automatic semantic modeling of indoor scenes from low-quality RGB-D data using contextual information. ACM Trans. Graph. 33(6) (2014)
Article Google Scholar
Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. arXiv preprint arXiv:1606.00915 (2016)
Chen, Q., Koltun, V.: Fast MRF optimization with application to depth reconstruction. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3914–3921 (2014)
Google Scholar
Choi, S., Zhou, Q.Y., Koltun, V.: Robust reconstruction of indoor scenes. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5556–5565, June 2015
Google Scholar
Choy, C.B., Xu, D., Gwak, J.Y., Chen, K., Savarese, S.: 3D-R2N2: a unified approach for single and multi-view 3D object reconstruction. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 628–644. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_38
Chapter Google Scholar
Çiçek, Ö., Abdulkadir, A., Lienkamp, S.S., Brox, T., Ronneberger, O.: 3D U-net: learning dense volumetric segmentation from sparse annotation. In: Ourselin, S., Joskowicz, L., Sabuncu, M.R., Unal, G., Wells, W. (eds.) MICCAI 2016. LNCS, vol. 9901, pp. 424–432. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46723-8_49
Chapter Google Scholar
Collet, A., et al.: High-quality streamable free-viewpoint video. ACM Trans. Graph. 34(4), 69:1–69:13 (2015). https://doi.org/10.1145/2766945
Article Google Scholar
Curless, B., Levoy, M.: A volumetric method for building complex models from range images. In: Proceedings of the 23rd Annual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH 1996, pp. 303–312. ACM, New York (1996). https://doi.org/10.1145/237170.237269
Dai, A., Nießner, M., Zollhöfer, M., Izadi, S., Theobalt, C.: Bundlefusion: real-time globally consistent 3D reconstruction using on-the-fly surface reintegration. ACM Trans. Graph. 36(3), 24:1–24:18 (2017). https://doi.org/10.1145/3054739
Article Google Scholar
Dai, A., Qi, C.R., Nießner, M.: Shape completion using 3D-encoder-predictor CNNs and shape synthesis. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol. 3 (2017)
Google Scholar
Dou, Q., et al.: 3D deeply supervised network for automated segmentation of volumetric medical images. Med. Image Anal. 41, 40–54 (2017)
Article Google Scholar
Fanello, S.R., et al.: Ultrastereo: efficient learning-based matching for active stereo systems. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6535–6544. IEEE (2017)
Google Scholar
Firman, M., Mac Aodha, O., Julier, S., Brostow, G.J.: Structured prediction of unobserved voxels from a single depth image. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5431–5440 (2016)
Google Scholar
Fuhrmann, S., Goesele, M.: Fusion of depth maps with multiple scales. In: ACM Transactions on Graphics (TOG), vol. 30, p. 148. ACM (2011)
Article Google Scholar
Gallup, D., Pollefeys, M., Frahm, J.-M.: 3D reconstruction using an n-layer heightmap. In: Goesele, M., Roth, S., Kuijper, A., Schiele, B., Schindler, K. (eds.) DAGM 2010. LNCS, vol. 6376, pp. 1–10. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-15986-2_1
Chapter Google Scholar
Guennebaud, G., Gross, M.: Algebraic point set surfaces. In: ACM Transactions on Graphics (TOG), vol. 26, p. 23. ACM (2007)
Article Google Scholar
Han, X., Li, Z., Huang, H., Kalogerakis, E., Yu, Y.: High-resolution shape completion using deep neural networks for global structure and local geometry inference. In: IEEE International Conference on Computer Vision (ICCV), October 2017
Google Scholar
Häne, C., Tulsiani, S., Malik, J.: Hierarchical surface prediction for 3D object reconstruction. arXiv preprint arXiv:1704.00710 (2017)
Ji, M., Gall, J., Zheng, H., Liu, Y., Fang, L.: SurfaceNet: an end-to-end 3D neural network for multiview stereopsis. arXiv preprint arXiv:1708.01749 (2017)
Kähler, O., Prisacariu, V.A., Murray, D.W.: Real-time large-scale dense 3D reconstruction with loop closure. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 500–516. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_30
Chapter Google Scholar
Kazhdan, M., Hoppe, H.: Screened poisson surface reconstruction. ACM Trans. Graph. 32(3), 29:1–29:13 (2013). https://doi.org/10.1145/2487228.2487237
Article MATH Google Scholar
Keller, M., Lefloch, D., Lambers, M., Izadi, S., Weyrich, T., Kolb, A.: Real-time 3D reconstruction in dynamic scenes using point-based fusion. In: 2013 International Conference on 3D Vision-3DV 2013, pp. 1–8. IEEE (2013)
Google Scholar
Kerl, C., Sturm, J., Cremers, D.: Robust odometry estimation for RGB-D cameras. In: 2013 IEEE International Conference on Robotics and Automation, pp. 3748–3754, May 2013. https://doi.org/10.1109/ICRA.2013.6631104
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Kutulakos, K.N., Seitz, S.M.: A theory of shape by space carving. Int. J. Comput. Vis. 38(3), 199–218 (2000)
Article Google Scholar
Lorensen, W.E., Cline, H.E.: Marching cubes: a high resolution 3D surface construction algorithm. In: ACM SIGGRAPH Computer Graphics, vol. 21, pp. 163–169. ACM (1987)
Google Scholar
Macedo, I., Gois, J.P., Velho, L.: Hermite radial basis functions implicits. In: Computer Graphics Forum, vol. 30, pp. 27–42. Wiley Online Library (2011)
Google Scholar
McIlroy, P., Izadi, S., Fitzgibbon, A.: Kinectrack: 3D pose estimation using a projected dense dot pattern. IEEE Trans. Vis. Comput. Graph. 20(6), 839–851 (2014)
Article Google Scholar
Meilland, M., Comport, A.I.: On unifying key-frame and voxel-based dense visual slam at large scales. In: 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 3677–3683. IEEE (2013)
Google Scholar
Newcombe, R.A., et al.: KinectFusion: real-time dense surface mapping and tracking. In: 2011 10th IEEE International Symposium on Mixed and Augmented Reality, pp. 127–136, October 2011
Google Scholar
Oeztireli, A.C., Guennebaud, G., Gross, M.: Feature preserving point set surfaces based on non-linear kernel regression. Comput. Graph. Forum (2009). https://doi.org/10.1111/j.1467-8659.2009.01388.x
Article Google Scholar
Öztireli, A.C., Guennebaud, G., Gross, M.: Feature preserving point set surfaces based on non-linear kernel regression. In: Computer Graphics Forum, vol. 28, pp. 493–501. Wiley Online Library (2009)
Google Scholar
Riegler, G., Ulusoy, A.O., Bischof, H., Geiger, A.: OctNetFusion: learning depth fusion from data. In: Proceedings of the International Conference on 3D Vision (2017)
Google Scholar
Riegler, G., Ulusoy, A.O., Geiger, A.: OctNet: learning deep 3D representations at high resolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, vol. 3 (2017)
Google Scholar
Ronneberger, O., Fischer, P., Brox, T.: U-net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24574-4_28
Chapter Google Scholar
Schnabel, R., Degener, P., Klein, R.: Completion and reconstruction with primitive shapes. In: Computer Graphics Forum, vol. 28, pp. 503–512. Wiley Online Library (2009)
Google Scholar
Shan, Q., Curless, B., Furukawa, Y., Hernandez, C., Seitz, S.M.: Occluding contours for multi-view stereo. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition, pp. 4002–4009, June 2014
Google Scholar
Shao, T., Xu, W., Zhou, K., Wang, J., Li, D., Guo, B.: An interactive approach to semantic modeling of indoor scenes with an RGBD camera. ACM Trans. Graph. (TOG) 31(6), 136 (2012)
Article Google Scholar
Sharma, A., Grau, O., Fritz, M.: VConv-DAE: deep volumetric shape learning without object labels. In: Hua, G., Jégou, H. (eds.) ECCV 2016. LNCS, vol. 9915, pp. 236–250. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-49409-8_20
Chapter Google Scholar
Shen, C.H., Fu, H., Chen, K., Hu, S.M.: Structure recovery by part assembly. ACM Trans. Graph. 31(6), 180:1–180:11 (2012). https://doi.org/10.1145/2366145.2366199
Article Google Scholar
Sinha, A., Unmesh, A., Huang, Q., Ramani, K.: SurfNet: generating 3D shape surfaces using deep residual networks. In: Proceedings of CVPR (2017)
Google Scholar
Song, S., Yu, F., Zeng, A., Chang, A.X., Savva, M., Funkhouser, T.: Semantic scene completion from a single depth image. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 190–198. IEEE (2017)
Google Scholar
Steinbrcker, F., Sturm, J., Cremers, D.: Real-time visual odometry from dense RGB-D images. In: 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops), pp. 719–722, November 2011. https://doi.org/10.1109/ICCVW.2011.6130321
Tatarchenko, M., Dosovitskiy, A., Brox, T.: Octree generating networks: efficient convolutional architectures for high-resolution 3D outputs. In: IEEE International Conference on Computer Vision (ICCV) (2017). http://lmb.informatik.uni-freiburg.de/Publications/2017/TDB17b
Tatarchenko, M., Dosovitskiy, A., Brox, T.: Multi-view 3D models from single images with a convolutional network. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9911, pp. 322–337. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46478-7_20
Chapter Google Scholar
Tulsiani, S., Zhou, T., Efros, A.A., Malik, J.: Multi-view supervision for single-view reconstruction via differentiable ray consistency. In: CVPR, vol. 1, p. 3 (2017)
Google Scholar
Wang, P.S., Liu, Y., Guo, Y.X., Sun, C.Y., Tong, X.: O-CNN: octree-based convolutional neural networks for 3D shape analysis. ACM Trans. Graph. (SIGGRAPH) 36(4) (2017)
Google Scholar
Wang, W., Huang, Q., You, S., Yang, C., Neumann, U.: Shape inpainting using 3D generative adversarial network and recurrent convolutional networks. arXiv preprint arXiv:1711.06375 (2017)
Whelan, T., Leutenegger, S., Salas-Moreno, R.F., Glocker, B., Davison, A.J.: Elasticfusion: dense slam without a pose graph. Robot.: Sci. Syst. (2015)
Google Scholar
Whelan, T., Salas-Moreno, R.F., Glocker, B., Davison, A.J., Leutenegger, S.: ElasticFusion: real-time dense slam and light source estimation. Int. J. Robot. Res. 35(14), 1697–1716 (2016). https://doi.org/10.1177/0278364916669237
Article Google Scholar
Wu, J., Zhang, C., Xue, T., Freeman, B., Tenenbaum, J.: Learning a probabilistic latent space of object shapes via 3D generative-adversarial modeling. In: Advances in Neural Information Processing Systems, pp. 82–90 (2016)
Google Scholar
Wu, Z., et al.: 3D shapenets: a deep representation for volumetric shapes. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1912–1920 (2015)
Google Scholar
Wurm, K.M., Hornung, A., Bennewitz, M., Stachniss, C., Burgard, W.: Octomap: A probabilistic, flexible, and compact 3D map representation for robotic systems. In: Proceedings of the ICRA 2010 Workshop on Best Practice in 3D Perception and Modeling for Mobile Manipulation, vol. 2 (2010)
Google Scholar
Yan, X., Yang, J., Yumer, E., Guo, Y., Lee, H.: Perspective transformer nets: learning single-view 3D object reconstruction without 3D supervision. In: Advances in Neural Information Processing Systems, pp. 1696–1704 (2016)
Google Scholar
Yang, B., Wen, H., Wang, S., Clark, R., Markham, A., Trigoni, N.: 3D object reconstruction from a single depth view with adversarial learning. arXiv preprint arXiv:1708.07969 (2017)
Zach, C., Pock, T., Bischof, H.: A globally optimal algorithm for robust TV-L 1 range image integration. In: IEEE 11th International Conference on Computer Vision, ICCV 2007, pp. 1–8. IEEE (2007)
Google Scholar

Download references

Acknowledgement

This work was supported by the Joint NSFC-DFG Research Program (project number 61761136018), and the Natural Science Foundation of China (Project Number 61521002).

Author information

Authors and Affiliations

Tsinghua University, Beijing, China
Yan-Pei Cao, Zheng-Ning Liu, Zheng-Fei Kuang & Shi-Min Hu
Owlii Inc., Beijing, China
Yan-Pei Cao
RWTH Aachen University, Aachen, Germany
Leif Kobbelt

Authors

Yan-Pei Cao
View author publications
You can also search for this author in PubMed Google Scholar
Zheng-Ning Liu
View author publications
You can also search for this author in PubMed Google Scholar
Zheng-Fei Kuang
View author publications
You can also search for this author in PubMed Google Scholar
Leif Kobbelt
View author publications
You can also search for this author in PubMed Google Scholar
Shi-Min Hu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yan-Pei Cao .

Editor information

Editors and Affiliations

Google Research, Zurich, Switzerland
Vittorio Ferrari
Carnegie Mellon University, Pittsburgh, PA, USA
Martial Hebert
Google Research, Zurich, Switzerland
Cristian Sminchisescu
Hebrew University of Jerusalem, Jerusalem, Israel
Yair Weiss

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 157 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Cao, YP., Liu, ZN., Kuang, ZF., Kobbelt, L., Hu, SM. (2018). Learning to Reconstruct High-Quality 3D Shapes with Cascaded Fully Convolutional Networks. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds) Computer Vision – ECCV 2018. ECCV 2018. Lecture Notes in Computer Science(), vol 11213. Springer, Cham. https://doi.org/10.1007/978-3-030-01240-3_38

Download citation

DOI: https://doi.org/10.1007/978-3-030-01240-3_38
Published: 05 October 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-01239-7
Online ISBN: 978-3-030-01240-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics