Neural 3D reconstruction from sparse views using geometric priors

Sparse view 3D reconstruction has attracted increasing attention with the development of neural implicit 3D representation. Existing methods usually only make use of 2D views, requiring a dense set of input views for accurate 3D reconstruction. In this paper, we show that accurate 3D reconstruction can be achieved by incorporating geometric priors into neural implicit 3D reconstruction. Our method adopts the signed distance function as the 3D representation, and learns a generalizable 3D surface reconstruction model from sparse views. Specifically, we build a more effective and sparse feature volume from the input views by using corresponding depth maps, which can be provided by depth sensors or directly predicted from the input views. We recover better geometric details by imposing both depth and surface normal constraints in addition to the color loss when training the neural implicit 3D representation. Experiments demonstrate that our method both outperforms state-of-the-art approaches, and achieves good generalizability.

animation, etc. Recently, much progress has been made with the development of neural implicit 3D representation. Unlike traditional methods which directly triangulate explicit 3D surface points by feature matching, neural implicit methods use multi-layer perceptrons (MLPs) to parameterize the underlying scene to be reconstructed using occupancy or signed distance fields. Such an implicit representation is usually learned by imposing photometric consistency via RGB image reconstruction loss through a volume rendering technique.
Solely applying photometric consistency obviously leads to an underconstrained surface reconstruction problem, since many geometries may exist that can reproduce the same colors upon volume rendering. Furthermore, the photometric consistency constraint is less effective in noisy or weakly textured regions. Thus, geometry directly recovered from a photometric consistency based neural implicit representation, such as NeRF [1], usually suffers from over-smoothing and noise. Reconstructing the underlying geometry with better, finer detail requires a dense set of 2D views.
In this paper, we show that a more accurate and detailed geometry can be achieved by incorporating monocular geometric cues when training the neural 3D implicit reconstruction method. Depths and normals are the two most common kinds of monocular geometric cues provided by the local geometry of the underlying surface. These cues can be estimated from monocular images with reasonable quality using deep learning approaches, such as Omnidata [2] and MVSNet [3], or the data can be provided directly by depth sensors. Photometric cues and geometric cues are complementary: depths and normals can help to infer the geometry in textureless regions, while photometric cues can help to enrich details. Thus, in addition to RGB image reconstruction loss, we also impose depth and normal reconstruction loss terms to the volume rendering results.
The geometric cues can also serve as an initial guess for the underlying geometry, which helps to build a reliable and sparse feature volume for the neural implicit 3D representation, making it easier to train. To obtain accurate reconstruction, previous methods [4] usually adopted a hierarchical implicit representation, which first reconstructs coarse geometry and then recovers details with a finer resolution. Though such a hierarchical architecture can be efficient, the final results are still be affected by uncertainty and noise in the coarse geometry. If depths are available, we can avoid the use of a hierarchical architecture by directly determine an initial feature volume from the point cloud reprojected from the depth maps. This initial geometry is sparse and has comparable accuracy to the learned geometry [4], and can also be further refined. Including geometry priors during training of the neural implicit representation and determining the initial geometry directly from the geometric cues (i) improves the quality of the reconstructed geometry, (ii) improves the generalizability of the method, and (iii) leads to faster convergence, giving an overall better approach for sparse view 3D reconstruction.
In summary, our method makes the following contributions: • a general approach exploiting geometric priors to improve 3D reconstruction quality and generalization for neural sparse view implicit 3D reconstruction, • initial geometric reasoning for neural implicit surface models, which is more effective and easier to train, and • extensive experiments which demonstrate that our method achieves state-of-the-art sparse view 3D reconstruction.

Multi-view 3D reconstruction using an explicit representation
Typical multi-view 3D reconstruction is conducted by first extracting features from 2D views, then reasoning about the underlying geometry producing each view, and finally fusing the view geometry to obtain the final 3D geometry for the whole object or scene. The fusion process differs depending on the 3D representation used, such as voxels [5][6][7][8], a point cloud [9,10], or depths [3,[11][12][13][14]. Compared to traditional methods [15][16][17] based on hand-crafted geometric features, current CNN-based methods are more capable of extracting robust features and have produced promising results. In particular, using readily available depth estimation neural networks [2,3,11,18], depth-based methods [19], together with dedicated fusion [20][21][22], can provide high-quality reconstruction from densely captured images. However, these methods have shortcomings when facing image noise or weak textures, and can fail to recovery a complete surface given insufficient images or sparse views.

Neural implicit 3D reconstruction
With the success of novel view synthesis using neural radiance fields (NeRF) [1,[23][24][25][26][27], neural 3D implicit representations were quickly applied to multiview 3D reconstruction [4,[28][29][30][31][32][33][34][35][36]. Such methods usually extract the geometry from the predicted voxel density, occupancy field, or signed distance function (SDF). However, the geometry can suffer from images noise, and can be unreliable given weak textures or incomplete data, since they usually only impose photometric consistency when learning the implicit representation. To learn a better surface for incompletely visible objects, SNeS [37] explored the use of an object symmetry prior to help recover the invisible parts. Some recent reconstruction methods consider depths [38] or normals [39,40] as geometric priors to help reconstruction; however, to obtain finely detailed geometry, these methods usually require a large number of views to be input to perform perscene optimization, leading to difficulties to generalize to new scenes.
To increase the generalizability of the networks, efforts have been made to enable the network to memorize the input views or geometric cues. MVSNeRF [41] encodes the input views into a feature volume for better view synthesis. StereoNeRF [42] learns stereo correspondences from the input sparse views. PixelNeRF [43] and IBRNet [44] encode pixel colors into the network in addition to 3D coordinates and viewing directions. Depth [45,46] and normal [47] priors have also been explored in neural rendering to better constrain the underlying geometry. Although these methods can synthesize plausible images, the extracted geometries can still suffer from noise, since all 3D spaces are considered in their representations. Our method, in contract, can determine a more accurate, sparse initial geometry with the help of geometric cues, especially using depth estimation from the input views.

Overview
The pipeline of our method is illustrated in Fig. 1.
It uses a volume rendering scheme for surface reconstruction [4]. Given sparse input views (typically 3-7 views), our method first obtains a depth map for each view. Then the sparse geometry reasoning module builds a sparse feature volume from these depth maps by projecting them back to the world space to assemble a sparse point cloud of the underlying geometry. The resulting sparse feature volume is then fed into the geometry-guided surface reconstruction module to reconstruct an accurate surface with fine details, doing so by imposing geometric constraints based on losses from rendered depths and normals, in addition the photometric loss.

Sparse geometry reasoning with depths
We parameterize the underlying geometry to be reconstructed using the signed distance function (SDF), following previous methods [1,4,47]. To efficiently reconstruct the underlying geometry, previous methods usually exploit a coarse-to-fine process, first estimating the coarse geometry for the whole scene, and then only refining details in occupied (non-empty) regions as determined by the coarse geometry. While this approach is efficient, it is not robust to noise and unreliable photometric consistency, since it usually only uses RGB images to build the underlying geometry. Given that depth maps of monocular images can be reliably and efficiently estimated by deep learning techniques [2,3], or easily acquired by depth sensors, we make use of these depth maps for monocular RGB images to build a sparse and reliable coarse geometry for our neural implicit model. Specifically, given N sparse images we first determine or obtain corresponding depth maps D i . These depth maps are then reprojected to the world coordinate system to produce a composite point cloud

using the camera intrinsic parameters and poses of all images
To construct a feature volume G for geometry reasoning, we follow a prior method [3] for multiview depth estimation to build a cost volume C. We first extract a 2D feature map F i from each input image using a 2D feature extraction network, and voxelize the fused point cloud P into regular voxels with voxel size d. We then project all points P (v) contained in each voxel v to feature map F i and determine the voxel's feature F i (v) as the average feature: The cost feature volume of a voxel v is thus obtained by computing the variance of all the projected features of the voxel to all input views: Fig. 1 Pipeline. Our method takes sparse views as input and reconstructs the underlying 3D surface by (i) reliably determining a coarse geometry from the depth maps, and (ii) in addition to photometric consistency Lc, imposing both depth consistency L d and normal consistency Ln, leading to a more general and accurate framework for the neural implicit model.
Finally, the geometric feature volume G is obtained by applying a sparse 3D CNN to C: Note that this feature volume is inherently sparse because of the sparsity of the point cloud. To account for possible errors in depth maps, we also dilate each voxel by a distance δ d .
In this way, our method directly reconstructs the underlying geometry using only one level of implicit field, i.e., the finest level of previous methods, while still being as efficient as possible, and achieving more accurate results.

Geometry guided surface reconstruction
Following Ref. [4], given a query 3D position q, our implicit 3D representation directly predicts its signed distance SDF(q). Specifically, an MLP network is applied to predict the surface from the interpolated geometric features of G, concatenated with q's positional encoding PE: NeuS [28] used dedicated volume rendering for multiview 3D reconstruction using a neural implicit surface and later was extended to sparse view 3D reconstruction [4]. However, the neural implicit surface is optimized only using photometric supervision, which may suffer from noise and unreliable photometric consistency in textureless regions.
In addition to photometric consistency, we try to exploit complementary geometric priors to reconstruct more accurate and detailed geometries from the sparse input views. Specifically, to render the depth d(r) and normal n(r) as well as the color c(r) of a ray r going through the underlying scene, we first query the depths d i , normals n i , and colors c i , and SDF values s i for all M sampled points {p(t i )} along the ray; then we convert s i to densities σ i using NeuS [28]: where Φ s (x) = (1 + e −sx ) −1 and s is a learnable parameter. Finally, we combine depths, normals, and colors with the densities to obtain the rendered values: where δ i = ||p(t i+1 )−p(t i )|| 2 is the distance between two consecutive sample points and T i = exp(−Σ i−1 j=0 δ i σ j ) is the accumulated transmittance. Note that we follow Ref. [4] to calculate the color at each sampled point which blends the colors of pixels or patches from the input views. d i is the ray distance from the sampled point to the ray original and n i is the spatial gradient of the sample point at the predicted SDF.
We now can train a more accurate neural implicit surface by imposing consistency on depth and normal as well as color with the total loss in Eq. (9): where L c , L n , and L d are the photometric loss, normal loss, and depth loss, respectively, R is the set of all sample rays, c(r), n(r), and d(r) are the groundtruth color, normal, and depth of the sample ray r, respectively, and w n and w d weight these losses.

Dataset
We trained our generalizable sparse view 3D reconstruction model on the DTU multiview stereo dataset [48], which contains 75 scenes for training and a further 15 non-overlapping scenes for testing. We centrally cropped the images to a resolution of 512 × 640 for both training and testing. To train our network, ground-truth normals were estimated from the ground-truth point cloud for the underlying scene provided with the dataset. While depth maps could be estimated using current learning-based methods [2,3] for each view, to ensure depth consistency between views, we used the ground-truth depth to determine the initial geometry required by our network for both training and testing. To account for sparsity and errors in the depth map, we set the dilation range δ d to 7 voxels. Our network was trained with 6 views, including 1 reference view and 5 source views.

Implementation details
We adopted a feature pyramid network [49] as the multi-scale 2D image feature extraction network and used a U-net like sparse 3D convolution network [50]. The 3D voxel resolution was set to 192×192×192 and the weights in the loss function were set as w d = 0.1 and w d = 0.9. We trained our model for 20k iterations with an initial learning rate of 2×10 −4 , adjusted using a cosine decay schedule, with a factor of 0.5 at 10k and 15k steps. Our model was trained using the Jittor deep learning framework [51] on a Titan RTX GPU using a batch size of 512 rays.

Metrics
To evaluate the accuracy of 3D reconstruction, we adopt the commonly used chamfer distance, which measures point distances between the predicted and ground-truth geometries. Following Ref. [4], we make use of the foreground object masks provided in IDR [34] to remove the background from the reconstructed results when computing metrics.

Quantitative and qualitative results
We first compared our method to a baseline neural surface reconstruction method SparseNeuS [4], which ignores both depth and normal cues when learning the 3D implicit field. For a fair comparison, we followed the setting of SparseNeuS by performing evaluation on two sets of three views for each test scene. The final metrics average these pairs of results. We also compared our method to other leading general sparse view 3D reconstruction methods, including (i) generic neural rendering methods such as PixelNeRF [43], IBRNet [44], and MVSNeRF [41], where the reconstructed mesh is extracted from the learned implicit field, and (ii) the widely used classic MVS method COLMAP [19], where the reconstructed mesh is extracted from the reconstructed point cloud. Note that, to test the generalizability of all methods, they were not fine-tuned to suit each scene. Per-scene chamfer distances and mean chamfer distance on the DTU test set are reported in Table 1. All values except those for our method are directly drawn from Ref. [4]. We also present a visual comparison of sample output from SparseNeuS and our method in Fig. 2; we also show cosine similarity of the predicted geometry's normals to the ground-truth values. Our method achieves the most accurate reconstruction as assessed in terms of chamfer distance. Generic neural rendering methods, such as PixelNeRF, IBRNet, and MVSNeRF struggle to recover fine geometric details using only the input images. Compared to SparseNeuS, with the help of depth guided sparse geometry reasoning and the constraints from both depth and normal priors, our method can reconstruct more accurate and detailed geometry. Furthermore, unlike SparseNeuS, our method does not need to train two (coarse and fine) networks, making training easier. Note that our method can outperform the COLMAP classical MVS method without the need of per-scene fine-tuning, which is required by SparseNeuS. Indeed, our results are even a little better than the fine-tuned results of SparseNeuS, having a mean chamfer distance of 1.27 on the DTU test dataset.
The reconstructed results for other scenes from the DTU test set are shown in Fig. 3.

Ablation study
To demonstrate the effectiveness of our approach, we conducted experiments by ablating the core modules of our method providing sparse geometry reasoning, depth constraints, and normal constraints. In the following evaluation, results were reconstructed using 6 views for consistency with the training process. These views were selected according to view pairs provided by the DTU dataset [48], where the first three views are one set of three views used in SparseNeuS.
The study was performed as follows: • Without sparse geometry reasoning (SGR). We replaced our sparse geometry reasoning module with the two-stage geometry reasoning module from SparseNeuS [4] and used the same settings of resolutions for coarse and fine voxels. Depth and normal constraints were imposed using the same weights as in our unaltered method. • Without normal prior. We set the weight for the normal loss to zero in our full method. • Without depth prior. We set the weight for the depth loss to zero in our full method. A quantitative analysis of 3D reconstruction results is given in Table 2 and a qualitative assessment is presented in Fig. 4. As we can see, eliminating the SGR module leads to over smoothed surfaces and less accurate geometry; both depth and normal priors  Fig. 4 Ablation study. Left to right: (a) reference image, (b) our method without sparse geometry reasoning (SGR), (c) our method without depth priors, (d) our method without normal priors, and (e) our full model. The SGR module helps to recover more accurate and detailed geometry; both depth and normal cues improve local details of the underlying geometry.
contribute to the geometric details. We also present novel view synthesis results in Fig. 5 by rendering an unseen view for several DTU test scenes. The results show that, though originally designed for 3D reconstruction, our geometric priors, especially the sparse geometry reasoning, are also beneficial when synthesizing novel views.

Parameter study
Our method is affected by four main parameters, including the weight for depth loss w d , the weight for normal loss w n , the voxel size d (inversely proportional to the number of voxels in each direction), and the depth dilation range δ d . We varied both w d and w n from 0.0 to 0.9 with a step size of 0.1 and tested two settings of voxel resolution, i.e., 96 or 192 voxels in each dimension. The mean chamfer distances for different parameter configurations, using the DTU dataset test scenes are listed in Table 3. We can observe that (1) d is responsible for geometric reconstruction accuracy: the smaller d is, the more accurate the geometry, but at a cost of increased computation. (2) As the depth weight starts to increase (from 0.01 to 0.1), the geometry becomes more and more accurate; however, when it continues to increase, the accuracy drops. This may be somewhat affected by the sparse geometry reasoning module, which uses the depth information to construct an initial geometry for the underlying scene. (3) A larger normal loss weight reconstructs more geometric details. (4) A larger depth dilation range can cover more true surface regions and thus be more robust to noise in the depth maps, achieving more accurate reconstruction results, but at a cost of a greater computational burden. To balance the accuracy of the recovered geometry and computational effort, we suggest setting w d = 0.1, w n = 0.9, and using higher voxel resolution and a smaller depth dilation range.

Conclusions and future work
This paper has presented a general framework for neural implicit 3D reconstruction from sparse views. By leveraging geometric priors, our method can determine a sparse and reliable coarse implicit geometry for optimization. This is done by imposing both depth consistency and normal consistency, as well as photometric consistency, on the training loss function. This makes the framework more general and accurate.
Currently, we set a fixed dilation range for the depth when constructing the initial geometry. This could be further improved in practice if the uncertainty of the depth map is known. Our model can also be per-scene fine-tuned given more views of a specific scene, using only the color loss.
In future, we would also like to apply our method to outdoor large scene 3D reconstruction from remote sensing images or aerial images, for which accurate depths and normals are even hard to obtain, by leveraging accurate mapping data.
Tai-Jiang Mu is an assistant researcher in the Department of Computer Science and Technology at Tsinghua University. He received his bachelor degree and Ph.D. degree in computer science and technology from Tsinghua University in 2011 and 2016, respectively. His research interests include computer graphics, visual media learning, 3D reconstruction, and 3D understanding.
Hao-Xiang Chen received his bachelor degree in computer science from Jilin University in 2020. He is currently a Ph.D. candidate in the Department of Computer Science and Technology, Tsinghua University. His research interests include 3D reconstruction and 3D computer vision.
Jun-Xiong Cai is currently a postdoctoral researcher at Tsinghua University, where he received his Ph.D. degree in computer science and technology in 2020. His research interests include computer graphics, computer vision, and 3D geometry processing.
Ning Guo is an assistant researcher at the Academy of Military Sciences. He received his bachelor degree, master degree, and Ph.D. degree in information and communication engineering from the National University of Defense Technology in 2014, 2016, and 2020, respectively. His research interests include digital earth, 3D GIS, 3D reconstruction, and spatial databases.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.
The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
Other papers from this open access journal are available free of charge from http://www.springer.com/journal/41095. To submit a manuscript, please go to https://www. editorialmanager.com/cvmj.