1 Introduction

Major advances have been made in scene modelling over the past decades through the development of both classical and more recently deep learning approaches. These leverage different cues and features, whether handcrafted or learnt, to infer scene geometry, often with an impressive degree of fidelity. However, existing techniques usually remain reliant on assumptions on the scene properties (surface reflectance, texture, geometry) or capture conditions (illumination, camera placement) to produce reliable reconstructions. Surface reflectance in particular is one of the main factors that currently prevents generalisation of modelling techniques to arbitrary scenes as it is often assumed to either follow a particular model (e.g. Lambertian, or specific parametric model) or be known a priori. This restricts the applicability of existing techniques to specific classes of objects which meet those requirements.

Recent developments in neural radiance fields (Mildenhall et al., 2021; Wang et al., 2021; Yariv et al., 2021; Oechsle et al., 2021) have relaxed the requirement to have a model of the Bidirectional Reflectance Distribution Function (BRDF). Another promising approach that makes the reconstruction process agnostic to the scene’s reflectance is Helmholtz Stereopsis (HS). The approach exploits Helmholtz reciprocity (von Helmholtz, 1924) to derive a constraint that is independent of the BRDF and can be used to retrieve both scene depth and normal information. Whilst promising, its formulations to date have been almost exclusively limited to 2.5D reconstruction, framing the problem in terms of estimating the depth and normals at each pixel from a given viewpoint (usually a virtual view) and thereby allowing only a partial reconstruction. These approaches also ignore visibility since all the cameras are confined to one side of the object. Aside from our recent work which we extend here, the few HS approaches that considered full 3D reconstruction were either based on local optimisation or refinement in conjunction with another modality (structured light).

This paper aims to fill the gap in this area by proposing a family of methods for full 3D modelling of scenes with arbitrary unknown reflectance based on Helmholtz reciprocity. It explores different paradigms to extract a full 3D model from a set of reciprocal image pairs each acquired by swapping the positions of the camera and light source. The first approach is based on fusing view-dependent reconstructions, leveraging confidence metrics to optimise the full 3D surface recovery. Different formulations are investigated depending on the geometry of the grid used to recover the intermediate view-dependent representations (orthographic or perspective). The second approach is based on a volumetric optimisation process which overcomes the need to compute intermediate representations. The last approach uses a mesh-based formulation to also allow direct optimisation of the full 3D surface while at the same time reducing the computational footprint compared to the volumetric formulation.

The paper makes the following contributions. Firstly, it introduces three novel approaches to perform full 3D modelling of scenes with complex reflectance. The formulations share the use of Markov Random Fields (MRFs) to provide a principled optimisation framework. A tailored visibility handling approach is also introduced to overcome the shortcomings of previous 2.5D formulations. Secondly, the paper contributes novel datasets for full 3D reconstruction using HS. These include both synthetic scenes generated using POV-Ray and real scenes acquired using a versatile capture setup built from consumer hardware (a pair of DSLR cameras equipped with lens-mounted flashes). The datasets are publicly released with this paper (see data access statement at the end of the paper). These datasets alongside other publicly available datasets are used to conduct an extensive experimental validation of the proposed approaches including a comparison against recent Multi-View Stereo (MVS), MultiView Photometric Stereo (MVPS) and neural radiance field-based methods.

The paper extends our previous work presented in Addari and Guillemaut (2019a, 2019b, 2020) in several ways. First, all three paradigms are brought together into a common article with full detail provided and a common evaluation protocol to facilitate their analysis and comparison. Second, the fused view-dependent formulation is generalised to perspective cameras and confidence measures are introduced. The former allows segmentation information to be leveraged during the modelling process while the latter improves the fusion process. Third, a substantially expanded experimental validation is conducted. This incorporates evaluation on a large number of scenes as well as a thorough analysis using additional performance metrics, new results and for the first time a comparison against the state of the art in \(\hbox {MVS}\), \(\hbox {MVPS}\) and neural radiance field-based methods to validate experimentally the benefits of HS for reconstruction of surfaces with complex reflectance.

The paper is structured as follows. Section 2 reviews the main 3D reconstruction approaches, highlighting their dependency on the scene’s reflectance properties as well as the state of the art in HS and its underpinning principles. Section 3 describes the three proposed approaches to full 3D modelling of complex scenes. Section 4 conducts an experimental evaluation of the different approaches on both synthetic and real data. Finally, Sect. 5 concludes the paper and discusses avenues for future work.

2 Related Work

2.1 3D Reconstruction Overview

This section provides an overview of the main categories of 3D reconstruction techniques, with an emphasis on their abilities to handle different types of surface reflectance properties.

Shape from Silhouettes (SfS) approaches are based on intersecting the set of visual cones defined by backprojecting the object’s silhouette in each image (Baumgart, 1974) to obtain its \(\hbox {Visual Hull (VH)}\) (Laurentini, 1994). Independence from the surface properties is the main advantage offered by this class of methods as long as good background segmentation can be obtained. SfS methods can be further divided into image-based (Matusik et al., 2000), volumetric (Szeliski, 1993; Tarini et al., 2002; Liu et al., 2006), surface-based (Cipolla & Blake, 1992; Forbes et al., 2004; Liang & Wong, 2010) and hybrid approaches (Boyer & Franco, 2003). Further improvements have been introduced by using \(\hbox {Convolutional Neural Networks (CNNs)}\) to obtain a probabilistic \(\hbox {VH}\) (Gilbert et al., 2018). However, this class of techniques remains inherently limited to the reconstruction of convex objects due to the impossibility of visualising concavities in silhouettes.

Binocular Stereo and Multi-View Stereo (MVS) methods (Szeliski et al., 2008; Seitz et al., 2006) use point correspondences across images to infer surface depth. Contrary to \(\hbox {SfS}\) they are not limited to the reconstruction of convex objects. However, the reconstructed surfaces are often assumed to be Lambertian or sufficiently textured to perform point matching across views. This assumption is the main drawback as it is often violated in practice. Certain approaches attempt to jointly estimate surface reflectance and scene geometry. In Oxholm and Nishino (2014) and Lombardi and Nishino (2016), the geometry is obtained iteratively form a previous estimate of the reflectance and vice versa. This hinders performance, as each solution will only be as good as the previously estimated other term. In Holroyd et al. (2010), instead, light descattering is used in conjunction with an active \(\hbox {MVS}\) technique to obtain both geometry and BRDF. The results are highly precise, however the method requires a complex setup comprising two coaxial camera/light source assemblies to capture stacks of coaxial images and reciprocal images of the object. More recently, several \(\hbox {MVS}\) formulations have explored the use of \(\hbox {CNNs}\). Examples include (Kar et al., 2015), in which reconstruction is limited to a specific set of categories, and Choy et al. (2016), where volumetric rendition of the analysed scenes is class-independent. Despite the potential of CNN-based methods, they have yet to reach the level of accuracy of other types of approaches when a larger number of images are used. A further hindrance that is still holding back machine learning approaches in this context is the availability of training data. Recently unsupervised solutions have been proposed such as in Dai et al. (2019), however the quality of results is still far from that of traditional techniques.

Photometric Stereo (PS) (Woodham, 1980) allows reconstruction of non-Lambertian surfaces, however the surface reflectance needs to be known a priori. An extensive survey of \(\hbox {PS}\) techniques was published in Ackermann and Goesele (2015). Han and Shen (2015) use a complex setup to obtain a very densely sampled set of lights and viewing directions. This allows to exploit specularities and shadows to obtain the reconstruction of both isotropic and anisotropic objects. The BRDF is modelled by dividing it into its components: diffuse, specular and shadows. Capturing many images under varying lighting allows to perform BRDF modelling and \(\hbox {PS}\) reconstruction. The main drawbacks of this approach are its computational complexity and the unsatisfactory accuracy obtained when few lighting directions are used. In Ikehata (2018), a CNN-based approach is proposed to reconstruct non-convex objects. The approach learns relations between the reflections in the input images and surface normal orientation, using synthetic datasets for training. A similar work is applied to near-field \(\hbox {PS}\) in Logothetis et al. (2020). Despite showing promising results, the surface estimation remains tied to the per point network predictions and no explicit surface optimisation strategy is implemented. PS removes the need to establish correspondences across views required in stereo approaches, allowing for scenes with more complex reflectance to be reconstructed. Further, it achieves high quality reconstruction under Lambertian conditions or when an accurate model of the scene reflectance is available. However, the Lambertian assumption is often violated and it remains challenging to obtain an accurate model of the scene reflectance (Ward, 1992; Tunwattanapong et al., 2013). Many challenges remain in achieving accurate reconstruction of non-Lambertian surfaces which present spatially varying or anisotropic BRDF and most \(\hbox {PS}\) approaches still rely on the assumption of isotropic reflectance to operate. \(\hbox {Multi-View Photometric Stereo (MVPS)}\) methods generalise \(\hbox {PS}\) to multiple viewpoints enabling both geometric and photometric cues to be leveraged in the same framework (Park et al., 2016; Logothetis et al., 2019; Li et al., 2020).

Another important set of techniques used to perform 3D reconstruction is based on separating specular and diffuse reflections. For example, in Mallick et al. (2005), a data-driven colour space conversion is performed on the RGB images, allowing to separate the two components. The authors present this method for surfaces that can be modelled using dichromatic reflectance, a special case of BRDF. Performing this separation allows to only consider the two-channel diffuse component, which can be approximated as Lambertian, to produce accurate reconstructions using \(\hbox {PS}\). Similarly, Ma et al. (2007) propose the use of spherical gradient illumination patterns to separate the specular and diffuse components of objects made up of complex materials such as human faces. The main drawbacks of this technique are the single viewpoint and the trade-off at the edge of the objects. Conversely, in Ghosh et al. (2011), the authors use two polarised spherical gradient illumination patterns to perform the reconstruction with multiple viewpoints. The cameras are equipped with polarisers to selectively capture the correct pattern. Finally, in Fyffe et al. (2016) the authors perform colour space conversion and use the results to obtain the diffuse normals and albedo and the specular normals for each colour sub-space. They further compute a per-pixel specular exponent to refine the resulting mesh. An advantage of this method with respect to Ma et al. (2007) and Ghosh et al. (2011) is that off-the-shelf components are used and the total capture time is significantly lower. However the setup is still fairly complex and requires a large number of cameras and flashes.

Recent advances in neural scene representations and volume rendering techniques have enabled an unprecedented level of photorealism in novel view synthesis from only a sparse number of input views, pioneered by the introduction of NeRF (Mildenhall et al., 2021). The approach is based on combining a neural radiance field scene representation, which uses a neural network to model both radiance and volume density at each point in space, with a volume rendering approach which, being differentiable, is particularly well suited for learning the network’s weights. Although the approach is primarily aimed at novel view synthesis, scene geometry can be retrieved from the volume density. However, the quality of the surface that can be extracted is limited since the network is tailored for novel view synthesis rather than reconstruction. Several concurrent subsequent works have extended the approach to address this limitation and enable high-quality surface modelling by incorporating an implicit surface representation into the framework (Wang et al., 2021; Yariv et al., 2021; Oechsle et al., 2021). Wang et al. (2021) and Yariv et al. (2021) achieve this by parametrising the volume density based on a signed distance function, while Oechsle et al. (2021) introduce a continuous occupancy field to represent the surface. All approaches demonstrate the ability to recover fine detail including thin structures. Interestingly, these techniques do not make any assumption about the scene reflectance. Other approaches combine inverse rendering techniques with deep learning to recover shape alongside material properties (Bi et al., 2020; Zhang et al., 2021). Bi et al. (2020) use a volumetric representation to model opacity, surface normal and reflectance at each voxel, and they require the images to be acquired under the assumption of a collocated light source. Zhang et al. (2021) use a signed-distance function to represent shape and spherical Gaussians to approximate the light transport. Their approach is able to infer the scene properties from a set of images captured under static illumination.

The vast majority of existing methods, aside from some of the more recent learning-based approaches using neural radiance fields, fail to address surfaces presenting complex and varying reflectance as they mainly rely on simplifying assumptions to constrain the problem. An interesting alternative to these techniques that allows reconstruction of surfaces with complex BRDFs is HS.

2.2 Helmholtz Stereopsis

Helmholtz Stereopsis (HS) exploits the principle of Helmholtz reciprocity (von Helmholtz, 1924) to make the reconstruction process agnostic to the reflectance. This was first introduced in Magda et al. (2001) and subsequently developed into what is known as HS in Zickler et al. (2002). The key idea is to exploit Helmholtz reciprocity as a constraint to identify 3D points located on the surface of an object and their corresponding normals. Helmholtz reciprocity states that the measured BRDF at a surface point remains invariant when illumination and viewing directions are swapped (Nicodemus et al., 1977; Snyder, 2002). This invariance is exploited to obtain the normal of a surface point, using multiple reciprocal pairs of images from which the point is visible. In its original formulation, HS is posed as a maximum likelihood problem, with no regularisation, which can yield noisy results.

Given a point light source with strength \(\kappa \) located at \(\textbf{O}_\text {r}\), the intensity of the surface point \(\textbf{P}\) imaged by a camera positioned at \(\textbf{O}_\text {l}\) satisfies:

$$\begin{aligned} i_\text {l}=f(\textbf{v}_\text {r},\textbf{v}_\text {l}) \frac{\textbf{n} \cdot \textbf{v}_\text {r}}{\Vert \textbf{O}_\text {r} - \textbf{P}\Vert ^2} \kappa , \end{aligned}$$
(1)

where \(f(\textbf{v}_\text {r},\textbf{v}_\text {l})\) is the BRDF at surface point \(\textbf{P}\) (see Fig. 1). The unit vectors \(\textbf{v}_\text {r}\) and \(\textbf{v}_\text {l}\) indicate the lighting and viewing directions respectively, while \(\textbf{n}\) denotes the surface normal at point \(\textbf{P}\). When inverting camera and light positions, the same point \(\textbf{P}\) viewed by the camera located at \(\textbf{O}_\text {r}\) and lit by a light source with strength \(\kappa \) located at \(\textbf{O}_\text {l}\) satisfies:

$$\begin{aligned} i_\text {r}=f(\textbf{v}_\text {l},\textbf{v}_\text {r}) \frac{\textbf{n} \cdot \textbf{v}_\text {l}}{\Vert \textbf{O}_\text {l} - \textbf{P}\Vert ^2}\ \kappa . \end{aligned}$$
(2)
Fig. 1
figure 1

Helmholtz Stereopsis (HS) acquisition principle. A reciprocal pair of images is taken by swapping camera and light positions (\(\textbf{O}_\text {l}\) and \(\textbf{O}_\text {r}\)). The \(\hbox {BRDF}\) at a surface point (\(\textbf{P}\)) is invariant in the two scenarios, allowing to obtain its normal (\(\textbf{n}\)) when at least three reciprocal pairs are considered from different positions

Helmholtz reciprocity states that the reflectance measured at point \(\textbf{P}\) remains unchanged when the positions of cameras and light source are interchanged, i.e. \(f(\textbf{v}_\text {r},\textbf{v}_\text {l}) = f(\textbf{v}_\text {l},\textbf{v}_\text {r})\), from which the following BRDF-agnostic constraint is obtained:

$$\begin{aligned} \bigg ( i_\text {l} \frac{\textbf{v}_\text {l}}{\Vert \textbf{O}_\text {l}-\textbf{P}\Vert ^2} - i_\text {r} \frac{\textbf{v}_\text {r}}{\Vert \textbf{O}_\text {r}-\textbf{P}\Vert ^2}\bigg ) \cdot \textbf{n} = \textbf{w} \cdot \textbf{n} = 0. \end{aligned}$$
(3)

With three or more camera pairs, both the surface location and its normal can be recovered. Stacking all vectors \(\textbf{w}\) into a matrix and performing \(\hbox {Singular Value} \)\( \hbox {Decomposition (SVD)}\), the ratio of the second and third singular values \(\sigma _2\) and \(\sigma _3\) can be used to define a measure of the coplanarity of the \(\textbf{w}\) vectors which, in turn, provides a means to identify surface points. Once the surface location has been identified, the last singular vector of the corresponding point provides an estimate of the surface normal.

To relax the requirement of having at least three camera/light pairs, binocular formulations (Tu & Mendonca, 2003; Zickler et al., 2003) were subsequently introduced, employing a single pair of reciprocal images for reconstruction. Both works utilise a partial differential equation to compute the surface depth at each epipolar line, obtaining a set of solutions that are then disambiguated with further optimisation. In Guillemaut et al. (2004) the authors extend the classes of surface that can be reconstructed to strongly textured and rough ones, by performing HS over image patches instead of using single pixels. In Janko et al. (2004), Janko et al. propose a radiometric calibration approach to account for variations in pixel sensitivity and non-isotropic illumination. In Zickler (2006), specular highlights, which have precise correspondences across reciprocal pairs, are used to perform radiometric and geometric calibration, without the need of acquiring additional images. Finally, in Guillemaut et al. (2008) a different normal error measure is proposed in the form of the radiometric distance function.

Fig. 2
figure 2

Pipeline for the fused view-dependent HS approaches. The methods are initialised using a Visual Hull (VH) obtained from silhouettes of the input images. Multiple partial reconstructions are then obtained using either orthographic sampling (VDo approach) or perspective sampling (VDp approach). The partial surfaces are then integrated using confidence scores and Poisson surface reconstruction

An alternative to maximum likelihood for classic HS was proposed in Roubtsova and Guillemaut (2014a, 2014b, 2017, 2018). They present a maximum a posteriori formulation for both classic HS (Roubtsova & Guillemaut, 2014a, 2018) and colour HS (Roubtsova & Guillemaut, 2014b, 2017). In the latter, wavelength multiplexing is used to extend HS to dynamic surfaces. All the aforementioned techniques are applied to 2.5D surfaces and do not handle visibility or occlusions directly. Furthermore, they all use virtual orthographic cameras to perform the reconstruction, which does not fully leverage the available segmentation information and can cause artefacts when the methods attempt to estimate depth at pixels where the object’s surface may not be visible.

In Weinmann et al. (2012), HS is used in conjunction with a structured light approach as a refinement step to perform full 3D reconstruction. The technique is employed on areas of the surface where fine details are present, while the low frequency shape of the object is obtained using structured light. A complex setup consisting of a light dome is used, making this method difficult to reproduce and constrained to a limited set of scenes. The first time HS is used on its own to perform full 3D reconstruction is in Delaunoy et al. (2010), where gradient descent is proposed to perform the optimisation. The faces of the initial surface are iteratively moved towards a lower energy solution. Despite obtaining convincing results, the use of gradient descent makes the method prone to local minima and does not guarantee a globally optimal solution.

To date, most HS approaches have been limited to performing 2.5D reconstructions. In contrast, this paper introduces several methods to achieve full 3D reconstruction of scenes with arbitrary unknown reflectance, exploring different strategies to improve performance. This paper presents our previous work from Addari and Guillemaut (2019a, 2020, 2019b) in a unified light and extends it in several ways. Firstly, this advances our earlier formulation by generalising the fused view-dependent approach to full perspective enabling us to leverage segmentation information and improve modelling accuracy. This generalisation is shown to translate into a significant improvement in performance compared to our previous orthographic formulation. We also introduce several confidence metrics to improve the fusion of view-dependent reconstructions. Second, the experimental evaluation has been substantially expanded in terms of number of datasets considered, the depth of the analysis which now incorporates a quantitative evaluation for each dataset, and some additional benchmarking results on the DiLiGenT-MV dataset. The proposed approaches are evaluated against a wide range of approaches including two \(\hbox {MVS}\) methods, one neural radiance field-based approach and two \(\hbox {MVPS}\) methods. The final contribution is the release of datasets to support further research in the field and allow benchmarking of algorithms for reconstruction of scenes with complex surface reflectance.

3 Methodology

3.1 Fused View-Dependent Helmholtz Stereopsis

In this first approach, partial reconstructions are performed from multiple viewpoints around the objects which are then fused to obtain a full 3D model (see illustration of pipeline in Fig. 2). Different approaches are possible depending on the types and placement of the viewpoints used to perform the partial reconstructions. In particular, two variants are considered. The first one consists in using virtual orthographic cameras. In practice, six orthographic views are considered, coinciding with each of the three world axes, with two directions per axis. This provides an intuitive way of sampling the scene volume allowing a full coverage of the object from a minimal number of reconstructions. The second variant is based on using perspective views. In this case, each partial reconstruction can be performed directly from the viewpoint of one of the input cameras used to acquire the scene. This second approach presents the benefit of being able to leverage segmentation information during the reconstruction process by ensuring that all pixels for which a depth is estimated correspond to the object (something that is not feasible in the case of an orthographic reconstruction where no image has been acquired from the reconstruction viewpoint).

More formally, given a set of reciprocal pairs of images \({\mathcal {I}}_0^\text {l}, {\mathcal {I}}_0^\text {r}, \dots , {\mathcal {I}}_{n-1}^\text {l}, {\mathcal {I}}_{n-1}^\text {r}\) and a camera from which reconstruction is performed, a 2-dimensional grid \({\mathcal {G}}\) is created over the image plane, where each node directly corresponds to a pixel in the frame of the camera used for reconstruction. Each node, or pixel, will be assigned a label indicating at what depth the surface is found when backprojecting the image point. The depth labels are denoted by: \({d_0,...,d_{N-1}}\), where \(d_0\) and \(d_{N-1}\) correspond to the closest and furthest points respectively.

To estimate the depth at which the surface is located at each pixel of the camera used for reconstruction, an \(\hbox {MRF}\) is constructed. Each node corresponds to an entry in the grid and an energy function, containing a data term to express the surface occupancy likelihood and a smoothness term to regularise the solution across all nodes, is defined. The energy function is formulated as follows:

$$\begin{aligned} E(\textbf{d})= & {} (1 - \alpha )\sum _{\textbf{p} \in {\mathcal {G}}} D_\text {2D}(B(\textbf{p},d_\textbf{p})) + \nonumber \\{} & {} \quad \alpha \sum _{(\textbf{p}, \textbf{q}) \in {\mathcal {N}}_\text {2D}} S_\text {2D}(B(\textbf{p},d_\textbf{p}),B(\textbf{q},d_\textbf{q})), \end{aligned}$$
(4)

where \(\alpha \) is a balancing parameter between the data and smoothness terms, \({\mathcal {G}}\) is the 2-dimensional grid defined by the virtual camera, \(D_\text {2D}(B(\textbf{p},d_\textbf{p}))\) is the data term of the function measured at 3D point \(B(\textbf{p},d_\textbf{p})\), obtained by backprojecting image point \(\textbf{p}\) at the depth corresponding to label \(d_\textbf{p}\). \({\mathcal {N}}_\text {2D}\) is the set of interacting nodes defined by a 4-connected neighbourhood in the image grid, and \(S_\text {2D}(B(\textbf{p},d_\textbf{p}),B(\textbf{q},d_\textbf{q}))\) is the smoothness term, which corresponds to the normal consistency term between 3D points \(B(\textbf{p},d_\textbf{p})\) and \(B(\textbf{q},d_\textbf{q})\).

In this approach, the data term is computed as:

$$\begin{aligned} D_\text {2D}(\textbf{P}) = {\left\{ \begin{array}{ll} 1 &{} \text {if} \; |\text {vis}(\textbf{P}) |< n_\text {vis}, \\ e^{- \mu \times \frac{\sigma _2(\textbf{P})}{\sigma _3(\textbf{P})}} &{} \text {otherwise}, \\ \end{array}\right. } \end{aligned}$$
(5)

where \(\text {vis}(\textbf{P})\) indicates the set of reciprocal pairs of cameras from which point \(\textbf{P}\) is visible, \(n_\text {vis}\) is a threshold representing the minimum number of reciprocal pairs of cameras for reliable normal estimation, \(\mu \) is set to \(0.2 \ln (2)\) to replicate the same weight used in Roubtsova and Guillemaut (2018) and \(\sigma _2\) and \(\sigma _3\) are the second and third singular values of . To prevent any potential numerical stability issues when \(\sigma _3\) becomes numerically close to zero, a very small \(\epsilon \) value can be added to the denominator of the exponent in (5).

An important contribution, to enable application to complex 3D scenes, is the introduction of the visibility term. The first criterion to determine visibility is to only consider the cameras whose axes stand at an angle smaller than \(80^\circ \) with respect to the virtual camera axis. Then, occlusions are computed by approximating each point’s visibility based on the visibility at its closest point on the surface of the \(\hbox {VH}\). If an intersection is found between the \(\hbox {VH}\) and the segment connecting the camera centre to the approximated point, the camera and its reciprocal are considered to be occluded and therefore are not used, as shown in Fig. 3.

Fig. 3
figure 3

A point \(\textbf{P}\) on the surface is approximated as its closest point \(\textbf{P}'\) on the \(\hbox {VH}\) before occlusions are taken into consideration for visibility computation

Fig. 4
figure 4

Illustration of the computation of \(\delta _{\textbf{P},\textbf{Q}}\) used to define the smoothness term. Given two neighbouring nodes’ depth estimates (\(d_p\) and \(d_q\)), \(\delta _{\textbf{P},\textbf{Q}}\) is computed as the distance between \(d_p\) and the estimated surface position at the same node based on \(d_q\) and its estimated normal \(\textbf{n}(\textbf{Q})\). \(\textbf{z}_{\text {cam}}\) denotes the unit vector defining the virtual camera axis and oriented such that it is pointing towards the camera

Fig. 5
figure 5

Pipeline for the volumetric approach to full 3D optimisation. The \(\hbox {VH}\) is used for initialisation. Then an orthographic grid is used to compute the probability of voxel occupancy for the surface. The optimisation is then performed using a Markov Random Field (MRF) to obtain the final representation of the object

The smoothness function used here is the distance based DNprior (Roubtsova & Guillemaut, 2018), which enforces a smooth surface that is consistent with the normals obtained through HS. This term is calculated as follows:

$$\begin{aligned} S_\text {2D}(\textbf{P},\textbf{Q}) = {\left\{ \begin{array}{ll} \frac{1}{2} (\delta _{\textbf{P},\textbf{Q}}^2 + \delta _{\textbf{Q},\textbf{P}}^2) &{} \text {if} \; \delta _{\textbf{P},\textbf{Q}} \; \text {and} \; \delta _{\textbf{Q},\textbf{P}} < t, \\ t^2 &{} \text {otherwise}, \\ \end{array}\right. } \end{aligned}$$
(6)

where t is the maximum threshold for \(\delta _{\textbf{P},\textbf{Q}}\) and \(\delta _{\textbf{Q},\textbf{P}}\). \(\delta _{\textbf{P},\textbf{Q}}\) is the distance between point \(\textbf{P}\) and the projection of \(\textbf{Q}\), perpendicular to its estimated normal, on the pixel where \(\textbf{P}\) lies, as illustrated in Fig. 4. It is calculated as follows:

$$\begin{aligned} \delta _{\textbf{P},\textbf{Q}} = \frac{ |\textbf{PQ} \cdot \textbf{n}(\textbf{Q}) |}{ \textbf{n}(\textbf{Q}) \cdot \textbf{z}_{\text {cam}} }, \end{aligned}$$
(7)

where \(\textbf{PQ}\) is the vector connecting \(\textbf{P}\) and \(\textbf{Q}\), \(\textbf{n}(\textbf{Q})\) indicates the estimated normal at point \(\textbf{Q}\) and \(\textbf{z}_{\text {cam}}\) defines the virtual camera axis. We adopt the convention that \(\textbf{z}_{\text {cam}}\) is oriented towards the camera. Consequently, \(\textbf{n}(\textbf{Q}) \cdot \textbf{z}_{\text {cam}}\) is guaranteed to be positive for any visible surface point. Whenever \(\delta _{\textbf{P},\textbf{Q}}\) or \(\delta _{\textbf{Q},\textbf{P}}\) are greater than a threshold t, dependent on the reconstruction resolution, this term is truncated to \(t^2\) in order to avoid heavy penalties where a strong discontinuity is present on the surface. An illustration of how \(S_\text {2D}\) is computed is shown in Fig. 4.

The energy function is then minimised using Sequential Tree-Reweighted Message Passing (TRW-S) (Kolmogorov, 2015) to obtain the depth maps from each viewing direction. This optimisation framework was chosen because it is able to perform optimisation in a multi-label \(\hbox {MRF}\), while giving strong optimality guarantees. The partial surface reconstructions are then fused together using Poisson surface integration (Kazhdan et al., 2006). This process may lead to inconsistencies across overlapping areas of the partial surfaces, therefore a confidence score is employed to weight the different samples based on their reliability during the fusion process. During the energy function computation of the view-dependent methods, two confidence scores are computed for each point to be used during the surface integration process.

The first confidence score is derived from the saliency and defined from Eq. 5 as follows:

$$\begin{aligned} C_{s}(\textbf{P}) = 1 - D_\text {2D}(\textbf{P}), \end{aligned}$$
(8)

where \(\textbf{P}\) is the selected point after the optimisation process. The second confidence score is derived from the angle between the viewing direction and the estimated normal and is defined as:

$$\begin{aligned} C_{\textbf{n}}(\textbf{P}) = \textbf{n}(\textbf{P}) \cdot \textbf{z}_{\text {cam}}, \end{aligned}$$
(9)

where \(\textbf{n}(\textbf{P})\) is the estimated normal at 3D point \(\textbf{P}\) and \(\textbf{z}_{\text {cam}}\) is the camera axis of the chosen view, orthographic or perspective depending on the reconstruction method used. The overall confidence score is obtained by multiplying the two confidence scores, that is:

$$\begin{aligned} C_{s\textbf{n}}(\textbf{P}) = C_{s}(\textbf{P}) C_{\textbf{n}}(\textbf{P}). \end{aligned}$$
(10)

Utilising these confidence scores allows to resolve ambiguities for points that are in close proximity, but stem from different views. In particular, the saliency score may appear low in specific camera combinations due to object occlusions or poor visibility, while \(C_{\textbf{n}}\) might detect a decline in normal estimation precision at the edges of a reconstructed view or when the surface is heavily slanted with respect to the camera axis. In these scenarios the fusion algorithm may favour the use of a different view with higher confidence scores, to increase the overall reconstruction accuracy.

3.2 Volumetric Full 3D Helmholtz Stereopsis

This second method performs direct optimisation over a voxel grid, where each voxel is labelled as outside the surface, inside, or containing a section of the surface itself. The method enforces coherency between neighbouring voxels on the whole surface, presenting a strong advantage with respect to the fusion of multiple partial 2.5D reconstructions, where the optimisation is performed separately for each partial reconstruction prior to merging. After dividing the volume in a regular voxel grid, a multi-label \(\hbox {MRF}\) is constructed, where each node corresponds to a voxel and its labels indicate whether the voxel is occupied by the surface or found inside or outside the object. Weights are then assigned to each voxel depending on their occupancy probability and a regularising term based on the HS estimated normals across neighbouring nodes. The optimisation is performed using a modified version of \(\hbox {Iterative Conditional Modes (ICM)}\) (Besag, 1986), where the labelling is changed iteratively to obtain the lowest score possible across local node clusters. The result is finally integrated using Poisson surface reconstruction to obtain the full 3D object representation. See Fig. 5 for an overview of the pipeline.

A 3D orthographic grid, that encompasses the whole object, is first created and each voxel is assigned a node in a multi-label \(\hbox {MRF}\) graph. The label set used is the following: \(\{I,O,L_0,\ldots ,L_{N-1}\}\), where I and O indicate respectively whether the voxel is found inside or outside the reconstructed surface, while the remaining labels are assigned when the surface is crossing inside the voxel. The sampling strategy used consists in subdividing the voxel into equal size subvoxels, where the centre of each subvoxel will correspond to a label between \(L_0\) and \(L_{N-1}\). In this paper, each voxel is subdivided into 27 subvoxels, subdividing by 3 along each dimension. The use of multiple surface labels is employed to generate enough variations in the orientation of the segments connecting neighbouring surface samples, which allows to exploit the normals estimated through HS. Using a single surface label would not allow to fully exploit the normals during the energy function regularisation, as the segments connecting the voxels would only be sampled at \(45^\circ \) steps. It must be noted that sampling the voxel regularly means that the surface may cross a voxel at multiple labels, in which case the algorithm will prioritise the label where a lower weight is achieved for the overall solution.

Performing a full 3D optimisation requires improving visibility estimation with respect to how it was approximated in the view-dependent approaches. To do so, a probabilistic approach is presented here, which is then applied in the methods outlined in this section. A first selection of cameras is performed by approximating the chosen point to their closest neighbour on the initialisation surface and computing occlusions for said point. A further selection is then performed on these cameras by finding the k pairs that have the highest likelihood of producing coplanar w vectors. The parameter k may be chosen to be 3, since this is the minimum number of camera pairs needed to perform HS. The method consists in iteratively selecting all possible combinations of k camera pairs and obtaining their resulting matrix, computed by stacking the corresponding \(\textbf{w}\) from Eq. 3. The subset that satisfies the following equation is then selected:

(11)

where \(c_0\) to \(c_{k-1}\) indicate the selected camera pairs and \(\sigma _2\) and \(\sigma _3\) are respectively the second and third singular values of the obtained matrix.

It must be noted that during the visibility computation it is unknown whether the chosen point is actually on the surface or not. This approach maximises the probability of computing the correct visibility for points on the surface. In case of points outside or inside the object, it is expected that the agreement of the camera pairs will generally be low. An added benefit of selecting a fixed number of cameras to perform the HS calculations is that it provides saliency scores that are consistent across the whole surface, making the optimisation process more robust.

Now that the graph construction, visibility handling and label strategy have been established, the energy function used for the \(\hbox {MRF}\) optimisation is defined as:

$$\begin{aligned} E(\textbf{L})= & {} (1 - \beta )\sum _{\textbf{v} \in {\mathcal {V}}} D_\text {3D}(\textbf{v},L_\textbf{v}) + \nonumber \\{} & {} \quad \beta \sum _{(\textbf{v},\textbf{w}) \in {\mathcal {N}}_\text {3D}} S_\text {3D}(\textbf{v},L_\textbf{v},\textbf{w},L_\textbf{w}), \end{aligned}$$
(12)

where \(D_\text {3D}\) and \(S_\text {3D}\) are respectively the data and smoothness terms, \(\beta \) is a weight to balance their effects, \({\mathcal {V}}\) indicates the volume in which the 3D grid is constructed and \({\mathcal {N}}_\text {3D}\) is the set of interacting nodes defined by a 6-connected neighbourhood in the 3D grid.

The data term is based on the HS saliency as in the methods previously presented. The higher the ratio between the singular values \(\sigma _2\) and \(\sigma _3\) of matrix , the higher the probability of the point being located on the surface. In addition to the previous methods, however, the additional inside and outside labels are considered in this approach and the full data term calculation is performed as follows:

$$\begin{aligned}{} & {} D_\text {3D}(\textbf{v},L_\textbf{v}) \nonumber \\{} & {} \quad ={\left\{ \begin{array}{ll} 0 &{} \text {if} \; L_\textbf{v} \in \{I,O\}, \\ 1 &{} \text {if} \; L_\textbf{v} \not \in \{I,O\} \; \text {and} \\ &{} \; |\text {vis}(M(\textbf{v},L_\textbf{v})) |< n_\text {vis}, \\ e^{- \mu \times \frac{\sigma _2(M(\textbf{v},L_\textbf{v}))}{\sigma _3(M(\textbf{v},L_\textbf{v}))}} &{} \text {otherwise}, \\ \end{array}\right. } \end{aligned}$$
(13)

where \(M(\textbf{v},L_\textbf{v})\) indicates the 3D position of the surface point at node \(\textbf{v}\) when assigned label \(L_\textbf{v}\) and \(\text {vis}(\textbf{P})\) represents the set of camera pairs from which point \(\textbf{P}\) is visible. Usually \(n_\text {vis}\) is set to 5, which is greater than the minimum number of camera pairs required in HS to avoid geometric ambiguities. Whenever a voxel is assigned a surface label (\(L_0,\dots ,L_{N-1}\)) and is visible in a sufficient number of cameras, the weight assigned is computed as the data term in the view-dependent methods as shown in Eq. 5.

The smoothness term is used as a regularising constraint to ensure that neighbouring voxels have coherent normals and that there is always a surface voxel between an outside and inside voxel pair, avoiding holes in the final surface. It is computed as follows:

$$\begin{aligned}{} & {} S_\text {3D}(\textbf{v},L_\textbf{v},\textbf{w},L_\textbf{w}) \nonumber \\{} & {} \quad ={\left\{ \begin{array}{ll} \Gamma (M(\textbf{v},L_\textbf{v}),M(\textbf{w},L_\textbf{w})) &{} \text {if} \; L_\textbf{v}, L_\textbf{w} \in \\ &{} \; \{L_0,\dots ,L_{N-1}\}, \\ \infty &{} \text {if} \; L_\textbf{v}, L_\textbf{w} \in \{I,O\} \\ &{} \; \text {and} \; L_\textbf{v} \ne L_\textbf{w}, \\ 0 &{} \text {otherwise}, \\ \end{array}\right. } \end{aligned}$$
(14)

where

$$\begin{aligned} \Gamma (\textbf{V},\textbf{W}) ={\left\{ \begin{array}{ll} \frac{1}{2}(\gamma _{\textbf{V},\textbf{W}}^2+\gamma _{\textbf{W},\textbf{V}}^2) &{} \text {if} \; \gamma _{\textbf{V},\textbf{W}} \; \text {and} \; \gamma _{\textbf{W},\textbf{V}} < t, \\ t^2 &{} \text {otherwise}, \\ \end{array}\right. } \end{aligned}$$
(15)

represents the normal consistency between points \(\textbf{V}\) and \(\textbf{W}\). t is used as a truncation term for \(\gamma _{\textbf{V},\textbf{W}}\) which is defined as follows:

$$\begin{aligned} \gamma _{\textbf{V},\textbf{W}} = |\textbf{VW} \cdot \textbf{n}(\textbf{W}) |, \end{aligned}$$
(16)

where \(\textbf{VW}\) is the vector connecting points \(\textbf{V}\) and \(\textbf{W}\), while \(\textbf{n}(\textbf{W})\) indicates the unit normal estimated via HS at point \(\textbf{W}\). This term represents the distance between \(\textbf{W}\) and the plane perpendicular to \(\textbf{n}(\textbf{V})\) intersecting point \(\textbf{V}\). Figure 6 illustrates how this term is calculated. The \(\infty \) term is used to constrain inside and outside voxels to be separated by surface voxels, thus avoiding an empty solution where all nodes are either labelled to be inside or outside. The truncation is performed on the regularisation term to avoid heavy penalties where a corner may result in two neighbouring voxels having severely different normals.

Fig. 6
figure 6

Illustration of the computation of \(\gamma (\textbf{V},\textbf{W})\). The term is based on the assigned label of node \(\textbf{W}\) and its predicted position based on the label of \(\textbf{V}\) and its estimated normal \(\textbf{n}(\textbf{W})\)

Once the graph has been initialised, optimisation is performed using a tailored version of \(\hbox {ICM}\) (Besag, 1986). \(\hbox {ICM}\) is an exhaustive search algorithm that iterates through an \(\hbox {MRF}\) graph and changes one variable at a time, by trying to optimise its local neighbourhood cost. In its classic formulation, \(\hbox {ICM}\) would not work in this scenario because of the constraint on the surface. Namely, changing the label of a surface node to be either outside or inside would result in a hole on the surface, which is currently prevented by having an infinite weight when outside and inside voxels are neighbours. However, by changing two neighbouring variables at a time and considering all surrounding nodes to compute the cost change, the surface can be shifted closer to its optimal solution through multiple iterations. Only tuples where one node is on the current surface of the reconstruction are considered at each iteration. To compute their local neighbourhood cost, all possible configurations of said tuple and their neighbours are considered, selecting the solution with the lowest energy. If the problem is initialised close to the actual surface, this step typically converges after a small number of iterations. During experimentation it was attempted to use \(\hbox {TRW-S}\) to perform the optimisation, however, due to the high number of nodes and low number of labels, \(\hbox {TRW-S}\) obtained poor results, while \(\hbox {ICM}\) proved to be suitable in this specific scenario.

Finally, the nodes labelled to be on the surface are extracted together with their Helmholtz estimated normals and integrated using Poisson surface reconstruction to obtain a mesh representation.

Fig. 7
figure 7

Pipeline overview for the mesh-based approach. Two initialisation surfaces that contain the object’s boundary are used to construct an \(\hbox {MRF}\) graph where each node corresponds to a pair of points on the surfaces and their labels are the points between them. The output surface is obtained by performing global optimisation on the graph and connecting the nodes estimated positions

3.3 Mesh-Based Full 3D Helmholtz Stereopsis

The key idea of this third approach is to introduce a single-step method to perform full 3D reconstruction from a coarse initialisation, providing a mechanism to perform global optimisation and recover a solution with strong optimality properties through the use of state-of-the-art \(\hbox {MRF}\) solvers. This method offers some major advantages with respect to the volumetric approach, which is much more computationally expensive, by providing a mesh-based optimisation that seeks to find the optimal 3D positions of the vertices in the solution space. The target surface is obtained through a Maximum a Posteriori (MAP) approach using an \(\hbox {MRF}\) graph.

The 3D reconstruction is performed through a pipeline that is illustrated in Fig. 7. To initialise the method and identify the search space over which the optimisation is performed, two surfaces are defined. The first one corresponds to the outer boundary of the solution space and must completely encompass the object. For instance, the \(\hbox {VH}\) of the object or an accordingly dilated approximation of a previous solution, obtained from a different technique, could be used. The second surface must, instead, be completely inside the object, while maintaining a similar topology to the outer surface; this can be achieved by carving the outer surface. Non-rigid \(\hbox {Iterative Closest Point (ICP)}\) (Audenaert et al., 2019) is used to draw correspondences between the two surfaces, allowing to match key features between the surfaces despite their difference in scale. The result is a dense matching between the two surfaces, where a point of the target surface is necessarily found between each pair of registered vertices.

These correspondences are used to construct a multi-label \(\hbox {MRF}\) graph. Each pair of corresponding vertices between the two surfaces will be assigned to a node and its neighbours will be established depending on the surface topology. In practice, the surface is represented as a 3D mesh from which each edge corresponds to a graph edge, defining the neighbourhood of the two bound vertices. Each node is then assigned a set of labels \(\{L_0,L_1,\dots ,L_{N-1}\}\), where each label indicates a 3D point on the segment connecting the two surfaces at the node’s corresponding vertices. In particular, \(L_0\) will coincide with the vertex on the outer surface, \(L_{N-1}\) with the corresponding vertex on the inner surface and all the intermediate labels will be spaced regularly in between them. In the remainder of this section, the energy function used to perform the optimisation is detailed.

To reconstruct the target surface, an energy function is defined as follows:

$$\begin{aligned} E(\textbf{L})= & {} (1 - \alpha )\sum _{v \in {\mathcal {M}}} D_\text {M}(X(\textbf{v},L_\textbf{v})) + \nonumber \\{} & {} \alpha \sum _{(\textbf{v},\textbf{w}) \in {\mathcal {N}}_\text {M}} S_\text {M}(X(\textbf{v},L_\textbf{v}),X(\textbf{w},L_\textbf{w})), \end{aligned}$$
(17)

where \(\textbf{L}\) indicates the labels assigned across the entire set of nodes, \(\alpha \) is a weighting factor to balance the effect of data and smoothness terms and \({\mathcal {N}}_\text {M}\) is the set of interacting nodes defined by the connectivity of the mesh representing the optimised surface \({\mathcal {M}}\). \(D_\text {M}\) and \(S_\text {M}\) are respectively the data and smoothness terms, calculated for all the nodes and edges of the graph, while the operator \(X(\textbf{v},L_\textbf{v})\) is used to identify the resulting position when node \(\textbf{v}\) is assigned the label \(L_\textbf{v}\).

The data term is based on the HS saliency measure and computed in a similar fashion to Eq. 5 whenever the point is deemed visible by at least a certain number of cameras:

$$\begin{aligned} D_\text {M}(\textbf{P}) = {\left\{ \begin{array}{ll} 1 &{} \text {if} \; |\text {vis}(\textbf{P}) |< n_\text {vis}, \\ e^{- \mu \times \frac{\sigma _2(\textbf{P})}{\sigma _3(\textbf{P})}} &{} \text {otherwise}, \\ \end{array}\right. } \end{aligned}$$
(18)

where \(\sigma _2\) and \(\sigma _3\) indicate the second and third singular values of the matrix, and \(\text {vis}(\textbf{P})\) is the set of camera pairs from which point \(\textbf{P}\) is visible. Points that are not visible from enough cameras are given a strong weight that still allows for points affected by self-occlusions to be reconstructed, in accordance with the neighbouring points which may be visible.

The smoothness term serves as a regularising weight to ensure the surface is smooth and consistent with the photometric normals calculated through HS. It is based on a depth disparity measure, here referred to as \(\delta (\textbf{V}, \textbf{W})\), calculated between pairs of neighbouring nodes. \(\delta (\textbf{V}, \textbf{W})\) represents the distance between a point and its predicted position based on the estimated normal of its neighbour. It is calculated as follows:

$$\begin{aligned} \delta (\textbf{V}, \textbf{W}) = \frac{\textbf{VW} \cdot \textbf{n}(\textbf{W})}{\textbf{n}(\textbf{W}) \cdot \textbf{s}(\textbf{V})}, \end{aligned}$$
(19)

where \(\textbf{VW}\) indicates the vector connecting the two points and \(\textbf{s}(\textbf{V})\) is a unit vector representing the direction of the segment that connects inner and outer surfaces at the node corresponding to point \(\textbf{V}\). The disparity error is computed as the difference between point \(\textbf{V}\) and the projection of \(\textbf{W}\) perpendicular to its estimated normal \(\textbf{n}(\textbf{W})\) towards said segment. Figure 8 illustrates the definition of the smoothness term.

Fig. 8
figure 8

Illustration of the computation of the smoothness term. The point \(\textbf{W}_{L_y}\) is projected perpendicularly to its estimated normal towards the segment \(\textbf{V}_i\textbf{V}_o\) and the error is measured as the distance between the projection and point \(\textbf{V}_{L_x}\)

This term is a generalisation to a perspective sampling in full 3D of the depth disparity measure presented in Eq. 7. Moreover, the error measure presented here is more discriminative than the one used in Eq. 14, which is tied to the voxel size chosen and where strong discontinuities do not result in a considerable error. In contrast, the proposed distance penalises more heavily depth and normal assignments which are inconsistent between neighbouring nodes, which are not bound by the volume sampling resolution.

The smoothness term is then computed as the average of the squared disparity terms for the two neighbours and it is truncated at a threshold of \(t^2\):

$$\begin{aligned} S_\text {M}(\textbf{V},\textbf{W}) ={\left\{ \begin{array}{ll} \frac{1}{2} (\delta (\textbf{V}, \textbf{W})^2 + \delta (\textbf{W}, \textbf{V})^2) &{} \text {if} \; \delta (\textbf{V}, \textbf{W}) \; \text {and} \\ &{} \; \delta (\textbf{W}, \textbf{V}) < t, \\ t^2 &{} \text {otherwise}. \\ \end{array}\right. }\nonumber \\ \end{aligned}$$
(20)

The threshold is used to allow for natural discontinuities and it is also used where occlusions do not allow HS to produce an estimated normal.

The final aspect taken into consideration in this methodology is the technique that can be used to perform the final optimisation. The energy function chosen to represent the problem violates the submodularity constraint and non-submodular functions cannot be properly minimised by classic graph-cut approaches, as indicated in Kolmogorov and Rother (2007). However, many techniques exist to approximate the solution of a non-submodular function with a high degree of confidence. The approach used here is \(\hbox {Tree-Reweighted Message Passing (TRW)}\) (Wainwright et al., 2005) in its more recent formulation called \(\hbox {TRW-S}\) (Kolmogorov, 2006, 2015), which, contrary to \(\hbox {TRW}\), guarantees that the energy lower bound does not decrease during optimisation and introduces the condition of weak tree agreement to identify local maxima in the energy bound.

After the final solution is obtained, it is not necessary to perform surface integration, since the initialisation surface topology is maintained in the obtained result. It was however found beneficial to perform some integration using Poisson surface reconstruction on the obtained vertices, as an extra regularisation step to leverage the normals obtained using HS. The extra integration step is also able to smooth some artefacts caused by minor differences which may exist between the topology of the outer and inner surfaces.

4 Experimental Evaluation

This section performs a comparative evaluation of the proposed approaches using both synthetic and real datasets. The methods evaluated will be referred to thereafter as: VH for \(\hbox {SfS}\); VDo for orthographic fused view-dependent HS (Sect. 3.1); VDp for perspective fused view-dependent HS (Sect. 3.1); 3DHSv for volumetric full 3D HS (Sect. 3.2) and 3DHSm for mesh-based full 3D HS (Sect. 3.3). An additional method referred to as 3DHSc is also introduced. This uses a volumetric graph-cut approach (Vogiatzis et al., 2007) which we adapted to HS for comparison purposes. The method estimates the voxel occupancy in a grid by performing graph-cut on a binary-label \(\hbox {MRF}\). The data term is a constant ballooning term which is applied to outside voxels to avoid an empty solution. The smoothness term is computed at the edges of neighbouring voxels and is based on the HS saliency measure, similarly to the data terms proposed in the previous methods. The proposed approaches are also compared against state-of-the-art \(\hbox {MVS}\), \(\hbox {MVPS}\) and neural radiance field-based techniques in the case of real scenes. An ablation study analysing the effects of the proposed confidence scores on the performance of the view-dependent approaches (VDo and VDp) is also included in Appendix A.

Fig. 9
figure 9

Examples images from the Armadillo and Bunny datasets

Fig. 10
figure 10

Illustration of the different image noise levels applied to the synthetic dataset: no noise (left), Gaussian noise with standard deviation 0.01% (middle) and 0.1% (right). The inset images show magnifications of the portion inside the square

Fig. 11
figure 11

Results including error maps obtained on the synthetic scenes Armadillo and Bunny with 0.01% noise level

4.1 Evaluation on Synthetic Scenes

A novel synthetic dataset was generated using the Stanford Bunny (Turk & Levoy, 1994) and the Armadillo (Krishnamurthy & Levoy, 1996). To ensure the synthesised images are physically plausible, and in particular satisfy Helmholtz reciprocity, images were rendered using the modified Phong reflectance model (Lewis, 1994). The model combines a diffuse and a specular component, with the BRDF defined as:

$$\begin{aligned} f(k_d, k_s) = k_d \frac{1}{\pi } + k_s \frac{\frac{1}{r}+2}{2 \pi } (\textbf{h} \cdot \textbf{n})^{\frac{1}{r}}, \end{aligned}$$
(21)

where \(k_s\) and \(k_d\) respectively represent the diffuse and specular coefficients. The specular component further depends on the surface roughness r and the angle between the normal \(\textbf{n}\) at the observed surface point and the vector \(\textbf{h}\) bisecting the incoming and outgoing light directions. In each case, 40 reciprocal pairs of images were generated by sampling viewpoints on a sphere of radius 600 mm with the object located at the centre of the sphere. Armadillo and Bunny are 151 mm and 153 mm tall respectively. The camera used to render the views had a horizontal field of view of \(40^\circ \), equivalent to a focal length of 2638 pixels/mm. Images were rendered at a resolution of \(1920 \times 1080\). Different datasets were generated by adding Gaussian noise at three different levels with standard deviation of 0%, 0.01% and 0.1% of the full 16-bit image range. Example images in case of the 0.01% noise level are shown in Fig. 9 for each object. Figure 10 shows a close-up of an image corrupted at the different noise levels in the case of the Bunny dataset. Ground truth models are shown in the first column of Fig. 11.

Fig. 12
figure 12

Examples of partial reconstructions for the Armadillo and Bunny datasets produced using the orthographic and perspective view-dependent approaches. It must be noted that the reconstructed portions differ slightly between the two methods due to the different viewpoints and image formation models they use

Fig. 13
figure 13

Geometric accuracy, normal accuracy and completeness graphs for the synthetic scenes Armadillo and Bunny

Fig. 14
figure 14

Error maps of the results obtained on the Armadillo and Bunny datasets at different levels of noise for the proposed methods

Figure 11 shows the reconstructions for the different methods and their respective error maps in the case of the intermediate noise level (with standard deviation 0.01%). Figure 12 also shows an example of intermediate view-dependent reconstruction for each object in the case of the VDo and VDp methods. As can be observed, the VH method performs poorly, especially with regards to reconstructing concavities. In comparison, 3DHSc achieves better results due to the use of HS for normal estimation, but results still lack surface detail. This shows the limitations of relying solely on HS saliency and a ballooning term for energy regularisation. VDo achieves better results by introducing a tailored regularisation term enforcing the consistency of the depth and normal estimates. However, artefacts are present on the surface due to the fusion of the separately computed partial surfaces, which present some imperfections due the use of orthographic cameras. While this is somewhat mitigated by the use of Poisson surface reconstruction and the confidence scores, some fine details present on the surfaces are lost in the process. VDp obtains notably better results, presenting considerably fewer artefacts and achieving a faithful reconstruction of the object. Improvements resulting from switching to a perspective grid with VDp are clearly visible in the partial reconstructions shown in Fig. 12, which ultimately results in better final reconstructions after fusion into complete models, compared to their orthographic counterparts obtained using VDo. Results obtained using 3DHSv overcome some of the limitations of the fused view-dependent approaches, by introducing a volumetric optimisation step. However, the use of \(\hbox {ICM}\) during optimisation is iterative and unable to fully retrieve some minute concavities present on the surfaces. Furthermore, the implementation of the method as a voxel grid ties its results to the resolution chosen, which is hindered by the computational resources available since memory consumption grows cubically with resolution in the case of a vanilla voxel implementation. Experiments reported here were run on a server with 60GB memory. These limitations could be alleviated through use of an octree implementation, however this was not considered in our implementation. In contrast, 3DHSm is able to retrieve fine surface details such as the shell grooves of the Armadillo and the facial features of the Bunny.

Table 1 Results obtained on the synthetic scenes using all methods under different noise levels

A quantitative evaluation is performed by computing the Middlebury geometric accuracy, normal accuracy and completeness scores (Seitz et al., 2006) for all methods. These are shown in Fig. 13. The graphs show how the geometric and normal accuracies vary for each method as a wider percentage of the surface is taken into consideration (shown on the horizontal axis), while for the completeness a varying distance threshold is considered. The graphs confirm that VH and 3DHSc produce poor results, highlighting the impossibility of reconstructing concavities in the first approach and the lack of an effective regularising term in the second method. It can then be observed that VDo, VDp, 3DHSv and 3DHSm significantly outperform the other approaches, achieving significantly better geometric accuracies, normal accuracies and completeness on both objects. VDo performs slightly worse on the Bunny dataset, but overall still obtains submillimetre accuracy and an above 90% completeness on thresholds below a millimetre. It can be noted that VDp achieves better results than 3DHSv, despite being based on fusing multiple partial reconstructions. This is likely due to the limitation imposed by the resolution used in the 3DHSv experiments as mentioned previously. In contrast, 3DHSm obtains the best results thanks to the global optimisation performed directly on the mesh vertices.

Fig. 15
figure 15

Acquisition setup consisting of a pair of cameras fitted with external flashes mounted on each camera lens

Fig. 16
figure 16

Timing graph showing how the cameras synchronisation is handled between the camera pairs to capture both HS image and direct flash image with consistent exposure

Fig. 17
figure 17

Example images for the Fox, Corgi, Llama, Bee, Duck and Giraffe datasets

Fig. 18
figure 18

Results obtained on the real datasets Fox, Corgi, Llama, Bee, Duck and Giraffe for the different methods

Fig. 19
figure 19

Results obtained on the real datasets Fish and Dragon for the different methods

Fig. 20
figure 20

Examples of partial reconstructions obtained using the orthographic and perspective view-dependent methods on the Fox, Corgi, Llama, Bee, Duck, Giraffe, Fish and Dragon datasets. Reconstructed portions differ slightly between the approaches due to different viewpoints and image formation models

Fig. 21
figure 21

Error maps for the results obtained on the real datasets Fox, Corgi and Llama for the different methods

Fig. 22
figure 22

Geometric accuracy, normal accuracy and completeness graphs for the real datasets Fox, Corgi and Llama for the different HS methods

4.2 Analysis of Robustness to Image Noise

To assess robustness, the different approaches are evaluated on the datasets generated with three different noise levels with standard deviation of 0%, 0.01% and 0.1% of the full 16 bit image range. Figure 14 shows the error maps on the fully reconstructed objects with respect to the ground truth at those different noise levels. Table 1 shows the geometric accuracy, normal accuracy and completeness for each noise level for the typical 90% accuracy and 0.5 mm completeness thresholds. Overall only a small deterioration can be noticed in the results when affected by noise. Areas such as the shell, ears and neck area of the Armadillo and the concavities between the legs of the Bunny show the main differences between the presented approaches. As it can be observed, in each method the errors are not significantly increased by the introduction of noise, which demonstrates the robustness of the different approaches proposed. This can be explained by the use of window-based averaging to compute the data terms and also the regularisation approaches that both mitigate the effects of noise.

4.3 Evaluation on Real Scenes

Two real datasets are used for evaluation: a novel dataset released with this paper and the Dragon and Fish scenes from Delaunoy et al. (2010). The new dataset was acquired using a pair of Canon EOS 5DS cameras equipped with Canon Macro Ring Lite MR-14 EX II flashes, which are external macro flashes mounted around each lens to approximate camera/light collocation. The flash on a given lens is synchronised with the other camera to allow acquisition of a reciprocal pair by triggering each camera separately. Moreover, synchronisation of the two cameras allows acquisition of an extra pair of images in which the object is lit directly by the flash mounted on the same camera capturing the image, which can be used for segmentation. Figures 15 and 16 show respectively the acquisition setup and the timing of the left and right cameras and flashes during the acquisition of a reciprocal pair. This provides a flexible setup for HS capture which can either be static with the object placed on a turntable or moved around the object. The former configuration was used to acquire this dataset. The dataset is acquired at a resolution of \(2928 \times 4368\) and comprises six objects: Fox, Corgi, Llama, Bee, Duck and Giraffe (see Fig. 17). These present several types of materials with varying BRDFs and are for the most part untextured, making them difficult to reconstruct with traditional methods and without prior knowledge over their reflectance properties. Duck also presents some subsurface scattering. For each object, 20 image pairs were captured. The cameras were positioned at about 1.5 m from the object and each pair had a baseline of about 60 cm. To perform the capture, the objects were placed on a rotating table, with each image pair taken after a rotation of \(18^\circ \) from the previous pair.

Figures 18 and 19 show the results obtained on the real scenes for each approach. Further, Fig. 20 shows examples of partial reconstructions for the view-dependent methods. As can be observed, the VH reconstructions are coarse. The 3DHSc method also produces low quality results due to the poor regularisation it relies on. The VDo results are somewhat noisy due to the fusion of multiple 2.5D surfaces that are sometimes incoherent. This can be observed especially at the base of the tail of the Fox and Fish or in the nose area of the Corgi. The perspective approach VDp significantly improves the results with respect to the orthographic method, thanks to the possibility of leveraging segmentation information during the reconstruction. The use of perspective cameras has the advantage of reconstructing views that have been captured directly in the input images, reducing possible effects from occlusions, which can be observed in the Dragon and Fish datasets, where it is not possible to perform segmentation due to the different setup used for capture. The artefacts produced by the fusion is partially mitigated in the 3DHSv results, however, due to the resolution used, some finer details are lost. Finally, the results from 3DHSm present an accurate reproduction of minute object details, such as the ears of the Corgi, the ornaments on the neck of the Llama or the patterns on the neck of the Giraffe. In the Fish scene, the small cavity inside the eye is maintained as well as the features on the side of the tail fin and the two small concavities behind the dorsal fin. The high-frequency details of the scales on the side of the object are also faithfully reconstructed, while they are overly smoothed in the previous approaches. In the Dragon scene, the scales pattern on the side is accurately reconstructed and more details can be seen on the face compared to the other approaches.

We also perform a quantitative evaluation based on laser scans of Fox, Corgi and Llama, which are the only three fully rigid objects suitable for use as ground truth. The other three objects could not be used as they were either made of deformable materials (rubber in the case of Duck and Giraffe) or contained moving parts (moving wheels and axles in the case of Bee). Due to their specular nature, the objects had to be spray-coated with talc powder in order to enable laser scanning. The ground truth meshes were then manually registered to the set of images, followed by \(\hbox {ICP}\) alignment against the reconstruction for each approach in order to compute the geometric accuracy, normal accuracy and completeness metrics. Figure 21 shows the error maps for each method. The geometric accuracy, normal accuracy and completeness results are represented in Fig. 22 and Table 2. Results indicate some good performance overall, with the different methods achieving close to mm accuracy and a high completeness. Not surprisingly the VDo approach performs the worst out of the four proposed HS methods. The VDp approach was found to be overall the best performer followed by 3DHSv, and then 3DHSm. It may be that the volumetric and mesh-based approaches are more sensitive to calibration errors than the view-dependent approach which would explain their lower performance on this dataset.

Finally, a run-time analysis is performed to benchmark the different methods. All experiments were performed on the same server with 16 cores and no use of GPU. Results are reported in Table 3. The view-dependent techniques are the slowest, requiring optimisation to be performed separately for each viewpoint. While this allows some parallelisation, this also generates some redundant processing, particularly if a large number of viewpoints are considered. The mesh-based approach is the fastest, while at the same time having a considerably lower memory footprint than the volumetric approach. None of the approaches have been optimised for speed and they all run entirely on the CPU. It is likely that performance could further reduced with some optimisation including porting of some of the operations to the GPU.

Table 2 Results obtained on the real datasets Fox, Corgi and Llama
Table 3 Runtime comparison for the different methods benchmarked on the Bee dataset

4.4 Comparison Against Multi-view Reconstruction Approaches

Fig. 23
figure 23

Results on the DiLiGenT-MV datasets Bear, Buddha, Cow, Pot2 and Reading

The proposed approach is compared against three 3D reconstruction techniques (one classical, two learning-based) on the real dataset released with this paper. The first approach, referred to as COLMAP, is based on Schönberger and Frahm (2016) and Schönberger et al. (2016). It is a classical approach consisting in jointly estimating depth and normal information by performing pixelwise view selection thanks to photometric and geometric priors and then minimising a multi-view geometric consistency term. This technique is one of the top performers on the Middlebury datasets. The second approach, referred to as CasMVSNet, is a deep learning approach, which is a fusion of MVSNet (Yao et al., 2018, 2019) and a technique that uses a cascade cost volume (Gu et al., 2020) to achieve higher resolution outputs during reconstruction. The third approach, referred to as NeuS, is a recent learning-based approach that combines a neural scene representation with a volume rendering technique (Wang et al., 2021).

Table 4 Results on the DiLiGenT-MV datasets Bear, Buddha, Cow, Pot2 and Reading

To ensure a fair comparison, all the objects in the dataset were re-acquired under constant illumination conditions with the same number of views and similar camera placement as for the HS dataset. For the multi-view dataset, a constant ambient illumination was used instead of flash illumination and a camera was moved around the object instead of rotating the object using a turntable. This guarantees that the illumination conditions remain unchanged during the acquisition process, to ensure objects are acquired under optimal conditions for these methods. The results obtained on the real scenes are shown in Fig. 18 with error maps shown in Fig. 21. Figure 22 and Table 2 show the geometric accuracy, normal accuracy and completeness results. Note that for a fair comparison, prior to computing the error maps and error metrics for these methods, the reconstructions were truncated to remove any protrusions located below the ground plane level.

As it can be observed, COLMAP and CasMVSNet often produce poor reconstructions of the objects. The most notable examples for COLMAP are the Fox and Llama scenes, where untextured regions on the objects result in holes. This is a result of point clouds being rather sparse and presenting holes that then lead to poor reconstruction once points are fused using Poisson surface reconstruction. The only exception is the Giraffe which, while being untextured, presents a Lambertian surface. Similarly, CasMVSNet produces poor reconstructions in the case of the Fox, Duck, Llama and Giraffe, obtaining very few point correspondences. This is likely due to the network not being trained on non-Lambertian scenes or due to the lack of textured surfaces. Although less dramatically affected, the other objects show very little detail and are noisy in places, ultimately resulting in worse results than the ones obtained using our proposed approaches. In contrast, NeuS was able to produce clean reconstructions, although these have a tendency to smooth out some of the detail. The quantitative analysis confirms that NeuS comes out as the top performer amongst the considered multi-view reconstruction approaches considered and comes close to the proposed methods.

4.5 Comparison Against Multi-View Photometric Stereo Approaches

The proposed approach is also evaluated on the DiLiGenT-MV dataset from Li et al. (2020). This dataset is tailored for reconstruction using \(\hbox {MVPS}\) approaches. It consists of five objects (Bear, Buddha, Cow, Pot2 and Reading), each captured from 20 different viewpoints under 96 different illumination conditions, i.e. a total of 1920 images per object. To test our approaches, we used a subset of the dataset consisting of 20 pairs of images that were found to approximately satisfy the Helmholtz reciprocity constraint. More specifically, we defined these 20 pairs by considering all pairs of adjacent cameras, with View i lit by LED 20 paired with View \(i+1\) lit by LED 68 in order to most closely approximate a reciprocal configuration. As such, our approaches are only able to use a fraction of the original dataset (approximately 2%).

We evaluate our two proposed view-dependent approaches VDo and VDp as well as the volumetric and mesh-based methods 3DHSv and 3DHSm. Performance is also compared against two \(\hbox {MVPS}\) approaches: the first one from Park et al. (2016) referred to thereafter as Park and the second one from Li et al. (2020) referred to as Li. Results for these two approaches are based on the mesh reconstructions provided for these two methods as part of the DiLiGenT-MV dataset. The reconstructed meshes for the different methods together with their corresponding error maps are shown in Fig. 23. Table 4 shows the geometric accuracy, normal accuracy and completeness metrics.

The proposed approaches based on Helmholtz reciprocity are able to achieve good reconstruction results, recovering generally well the geometric detail of the objects, even in the presence of complex topologies such as in the case of the Buddha and Pot2. Errors in reconstruction are most prominent in the case of Reading which contains some concavities causing inter-reflections that violate the reciprocity assumption. Amongst these four variants, VDo performs the worst. In contrast, VDp performs particularly well, outperforming the other three proposed approaches on all objects except Bear in which case 3DHSm is the top performer. This trend is confirmed by a qualitative analysis of the results which shows that VDp and 3DHSm produce more realistic results than VDo and 3DHSv which suffer from more noisy reconstructions.

When compared against the \(\hbox {MVPS}\) approaches, our proposed approaches are found to perform less well on this dataset. This is not surprising considering that they only utilise about 2% of the data the \(\hbox {MVPS}\) methods have access to. It is worth noting that our proposed approaches are still able to achieve reasonably close performance to Li, and are even outperforming the earlier method of Park et al. These results demonstrate the potential of the proposed approach as an alternative paradigm to model scenes from multi-view data under different illumination conditions. However, further evaluation is necessary to better understand how it compares against \(\hbox {MVPS}\) approaches under some more controlled evaluation conditions where each class of methods has access to the same amount of data.

5 Conclusions and Future Work

This paper introduced a family of BRDF-agnostic approaches for full 3D reconstruction. The first approach is based on fusing a set of orthographic or perspective view-dependent reconstructions from viewpoints distributed around the object. The second approach casts instead the problem as a volumetric optimisation problem seeking the optimal surface location within a voxel grid. The third approach uses a mesh-based formulation optimising vertices positions for a given mesh topology. Further, the paper contributes novel datasets to allow future benchmarking of full 3D HS approaches. An extensive experimental evaluation demonstrates that the proposed approaches are able to accurately model objects with complex materials, achieving sub-millimetre accuracy on the synthetic scenes and exhibiting robustness to image noise. A comparison against multi-view reconstruction techniques shows how the proposed HS approaches are able to improve reconstruction quality on challenging non-Lambertian low-texture objects where \(\hbox {MVS}\) approaches typically perform poorly. The proposed approaches were also compared against \(\hbox {MVPS}\) approaches and found to achieve good quality results while using only a fraction of the number of images.

Table 5 Results on synthetic scenes using the view-dependent methods with different confidence scores: no confidence (denoted by no conf.), saliency-driven confidence only (denoted by \(C_s\)), surface normal-driven confidence only (denoted by \(C_n\)) and complete confidence score combining salience-driven and surface normal-driven scores (denoted by \(C_{sn}\))

The proposed approaches all suffer from the following limitations. Firstly, they are unable to handle scenes with significant subsurface scattering as these cannot be described by a BRDF which requires light to enter and leave a surface at the same point. In practice, we have observed that the approach is still able to reconstruct scenes exhibiting small amounts of subsurface scattering such as in the case of the Duck scene in our dataset. Secondly, performance degrades in the presence of inter-reflections, particularly in concavities where these are likely to be significant such as in the case of the Reading scene from the DiLiGenT-MV dataset. Finally, the approach usually fails in the presence of highly reflective surfaces (mirror-like surfaces) due to the limited dynamic range of the sensors used during acquisition; this may be alleviated by making use of high dynamic range imaging.

An interesting avenue for future work would be to extend the approach to enable capture in less controlled environments such as outdoor settings. This could be achieved in principle by capturing an additional image acquired with only the ambient illumination which could then be factored out from the images acquired using the light sources for Helmholtz reciprocal pair acquisition. Another avenue for future work would be to extend the approach to dynamic scenes. To date, use of HS for dynamic scenes has been limited to 2.5D reconstruction. The extension to full 3D could be made possible via multi-spectral imaging using a larger number of frequency bands or using temporal multiplexing. Finally, another direction would be to explore the use of deep neural networks. This may prove beneficial in overcoming the remaining limitations relating to dealing with subsurface scattering and inter-reflections which are currently difficult to model explicitly.