1 Introduction

The wide variety of images around us are the outcome of interactions between lighting, shapes and materials. In recent years, the advent of convolutional neural networks (CNNs) has led to significant advances in recovering shape using just a single image [9, 12]. In contrast, material estimation has not seen as much progress, which might be attributed to multiple causes. First, material properties can be more complex. Even discounting more complex global illumination effects, materials are represented by a spatially-varying bidirectional reflectance distribution function (SVBRDF), which is an unknown high-dimensional function that depends on exitant and incident lighting directions [25]. Second, while large-scale synthetic and real datasets have been collected for shape estimation [8, 24], there is a lack of similar data for material estimation. Third, pixel observations in a single image contain entangled information from factors such as shape and lighting, besides material, which makes estimation ill-posed.

In this work, we present a practical material capture method that can recover an SVBRDF from a single image of a near-planar surface, acquired using the camera of an off-the-shelf consumer mobile phone, under unconstrained environment illumination. This is in contrast to conventional BRDF capture setups that usually require significant equipment and expense [11, 21]. We address this challenge by proposing a novel CNN architecture that is specifically designed to account for the physical form of BRDFs and the interaction of light with materials, which leads to a better learning objective. We also propose to use a dataset of SVBRDFs that has been designed for perceptual accuracy of materials. This is in contrast to prior datasets that are limited to homogeneous materials, or conflate material properties with other concepts such as object categories.

Fig. 1.
figure 1

We propose a deep learning-based light-weight SVBRDF acquisition system. From a single image of a near planar surface captured with a flash-enabled mobile phone camera under arbitrary lighting, our network recovers surface normals and spatially-varying BRDF parameters – diffuse albedo and specular roughness. Rendering the estimated parameters produces an image almost identical to the input image.

We introduce a novel CNN architecture that encodes the input image into a latent representation, which is decoded into components corresponding to surface normals, diffuse texture, and specular roughness. We propose a differentiable rendering layer that recombines the estimated components with a novel lighting direction. This gives us additional supervision from images of the material rendered under arbitrary lighting directions during training; only a single image is used at test time. We also observe that coarse classification of BRDFs into material meta-categories is an easier task, so we additionally include a material classifier to constrain the latent representation. The inferred BRDF parameters from the CNN are quite accurate, but we achieve further improvement using densely-connected conditional random fields (DCRFs) with novel unary and smoothness terms that reflect the properties of the underlying microfacet BRDF model. We train the entire framework in an end-to-end manner.

Our approach – using our novel architecture and SVBRDF dataset – can outperform the state-of-art. We demonstrate that we can further improve these results by leveraging a form of acquisition control that is present on virtually every mobile phone – the camera flash. We turn on the flash of the mobile phone camera during acquisition; our images are thus captured under a combination of unknown environment illumination and the flash. The flash illumination helps further improve our reconstructions. First, it minimizes shadows caused by occlusions. Second, it allows better observation of high-frequency specular highlights, which allows better characterization of material type and more accurate estimation. Third, it provides a relatively simple setup for acquisition that eases the burden on estimation and allows the use of better post-processing techniques.

In contrast to recent works such as [2] and [1] that can reconstruct BRDFs with stochastic textures, we can handle a much larger class of materials. Also, our results, both with and without flash, are a significant improvement over the recent method of Li et al. [19] even though our trained model is more compact. Our experiments demonstrate advantages over several baselines and prior works in quantitative comparisons, while also achieving superior qualitative results. In particular, the generalization ability of our network trained on the synthetic BRDF dataset is demonstrated by strong performance on real images, acquired in the wild, in both indoor and outdoor environments, using multiple different phone cameras. Given the estimated BRDF parameters, we also demonstrate applications such as material editing and relighting of novel shapes. To summarize, we propose the following novel contributions:

  • A lightweight method for high quality acquisition of SVBRDF and normal map using a single mobile phone image in an unconstrained environment.

  • A physically-motivated CNN and DCRF framework for joint SVBRDF reconstruction and material classification.

  • Use of a large-scale SVBRDF dataset specifically attuned to complex materials.

2 Related Work

BRDF Acquisition: The Bidirectional Reflection Distribution function (BRDF) is a 4-D function that characterizes how a surface reflects lighting from an incident direction toward an outgoing direction [25]. Alternatively, BRDFs are represented using low-dimensional parametric models [4, 10, 27, 35]. In this work, we use a physically-based microfacet model [16] that our SVBRDF dataset uses.

Traditional methods for BRDF acquisition rely on densely sampling this 4-D space using expensive, calibrated acquisition systems [11, 21, 22]. Recent work has demonstrated that assuming BRDFs lie in a low-dimensional subspace allows for them to be reconstructed from a small set of measurements [26, 37]. However, these measurements still to be taken under controlled settings. We assume a single image captured under largely uncontrolled settings.

Photometric stereo-based methods recover shape and BRDF from images. Some of these methods recover a homogeneous BRDF given one or both of the shape and illumination [28, 31, 32]. Chandraker et al. [5,6,7] utilize motion cues to jointly recover shape and BRDF from images under known directional illumination. Hui et al. [14] recover SVBRDFs and shape from multiple images under known illuminations. All of those methods require some form of controlled acquisition, while we estimate SVBRDFs and normal maps “in-the-wild”.

Recent work has shown promising results for “in-the-wild” BRDF acquisition. Hui et al. [15] demonstrate that the collocated camera-light setup on mobile devices is sufficient to reconstruct SVBRDFs and normals. They require over 30 calibrated images, while we aim to do the same with a single image. Aittala et al. [2] propose using a flash and no-flash image pair to reconstruct stochastic SVBRDFs and normals using an optimization-based scheme. Our method can handle a larger class of materials and is orders of magnitude faster.

Deep Learning-Based Material Estimation: Inspired by the success of deep learning for a variety of vision and graphics tasks, recent work has considered CNN-based material recognition and estimation. Bell et al. [3] train a material parsing network using crowd-sourced labeled data. However, their material recongition is driven more by object context, rather than appearance. Liu et al. [20] demonstrate image-based material editing using a network trained to recover homogenous BRDFs. Methods have been proposed to decompose images into their intrinsic image components which are an intermediate representation for material and shape [23, 33, 34]. Rematas et al. [29] train a CNN to reconstruct the reflectance map – a convolution of the BRDF with the illumination – from a single image of a shape from a known class. In subsequent work, they disentangle the reflectance map into the BRDF and illumination [13]. Neither of these methods handle SVBRDFs, nor do they recover fine surface normal details. Kim et al. [17] reconstruct a homegeneous BRDF by training a network to aggregate multi-view observations of an object of known shape.

Similar to us, Aittala et al. [1] and Li et al. [19] reconstruct SVBRDFs and surface normals from a single image of a near-planar surface. Aittala et al. use a neural style transfer-based optimization approach to iteratively estimate BRDF parameters, however, they can only handle stationary textures and there is no correspondence between the input image and the reconstructed BRDF [1]. Li et al. use supervised learning to train a CNN to predict SVBRDF and normals from a single image captured under environment illumination [19]. Their training set is small, which necessitates a self-augmentation method to generate training samples from unlabeled real data. Further, they train a different set of networks for each parameter (diffuse texture, normals, specular albedo and roughness) and each material type (wood, metal, plastic). We demonstrate that by using our novel CNN architecture, supervised training on a high-quality dataset and acquisition under flash illumination, we are able to (a) reconstruct all these parameters with a single network, (b) learn a latent representation that also enables material recognition and editing, (c) obtain results that are significantly better qualitatively and quantitatively.

3 Acquisition Setup and SVBRDF Dataset

In this section, we describe the setup for single image SVBRDF acquisition and the dataset we use for learning.

Setup. Our goal is to reconstruct the spatially-varying BRDF of a near planar surface from a single image captured by a mobile phone with the flash turned on for illumination. We assume that the z-axis of the camera is approximately perpendicular to the planar surface (we explicitly evaluate against this assumption in our experiments). For most mobile devices, the position of the flash light is usually very close to the position of the camera, which provides us a univariate sampling of a isotropic BRDF [15]. We argue that by imaging with a collocated camera and point light, we can have additional constraints that yield better BRDF reconstructions compared to acquisition under just environment illumination.

Our surface appearance is represented by a microfacet parametric BRDF model [16]. Let \(\mathbf {d}_{i}\), \(\mathbf {n}_{i}\), \(r_{i}\) be the diffuse color, normal and roughness, respectively, at pixel i. Our BRDF model is defined as:

$$\begin{aligned} \rho (\mathbf {d}_{i}, \mathbf {n}_{i}, r_{i}) = \mathbf {d}_{i} + \frac{D(\mathbf {h}_{i}, r_{i})F(\mathbf {v}_{i}, \mathbf {h}_{i})G(\mathbf {l}_{i}, \mathbf {v}_{i}, \mathbf {h}_{i}, r_{i})}{4(\mathbf {n}_{i} \cdot \mathbf {l}_{i})(\mathbf {n}_{i}\cdot \mathbf {v}_{i})} \end{aligned}$$

where \(\mathbf {v}_{i}\) and \(\mathbf {l}_{i}\) are the view and light directions and \(\mathbf {h}_{i}\) is the half angle vector. Given an observed image \(I(\mathbf {d}_{i}, \mathbf {n}_{i}, r_{i}, \mathbf {L})\), captured under unknown illumination \(\mathbf {L}\), we wish to recover the parameters \(\mathbf {d}_{i}\), \(\mathbf {n}_{i}\) and \(r_{i}\) for each pixel i in the image. Please refer to the supplementary material for more details on the BRDF model.

Dataset. We train our network on the Adobe Stock 3D Material datasetFootnote 1, which contains 688 materials with high resolution (\(4096 \times 4096\)) spatially-varying BRDFs. Part of the dataset is created by artists while others are captured using a scanner. We use 588 materials for training and 100 materials for testing. For data augmentation, we randomly crop 12, 8, 4, 2, 1 image patches of size 512, 1024, 2048, 3072, 4096. We resize the image patches to a size of \(256 \times 256\) for processing by our network. We flip patches along x and y axes and rotate them in increments of 45\(^\circ \). Thus, for each material type, we have 270 image patches.Footnote 2 We randomly scale the diffuse color, normal and roughness for each image patch to prevent the network from overfitting and memorizing the materials. We manually segment the dataset into 8 materials types. The distribution is shown in Table 1, with an example visualization of each material type in Fig. 2. More details on rendering the dataset are in supplementary material.

Fig. 2.
figure 2

Examples of our material types.

Table 1. Distribution of materials in our training and test sets.

4 Network Design for SVBRDF Estimation

In this section, we describe the components of our CNN designed for single-image SVBRDF estimation. The overall architecture is illustrated in Fig. 3.

Fig. 3.
figure 3

Our network for SVBRDF estimation consists of an encoder, three decoder blocks with skip links to retrieve SVBRDF components, a rendering layer and a material classifier, followed by a DCRF for refinement (not visualized). See Sect. 4 for how our architectural choices are influenced by the problem structure of SVBRDF estimation and supplementary material for the hyperparameter details.

4.1 Considerations for Network Architecture

Single-image SVBRDF estimation is an ill-posed problem. Thus, we adopt a data-driven approach with a custom-designed CNN that reflects physical intuitions.

Our basic network architecture consists of a single encoder and three decoders which reconstruct the three spatially-varying BRDF parameters: diffuse color \(\mathbf {d}_{i}\), normals \(\mathbf {n}_{i}\) and roughness \(r_{i}\). The intuition behind using a single encoder is that different BRDF parameters are correlated, thus, representations learned for one should be useful to infer the others, which allows significant reduction in the size of the network. The input to the network is an RGB image, augmented with the pixel coordinates as a fourth channel. We add the pixel coordinates since the distribution of light intensities is closely related to the location of pixels, for instance, the center of the image will usually be much brighter. Since CNNs are spatially invariant, we need the extra signal to let the network learn to behave differently for pixels at different locations. Skip links are added to connect the encoder and decoders to preserve details of BRDF parameters.

Another important consideration is that in order to model global effects over whole images like light intensity fall-off or large areas of specular highlights, it is necessary for the network to have a large receptive field. To this end, our encoder network has seven convolutional layers of stride 2, so that the receptive field of every output pixel covers the entire image.

4.2 Loss Functions for SVBRDF Estimation

For each BRDF parameter, we have an L2 loss for direct supervision. We now describe other losses for learning a good representation for SVBRDF estimation.

Rendering Layer. Since our eventual goal is to model the surface appearance, it is important to balance the contributions of different BRDF parameters. Therefore, we introduce a differentiable rendering layer that renders our BRDF model (Eq. 1) under the known input lighting. We add a reconstruction loss based on the difference between these renderings with the predicted parameters and renderings with ground-truth BRDF parameters. The gradient can be backpropagated through the rendering layer to train the network. In addition to rendering the image under the input lighting, we also render images under novel lights. For each batch, we create novel lights by randomly sampling the point light source on the upper hemisphere. This ensures that the network does not overfit to collocated illumination and is able to reproduce appearance under other light conditions. The final loss function for the encoder-decoder part of our network is:

$$\begin{aligned} \mathcal {L} = \lambda _{d}\mathcal {L}_{d} + \lambda _{n}\mathcal {L}_{n} + \lambda _{r}\mathcal {L}_{r} + \lambda _{rec}\mathcal {L}_{rec} , \end{aligned}$$

where \(\mathcal {L}_{d}\), \(\mathcal {L}_{n}\), \(\mathcal {L}_{r}\) and \(\mathcal {L}_{rec}\) are the L2 losses for diffuse, normal, roughness and rendered image predictions, respectively. Here, \(\lambda \)’s are positive coefficients to balance the contributions of various terms, which are set to 1 in our experiments.

Since we train on near planar surfaces, the majority of the normal directions are flat. Table 2 shows the normal distributions in our dataset. To prevent the network from over-smoothing the normals, we group the normal directions into different bins and for each bin we assign a different weight when computing the L2 error. This balance various normal directions in the loss function.

Table 2. The \(\theta \) distribution of the normal vector in the dataset, where \(\theta \) is the angle between normal vector and z axis. To avoid the network from over-smoothing the normal map, we group normal vectors into three bins according to \(\theta \). With probability \(P_{i}\) for bin i, its weight is \(W_{i} = 0.7 + 1/10P_{i}\).

Material Classification. The distribution of BRDF parameters is closely related to the surface material type. However, training separate networks for different material types similar to [19] is expensive. Also the size of the network grows linearly with the number of material types, which limits utility. Instead, we propose a split-merge network with very little computational overhead.

Given the highest level of features extracted by the encoder, we send the feature to a classifier to predict its material type. Then we evaluate the BRDF parameters for each material type and use the classification results as (the output of softmax layer) weights. This averages the prediction from different material types to obtain the final BRDF reconstruction results. Suppose we have N channels for BRDF parameters and K material types. To output the BRDF reconstruction for each type of material, we only modify the last convolutional layer of the decoder so that the output channel will be \(K\times N\) instead of N. In practice, we set K to be 8, as shown in Table 1.

The classifier is trained together with the encoder and decoder from scratch, with the weights of each label set to be inversely proportional to the number of examples in Table 1 to balance different material types in the loss function. The overall loss function of our network with the classifier is

$$\begin{aligned} \mathcal {L} = \lambda _{d}\mathcal {L}_{d} + \lambda _{n}\mathcal {L}_{n} + \lambda _{r}\mathcal {L}_{r} + \lambda _{rec}\mathcal {L}_{rec} + \lambda _{cls}\mathcal {L}_{cls}, \end{aligned}$$

where \(\mathcal {L}_{cls}\) is cross entropy loss and \(\lambda _{cls} = 0.0005\) to limit the gradient magnitude.

4.3 Designing DCRFs for Refinement

The prediction of our base network is quite reasonable. However, accuracy may further be enhanced by post-processing through a DCRF (trained end-to-end).

Diffuse Color Refinement. For diffuse prediction, when capturing the image of specular materials, parts of the surface might be saturated by specular highlight. This can sometimes lead to artifacts in the diffuse color prediction since the network has to hallucinate the diffuse color from nearby pixels. To remove such artifacts, we incorporate a densely connected continuous conditional random field (DCRF) [30] to smooth the diffuse color prediction. Let \(\mathbf {\hat{d}}_{i}\) be the diffuse color prediction of network at pixel i, \(\mathbf {p}_{i}\) be its position and \(\mathbf {\bar{I}}_{i}\) is the normalized diffuse RGB color of the input image. We use the normalized color of the input image to remove the influence of light intensity when measuring the similarity between two pixels. The energy function of the dense connected CRF that is minimized over \(\{\mathbf {d}_{i}\}\) for diffuse prediction is defined as:

$$\begin{aligned} \sum _{i=1}^{N} \alpha _{i}^{d}(\mathbf {d}_{i} - \mathbf {\hat{d}}_{i})^{2} + \sum _{i, j}^{N}(\mathbf {d}_{i} - \mathbf {d}_{j})^{2} \!\! \left( \beta _{1}^{d}\kappa _{1}(\mathbf {p}_{i}; \mathbf {p}_{j}) + \beta _{2}^{d}\kappa _{2}(\mathbf {p}_{i}, \mathbf {\bar{I}}_{i}; \mathbf {p}_{j},\mathbf {\bar{I}}_{j}) + \beta _{3}^{d}\kappa _{3}(\mathbf {p}_{i}, \mathbf {\hat{d}}_{i}; \mathbf {p}_{j},\mathbf {\hat{d}}_{j}) \right) . \end{aligned}$$

Here \(\kappa _{i}\) are Gaussian smoothing kernels, while \(\alpha _{i}^{d}\) and \(\{\beta _{i}^{d}\}\) are coefficients to balance the contribution of unary and smoothness terms. Notice that we have a spatially varying \(\alpha _{i}^{d}\) to allow different unary weights for different pixels. The intuition is that artifacts usually occur near the center of images with specular highlights. For those pixels, we should have lower unary weights so that the CRF learns to predict their diffuse color from nearby pixels.

Normal Refinement. Once we have the refined diffuse color, we can use it to improve the prediction of other BRDF parameters. To reduce the noise in normal prediction, we use a DCRF with two smoothness kernels. One is based on the pixel position while the other is a bilateral kernel based on the position of the pixel and the gradient of the diffuse color. The intuition is that pixels with similar diffuse color gradients often have similar normal directions. Let \(\hat{\mathbf {n}}_{i}\) be the normal predicted by the network. The energy function for normal prediction is defined as

$$\begin{aligned} \min _{\{\mathbf {n}_{i}\}}: \sum _{i=1}^{N}\alpha ^{n}(\mathbf {n}_{i} - \mathbf {\hat{n}}_{i})^{2} + \sum _{i, j}^{N}(\mathbf {n}_{i} - \mathbf {n}_{j})^{2}\Big (\beta _{1}^{n}\kappa _{1}(\mathbf {p}_{i};\mathbf {p}_{j}) + \beta _{2}^{n}\kappa _{2}(\mathbf {p}_{i}, \varDelta \mathbf {d}_{i}; \mathbf {p}_{j}, \varDelta \mathbf {d}_{j}) \Big ) \end{aligned}$$

Roughness Refinement. Since we use a collocated light source to illuminate the material, once we have the normal and diffuse color predictions, we can use them to estimate the roughness term by either grid search or using a gradient-based method. However, since the microfacet BRDF model is not convex nor monotonic with respect to the roughness term, there is no guarantee that we can find a global minimum. Also, due to noise from the normal and diffuse predictions, as well as environment lighting, it is difficult to get an accurate roughness prediction using optimization alone, especially when the glossiness in the image is not apparent. Therefore, we propose to combine the output of the network and the optimization method to get a more accurate roughness prediction. We use a DCRF with two unary terms, \(\hat{r}_{i}\) and \(\tilde{r}_{i}\), given by the network prediction and the coarse-to-fine grid search method of [15], respectively:

$$\begin{aligned} \min _{ \{r_{i}\} }: \sum _{i=1}^{N}\alpha _{i0}^{r}(r_{i} - \hat{r}_{i})^{2} + \alpha _{i1}^{r}(r_{i} - \tilde{r}_{i})^{2} + \sum _{i, j}^{N} (r_{i} - r_{j})^{2} \Big (\beta _{0}\kappa _{0}(\mathbf {p}_{i}; \mathbf {p}_{j}) + \beta _{1}\kappa _{1}(\mathbf {p}_{i}, \mathbf {d}_{i}; \mathbf {p}_{j}, \mathbf {d}_{j}) \Big ) \end{aligned}$$

All DCRF coefficients are learned in an end-to-end manner using [36]. Here, we have a different set of DCRF parameters for each material type to increase model capacity. During both training and testing, the classifier output is used to average the parameters from different material types, to determine the DCRF parameters. More implementation details are in supplementary material.

5 Experiments

In this section, we demonstrate our method and compare it to baselines on a wide range of synthetic and real data.

Rendering Synthetic Training Dataset. To create our synthetic data, we apply the SVBRDFs on planar surfaces and render them using a GPU based renderer [19] with the BRDF importance sampling suggested in [16]. We choose a camera field of view of \(43.35^{\circ }\) to mimic typical mobile phone cameras. To better model real-world lighting conditions, we render images under a combination of a dominant point light (flash) and an environment map. We use the 49 environment maps used in [19], with random rotations. We sample the light source position from a Gaussian distribution centered at the camera to make the inference robust to differences in real-world mobile phones. We render linear images, though clamped to (0, 1) to mimic cameras with insufficient dynamic range. However, we still wish to reconstruct the full dynamic range of the SVBRDF parameters. To aid in this, we can render HDR images using in-our network rendering layer and compute reconstruction error w.r.t HDR ground truth images. In practice, this leads to unstable gradients in training; we mitigate this by applying a gamma of 2.2 and minor clamping to (0, 1.5) when computing the image reconstruction loss. We find that this, in combination with our L2 losses on the SVBRDF parameters, allows us to hallucinate details from saturated images.

Training Details. We use Adam optimizer [18] to train our network. We set \(\beta _{1} = 0.5\) when training the encoder and decoders and \(\beta _{1} = 0.9\) when training the classifier. The initial learning rate is set to be \(10^{-4}\) for the encoder, \(2\times 10^{-4}\) for the three decoders and \(2\times 10^{-5}\) for the classifier. We cut down the learning rate by half in every two epochs. Since we find that the diffuse color and normal direction contribute much more to the final appearance, we first train their encoder-decoders for 15 epochs, then we fix the encoder and train the roughness decoder separately for 8 epochs. Next, we fix the network and train the parameters for the DCRFs, using Adam optimizer to update their coefficients (Fig. 4).

5.1 Results on Synthetic Data

Qualitative Results. Figure 1 shows results of our network on our synthetic test dataset. We can observe that spatially varying surface normals, diffuse albedo and roughness are recovered at high quality, which allows relighting under novel light source directions that are very different from the input. To further demonstrate our BRDF reconstruction quality, in Fig. 5, we show relighting results under different environment maps and point lights at oblique angles. Note that our relighting results closely match the ground truth even under different lighting conditions; this indicates the accuracy of our reconstructions.

We next perform quantitative ablation studies to evaluate various components of our network design and study comparisons to prior work.

Fig. 4.
figure 4

BRDF reconstruction results from our full method (\(\mathtt {clsCRF}\)-\(\mathtt {pt}\) in Table 3) on the test set. We compare the ground truth parameters with our reconstructions as well as renderings of these parameters under novel lighting. The accuracy of our renderings indicates the accuracy of our method.

Fig. 5.
figure 5

Materials estimated with our method and rendered under two environment lights and three point lights (placed on a unit sphere at \(\theta = 50^{\circ }\) and various \(\phi \) angles).

Effects of Material Classifier and DCRF: The ablation study summarized in Table 3 shows that adding the material classifier reduces the L2 error for SVBRDF and normal estimation, as well as rendering error. This validates the intuition that the network can exploit the correlation between BRDF parameters and material type to produce better estimates. We also observe that training the classifier together with the BRDF reconstruction network results in a material classification error of \(73.65\%\), which significantly improves over just our pure material classification network that achieves \(54.96\%\). This indicates that features trained for BRDF estimation are also useful for material recognition. In our experiments, incorporating the classifier without using its output to fuse BRDF reconstruction results does not improve BRDF estimation. Figure 6 shows the reconstruction result on a sample where the classifier and the DCRF qualitatively improve the BRDF estimation, especially for the diffuse albedo.

Table 3. Left to right: basic encoder-decoder, adding material classifier, adding DCRF and a pure material classifier. \(-\mathtt {pt}\) indicates training and testing with dominant point and environment lighting.
Fig. 6.
figure 6

Qualitative comparison of BRDF reconstruction results of different variants of our network. The notation is the same as Table 3 and \(-\mathtt {env}\) represents environment illumination.

Table 4. BRDF reconstruction accuracy for different material types in our test set. Albedo-N is normalized diffuse albedo as in [19], that is, the average norm of each pixel will be 0.5.
Fig. 7.
figure 7

The first two inputs rendered under different environment maps are very different. Thus, the normals recovered using [19] are inaccurate. Our method uses point illumination (third input) which alleviates the problem, and produces better normals.

Effect of Acquisition Under Point Illumination. Next we evaluate the effect of using point illumination during acquisition. For this, we train and test two variants of our full network – one on images rendered under only environment illumination (-\(\mathtt {env}\)) and another on images illuminated by a point light besides environment illumination (-\(\mathtt {pt}\)). Results are in Table 4 with qualitative visualizations in Figure 6. The model from [19] in Table 4, which is trained for environment lighting, performs slightly worse than our environment lighting network cls-env. But our network trained and evaluated on point and environment lighting, cls-pt, easily outperforms both. We argue this is because a collocated point light creates more consistent illumination across training and test images, while also capturing higher frequency information. Figure 7 illustrates this: the appearance of the same material under different environment lighting can significantly vary and the network has to be invariant to this, limiting reconstruction quality.

Fig. 8.
figure 8

SVBRDF estimation errors for relative intensities of environment against point light ranging from 0 to 0.8.

Relative Effects of Flash and Environment Light Intensities. In Fig. 8, we train and test on a range of relative flash intensities. Note that as relative flash intensity decreases, errors increase, which justifies our use of flash light. Using flash and no-flash pairs can help remove environment lighting, but needs alignment of two images, which limits applicability.

5.2 Results on Real Data

Acquisition Setup. To verify the generalizabity of our method to real data, we show results on real images captured with different mobile devices in both indoor and outdoor environments. We capture linear RAW images (with potentially clipped highlights) with the flash enabled, using the Adobe Lightroom Mobile app. The mobile phones were hand-held and the optical axis of the camera was only approximately perpendicular to the surfaces (see Fig. 1).

Qualitative Results with Different Mobile Phones. Figure 9 presents SVBRDF and normal estimation results for real images captured with three different mobile devices: Huawei P9, Google Tango and iPhone 6s. We observe that even with a single image, our network successfully predicts the SVBRDF and normals, with images rendered using the predicted parameters appear very similar to the input. Also, the exact same network generalizes well to different mobile devices, which shows that our data augmentation successfully helps the network factor out variations across devices. For some materials with specular highlights, the network can hallucinate information lost due to saturation. The network can also reconstruct reasonable normals even for complex instances.

A Failure Case. In Fig. 10, we show a failure case. Here, the material is misclassified as metal which causes the specular highlight in the center of image to be over-suppressed. In future work, we may address this with more robust material classification, potentially exploiting datasets like [3].

Fig. 9.
figure 9

BRDF reconstruction results on real data. We tried different mobile devices to capture raw images using the Adobe LightRoom Mobile app. The input images in were captured using a Huawei P9 (first three rows), Google Tango (fourth row) and iPhone 6s (fifth row), all with a handheld mobile phone where the z-axis of camera was only approximately perpendicular to the sample surface.

5.3 Further Comparisons with Prior Works

Comparison with Two-Shot BRDF Method [2]. The two-shot method of [2] can only handle images with stationary texture while our method can reconstruct arbitrarily varying SVBRDFs. For a meaningful comparison, in Fig. 12, we compare our method with [2] on a rendered stationary texture. We can see that even for this restrictive material type, the normal maps reconstructed by the two methods are quite similar, but the diffuse map reconstructed by our method is closer to ground truth. While [2] takes about 6 h to reconstruct a patch of size \(192\times 192\), our method requires 2.4 s. The aligned flash and no-flash pair for [2] is not trivial to acquire (especially on mobile cameras with effects like rolling shutter), making our single image BRDF estimation more practical.

Comparison of Normals with Environment Light and Photometric Stereo. In Fig. 11, we compare our normal map and the results from (a) [19] (from a single captured under environment lighting) and (b) photometric stereo [14]. We observe that the normals reconstructed by our method are of higher quality than [19], with details comparable or sharper than photometric stereo.

Fig. 10.
figure 10

A failure case, due to incorrect material classification into metal, which causes the specularity to be over-smoothed.

Fig. 11.
figure 11

Comparison of normal maps using our method and [19], with photometric stereo as reference. Even with a lightweight acquisition system, our network predicts high quality normal maps.

Fig. 12.
figure 12

Comparison with [2], which requires two images, assumes stationary textures and takes over 6 h (with GPU acceleration), yet our result is more accurate.

The supplementary material provides more information, including: details of data augmentation and continuous DCRF, error distributions of BRDF, distribution of material categories, material editing and relighted images and further qualitative results on synthetic and real data.

6 Discussion

We have proposed a framework for acquiring spatially-varying BRDF using a single mobile phone image. Our solution uses a convolutional neural network whose architecture is specifically designed to reflect various physical insights into the problem of BRDF estimation. We propose to use a dataset that is larger and better-suited to material estimation as compared to prior ones, as well as simple acquisition settings that are nevertheless effective for SVBRDF estimation. Our network generalizes very well to real data, obtaining high-quality results in unconstrained test environments. A key goal for our work is to take accurate material estimation from expensive and controlled lab setups, into the hands of non-expert users with consumer devices, thereby opening the doors to new applications. Our future work will take the next step of acquiring SVBRDF with unknown shapes, as well as study the role of semantic priors.