An attention-embedded GAN for SVBRDF recovery from a single image

Learning-based approaches have made substantial progress in capturing spatially-varying bidirectional reflectance distribution functions (SVBRDFs) from a single image with unknown lighting and geometry. However, most existing networks only consider per-pixel losses which limit their capability to recover local features such as smooth glossy regions. A few generative adversarial networks use multiple discriminators for different parameter maps, increasing network complexity. We present a novel end-to-end generative adversarial network (GAN) to recover appearance from a single picture of a nearly-flat surface lit by flash. We use a single unified adversarial framework for each parameter map. An attention module guides the network to focus on details of the maps. Furthermore, the SVBRDF map loss is combined to prevent paying excess attention to specular highlights. We demonstrate and evaluate our method on both public datasets and real data. Quantitative analysis and visual comparisons indicate that our method achieves better results than the state-of-the-art in most cases.


Introduction
The complex interaction between light and the surfaces of objects with various appearances explain the variations in photographs captured in the real world. Acquiring the surface reflection parameters of objects is one of the major tasks in computer vision and realistic graphics [1][2][3][4][5], with various applications including appearance transfer, restoration, and augmented reality. Breaking appearance down into reflectance and lighting leads to powerful image editing applications, such as material transfer [6] and illumination editing [7].
There is a trend towards data-driven approaches when capturing surface appearance, as they are more expressive than analytic models. Recently, lightweight capture processes using consumer devices and uncalibrated lighting have drawn more attention. It would benefit content creation for virtual worlds if a user could create the appearance from a few or even a single image. Since different lighting and material properties may produce the same visual effects, recovering material parameters based on a single image is an ill-posed problem.
Researchers have proposed many learning-based approaches to address the task; some focus on estimating spatially-varying surface material parameters from a single image [8][9][10][11]. The numerical pixel error in these works is low, but the visual effect is not satisfactory in certain cases. Others use multiple image inputs to capture materials [12,13], but require tedious optimization to support an arbitrary number of input images. In this paper, inspired by the significant progress in using generative adversarial networks (GANs) [14] in image processing, we propose an end-to-end learning framework to reconstruct spatially-varying bidirectional reflectance distribution functions (SVBRDFs) from a single image.
Traditional convolutional neural networks (CNNs) for SVBRDF recovery cannot effectively express smooth highlight details in the reconstructed appearance. Another challenge is generalizability, especially when trained on synthetic datasets. Some methods [15,16] add multiple adversarial losses to optimize individual material parameters. Considering the inherent instability in GAN training, increasing the adversarial loss inevitably increases redundancy and the difficulty of determining network parameters. We aim to obtain high-quality material parameters to efficiently reconstruct the appearance of the input, so in our framework we use a unified discriminator for all maps. Specifically, we apply a binary classification discriminant network to judge the authenticity of the generated results, and use it to optimize the latent features in the generative network, so that the generator can produce images of high perceptual quality.
Parameter estimation not only needs to consider overall quality, but also must be optimized for certain local details, which are often the main visual clues for human determination of material properties: when a person observes an image, they first scan it globally, and then pay attention to details. This focusing of attention tends to allocate more resources to the target of interest, while suppressing the less useful features. Therefore, we design an attention mechanism to improve the quality of SVBRDF maps, especially for local high-frequency details.
However, the attention mechanism may also increase error, especially in specular regions, because it focuses on highlights and cannot decompose the result of the joint contributions of multiple maps, resulting in false bright regions in the specular map. We thus add an SVBRDF map loss to guide weight learning after adopting the attention mechanism. Our new joint loss consists of adversarial loss, rendering loss, and SVBRDF map loss. We have validated that our GAN-based reconstruction framework can produce convincing results.
To summarize, our contributions include: • a unified GAN framework for supervised highquality SVBRDF map recovery, • an attention mechanism specifically designed to improve visual quality resulting from the reconstructed SVBRDFs, and • a new joint loss consisting of a weighted sum of the adversarial loss, rendering loss, and SVBRDF map loss.

Related work
This overview mainly discusses works that allow non-expert users to employ image-driven appearance modeling tools with commodity devices and lightweight capturing processes. Some use a few or multiple images as input and fit the SVBRDF parameters using them, without any prior knowledge.
Others are deep learning-based, and use large-scale datasets to train network parameters.

Multi-image appearance modeling
Several works use multiple images as input to capture SVBRDFs. Chandraker [17] used motion cues to jointly optimize the shape and reflectance of objects, under known lighting conditions. Hui and Sankaranarayanan [18] captured several images of an object from a fixed viewing angle under varied lighting to estimate shape and reflectance. Riviere et al. [19] used a handheld camera or mobile phone to collect spatially varying material samples and utilized handcrafted heuristics to estimate specular and diffuse reflectance. Xia et al. [20] estimated SVBRDFs and detailed geometric shapes from videos of rotating objects under unknown natural lighting. Although these methods can obtain accurate reflectance, they need heuristic regularization or assumptions due to the paucity of input samples. In contrast, our method can perform well even for a single input image.

Single or few image appearance modeling
Other works aim to estimate material properties by inputting one or a minimal number of images. Boivin and Gagalowicz [21] used a single image and a 3D geometric model to recover surface reflectance. Their algorithm first classifies the materials in the image, and finds the optimal values of material parameters by continuous layering and iteration. Aittala et al. [22] recovered reflectance from only two photos assuming that each local area is statistically similar. Xu et al. [23] obtained the BRDF of a homogeneous plane sample from two photos; the approach can also be simply extended to acquisition of SVBRDFs by clustering the materials. These methods usually require strict constraints and complex fitting or optimization to achieve their goals.

Learning-based appearance modeling
At present, most successful works are deep learning based, as this allows use of prior knowledge to help solve this ill-posed problem. Li et al. [8] obtained diffuse albedo and normal maps from nearly-planar samples under global illumination, using a self-augmentation strategy to train the model using a small training set. Deschaintre et al. [9] proposed a parallel network structure that combines a fully connected layer with a traditional U-Net [24] to extract global features, in order to reduce artifacts. They extended this work to flexibly allow a varying number of images by using an orderindependent fusion layer [25]. Li et al. [26] designed a complicated cascade network that can recover shape and SVBRDFs simultaneously. They added a rendering layer to the network to estimate global illumination effects; this is essential for real-world scenes. The method proposed by Gao et al. [12] can estimate SVBRDF maps for a flat sample from any number of input photos. They train an autoencoder to build the latent space of the SVBRDF maps, and then optimize the material maps within it. Using more input images, the SVBRDF maps become more accurate, but takes more time.
Zhao et al. [27] proposed an unsupervised generative adversarial neural network that can generate high-quality SVBRDFs from a single photograph with a repetitive structure. Guo et al. [13] proposed MaterialGAN to solve the problem of SVBRDF reconstruction from multiple input images. Generally, the multi-view method requires relatively correct viewing angles and light directions to provide high-quality output, but non-expert users cannot accurately determine these parameters, increasing difficulty and decreasing robustness of the method. Asselin et al. [28] used a new portable capture device to obtain real datasets, and estimate material maps based on the deep learning architecture of StyleGANv2 [29]. Zhou and Kalantari [15] adopted multiple adversarial losses and added some real-world images during training to improve the quality of the reconstructed parameter maps, but the generated maps have artifacts in saturated highlight regions. Guo et al. [16] designed a two-stream neural network to obtain SVBRDF maps, which contains two independent feature extraction modules and four feature fusion modules to reduce artifacts caused by input highlights, but there are still large errors in some cases. Our solution is an end-to-end GAN architecture; an attention mechanism is embedded in the generator to keep details. A comprehensive comparison of results demonstrates the superiority of this method.

Method
Inspired by the progress of GANs in image processing tasks, we propose a new GAN architecture, which can generate reliable SVBRDF maps from a single image. The main structure of our GAN architecture is shown in Fig. 1. We input a picture taken by a mobile phone or camera into the generation network to give an initial result. We then input the predicted SVBRDF maps and ground truth maps into the rendering layer to randomly render multiple images and concatenate them. The discriminant network is used to distinguish between true and false, and the final difference is combined as the loss function.

Network structure
Our generator consists of an encoder and a decoder, and finally generates a normal map, a diffuse albedo map, a roughness map, and a specular albedo map. For convenience, we denote the layers producing outputs of the same resolution as a level. Figure 2 shows our generator architecture for one such level, which includes down-sampling, up-sampling, and a parallel fully connected layer to fuse global information. A typical down-sampling block contains 4 layers: a convolution layer, an attention layer, an InstanceNorm layer, and a Leaky ReLU [30] activation layer. The attention mechanism module will be explained in Section 3.2. A typical up-sampling block contains 4 layers: a deconvolution layer, an InstanceNorm layer, a Leaky ReLU activation layer, and a dropout layer. For image generation tasks like ours, the generated result mainly depends on one image instance, so we use instance normalization instead of normalizing the entire batch. In order to increase the nonlinear relationship between the layers of the network, the choice of activation function is crucial. We use the Leaky ReLU function with a weight of 0.2, as, compared to ReLU, doing so can speed up convergence and effectively avoid gradient vanishing and dead neurons during training. We introduce skip connections to sample blocks of the same size to reintroduce missing high-frequency details. Deschaintre et al. [9] showed that the current task cannot be readily solved using only a network structure similar to U-Net, as the convolution operation is usually used to extract spatially local features. The practical receptive field of the CNN is actually much smaller than the theoretical value especially at high levels, as shown by Zhou et al. [31]. Therefore, a global feature extraction module is needed to fuse far-away information. Specifically, we add a network composed of fully connected (FC) layers parallel to the U-Net as the global feature extractor, following Ref. [9]. The output of the InstanceNorm layer in each sampling block is added to the output of the FC layer in the current level and then passed to the activation layer. The output of the FC layer is further concatenated with the mean vector of the activation layer output, as the input to the FC layer in the next level. The discriminator follows Isola et al. [32].

Attention module
Attention mechanisms [33,34] are widely used in natural language processing, speech and image processing, etc. [35][36][37]. We designed an attention module considering multi-scale features (see Fig. 3) to improve both global and local quality of the reconstructed results. Intuitively speaking, "highlevel" attention tends to concentrate on highlight saturated area, while "low-level" attention focuses on details like local high-frequency variations in SVBRDF maps. Experimental results of ablation studies of the attention module are given in Section 4.3.
X in the attention module is given by Fig. 3 Attention module. Arrows indicate directions of data flows. X l is the input to level l. X is the result of multiplying the output of the activation function and the input feature map. X l+1 is the output of the entire attention module to the next level, concatenating X l and X .
where α i represents average pooling, W c denotes a 1×1 convolution kernel, f i is the bilinear interpolation for up-sampling, σ is the activation function, ⊗ represents convolution, and represents elementwise product.
After adding this mechanism, detailed features are enhanced, as can be seen later from the results in Fig. 6.

Loss function
Choice of loss function is critical for generators. Deschaintre et al. [9] showed that L1 loss using SVBRDF maps alone cannot recover appearances relatively consistent with the ground truth, so they used the rendering loss instead of L1 loss. Although re-rendering of the restored SVBRDF maps can produce an appearance relatively consistent with the input, there are still large errors for some reflection parameters, especially the roughness map and the specular albedo map. In order to account for both per-pixel error and consistency, we apply a joint loss function which is a weighted sum of adversarial loss, rendering loss, and SVBRDF map loss: The optimized objective function for the generative adversarial network is where G is the generative network, which represents the mapping of training samples to generated data. D is the discriminant network, which discriminates the input samples and maximizes the distance between the real data and the generated data. x ∼ p data (x) is the real data, and z ∼ p z (z) is the input data. The two networks optimize the objective function through alternated iterative training. The rendering loss is an L1 loss between the rendering result of the predicted SVBRDF maps and the ground truth SVBRDF maps under several lighting and viewing directions. The logarithmic transformation aims to enhance details especially in dark regions, following Ref. [38]. N represents the number of generated images rendered in random directions, and R i is the rendering layer in the network. The rendering layer acts as a pixel shader that evaluates the rendering equation at each pixel of the SVBRDFs, given a pair of viewing and lighting directions. The process is performed in SVBRDF coordinate space. We use the Cook-Torrance [39] BRDF model to render the image following Aittala et al. [40]. The SVBRDF map loss is defined as L svbrdf = λ n L n + λ d L d + λ r L r + λ s L s (5) where L n , L d , L r , L s are the L1 loss of the normal map, diffuse albedo map, roughness map, and specular map, respectively. In our experiments, λ n = λ d = 1, λ r = λ s = 0.5.

Experiments
We now introduce the dataset and experimental parameters used in our experiments, and give a quantitative and qualitative evaluation of our proposed methods.

Datasets and implementation
For evaluation we use a synthetic dataset provided by Deschaintre et al. [9]. It contains approximately 200,000 synthetic samples, including training and test samples. Each sample contains the original image, normal map, diffuse albedo map, roughness map, specular albedo map; the size of the image is 256. We implemented our model using the TensorFlow [41] deep learning framework. Training was performed on a Tesla V100 GPU. The generator and the discriminator were trained alternately, and the discriminator was updated once after the generator was trained 5 times, on average. Training used the Adam optimizer [42], with initial learning rate 0.00002, reduced by half every two epochs. All other hyperparameters were set to the default values for TensorFlow. We set λ 1 = 0.1, λ 2 = 0.5, λ 3 = 0.5 for L adv . The batch size was 8. We trained our GAN architecture for 20 epochs, which took about 7 days.

Comparison
We conducted quantitative and qualitative comparisons on the synthetic dataset to verify the validity. For further experimental results, refer to the Electronic Supplementary Material. The test dataset is a selection from the synthetic dataset to provide ground truth, and was not used in training.

Synthetic data
We chose several state-of-the-art methods [9,12,15,16] for comparison. Gao et al. [12] used an arbitrary number of input images; to be fair, we set N = 1. The root mean square error (RMSE) on each reflectance map was calculated. Merely calculating the numerical error of the material map does not suffice to show the superiority of our method, so the RMSE between the re-rendered image and the original image under 6 random lighting and viewing directions was also calculated. Average results for all test sets are given in Table 1. To further demonstrate the superiority of our method, we also considered two more advanced evaluation metrics, SSIM and LPIPS, with results shown in Tables 2 and 3. Our method achieves better results on several error metrics.
To make a qualitative comparison, we randomly selected some synthetic data, and provided visual results for each algorithm in Fig. 4. When the method of Deschaintre et al. [9] processes input images with strong highlights, the output material maps have noticeable artifacts. Gao et al. [12] needed a reconstruction method to give the initial input material maps; we used the material maps provided by Deschaintre et al. [9]. If these initial maps are not close to the ground truth, the optimization result is likely to fall into a local minimum. Furthermore, their  [15] suffered from artifacts or color distortion when dealing with areas with significant highlight saturation, as shown in the upper row of Fig. 4. Guo and Kalantari [16] can effectively suppress artifacts, but there are still noticeable visual errors in some cases, as shown in the left column of Fig. 4. Our results are better than previous methods in terms of overall visual appearance and some local details, due to our joint loss and attention mechanism, which help to restore global and local consistency of feature details.

Real data
We compared our results to those of other methods [9,12,15,16] using the collected real samples [9,43]. Pictures taken by cameras or mobile phones were used as inputs, and the output SVBRDF maps were re-rendered under the same lighting conditions; final results are displayed in Fig. 5. As can be seen, results of our method after re-rendering are closer to the real input.

Ablation studies
We performed a set of ablation experiments to verify the contribution of each component of our method, and compared our GAN architecture to ablated versions. The variants considered were • without L adv . To analyze the importance of the adversarial loss in restoring material maps, we removed the discriminator network and set λ 1 = 0. • without AM. To verify that the attention mechanism module can effectively enhance details, we removed the module. • without L svbrdf . To verify that, if SVBRDF map loss is not used (λ 3 = 0), there will be incorrect bright regions in the specular map. • MultiGAN. To demonstrate that our unified GAN framework can produce better results, multiple adversarial loss experiments were conducted. Deschaintre et al. [9] have already demonstrated the importance of rendering loss.
We trained the above models on the same dataset under the same training conditions as before. Quantitative results of the ablation studies are shown in Table 4.
The visual comparison in Fig. 6 further shows that our approach provides the best results. Figure 6(a) shows that when the adversarial loss is removed, the overall error increases, especially in the normal map: the adversarial loss has a significant impact on the Fig. 4 Comparison to RADN of Deschaintre et al. [9], DIR of Gao et al. [12], ASSE of Zhou and Kalantari [15], and HATS of Guo et al. [16], using synthetic data. The parameters from our method and its results after re-rendering are closer to the ground truth. quality of the generator. In Fig. 6(b), there is higher error in the roughness map: it lacks details after removing the attention module. Simply using the rendering loss to update the network parameters, without L svbrdf , leads to weight reduction for the specular albedo map, resulting in higher errors, as shown in Fig. 6(c). If SVBRDF loss is omitted, if a picture with specular highlights is input, the network will pay more attention to it due to the existence of the attention mechanism, but cannot decompose the maps, resulting in the false highlight in the specular albedo map, as shown in Fig. 6(d).

Fig. 5
Comparison to RADN of Deschaintre et al. [9], DIR of Gao et al. [12], ASSE of Zhou and Kalantari [15], and HATS of Guo et al. [16] using a single real image as input. All re-rendered results were generated with the same lighting conditions and viewing direction. It can be seen that our results are the closest to the input pictures, effectively reconstructing the SVBRDF maps. When adding multiple adversarial losses for training as in MultiGAN, the recovery of SVBRDF maps is not significantly improved. This is due to the inherent difficulty in achieving the Nash equilibrium between the generator and multiple discriminators, so it is difficult and time-consuming to obtain the optimal solution. In order to further study the role of SVBRDF map loss in the training process when the input image has specular highlights, we analyzed its effect by changing the value of λ 3 , as shown in Fig. 7. As λ 3 increases, the false highlights in the specular albedo map are eliminated, and the learned results become closer to the ground truth.

Conclusions
This paper presents a novel solution based on a GAN to recover SVBRDF maps from a single image. It can generate more accurate material maps and rerendered appearance is more realistic. However, our method also has limitations.
Although we achieve reasonable results for input images containing specular highlights, when the highlights are too large, low-saturation pixels will dominate the entire image. In this case, the network cannot learn enough features to generate plausible SVBRDF maps, resulting in color distortion around highlights in the re-rendered image. One possible way to improve this is to use multiple inputs from different lighting and viewing directions. A more flexible network structure should also be designed to support multiple inputs.
Currently, a synthetic dataset is used for training, and then the model is applied to real data to generate material maps, because labelled real datasets are scarce and difficult to acquire. Therefore, another potential direction for future work is to find a way to utilize real datasets in training.