Learning Physically-based Material and Lighting Decompositions for Face Editing

Lighting is crucial for portrait photography, yet the complex interactions between skin and incident light are expensive to model computationally in graphics and hard to reconstruct analytically via computer vision. Instead, to allow fast and controllable reflectance and lighting editing, we form a physically-based decomposition through deep learned priors from path-traced por-trait images. Previous approaches use simplified material models or low-frequency or low-dynamic-range lighting struggle to model specular reflections, or relight directly without intermediate decomposition. Instead, we estimate surface normal, skin albedo and roughness, and high-frequency HDRI maps, and propose an architecture to estimate both diffuse and specular reflectance components. In experiments, we show that this approach can better represent the true appearance function than simpler baseline methods, leading to better generalization and higher-quality editing.


Introduction
Lighting is a crucial factor in successful portrait photography. Photographers set up studio lights and reflectors to enhance the appearance of subjects, with careful consideration for the appearance of skin to avoid unwanted gloss and highlights. For casual camera users, this level of control is difficult to achieve. Manually editing lighting and material appearance after a photo has been taken might simplify the creation process for novices, but current tools require manual operation and skill to produce convincing effects.
Decomposing an image into useful channels could help the portrait manipulation task. Under Lambertian reflectance assumptions, intrinsic decomposition separates the material color-the albedo-from the received illumination at each pixel of an image. This allows edits and recombination with novel lighting or materials. However, faces are not Lambertian, and require complex lighting and 1 Equal contribution. 2 Corresponding author. material models to more accurately decompose an image into useful intermediate channels. Further, such decompositions are ill-posed, and so must consider how to incorporate assumptions or priors to produce plausible answers.
Our work focus on the problem of single image face decomposition using physically-based lighting and material models. First, we consider diffuse and specular reflectance under a Cook-Torrance SVBRDF model, consisting of separate skin albedo, and specular scaling coefficient (ρ) and roughness (m) maps. Next, we consider that high-dynamicrange lighting with high spatial frequency is critical for specular appearance. As such, we create realistic synthetic data using real-world face geometry captures, real-world reflectometer measurements of skin, and real-world HDRI illumination with self-shadowing via path tracing.
To produce plausible decompositions, we supervise training of a deep neural network to estimate from a single face image a normal map, albedo map, specular scaling and roughness maps, and an approximate HDR incoming lighting map. Then, as realistic shadowing and glossy reflection rendering is computationally expensive, we use these physically-based maps to predict diffuse shading and specular maps given the lighting as conditioning information. Finally, we reconstruct the outputs using our image formation model. Each intermediate image formation model component (and so network architecture) can be supervised explicitly for stability, with final end-to-end fine tuning.
We operate directly on linear HDR images as specular illumination components are often clipped/saturated in LDR images. This allows more accurate specular reconstruction, Linearity also makes our decompose-edit-compose pipeline possible without introducing any non-linear errors due to tone mapping, which eases later editing and compositing. For instance, such a decomposition allows relighting with plausible specular highlights, along with shading editing and gloss and sharp specular highlight editing. In comparisons to baselines with simpler lighting and material models, and to pure relighting methods that do not decompose to intermediate maps, our method is better able to reproduce specularity and shading, and so provide more control in editing and more accurate relighting.
In short, our work argues that portrait editing can benefit from learning priors for decompositions through physicallybased image formation models. Our contributions are: • A realistic synthetic face image generation pipeline using public available face assets, creating a high quality synthetic face dataset with specularity and selfocclusions under varying lighting conditions, • A method to decompose HDR image via a physicallybased image formation model, allowing editing of properties like spatially-varying specular gloss.
Our work helps to shed light on how to accomplish accurate image decomposition without access to expensive light stage captures. We will release our source code and dataset for further research in the community.

Related work
We discuss classic and recent methods that address closely-related problems. We also provide an additional table of closely-related work (Table 1), This relates reflectance models, illumination models, geometry models, and model features, as well as whether code and data are available for each technique.
Intrinsic decomposition. These commonly assume classic monochromatic illumination (MI) [2] or Retinex constraints [17]. Li et al. [19] used statistics of skin reflectance and facial geometry as constraints in an optimization for intrinsic components. Recently, end-to-end learning approaches embed priors in neural networks via synthetic images [14,20]. Better results can be achieved with hybrid training of synthetic and real data [27] or with high-quality real images [21,32].
Skin reflectance models. Face appearance modeling is well-studied in computer graphics. One common approach is the Torrance-Sparrow specular BRDF, as used by Weyrich et al. [34] to develop an analytic spatially-varying face rendering model with measured skin data. Recent analysis works employ it. For example, based upon 3DMM face geometry, Smith et al. [28] build a statistical model for human face appearance including both diffuse and specular albedo. Subsurface scattering is an additional skin appearance component [22] that is computationally expensive to model; similar to most concurrent works, we do not assume this appearance factor in our model.
Face decomposition. With capture setups like light stages, recent approaches have trained deep neural networks to estimate physically-based reflectance from monocular images, usually in a supervised manner [36,27,5,21,18,32]. Sengupta et al. [27] assume Lambertian reflectance, while Nestmeyer et al. [21] and Wang et al. [32] predict specularity in addition, though not with a decomposed skin model. Yamaguchi et al. [36] and Lattas et al. [18] trained deep neural networks to infer high-quality geometry, diffuse, specular albedo, and displacement map from a single image but do not explicitly model illumination. In our work, we avoid the problem of geometry estimation and work only in screen space with normal maps. Finally, differentiable ray tracing can produce accurate reconstructions [10] with more realistic self-shadows and without large databases, though it is computationally expensive.
Lighting representation. Spherical harmonics (SH) capture low-frequency signals efficiently for fast rendering [25,1], and have been used at 2nd order to cheaply model the irradiance onto the face for diffuse shading [24]. Zhou et al. [40] use SH illumination to learn to relight a single input face image. For more accurate decompositions, Kanamori et al. [15] precompute light occlusion in the SH formulation directly for human body relighting.
Environment maps store sampled light and are often in a high dynamic range (HDR) for image-based lighting [8]. Many face works choose this representation as it can sample high-frequency signals, though deep learning models often Figure 1. Training dataset comparison. From left to right: Sf-SNet uses normals derived from 3DMM geometry, SH2 approximated LDR lighting, and diffuse reflectance [27]. DIPR uses SH2 approximated LDR lighting, Lambertian reflectance, and normals derived from 3DMM fit to CelebA. To improve upon these, our dataset uses realistic captured FaceScape geometry [37], high-frequency HDRI environment maps, and Torrance-Sparrow SVBRDF reflectance with physically-measured parameters mapped from Weyrich et al. [34], leading to increased realism. use smaller (32 × 16) maps. Yi et al. [38] trace specular highlights into the scene and obtain an environment map through a deconvolution determined by prior knowledge of face materials. Calian et al. [3] use faces as light probes to estimate HDR lighting from a single LDR photograph, learning priors through an autoencoder. Sun et al. [29] estimate high frequency environment maps at the bottleneck for portrait image relighting. Nestmeyer et al. [21] assume directional lighting and model specularity as a non-diffuse 'residual' term in their image formation process.
Our work takes the decomposition approach with HDR environment maps and a physically-based model of skin. On high-quality supervised data, we show that this can improve editing quality and capability over simpler decomposition and pure relighting approaches.

Dataset generation
High-quality data is important for overall model quality and generalization, but is expensive to acquire via light stages and so is often proprietary. As such, the research community has created synthetic databases for face decomposition and relighting [27,40] (Fig. 1). Our approach increases data realism; we will release scripts to generate our data for further research in face analysis and editing.
Renderer and Shading Model We generate our synthetic dataset in Blender [7] and use the physically-based path tracing renderer Cycles. Our synthetic faces are modeled with Blender's Principled BSDF, which is based on Disney's "PBR" shader, itself derived from the Torrance-Sparrow model [30,31]. The rendering integral for this diffuse and specular model is: where: with: G is the geometry term, D is the micro-facet distribution, and F r is the reflective Fresnel term. We have a factor of 4 in the denominator instead of π in the original Torrance-Sparrow paper [30] as we use the GGX micro-facet distribution [31]. The free variables determining specular appearance are the surface normal N , lighting L Ω , albedo α(x), specular scaling coefficient ρ s and roughness m.
Geometry and Albedo Our face geometry and diffuse albedo data comes from the large-scale 3D face dataset FaceScape [37], consisting of 18,760 detailed 3D face models and high resolution albedo and displacement maps, captured from 938 subjects each with 20 expressions.
Skin Reflectance We use skin reflectance statistics from the MERL/ETH Skin Reflectance Database [34], which provides per-face-region estimates. For each face in FaceScape's captured data, we find the closest matching face regions in the MERL/ETH dataset using the per-faceregion diffuse albedo, and then sample specular roughness m and scaling coefficient ρ s for specular response. Rather than constant ρ and m for all face regions across all individuals [32], our approach uses spatially-varying specularity. We split the face into 10 regions and randomly sampled the Torrance-Sparrow specular reflectance parameters per face region as in Weyrich et al. [34]. The face regions and variation of parameters are shown in Fig. 15 of their paper. We manually aligned the FaceScape geometry and albedo data to have the same 10 face regions. Figure 3. Our neural network architecture is composed of three blocks. Blue: First, a decomposition block uses a shared bottleneck to produce the constituent maps for shading. Yellow: Second, a diffuse shading branch uses lighting conditioning [32] in the decoder to produce a shading map (quotient image). Green: Third, a specular shading branch takes skin roughness and scaling maps and, again via lighting conditioning, creates the specular map. Finally, the image is linearly composited. Our renderings do not include subsurface scattering as these numerical parameters are not provided in the MERL/ETH dataset. Rendering low-noise subsurface scattering with a path tracer is computationally expensive, taking more than 30 seconds per image in Blender (GPU) and Mitsuba1 (CPU) renderers even for noisy outputs.
Lighting representation We use a 32 × 16 × 3 resolution HDR image [21,29,32] rather than an SH2 approximation [27,40]. SH2 approximations cannot capture illumination effects like hard shadows from self-occlusion or accurate specular reflectance (Fig. 2 compares via path  tracing). However, using even low-resolution HDR maps is a trade off, as more parameters must be estimated than SH2 or spherical Gaussian models, and so a larger neural network is required. This choice of environment lighting representation was also adopted in recent works [29,32].
Later, we will show the importance of higher-frequency illumination in the network's ability to model these complex effects (Figs. 9 and 5). Our equirectangular HDR environment maps are selected from the Laval Indoor HDR Dataset [11]. We choose the environment maps randomly, and replace only for very dark lighting conditions.
Output We render our data on a NVIDIA Quadro RTX 6000 GPU, taking ≈ 18 seconds per image. We export each component as a 512 × 512 image, in 32-bit high dynamic range where appropriate: normal I N , albedo I α , lighting l, scaling coefficient I ρ , roughness I m , as well as intermediate diffuse shading I sh (sometimes called a quotient image), specular I sp , albedo modulated diffuse shading I D , and final output I. For reproduction in the paper, all HDR images are tone mapped via the Reinhard operator [26].

Decomposition architecture
Given a dataset of face images with generated supervision for a physically-based deep learning approach, we take inspiration from Sengupta et al. [27] and Nestmeyer et al. [21] and design a three-stage approach with decomposition, diffuse shading, and specular branches (Fig. 3).
1. Decomposition branch. This takes as input a single portrait image I and decomposes it into the diffuse albedo map (Î α ), surface normal map (Î N ), specular reflectance parameter mapsÎ ρ andÎ m , and illumination (l). The decomposition branch must extract all relevant information from the face, and it is important that the features embedded in the bottleneck are not invariant to lighting as otherwise it would be impossible to predict an environment map. As 1a. Illumination block. Fig. 4 shows detailed architecture of the illumination block, This block estimates a high frequency 16×32×3 environment map from the bottleneck encoded from the input image. Taking inspiration from Hu et al. [13] and Sun et al. [29], our method decomposes 256 localized environment maps (referred to as illumination basis in the figure) and 256 corresponding confidence maps. Then, these are combined in a weighted sum to form the estimated environment map. This network architecture was first proposed by Hu et al. [13] as a "confidence learning" approach for color constancy, and then adapted by Sun et al. [29] for low-resolution environment map prediction.

Diffuse shading branch
This takes a normal map as input with illumination conditioning to produce the shading layer (Î sh ). It is built from a U-Net autoencoding architectures with skip connections, with additional lighting conditioning on the decoder. Inspired by Wang et al. [32], we observed that illumination is best fed as a feature-wise linear modulation at the up-convolution layers, analogous to AdaIN in StyleGAN [16], rather than concatenation at the input stage or at the bottleneck stage [29].
3. Specular branch. This is also a U-Net with illumination conditioning in the decoder. It takes a normal map and specular reflectance parameter mapsÎ ρ andÎ m as input to produce the specular layerÎ sp .
Final output. We construct the final imageÎ simply as: [32], our final image is created from a linear combination of estimated shadings, making editing operations on the inferred maps easier and faster.

Training
Two-stage training Each branch is initially trained separately on 128 × 128 input images, then all three are combined and fine tuned end to end for reconstruction. The intuition behind the two-stage training process is that, for practical reasons, it is more efficient to train the hyperparameters of a large multi-decoder network, as opposed to training it end-to-end from scratch.
HDR space and data normalization Unlike previous works that operate on LDR images, we use HDR images as the linear property of pixel intensitie and the environment map,s is critical for accurate specular reconstruction and avoiding artifacts like clipping.
Given that we are operating with HDR images, data normalization becomes critical to network training. Simple standardization will fail to reconcile the large differences in distributions between the input image, each reflectance component, and lead to unstable training. As such, following Weber et al. [33], we use normalization techniques (Alg. 1) to preserve the dynamic range prior to standardization, which can be reversed after inference.
Losses Given our rich synthetic data, we penalize L 1 supervised losses on all components. For the HDR illumination (l), we penalize a weighted-log L 1 loss, weighted by Algorithm 1 Data normalization routine. Notation: normal I N , albedo I α , illumination l, specular scaling coefficient map I ρ , spcular roughness map I m , diffuse shading (quotient image) I sh , specular shading I sp , diffuse lit face I d , and input image I. Input: Set L of HDR environment maps.  the solid angle of each pixel over the sphere [33]. We set λ ρ , λ m = 1.0, λ N , λ α , λ sh , λ D , λ sp , λ I = 0.8 and λ l = 0.1, where λ i is the weight on the respective loss for component i. Please see our supplemental material for additional architectural layer and training loss details.

Experiments
Dataset We render 100 face identities each under 25 random (of 100) different illumination environments, producing 2,500 training samples. Then, we render a test set of 100 other face identities each under 10 random (of a set of 20) test illuminations, producing 1,000 test samples. For our qualitative results, we show examples from only ten authorized identities [37], none of which are in the training set or the test set used for numeric comparisons.
Baselines For decompositions, we compare to public baselines. We consider SfSNet from Sengupta et al. [27], which is a 2D single-image decomposition approach that assumes SH2 lighting and diffuse reflectance only. We retrain SfSNet on our more realistic image data. We also compare to AlbedoMM from Smith et al [28], which is a geometric fitting approach based on 3DMM with diffuse and specular statistical model components. This cannot be retrained on our data. Finally, for relighting, we compare to DIPR from Zhou et al. [40], which uses a SH2 bottleneck to relight without decomposition.
Ablation-No ρ or m Fig. 10 demonstrate that estimating a specular map directly from normal and bypassing separate ρ and m maps produces substantially less accurate specular results for our architecture.
Results-Quantitative evaluation In Table 3, we quantitatively compare our decomposition and reconstruction results with SfSNet using L1 (the loss that both methods penalize), mean-squared error, and perceptual LPIPS [39] metrics. Our method produces more accurate reconstructions overall, with equivalent albedo estimates and better shading estimates. Given SfSNet's assumptions, we also show results when only diffuse effects are in the input (column 'Diffuse') for reconstruction without specular effects.
Here, our method shows smaller gains over SfSNet. For specularity, we compare to AlbedoMM [28] in Table 2. We use their probabilistic fitting pipeline 1 to estimate 3DMM parameters. The estimated specular maps are of lower quality, partly because of geometric fitting inaccuracies.
Results-Qualitative evaluation Albedo Estimation: Ours vs. SfSNet: In Figures 6 and 8, we qualitatively compare our diffuse albedo estimates from our decomposition branch with SfSNet's. In Fig 6, we show that our network can predict consistent albedo for the same individual across different illumination conditions, while SfSNet is less able to do so despite having been trained on the same data. explicit specular handling, SfSNet tends to bake specular effects into the albedo layer, making them look more like the input images. Our approach does not do this as it explicitly reconstructs specular, which we will later show is important for editing and relighting tasks.
Diffuse Shading Estimation: Ours vs. SfSNet : In Figure 9, we compare our shading layer estimation to SfSNet on hand-picked illumination conditions where the light is causing self-occlusions. Our network's approach of using high frequency illumination is better able to construct more complex diffuse shading with self-occlusion effects, which is important for capturing realistic illumination. SfSNet's SH2 lighting assumption prevents this model from being able to capture these effects. Specular Estimation: Ours vs. AlbedoMM : In Figure  7, we compare our specular layer estimation to AlbedoMM, which is one of the only published specular models and is attempting to solve a 3D fitting problem. We show our model's ability to capture specular effects under varying illumination from a 2D image. Figure 8. Specular separation: With its explicit specular handling, our model does not bake specular effects into the albedo. Without explicitly modeling specular, SfSNet tends to bake this appearance into the albedo image, causing it to look closer to the reconstruction.

Applications
Relighting: Ours vs. DIPR vs. SfSNet. Relighting takes as input a portrait image and a target illumination; some approaches tackle this through decomposition and others attempt to more directly learn a relighting function [40]. We compare our results with DIPR [40] and SfSNet. Besides the limitation that both approaches use 2nd-order SH lighting, DIPR also assumes monochromatic lighting. Figure 5 shows relighting under various target illuminations. Both DIPR and SfSNet fail to successfully unbake illumination effects: DIPR does not explicitly model the reflectance components and SfSNet has a Lambertian assumption. Our approach fares better. Specular reflectance editing. Finally, we show editing of the specular maps as another application of our approach 11. Notice because of our image formation model, we are able to selectively edit the desired component and preserve all components we chose not alter. Table 3. Decomposition quantitative evaluation performance by L1, MSE, and perceptual LPIPS [39] metrics. † SfSNet was re-trained on our data. With specular effects in the input images ('Reconstruction'), our approach is better than SfSNet. Without specular in the input ('Diffuse'), we see that our reconstruction quality is slightly improved thanks to more accurate shading estimation, though albedo estimates are slightly better for SfSNet. Note: We show LPIPS on Albedo and Shading for completeness, though this perceptual metric may be less meaningful on these less natural images.

Method
Reconstruction . Modeling higher-frequency illumination than SH2 produces shading that is closer to the path-traced ground truth.

Additional Results
We show additional results of our decomposition approach and compare it with SfSNet [27] and ground truth in Figures 12 and 13 at the end of the paper. We also show an example on a real-world image taken from the FaceScape dataset (Fig. 15).

Limitations
The FaceScape [37] dataset contains mostly Asian faces of limited skin tone variation, which restricts the ability of the priors learned from the data to be useful for other skin tones. Rather than build a practical system that one might deploy in the real world, our work only attempts to show academically that better synthetic data and image formation can improve face decomposition. Further, capturing albedo maps from human subjects is complex, and some of the data we use still has baked illumination components from shadowing in fine geometric detail.
Overcoming domain gaps is still a challenge. Chandran and Winberg et al. [4] propose a technique to project ray-traced faces to StyleGAN's latent space. This lets synthetic renderings retrain generated skin surface details, and let StyleGAN fill in missing details like eyes, inner mouth, hair, and background, all while respecting global scene illumination and camera pose. Such an approach allows supervised synthetic training to generalize to real world data.
Another limitation is missing subsurface scattering effects, which depend on the incoming illumination and the variable diffusivity of skin itself. While Blender supports subsurface effects, and while it is possible to map Christensen-Burley's d parameter (approximate scattering distance) to the scattering coefficient σ ′ s and absorption coefficient σ a captured by Weyrich et al. [34], we do not use Christensen-Burley's [6] approximation of subsurface scattering because a) there is a lack of captured priors from ETH/MERL, which makes realistic parameter selection difficult, and b) it is expensive to path trace computationally with low noise. We show an example of the difference between images with and without subsurface scattering (Fig. 14). Effects are typically most visible in thin regions of skin, such as the ears or nostrils.
Due to the path tracing, slight render noise is present in the training data. One side effect of using convolutional neural networks of limited capacity is that they learn this random high frequency patterning very late (if at all), and as such our output maps are not noisy.
Finally, our work estimates face decompositions. Moving to the more general case of portraits requires modeling geometric occlusion from external objects (world, hair, trees, etc.) which we cannot deal with, and other complex optical effects from face accessories like glasses. Future work should investigate how to combine differentiable path tracing with face modeling to capture refraction effects.

Conclusion
We present a method to decompose a single face image into physically-based channels that are useful for editing applications. Our approach renders a more realistic dataset than previously available, and then uses supervised deep learning to encode priors that predict individual image for- Figure 10. Predicting a specular layer with rho and roughness maps as input is beneficial. w/o maps indicates the specular branch was trained with only normal as input, while w/ maps indicates training with both rho and roughness map provided as input along with normal. mation components. We demonstrate that this approach is more successful than three recent methods with public codebases, particularly for specular reflections. Going forward, our work demonstrates the value of structuring deep learning frameworks around physically-based image formation models for more accurate reconstruction and editing.

Declarations
Consent for publication: We commit to the License Agreement of the FaceScape Dataset; all visages shown in our paper are from the stated publishable list.
Availability of data and materials: Image samples, code, and results released on https://github.com/brownvc/phaced. Funding: We thank Brown University, Qian Zhang thanks an Andy van Dam PhD Fellowship, and we declare no competing interests.