Learning conditional photometric stereo with high-resolution features

Photometric stereo aims to reconstruct 3D geometry by recovering the dense surface orientation of a 3D object from multiple images under differing illumination. Traditional methods normally adopt simplified reflectance models to make the surface orientation computable. However, the real reflectances of surfaces greatly limit applicability of such methods to real-world objects. While deep neural networks have been employed to handle non-Lambertian surfaces, these methods are subject to blurring and errors, especially in high-frequency regions (such as crinkles and edges), caused by spectral bias: neural networks favor low-frequency representations so exhibit a bias towards smooth functions. In this paper, therefore, we propose a self-learning conditional network with multi-scale features for photometric stereo, avoiding blurred reconstruction in such regions. Our explorations include: (i) a multi-scale feature fusion architecture, which keeps high-resolution representations and deep feature extraction, simultaneously, and (ii) an improved gradient-motivated conditionally parameterized convolution (GM-CondConv) in our photometric stereo network, with different combinations of convolution kernels for varying surfaces. Extensive experiments on public benchmark datasets show that our calibrated photometric stereo method outperforms the state-of-the-art.


Introduction
The goal of photometric stereo is to recover the dense surface orientation of a 3D object from varying shading cues, with a fixed camera, by establishing the relationship between two-dimensional images and the object geometry [1]. The earliest photometric stereo algorithm reconstructed the surface normal based on the Lambertian assumption [2]. Unfortunately, real-world objects hardly ever have the property of Lambertian reflectance, and therefore robust methods are needed to deal with objects with more general reflectance properties [3]. Traditional photometric stereo methods mainly address this problem by treating non-Lambertian regions as outliers [4,5], or adopt bidirectional reflectance distribution functions (BRDFs) to model general reflectance [6,7]. However, these traditional models are only accurate for limited categories of materials and suffer from unstable optimization.
Recently, deep learning frameworks have shown powerful capabilities for various tasks [8][9][10]. In particular, researchers have made efforts to learn general reflectance models through deep neural networks to solve the problem of photometric stereo. DPSN [11] first addressed non-Lambertian photometric stereo using a deep fully-connected network, to learn the surface normal in a per-pixel manner. Later, a series of methods employed convolutional neural networks (CNNs) to better utilize adjacent information embedded in images, such as PS-FCN [12], SDPS-Net [13], Manifold-PSN [14], and IRPS [15]. However, these methods suffer from the blurring, especially in high-frequency regions (e.g., crinkles and edges). This phenomenon is caused by spectral bias [16], in which neural networks favor low-frequency representations so exhibit a bias towards smooth functions. Unfortunately, these regions are always those to which the human visual system pays attention and consequently should be reconstructed accurately.
Existing photometric stereo networks pass the input through high-to-low resolution subnetworks that are connected in series, and then raise the resolution; these procedures cause the information loss and result in the blurring. Furthermore, existing photometric stereo networks employ the same learning strategy in all surface regions. The patterns we need to learn essentially vary from plain surfaces to high-frequency surfaces, and thus errors arise due to using the same learning strategy. Therefore, it remains urgent yet challenging to develop a robust and efficient photometric stereo method that can avoid blurring and accurately reconstruct objects' surface orientation.
In this paper, we propose a conditional deep neural network with a high-resolution structure, called CHR-PSN, for estimating the surface normals of objects. In contrast to existing methods, our framework reduces the error and blurring, especially for surfaces with high-frequency details. Extensive experiments on public datasets show that CHR-PSN achieves stateof-the-art performance. Our contributions are as follows.
Firstly, inspired by the High-resolution Net [17] for human pose estimation, we employ a parallel network structure for maintaining both deep features and high-resolution details of surface normals, for the first time. We show that high-resolution information in extracted features is essential to the per-pixel surface normal estimation task, a point which has not been explored in learning-based or data-driven photometric stereo.
Secondly, we investigate an improved gradientmotivated conditionally parameterized convolution module (GM-CondConv) [18] in the regression stage of our network, where frequency information in surface representations is integrated into the routing function. We show that the GM-CondConV module can regress the surface normal, with high-frequency details.

Background
The imaging model establishes the relationship between the surface normal n ∈ R 3 and visual observations I in a per-pixel manner. By introducing the general BRDF ρ of the object and illumination direction l with intensity e, photometric stereo recovers the surface orientation from a combination of multiple images with differing illumination directions, as follows: where the subscript j indexes the input, max n T l j , 0 accounts for attached shadows, and accounts for noise (such as inter-reflections).
To extend photometric stereo to work with unknown general BRDF ρ in practice, researchers have investigated different strategies. We divide them into non learning-based methods and deep leaning-based methods.

Non learning-based methods
Generally, traditional photometric stereo technologies aim to solve the ill-posed surface normal under unknown reflectance. Here, we briefly introduce these non learning-based photometric stereo techniques, divided as sophisticated reflectance methods and outlier rejection methods. More comprehensive surveys can be found in Refs. [19,20] Sophisticated reflectance methods are applied to model and approximate non-Lambertian reflectance. In this direction, many models have been proposed to fit nonlinear analytic BRDFs, such as bivariate functions [21,22], the Ward reflectance model [23,24], the specular spike reflectance model [25,26], the Blinn-Phong reflectance model [27], and the Torrance-Sparrow reflectance model [28]. However, these sophisticated reflectance methods are generally useful for limited categories of surfaces as the reflectance properties significantly change from material to material.
Outlier rejection methods treat non-Lambertian regions (such as specularities and cast shadows) as outliers that should be discarded. A range of outlier rejection based photometric stereo algorithms have been proposed such as maximum-likelihood estimation [29], low rank approximation [4,5], an RANSAC method [30], a maximum feasible subsystem method [31], etc. However, these methods assume outliers to be local and sparse, and cannot handle surfaces with broad and soft specularities.

Deep learning-based methods
Inspired by the powerful fitting ability of deep neural networks [32,33], deep learning-based methods have been introduced to solving the non-Lambertian photometric stereo problem. DPSN [11] first applied a fully-connected architecture for the non-Lambertian photometric stereo in a per-pixel manner. Some works use an observation map, which rearranges observed per-pixel intensities according to the light direction, to recover surface normals, such as CNN-PS [34], LMPS [35], and SPLINE-Net [36]. PS-FCN [12] and SDPS [13] employed a fully-convolutional network to learn the surface normal from input patches with neighborhood embedding. IRPS [15] further proposed an unsupervised learning framework that predicts surface normals by minimizing the loss of reconstructed images. However, existing networks pass the input through high-to-low resolution subnetworks connected in series, and then increase the resolution; these approaches cause blurring of predicted surface normals.
Recently, Attention-PSN [37] proposed an adaptive attention-weighted loss to improve the performance in various surface regions. Using the self-supervised weights of detail-preserving gradient loss, the method achieves better reconstruction results in high-frequency surface regions. However, we argue that the detail-preserving gradient loss can only constrain the high-frequency of surface structure but it is useless in terms of accuracy of predicted normal, i.e., the gradient loss dilutes supervision of the normal. Furthermore, Attention-PSN only uses the adaptive loss function to improve the details but ignores the impact of unsuitable kernels and receptive fields in the convolutional layers, which is the essential cause of blurring in high-frequency regions.
Other reconstruction approaches also address the frequency problem. Mildenhall et al. [38] proposed a method for synthesizing novel views of complex scenes by optimizing an underlying continuous volumetric scene function. This method represents high-frequency scene content, by using a positional encoding to map each input 5D coordinate into a higher dimensional space. Liu et al. [39] introduced a wavelet-based network to remove Moiré patterns, using the fact that high-frequency features may be highlighted in wavelet sub-bands.

Proposed method
In this section, we present our conditional deep photometric stereo network with high-resolution features. Our goal is to improve accuracy and reduce blurring of surface normal estimates. The architecture of the proposed CHR-PSN is shown in Fig. 1.

Feature extraction stage
As shown in Fig. 1, we first fuse the input images with their illumination direction in the feature fusion stage. For an object captured under j illumination directions, we expand each direction l j to form a 3-channel image having the same spatial resolution as the input image (H ×W × 3), and concatenate it with the corresponding image I j as the Φ j ∈ R H×W ×6 .
The feature extraction stage of our network can be seen as the j-fold multi-branch shared-weight feature extraction network, which can be expressed as where F ext is the multi-scale feature architecture with learnable parameters θ ext , inspired by the Highresolution Net [17]. We employ a parallel network structure for extracting three scales of features, avoiding the feature map from low-resolution to high-resolution. Therefore, our feature extraction maintains both the deep features and high-resolution details of surface normals. As shown in Fig. 1, the down-sampling operations are executed through convolutional layers with stride 2 (double downsampling) or 4 (twice double down-sampling), and the up-sampling operations are executed through bilinear-upsampling and 1 × 1 convolutional layers to adjust the channel of the feature to be the same as the high-resolution feature channel. The fusion of high-to-low and low-to-high processes into the same-resolution features is executed through skip connections. Therefore, our feature extraction method outputs three different resolution features, as full resolution (  We also introduce an edge-preserving layer for each I j as follows: where F edge is the edge-preserving layer, calculated as the gradient of input image I j . Ω FR j ∈ R H×W ×3 is the output with high-frequency edge information, which is used in the improved CondConv module of the regression stage.

Fusion stage
In the fusion stage, we apply multi-scale max-pooling operations [12,37] to fuse the j features into one, so our network can handle an arbitrary number of inputs and backpropagate the parameters. We argue that max-pooling extracts the most salient information from all features, while average-pooling may smooth out useful features and be impacted by non-activated features. Here, the subscript p indexes position in the feature: where Ω FR max , Ψ FR max , Ψ HR max , and Ψ QR max are the fused features.

Regression stage
The normal regression stage takes Ω FR max , Ψ FR max , Ψ HR max , and Ψ QR max as inputs and regresses the predicted surface normalsN , by F reg with learnable parameters θ reg , as follows: In the regression stage, we first employ transposed convolution operations to up-sample the lowresolution feature Ψ HR max and Ψ QR max to the full resolution of H × W (twice transposed convolution and once regular convolution for Ψ QR max , once transposed convolution for Ψ HR max ). As shown in Fig. 1, we employ concatenation to fuse the two up-sampled features and the full resolution feature, instead of using skip connections in the feature extraction stage.
To better reconstruct details of objects and remove blurring in high-frequency regions, we propose an improved GM-CondConv module in the regression stage [18], with the motivation that previous methods use the same learning strategy for all surface regions, causing blurring and error. By parameterizing the convolutional kernel conditionally on the input, the network can give accurate estimates for both simple surface regions and high-frequency surface regions (crinkles, edges). Particularly, we concatenate the high-frequency edge information Ω FR max with the previous layer feature x. We argue that the frequency information is beneficial to the classification of each learned kernel, which is better used to predict different surface normal regions. Therefore, the convolutional kernels in GM-CondConv are parameterized as where each α i = r i (x, Ω FR max ) is an example-dependent scalar weight computed using a routing function with learned parameters, n is the number of weights (n = 5 in our default setting), and σ is the Leaky-ReLu activation function. Following CondConv [18], we compute example-dependent routing weights α i = r i (x, Ω FR max ) from the layer input in three steps: global average pooling, a fully-connected layer, and sigmoid activation: where R is a matrix of learned routing weights mapping the pooled inputs to n expert weights. We finally employ L2 normalization of the predictions givingN .

Loss function and training procedure
Learning in our network is supervised by the angular error between the estimated and the ground-truth surface normals. We optimize network parameters θ ext and θ reg by minimizing the cosine similarity loss: whereN p and N p denote the estimated and groundtruth normals respectively at pixel p. If the estimated normalN p at pixel p has similar orientation to the ground-truth N p , thenN p · N p will be close to 1 and the loss L normal will approach 0. Our network is implemented in PyTorch [40] on an RTX 2080Ti GPU, and the Adam optimizer [41] is used with default settings, with the learning rate initially set to 0.001 and divided by 2 every 5 epochs. We train the model using a batch size of 32 for 40 epochs, with j = 32 for each sample in training, while our network can accept an arbitrary number of j in testing. Also, we set the resolution to H = W = 32 in training; an arbitrary resolution can be used in testing.

Training and validation datasets
We adopt two public synthetic blobby shape [42] and sculpture shape datasets [43] to train our network. Following the setup in PS-FCN [12], we render these two shape datasets with the MERL dataset [44], which contains 100 different BRDFs of realworld materials, using the physically-based raytracer Mitsuba [45]. Their resolution is 128 × 128. Image patches of size 32 × 32 are randomly cropped for data augmentation. This results in 85,212 samples in total, each sample containing 64 images with different illumination directions (random directions across the upper hemisphere). We split the samples into a training set (84,360 samples) and a validation set (852 samples).

Testing datasets
We use public non-Lambertian photometric stereo datasets to evaluate our method. First, we employ the DiLiGenT benchmark dataset [19]. It contains 10 objects of various shapes with complex materials. For each object, the dataset provides 96 images under different illumination directions, at a resolution of 612 × 512. Then, we employ the Light Stage Data Gallery dataset [46]. It contains six complex objects with higher resolution. Each object has up to 253 images under different illumination directions. Note that this dataset lacks the ground-truth surface normal. Therefore we qualitatively evaluate our method on it.

Experimental results
We present experiments and analysis in this section.

Metrics
To verify the quantitative performance of our method, we employ widely used metrics to measure accuracy. We adopt the mean angular error (MAE) in degrees to evaluate the accuracy of the estimated surface normal: We also measure the percentage (%) of pixels with angular error less than 20 • , which is denoted by < err 20 • . This metric better measures high-frequency error, as the normal error in high-frequency regions is bigger.

Procedure
We performed quantitative ablation experiments on the validation set, reporting the average MAE of its 852 samples (tested with 32 images). Table 1 summarizes the results of the ablation experiments.
Our default method is marked as D0, with full resolution features + 1 2 resolution features + 1 4 resolution features in the high-resolution feature extraction stage [17], as well as fusion of highfrequency edge information Ω FR max and 5 weights in the GM-CondConv module of the regression stage. We first evaluate the effectiveness of multi-scale features: experiments D0, M1, M2, M3, and M4 combine different resolution features. For M1, M2, M3, and M4, we adjust the architecture of the feature extraction network, the corresponding multi-scale max-pooling fusion, and the number of concatenations in the regression stage, but maintain the GM-CondConv module unchanged. Note that the 1 8 resolution feature in M4 has dimensions R 1 8 H× 1 8 W ×512 . For the network without full resolution features, we down sample at the beginning. We then evaluate the effectiveness of the improved GM-CondConv module (experiments D0, C5, C6, C7, and C8). We test the impact of fusing edge information, and the number of weights of routing function in the GM-CondConv module. For C5, C6, C7, and C8, we only adjust the GM-CondConv module but maintain the architecture of the high-resolution network unchanged. Finally, we evaluate different methods of fusing illumination resolution features, while the 1 8 resolution features significantly increase the number of parameters and training time. This might be because such deep features contain less detail information but highlevel semantic information, which is useless for the per-pixel prediction task. Therefore, we select full resolution features + 1 2 resolution features + 1 4 resolution features in the high-resolution feature extraction stage [17].

Effectiveness of fusing high-frequency information in routing
Experiments D0, C5 show the influence of fusing high-frequency edge information Ω FR max in the routing function of the GM-CondConv module. We can see that the angular error and < err 20 • of the validation set are lower when edge information is taken into account. This might be explained by the fact that the improved routing function incorporates highfrequency information into the self-learned weights, which is beneficial to the GM-CondConv module for estimating different frequency surface regions (such as crinkles and planar parts). We also show a "Buddha" example in Fig. 2. The comparison between CondConv (C5) and GM-CondConv (D0) shows that using GM-CondConv improves the performance in high-frequency areas.

Choice of number of weights in GM-CondConv
In experiments D0, C6, C7, our method increased the number of weights in GM-CondConv. Note that with one weight there is only one convolution kernel and no dynamic weight. These comparisons show the effectiveness of our improved GM-CondConv module. Also, compared with default settings, adding further weights to GM-CondConv does not continue to improve accuracy. Our method performs best when 5 weights are used, according to the above experiments.

Effectiveness of illumination direction fusion methods
Experiments D0, L9 show the influence of different fusion methods. Angular error and < err 20 • for the validation set are best when using concatenation (our default). The performance of prediction severely decreases when using the add operation between the input image and the illumination direction. We argue that the network can hardly decouple features that are numerically added into image and illumination.

Evaluation on 96 input images
We compare our method with both non learningbased methods and recent deep learning-based methods in terms of achieved MAE, on the DiLiGenT benchmark [19]. As non learning-based methods, we evaluate the least squares (baseline) method [2], rank minimization [4], and matrix rank = 3 [5] of the outlier rejection method. We also evaluate sophisticated reflectance methods, such as Multi-Ward models [23], bivariate BRDF [6], and a bi-polynomial method [47]. For deep learning-based methods, we compared our method to DPSN [11], IRPS [15], PS-FCN [12], and Attention-PSN [37] using 96 input images. Quantitative results are reported in Table 2. Figure 3 visualizes results for the four most accurate deep learning-based photometric stereo methods: Attention-PSN [37], PS-FCN [12], IRPS [15], and DPSN [11], as well as the baseline least squares method [2]. Figure  3 illustrates the performance of our method in high-frequency regions, such as the face of "Buddha" and the flower in "Pot2", and cast shadows regions, such as the shoulder of "Buddha" and the base of "Goblet". It can be seen that our method is more accurate in regions with cast shadows and crinkles. We also show details in an enlargement of part of "Buddha" in Fig. 2. We can see that the last three comparisons, which take high-frequency information into consideration, achieve much better accuracy on crinkles and edges. Specifically, our default settings (using improved GM-CondConv) result in Fig. 2 An enlargement from "Buddha" from the DiLiGenT dataset [19]. Att.-PSN: Attention-PSN. CondConv represents using the original CondConv module [18] (ID = C5 in Table 1), while GM-CondConv represents our default model. reduced error in high-frequency areas, compared to using CondConv module (without high-frequency information Ω FR max in the routing function, C5 in Table 1).

Limitations
Our CHR-PSN method does not achieve the best performance on some objects, such as "Ball" and "Bear". We also illustrate some failures in Fig. 4. For these objects, our method provides sub-optimal performance. Objects like Ball and Bear have smooth surface normals and approximately Lambertian reflectance. In these cases, we argue that the high-resolution feature extraction of our method and GM-CondConv module are excessive. IRPS [15]  performs very well on these objects because it introduces the reconstruction loss to learn the surface normal, where an approximate Lambertian surface and simple structure is beneficial to the inverse rendering. However, we can see that our method still outperforms Attention-PSN and IRPS in non-Lambertian regions (such as the specularity of "Ball") and cast shadows regions (such as the chin of "Bear").

Evaluation using fewer input images
We further evaluated our method against several methods with sparse inputs (10 input images). Our method employs max-pooling to handle an arbitrary number of input images, which is of practical use. For non learning-based methods, we evaluate the least squares baseline method [2], the bi-polynomial [47], and matrix rank = 3 [5]. For deep learning-based methods, we evaluate CNN-PS [34], SPLINE-Net [36], LMPS [35], and PS-FCN [12]. We summarize the comparisons in Table 3.
It can be seen that our method outperforms others on average MAE using the DiLiGenT dataset and achieves state-of-the-art accuracy on most objects. We also visualize the average MAE of the DiLiGenT dataset from sparse input (8) to dense input (96), as shown in Fig. 5. We compare our method to PS-FCN [12], which also uses max-pooling to handle differing numbers of input images with a single round of training. We can see that our method outperforms PS-FCN on all numbers of input images (both methods were trained with 32 input images).

Extension to uncalibrated photometric stereo
We next report the superior performance of our method in uncalibrated conditions. In actual applications, there are conditions where the directions of illuminations l j are unknown. Our method can be easily extended to handle uncalibrated photometric stereo by removing the illumination direction from the input (as the Φ j ∈ R H×W ×3 , which only includes the RGB-channel image). To verify the potential of our method, we trained the model without illumination directions (also using 32 images for one   sample) and tested it on the DiLiGenT benchmark [19] with 96 images. The results are reported in Table 4. We compare our method (uncalibrated) with several uncalibrated photometric stereo methods, such as entropy minimization [48], a self-calibrating method [49], reflectance symmetry [50], diffuse maxima [51], and UPS-FCN (for uncalibrated) [13]. Our method (uncalibrated) outperformed existing methods in terms of the average MAE, except for SDPS-Net [13]. SDPS-Net is specially designed for uncalibrated conditions (solely learning the illumination direction), while our method can be both used in both calibrated and uncalibrated conditions.

Evaluation on the Light Stage Data Gallery dataset
We further qualitatively evaluated our method on a more complex dataset with general non-Lambertian materials. Figure 6 shows the results of our method (tested with a random sample of 150 of 253 total images) on objects "Kneeling", "Helmet", and "Standing". We show qualitative outcomes in this experiment, due to the absence of ground-truth surface normals. Due to limited GPU memory, we tested the Light Stage Data Gallery with 64 input images (calibrated illumination directions). As shown in Fig. 6, the estimated normal keeps the details without blurring, such as in the hair of Kneeling, and screws of the Helmet. The predicted surface normal and 3D reconstruction convincingly reflect the shapes of the objects, with accurate detail. The belt of Kneeling further illustrates our performance on cast shadows. However, we also note that the predicted surface normal of the object Kneeling has some blurring and noise. We argue that the poor quality observed in Kneeling is due to highfrequency noise, which may affect the GM-CondConv module of our method.  6 Qualitative results of our method on objects Kneeling, Helmet, and Standing. Yellow boxes: regions with high-frequency surfaces (such as crinkles). Red boxes: regions with cast shadows. Contrast is adjusted for ease of viewing. After predicting surface normals, 3D reconstructions are recovered by Ref. [52].

Conclusions
In this paper, we have proposed a conditional photometric stereo network with a high-resolution feature extraction architecture.
Compared to previous deep learning approaches which regress surface normals from a down-sampled feature map, we employ a multi-scale parallel architecture to enhance the details in predictions. Furthermore, we employ an improved GM-ConvCond module in the regression stage which considers the frequency of surfaces. As a result, our method outperforms others in high-frequency regions such as crinkles and edges. Ablation experiments have illustrated that our method performs more accurate reconstruction.
Extensive quantitative and qualitative comparisons on the DiLiGenT benchmark and the Light Stage Data Gallery have shown that our method outperforms state-of-the-art methods.
Despite offering state-of-the-art performance, our method can be further improved. Firstly, our method provides sub-optimal results on some objects with very simple structure, in which cases the high-resolution feature extraction and GM-CondConv are excessive. Secondly, the training time of our method is longer than for other deep learning-based photometric stereo methods, due to our much bigger network architecture. In future, we will further design the feature extractor architecture to be better and predict the surface normal faster. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.
The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
Other papers from this open access journal are available free of charge from http://www.springer.com/journal/41095. To submit a manuscript, please go to https://www. editorialmanager.com/cvmj.