Unsupervised single-shot depth estimation using perceptual reconstruction

Real-time estimation of actual object depth is an essential module for various autonomous system tasks such as 3D reconstruction, scene understanding and condition assessment. During the last decade of machine learning, extensive deployment of deep learning methods to computer vision tasks has yielded approaches that succeed in achieving realistic depth synthesis out of a simple RGB modality. Most of these models are based on paired RGB-depth data and/or the availability of video sequences and stereo images. However, the lack of RGB-depth pairs, video sequences, or stereo images makes depth estimation a challenging task that needs to be explored in more detail. This study builds on recent advances in the field of generative neural networks in order to establish fully unsupervised single-shot depth estimation. Two generators for RGB-to-depth and depth-to-RGB transfer are implemented and simultaneously optimized using the Wasserstein-1 distance, a novel perceptual reconstruction term, and hand-crafted image filters. We comprehensively evaluate the models using a custom-generated industrial surface depth data set as well as the Texas 3D Face Recognition Database, the CelebAMask-HQ database of human portraits and the SURREAL dataset that records body depth. For each evaluation dataset, the proposed method shows a significant increase in depth accuracy compared to state-of-the-art single-image transfer methods.


Introduction
Real-time depth inference of a given object is an essential computer vision task which can be applied in various robotic tasks such as simultaneous localization and mapping [34,13,47] as well as autonomous quality inspection in in-dustrial applications [2,31].As the popularity of VR applications has continued to grow, instant depth estimation has also become an integral part of modeling complex 3D information out of single 2D images of human faces [4,27] or body parts [42,40,39].Depth information about an object can be directly obtained from sensors for optical distance measurement.Time-of-Flight (ToF) cameras, LIDAR or stereo imaging systems are often used in practice and were also employed to generate paired RGB-depth data from some well-known depth databases [42,38,17,13,34,23,39].Since these sensors are typically costly and time-consuming devices that are also sensitive to external influences, their applicability to fast full-image depth generation on small on-site devices is limited.These limitations have motivated depth synthesis out of a simpler modality in terms of acquisition effort, namely an RGB image.This development has initiated a completely new field of research in computer vision.
An important contribution was made by Eigen et al. [9], who proposed deep convolutional neural networks (DCNNs) for monocular depth synthesis of indoor and outdoor scenes.Basically, monocular single-image depth estimation out of RGB images can be seen as a modality transfer in which observed data of one modality is mapped to desired properties of another, potentially more complex, modality.Although DC-NNs are a promising approach that succeed on such transfer tasks, they are commonly based on large amounts of training data, and generation and acquisition can be a demanding task.In the supervised setting in particular, DCNNs make use of paired training data during network parameter optimization, i.e., the network is provided with a single-view RGB and corresponding per-pixel depth [9,4,39,29].Since large scale dense depth profiles are not abundant in many applications, supervised approaches are not feasible for these objects.One possible way to remedy these shortcomings of su-pervised methods is to consider self-supervised approaches based on monocular video clips in which a supervisory depth counterpart is extracted from pose changes between adjacent frames.These models can be trained on RGB sequences in a self-supervised manner, where a depth network and a pose estimation network are simultaneously optimized via sophisticated view-synthesis losses [50,14,47,25].Obviously, these methods require non-static scenes or a moving camera position (e.g., moving humans [25], autonomous driving [13]).A very recent example for a scenario, where neither video sequences, stereo pairs nor paired data are available, is nondestructive evaluation of interal combustion engines for stationary power generation [2,31].Within this application, surface depth information has to be extracted from RGB image data.With current standards, cylinder condition can be assessed from a depth profile on a micrometer scale of the measured area (cf. Figure 1).However, microscopic depth sensing of cylinder liner surface areas is a time-consuming and resource-intensive task which consists of disassembling the liner, removing it from the engine, cutting it into segments and measuring them with a highly expensive and stationary confocal microscope [2].With a handheld microscope, however, single RGB records of the liner's inner surface can be generated from which depth profiles may be synthesized.Since depth data is generated on a quite small scale (1.9 × 1.9 mm 2 ) and is comparatively high resolved, it is hardly possible to generate RGB data with accurately aligned pixel positions.This results in a fully unsupervised approach required for reasonable depth synthesis of this static scene.
Figure 1: Top: RGB measurements of the inner surface of three cylinder liners with a spatial range of 4.2 × 4.2 mm 2 , recorded by a handheld microscope.Bottom: Depth profile of the same cylinder with a spatial range of 1.9 × 1.9 mm 2 , measured with a confocal microscope.The pixels of the modalities are not aligned.
The main objective of this study is to propose a general method for depth estimation out of scenes for which neither paired data, video sequences, nor stereo pairs are available.Therefore, we consider the depth estimation problem as an intermodal transfer task of single images.Several recent advances in unpaired modality transfer are based on generative adversarial models (GAN) [15], cycle-consistency [51] and probabilistic distance measures [3,16].The method pro-posed in this paper builds on established model architectures and training strategies in deep learning which are beneficially combined for unpaired single-view depth synthesis.Introduction of a novel perceptual reconstruction term in combination with appropriate hand-crafted filters further improves accuracy and depth contours.The method is comprehensively tested on the afore mentioned industrial application of surface depth estimation.Furthermore, the approach is applied to other, external, datasets to create realistic scenarios where perfectly aligned RGB-depth data of single images is not available in practice.More precisely, we test the model on the Texas 3D Face Recognition database (Texas-3DFRD) [17], the Bosphorus-3DFA [38] and the CelebAMask-HQ [32] to show its plausibility for facial data in an unsupervised setting.The SURREAL dataset [40] is used to test performance on RGB-D videos of human bodies, where RGB and depth frames are not perfectly aligned.For every evaluation experiment the depth accuracy of the proposed framework is compared to state-of-the-art methods in unsupervised single-image transfer.To be more precise, the methods used for comparison are standard cycleGAN [51], CUT [35] that uses contrastive learning for one-sided transfer and gcGAN [11] that utilizes geometric constraints between modalities.For facial data, we additionally compare to Wu et al. [45], a very recent work where in addition to the depth profile also the albedo image, the illumination source and a symmetry confidence map is predicted in an unsupervised manner.

Contributions:
• This study finds a solution to the industrial problem of single-shot surface depth estimation where no paired data, no video sequences and no stereo pairs are available.
• In this work depth estimation is considered as a singleimage modality transfer; the proposed method shows superior performance over state-of-the-art works, quantitatively and qualitatively.
• Application to the completely different tasks of unsupervised face and human body depth synthesis indicates the universality of the approach.

Related Work
The following section summarizes the most important milestones in the development of generative adversarial networks, highlights important work on single-image depth estimation as well as depth synthesis via GANs.In the supplementary, background is provided on some 3D databases that have been critical to the development of deep learning-based models for depth estimation.

Generative Adversarial Networks
A standard GAN [15] consists of a generator network G : Z → X mapping from a low-dimensional latent space Z to image space X , where parameters of the generator are adapted so that the distribution of generated examples assimilates the distribution of a given data set.To be able to assess any similarity between arbitrary high-dimensional image distributions, a discriminator f : X → [0, 1] is trained simultaneously to distinguish between generator distribution and real data distribution.In a two-player min-max game, generator parameters are then updated to fool a steadily improving discriminator.Usage of the initially proposed discriminator approach can cause the vanishing gradient problem and does not provide any information on the real distance between the generator and the real distribution.This issue has been discussed thoroughly in [3], where the problem is bypassed by replacing the discriminator with a critic network that approximates the Wasserstein-1 distance [41] between the real distribution and the generator distribution.While the quintessence of GANs is to draw synthetic instances following a given data distribution, cycle-consistent GANs [51] allow one-to-one mappings between two image domains X and Y.In essence, two generator networks are trained simultaneously to enable generation of synthetic instances for both image domains (e.g., synthesizing winter landscapes from summer scenes and vice versa).To ensure one-to-one correspondence, a cycle-consistency term is added to the two adversarial loss functionals.Although cycle-consistent GANs had initially been constructed for style transfer purposes, they were also very well received in the area of modality transfer in biomedical applications [18,21,33].Since optimization and fine-tuning of GANs often turns out to be extremely demanding and time-intensive, much research has emphasized stabilization of the training process through the development of stable network architectures such as DCGAN [37] or Patch-GAN [24].

Monocular Depth Estimation
Deep learning based methods achieve state-of-the-art results on depth synthesis task by training a DCNN on a large-scale and extensive data set [13,34].Most of RGB-based models are supervised, i.e. they require corresponding depth data that is pixel-wise aligned.One of the first DCNN approaches by Eigen et al. [9] included sequential deployment of a coarsescale stack and a refinement module and was benchmarked on the KITTI [13] and the NYU Depth v2 data set [34].Using a encoder-decoder structure in combination with an adversarial loss term helped to increase visual quality of the dense depth estimates [26].Later methods also considered deep residual networks [30] or deep ordinal regression networks [10] in order to significantly increase performance on these data sets, where commonly considered performance measures are the root mean squared error (RMSE) or the δ 1 accuracy [47].Since a lot of research focused on further performance in-crease at the expense of model complexity and runtime, Wofk et al. [43] used a lightweight network architecture [22] and achieved comparable results.

Depth Estimation using GAN
Use of left-right consistency and a GAN architecture results in excellent unsupervised depth estimation based on stereo images [36,48].In [28] and [49], a GAN has been trained to perform unpaired depth synthesis out of single monocular images.To this end, GANs were employed in the context of domain adaptation using an additional synthesized data set of the same application with paired samples.This approach may not be regarded as a fully unsupervised method and requires availability or construction of a synthetic dataset.Arslan et Seke [4] consider a conditional GAN (CGAN) [24] for solving single-image face depth synthesis.Nevertheless, CGANs rely on paired data since the adversarial part estimates the plausibility of an input-output pair.Another interesting approach was tried in [29], where indoor depth and segmentation were estimated simultaneously using cycle-consistent GANs.The cycle-consistency loss helped them to maintain the characteristics of the RGB input during depth synthesis while the simultaneous segmentation resolved the fading problem in which depth information is hidden by larger features.However, the proposed discriminator network and reconstruction term in the generator loss function are based on paired RGB and depth/segmentation data, which is not available for the aforementioned industrial application of surface depth synthesis.

Method
This section proposes an approach to monocular single-image depth synthesis with unpaired data and discusses the introduced framework and training strategy in detail.

Setting and GAN Architecture
The underlying structure of the proposed modality synthesis are two GANs linked with a reconstruction term (cf.This research work has chosen a network critic based on the Wasserstein-1 distance [41,3].The Wasserstein-1 distance (earth mover distance) between two distributions P 1 and P 2 is defined as W 1 (P 1 , P 2 ) := inf J∈J (P1,P2) E (x,y)∼J x − y , where the infimum is taken over the set of all joint probability distributions that have marginal distributions P 1 and P 2 .Since the exact computation of the infimum is highly intractable, the Kantorovich-Rubinstein duality [41] is used where where p denotes the influence of the gradient penalty, (•) The goal of the RGB-to-depth generator G θ Y is to minimize the distance.Since only the first term of the functional in (2) depends on the generator weights θ Y , the adversarial empirical risk for generator G θ Y simplifies as follows:

Perceptual Reconstruction
In the context of depth synthesis, it is not sufficient to ensure that the output samples lie in the depth domain.Care must be taken that synthetic depth profiles do not become irrelevant to the input.A reconstruction constraint forces generator input and output to share same spatial structure by taking into account the similarity between the input and the reconstruction of the synthesized depth profile.Obviously, calculation of a reconstruction error requires an opposite generator function G θ X : Y → X to assimilate real RGB distribution P X as well as the corresponding distance network f ω X : X → R. Both have to be optimized simultaneously to the RGB-to-depth direction.The reconstruction error is commonly evaluated by assessing similarity between x and G θ X (G θ Y (x)) as well as similarity between y and G θ Y (G θ X (y)) for x ∈ X and y ∈ Y.
In the setting of style transfer and cycle-consistent GANs [51], a pixelwise distance function on image space is considered, where the mean absolute error (MAE) or the mean squared error (MSE) are common choices.The use of a contrary generator G θ X can be viewed as a type of regularization since it prevents mode collapse, i.e., generator outputs remain dependent on the inputs.Deployment of the cycle-consistency approach [51], where reconstruction error is measured in image space, assumes no information loss during the modality transition.This corresponds to the applications of summer-to-winter landscape or photograph-to-Monet painting transition.Determining G θ Y and G θ X is an ill-posed problem since a single depth profile may be generated by an infinite number of distinct RGB images and vice versa [6].For example, during RGB-to-depth transition of human faces, information on image brightness, light source or the subject's skin color is lost.As a consequence, the contrary depth-to-RGB generator needed for regularization has to synthesize the lost properties of the image.Both generators G θ Y and G θ X may be penalized if the skin color or the brightness of the reconstruction is changed even though G θ X did exactly what we expected it to do, i.e., synthesize a face that is related to the input's depth profile.
Adapting the idea of [8], we propose a perceptual reconstruction loss, i.e., instead of computing a reconstruction error in image space, we consider certain image features of the reconstruction.Typical perceptual similarity metrics extract features by propagating the images (to be compared) through an auxiliary network that is usually pretrained on a large image classification task [7,8,20].Nevertheless, we expect our feature extractor to be perfectly tailored to our data and not determined by an additional network pretrained on a very general classification task [7] that may not even cover our type of data.Therefore, we enforce the reconstruction consistency on the image space by using the MAE loss on feature vectors extracted by φ X (•) := f l ω X (•), which corresponds to the l-th layer of the RGB critic (cf.Algorithm 1).Analogously, we define the feature extractor on depth space by φ Y (•) := f l ω Y (•), which corresponds to the l-th layer of the depth critic.Although we are aware that feature extractor weights are adjusted with each update of critic weights ω X , ω Y , we assume that, at least at a later stage of training, φ X and φ Y have learned good and stable features on the image and depth domain.This yields the following empirical reconstruction risk: In our implementation, we set l := L − 2 for a critic with L layers, i.e., we use the second-to-last layer of the critic.
A good reconstruction term must still be found for the start of training when the critic features are not yet sufficiently reliable.At first, it is desirable to guide the framework to preserve structural similarity during RGB-to-depth and depth-to-RGB transition.Therefore, we propose to compare the input and its reconstruction in the image space while automatically removing the brightness, illumination and color of the RGB images beforehand.This can be ensured by applying the following steps: 1. Convert the image to grayscale by applying the function g : [0, 255] d1×d2×3 → R d1×d2 , x → 0.299 255 • x (,,0) + 0.587 255 • x (,,1) + 0.144 255 • x (,,2) , where (, , i) denotes the i-th color channel for i = 0, 1, 2.
2. Enhance the brightness of the grayscale image using an automated gamma correction based on the image brightness [5], i.e. take the grayscale image x gr to the power of Γ(x gr ) := −0.3 • 2.303/ln x gr , where x gr denotes the average of the gray values.
3. Convolve the enhanced image with a high-pass filter h in order to dim the lighting source and color information (cf. Figure 3).The high-pass filter may be applied in Fourier domain, i.e., the 2D Fourier transform is multiplied by a Gaussian high-pass filter matrix H σ defined by 2 2 /(2σ 2 ) for i = 1, . . ., d 1 and j = 1, . . ., d 2 .In our implementation, σ = 4 yielded satisfactory results for all tasks.This yields the updated empirical reconstruction risk: where ψ(•) := h * g(•) Γ(g(•)) and γ is gradually increased from 0 to 1 during training to control feature extractor reliability.In the far right column in Figure 3, we may observe the strong effect of operator ψ.For the face sample, the face shape and the positions of the nose and the eyes are very clear, at the same time the low image brightness and the exposure direction are resolved.The main edges of the cylinder liner surfaces are clearly identifiable whereas the different brown levels and illumination inconsistencies of the input are no longer visible.Using the previously discussed risk functions R cri (2), R adv (3) and R rec (5), Algorithm 1 summarizes the proposed architecture for fully unsupervised single-view depth estimation.Implementation of the proposed framework is publicly available on https://github.com/anger-man/unsupervised-depth-estimation.
As critical as the loss function design of an unsupervised method is the choice of an appropriate architecture for the critic and the generator network.A decoder for the critic is built following the PacthGAN critic that was initially proposed in [24] with nearly 15.7 × 10 6 parameters.The Patch-GAN architecture is empirically proven to perform quite stably over a variety of different generative task and is part of many state-of-the-art architectures for image generation end for [51,35,12].The generator is a ResNet18 [19] with a depthspecific upsampling part taken from [14] (19.8 × 10 6 parameters).Detailed information on critic and generator implementations is provided in the supplementary.

Experiments and Discussion
The framework proposed in Algorithm 1 is implemented with the publicly TensorFlow framework [1].The applications are inner surface depth estimation of cylinder liners, face depth estimation based on the Texas-3DFRD [17] and body depth synthesis using the SURREAL dataset [40].In this section we benchmark the proposed framework on each dataset and separately present the results, followed by a discussion at the end.As discussed in the introduction, the methods used for comparison are a standard cycleGAN [51], gcGAN [12] and CUT [35].For CUT we use the publicly available github reposi-tory 1 .For cycleGAN we remove the novel perceptual loss and handcrafted image filters from our method and replace them with the standard cycleGAN loss.For gcGAN we use the critic and generator implementations of our method, remove the contrary generator and employ up-down-flip as the geometric constraint.In our implementation, we set the number of generator updates n G to 10k, the minibatch size b to 8 and the penalty term p to 100.The number of critic iterations n f is initially established to be 24 to ensure a good approximation of the Wasserstein-1 distance in the beginning.After 1000 generator updates, it is halved to speed up training.Furthermore, we set α f to 5 × 10 −5 and α G to 1 × 10 −4 .The influence of the reconstruction term λ rec is found for each dataset and method individually by a parameter grid search.

Surface Depth
This study uses the same database initially proposed in [2] for depth estimation of inner cylinder liner surfaces of large internal combustion engines.Depth measurements cover a spatial region of 1.9 × 1.9 mm 2 , have a dimension of approximately 4000 × 4000 pixels and are acquired using a resourceintensive logistic chain as discussed in the introduction.The profiles denote relative depth with respect to the core area of the surface on a µm scale.The RGB data is taken from the same cylinder surfaces with a simple handheld microscope.The RGB measurements cover a region of 4.2 × 4.2 mm 2 and have a resolution of nearly 1024 × 1024 pixels.Measurement positions are not registered to the depth data.592 random samples are obtained from each image domain.The RGB and depth data is then augmented separately to nearly 7000 samples via random cropping, flipping and gamma correction [5].To make computation feasible with an NVIDIA GeForce RTX 2080 GPU, each sample is resized to a dimension of 256 × 256 pixels.In order to assess the visual quality between two completely unaligned domains, we also generated depth profiles of 211 additional surface areas and registered them with great effort using shear transformations and a mutual information criterion.These evaluation samples are not included in the training database.During optimization, RGB images and depth profiles are scaled from [0, 255] to [−1, 1] and from [−5, 5] to [−1, 1], respectively, whereas evaluation metrics (RMSE and MAE) are calculated on the original depth scale in µm.

Face Depth
The Texas-3DFRD [17]      More experiments on unsupervised facial depth synthesis on the Bosphorus-3DFA [38], the CelebAMask-HQ [32] and qualitative comparison to Wu et al. [45] are presented in the supplementary.
Figure 6: From left to right: Face RGB input, ground truth and profiles predicted by our method, gcGAN, cycleGAN and CUT.

Body Depth
The SURREAL dataset [40] consists of nearly 68k video clips that show 145 different synthetic subjects performing various actions.The clips consist of 100 RGB frames with perfectly aligned depth profiles that denote real-world camera distance.
We use the same train/test split as Varol et al. [40], i.e., we remove nearly 12.5k clips and use the middle frame of each 100-frame clip for evaluation.For the remaining clips, an amount of 2500 clips is randomly selected for training.We choose 20 RGB and 20 depth frames per clip ensuring that RGB and depth frames are disjointed in order to mimic an application without any accurately aligned RGB-depth pairs.This results in approximately 50k samples per modality.We strictly follow the preprocessing pipeline of Varol et al.

Discussion
Quantitative evaluation on unseen test data in tables 1 to 3 confirms superiority of the proposed method compared to other state-of-the-art modality transfer methods.Especially the CUT method is not suitable for the depth estimation of planar surfaces and human bodies.Obviously, usage of a novel perceptual reconstruction term in combination with handcrafted image filters is able to overcome the shortcomings of a standard cycle-consistency constraint as explained in Section 3.2 and improves depth accuracy significantly.Considering the industrial application, Figure 4 indicates that we have been able to synthesize realistic surface depth profiles with an RMSE of 0.751 µm compared to the registered ground truth.In Figure 6 we observe that predictions coming from our method seem most similar to the ground truth, while the results of cycleGAN and CUT do not correctly reproduce the contours of the input.In Figure 8 it can be seen that the CUT benchmark completely fails on the SURREAL dataset, which can possibly be attributed to the fact that here, in parallel to the depth estimation, the body must also be segmented.
Although the proposed method was initially motivated by cy-cleGAN [51], it is important to point out that replacement of the standard cycle-consistency term with perceptual losses and usage of appropriate hand-crafted filters in image space is a novel idea that overcomes significant shortcomings of the standard cycleGAN architecture in depth estimation that are thoroughly discussed in the paper.For depth synthesis of surfaces, faces and human bodies, the RMSE decreases (compared to a standard cycleGAN) about 9.8 %, 35.2 % and 12.1 %, respectively.The proposed method has been mainly developed to find a solution to the problem of depth synthesis of planar cylinder liner surfaces.The results confirm that the framework not only succeeds on the cylinder surface task but also significantly improves performance in the field of face and whole body depth synthesis compared to state-of-the-art modality transfer methods.

Conclusion
This paper proposes a framework for fully unsupervised single-shot depth estimation from monocular RGB images based on the Wasserstein-1 distance, a novel perceptual reconstruction loss and handcrafted image filters.The model is comprehensively evaluated on differing depth synthesis tasks without using pairwise RGB and depth data during training.
The approach provides a reasonable solution for estimating the relative depth of cylinder liner surfaces when generation of paired data is technically not feasible.Moreover, the proposed algorithm also shows promising results when applied to the task of absolute depth estimation of human bodies and faces, thereby proving that it may be generalized to other reallife tasks.However, one disadvantage of the perceptual reconstruction approach is that four neural networks must be fitted in parallel.Future work will therefore include the development of one-sided depth synthesis models in an unsupervised manner as well as the application of our approach to other modality transfer tasks.

A 3D Databases -An Overview
Single-shot depth estimation has become increasingly popular over the last decade of deep learning.The first deep learning solutions for depth synthesis were motivated by the development of autonomous driving and localization systems and therefore were initially designed to automatically determine the depth of indoor or outdoor scenes [9,50,28,36,14,48,29].Deep convolutional neural networks, trained on large-scale and extensive data sets such as KITTI [13] or NYU Depth Dataset v2 [34] achieved state-of-the-art results.The outdoor video clips of the KITTI dataset can be used for various subtasks in computer vision such as optical flow, object detection, semantic segmentation and depth [47].Each video sequence of the KITTI dataset consists of stereo image pairs with aligned depth images (LIDAR), which renders the database a common benchmark for unsupervised or self-supervised depth estimation tasks [50,36,14].The NYU Depth Dataset v2 focuses on monocular sequences of indoor environments, where depth counterparts are obtained with a high quality RGB-D camera.Therefore, this dataset is considered a primary benchmark in supervised monocular depth estimation [9,29].
With the advent of virtual and augmented reality applications, single-image pose estimation and 3D reconstruction of human bodies or body parts received a great amount of attention in the research field of computer vision [25].3D information on human faces provides additional benefits for face recognition or detection systems [4].The Texas-3DFRD [17] and the Bosphorus-3DFA [38] are known representatives of paired face RGB-depth data of high quality and include a variety of head poses and emotional expressions.Both databases provide facial landmarks for additional face expression analysis, but with approximately 100 different individuals each, the sets are rather small.A larger number of facial depth models can be derived from 3D synthetic data of human faces as in [27,44].Leveraging the task to whole body depth estimation is challenging due to the fact that RGBdepth pairs of real individuals are not abundant in many datasets.A small dataset of 25 video clips for detailed human depth estimation is proposed in [39] while a depth dataset of 10 sequences recorded from different viewpoints is published in [42].The Human3.6M dataset [23] contains high-resolution depth data from 11 individuals acting in varying scenarios.[40] propose using the approximately 68k video clips of synthetic humans in the large-scale SURREAL dataset for supervised training of human body depth and segmentation models.

B Network Details
In the following, k denotes the kernel size, s the stride, and channels the number of layer output channels.Input corresponds to the input of each layer.Network input and output are denoted by I and O, respectively, where for a generator network the output channel size equals 1 (RGB-to-depth) or 3 (depth-to-RGB).C Facial Depth Estimation on Bosphorus-3DFA and CelebAMask-HQ Section 4.2 demonstrates the plausibility of our proposed framework for fully unsupervised facial depth estimation using the small Texas-3DFRD [17].Obviously, the shooting position of the portrayed faces is always constant.The data set consists exclusively of frontal views, the illumination direction is consistent and all images are individually cropped to the facial region.However, the goal of this section is to train a model that is capable of generating depth profiles from arbitrary portrait images that are at least sufficient for reasonable viewpoint augmentation.To accomplish this, we make use of the following two data sets: the Bosphorus Database for 3D Face Analysis (Bosphorus-3DFA) [38] and the CelebAMask-HQ [32] that records face portraits.
The Bosphorus-3DFA consists of 105 individuals, where for each person, in contrast to the Texas-3DFRD, varying poses, different head rotations and occlusions (e.g.eyeglasses, long hair) are available.Pixel-aligned depth samples represent absolute depth and are preprocessed to the range [0, 1].Analogously to Section 4.2, we resize all RGB frames and depth profiles to a dimension of 256 × 256 and conduct data augmentation via random cropping.This results into 11k samples per modality.Although this database now contains different positions and face expressions, the decisive disadvantage is that all images were taken with constant lighting and with the same background (cf. Figure 9).Therefore, we add the CelebAMask-HQ to our experiment.
The CelebAMask-HQ is a large-scale facial portrait dataset with high-resolution face images of 30k celebrities selected from the CelebA dataset [46].Each sample is provided with a segmentation mask of face attributes, and therefore this database is used to train and evaluate face analysis, face recognition and segmentation algorithms.In our opinion, this database is particularly well suited for depth prediction of arbitrary portraits, as it consists of RGB images with different exposures and different image backgrounds.Furthermore, all images are already cropped to a face-bounding box.We randomly select 10k RGB frames and resize them to a dimension of 256 × 256.The RGB images of the Bosphorus-3DFA and all samples of the CelebAMask-HQ are used as training data for the RGB domain, the depth profiles of the Bosphorus-3DFA are used for the depth domain.We conduct unsupervised training of our proposed framework as described in Algorithm We qualitatively benchmark our proposed method against Wu et al. [45], where a method for fully unsupervised 3D modeling out of single images is introduced.To be more exact, a network is proposed that factors each input RGB into depth, albedo, viewpoint and illumination.In order to disentangle these different components without any supervision via paired data, stereo pairs or video sequences, Wu et al. make use of the fact that faces have in principle a symmetric structure.Thus, this proposed method for image disentanglement can also be applied to other object categories, provided that these have a symmetrical structure.The research of Wu et al. is one of the few works which has especially been developed for 3D modeling and where no supervision via paired RGB-depth data or availability of video sequences and stereo images is possible.The method has has been evaluated on several databases of cat and human faces, also including the CelebA.For visual comparison we make use of the publicly available demo version2 provided by the authors.We visually evaluate the success of the proposed unsupervised approach and present in Figure 10 synthesized 3D models that were created from RGB images of the Bosphorus-3DFA, the CelebAMask-HQ, and images in the wild.

Fig- ure 2 )Figure 2 :
Figure 2: Illustration of the proposed framework: The left part describes the domains in which the RGB-to-depth generator G θ Y and the contrary depth-to-RGB generator G θ X operate.Both generators are updated via the probabilistic Wasserstein-1 distance, estimated by f ω Y in the input and f ω X in the target domain.Perceptual similarity is compared between each generator input and its reconstruction.The right plot indicates that during inference, only G θ Y has to be deployed to synthesize new depth profiles.RGB images and ground truth depth images were taken from the Texas-3DFRD [17].
where the Lipschitz continuity of f ω Y can be enhanced via a gradient penalty [16].Given training batches y = {y n } b n=1 , y n iid ∼ P Y and x = {x n } b n=1 , x n iid ∼ P X , this yields the following empirical risk for critic f ω Y : Rcri(ωY , θY , p, y, x) := 1 b b n=1

Figure 3 :
Figure 3: The first column visualizes the RGB samples and the second column the grayscale versions.The third column contains the gamma corrected counterparts, where the contrast in lower gray levels is enhanced for dark images in particular.The last column illustrates the application of the highpass filter.
consists of 118 individuals and a variety of facial expressions and corresponding depth profiles are available for each of them.Depth pixels represent absolute depth and their values are in [0, 1] where 1 represents the near clipping plane while 0 denotes the background.We randomly select 16 individuals as evaluation data and use the remaining samples as training data.For unsupervised training, we randomly select 50 % of the training individuals for the input do-

Figure 4 :
Figure 4: From left to right: Surface RGB input, ground truth and profiles predicted by our method, gcGAN and cycleGAN.

Figure 5 :
Figure 5: An instant 3D model generated by our proposed framework provides valuable information on the liner surface condition.

Figure 7 :
Figure 7: An example of viewpoint augmentation using a 3D face model instantly generated by our proposed framework.
[40], cropping each frame to the human bounding box and resizing/padding images to a dimension of 256 × 256 pixels.In addition, for each image, we subtract the median of depth values to fit the depth images into the range ±0.4725 meters, where values less or equal −0.4725 denote background.During optimization, RGB images are scaled from [0, 255] to [−1, 1] and depth profiles are scaled from [−0.4725, 0.4725] to [−1, 1], whereas evaluation metrics RMSE and MAE are computed on the original depth scale in meters.

Figure 8 :
Figure 8: From left to right: Body RGB input, ground truth and profiles predicted by the proposed method, gcGAN, cy-cleGAN and CUT.

Figure 10 :
Figure 10: From left to right: RGB input, four snapshots of the synthesized 3D model generated by our method and four snapshots of the synthesized 3D model generated by Wu et al. [45].

Table 1 :
Unsup.surface depth estimation: The reported metrics are RMSE and MAE of the ground truth and the synthesized depth and are evaluated on unseen data (smaller is better).

Table 2 :
Unsup.face depth estimation: The reported metrics are RMSE and MAE of the ground truth and the synthesized depth and are evaluated on unseen data (smaller is better).

Table 3 :
Unsup.body depth estimation: The reported metrics are RMSE and MAE of the ground truth and the synthesized depth and are evaluated on unseen data (smaller is better).

Table 4 :
[14]et18 generator.The encoder is quite similar to the illustrated architecture in[19].The decoder architecture is a slightly modified version of[14].For upsampling, nearest neighbor method is used.Convolution layers followed by an instance normalization are denoted by convnorm.

Table 6 :
Residual block.A residual block (res-block) with kernel size k, stride s and channel size c is implemented as follows: