AOGAN: A generative adversarial network for screen space ambient occlusion

Ambient occlusion (AO) is a widely-used real-time rendering technique which estimates light intensity on visible scene surfaces. Recently, a number of learning-based AO approaches have been proposed, which bring a new angle to solving screen space shading via a unified learning framework with competitive quality and speed. However, most such methods have high error for complex scenes or tend to ignore details. We propose an end-to-end generative adversarial network for the production of realistic AO, and explore the importance of perceptual loss in the generative model to AO accuracy. An attention mechanism is also described to improve the accuracy of details, whose effectiveness is demonstrated on a wide variety of scenes.


Introduction
Ambient occlusion (AO) is a rendering technique widely used in real-time 3D applications. It simulates low-frequency shadows caused by ambient light due to object occlusions, which can effectively improve realism of appearance of rendered scenes. At present, there are mainly two groups of real-time AO approaches for dynamic scenes: screen-space ambient occlusion (SSAO) [1][2][3][4] and learning-based methods [5][6][7], both of which are evaluated using screen-space information stored in a G-buffer. The former class are efficient, but often produce inaccurate results due to the empirical sampling strategy. The latter have gained much attention for their generality, but tend to miss details due to deficiencies in network structures or loss functions.
In a learning-based shading framework, a common idea is to train a neural network to learn from scene attributes to AO appearances as well as other screenspace shading effects. With powerful expressiveness and filter-sharing capacity, U-Net [8] and fully convolutional networks (FCNs) [9] have become the mainstream network structures for AO evaluation. Although these networks are representative, their predictive quality can still be improved [10,11].
In this paper, AO evaluation is regarded as a generative learning task. For generative problems, deep learning-based methods can achieve high structural similarity (SSIM) [12] and peak signal-tonoise ratio (PSNR) [13]. Although pixel-wise errors in these works [7,14,15] are reduced, there are still noticeable differences in visual quality compared to the ground truth. For example, results often exhibit undesirable smoothness. To further optimize the model in feature space, perceptual metrics can be introduced in addition to traditional metrics on image tasks to achieve higher quality results [16,17]. Since the emergence of generative adversarial networks (GANs) [18], a large number of researchers [19] have found that adversarial loss is helpful to obtain better perceptual quality. An additional way to improve the perceptual similarity between generated results and the ground truth is to use a pre-trained VGG-19 [20] network to evaluate perceptual loss. Although perceptual loss is popular in image transformation tasks, existing learning-based AO approaches only use per-pixel metrics like PSNR and SSIM, where more attention is paid to pixel-wise differences rather than perceptual ones.
In AO results, human vision is sensitive to regions containing object edges. Self occlusion helps in understanding object shape, while mutual occlusion helps in understanding distances between objects. There are two advantages of introducing the attention mechanism. On the one hand, edge regions can be better integrated with global information through redistribution of weights, making evaluation of occlusion around complex geometries more accurate. On the other hand, for scenes with rich details, the learning process is more difficult and timeconsuming; the attention mechanism can get the most significant feature subset through a dimensionality reduction process. For this purpose, we adopt the attention mechanism to learn AO results, which makes occlusion around geometric details more accurate.
In summary, our contributions are as follows: • an end-to-end trainable generative adversarial network, AOGAN, which can generate high quality AO, • use of perceptual loss to improve the overall visual quality of AO generation in complex scenes, and • an attention mechanism to make evaluation of occlusion around complex geometries more accurate.

Screen-space ambient occlusion
SSAO was first proposed to overcome the huge computation and extra storage overhead of offline AO algorithms. Such methods are popular in modern 3D applications, but often produce artifacts such as white highlights at the edges of objects or dark halos around the contours of objects due to large depth differences in adjacent pixels in screen space. SSAO+ [4] is an extension of hemispheric sampling along surface normals, resolving the white highlight at the edge of objects but still producing dark halos. Horizon based ambient occlusion (HBAO) [22] approximates AO by finding the free horizon angle along each direction in a jittered sample pattern, and integrates over all directions. However, special care must be taken to avoid banding and noise. Alchemy screen-space ambient obscurance (ASSAO) [23] uses a falloff function with user-specified parameters in the occlusion integral to approximate a more realistic result, but still does not directly deal with errors introduced by the screen space approximation. Vardis et al. [24] adopted an importance sampling strategy, and determined the key region for sampling by joint optimization over multiple views. Jimenez et al [25] proposed a new form of integration, GTAO, which can speed up the computation and improve accuracy. These methods shared a trait that the empirical sampling parameters or thresholds must be carefully chosen to avoid artifacts.

Deep learning for ambient occlusion
Recently, there has been a trend to use deep learning to solve rendering problems, such as neural network ambient occlusion (NNAO) [5] and DeepShading [7], which perform better than traditional algorithms with comparable rendering performance. In NNAO, the authors trained a multilayer perceptron (MLP) by using depth-normal pairs and the ground truth. Nalbach et al. [7] proved that CNNs are superior to MLPs because the former have larger receptive fields. However, because the loss function they adopted was pixel-level SSIM or L2 loss, they did not pay enough attention to edges or gaps between small objects, leading to local inaccuracies in occlusions (Fig. 1). Zhang et al. [21] improved rendering performance following the same idea as in Ref. [7]. They also contributed their own datasets. Although some scenes result in a high average SSIM, differences from the ground truth can still be seen ( Fig. 1(c)).

Generative adversarial networks
GANs [18] have proved to be powerful in regenerating complex distributions. They can be classified into two categories according to the input data distribution, unconditional or conditional. Unconditional GANs generate real data from samples of general data distributions such as Gaussian noise; DC-GAN [26] is a good example which is designed primarily to produce general natural images. In conditional GANs [27], inputs related to the objective function are provided for the network to control the generated output. Thus, conditional GANs are very popular in conversion tasks, such as image translation [28], superresolution [19], and neural rendering [29]. We model the AO task as a conversion problem, to be solved with a conditional GAN. With proper loss functions, it can seek differences both in feature space and at the pixel level, providing better visual quality.

Perceptual metrics
In the training process, the loss function plays an important role. Some theoretical and experimental [30,31] analysis has proved that high-level features extracted from a pre-trained image classification network can retrieve perceptual information from input images. Specifically, in the pre-trained image classification network, the representation extracted from the hidden layer is helpful to interpret the semantics of input images. Researchers [19,32] have found that the performance of image generation can be improved by using the Euclidean distance between high-level image features as a measure of depth perception similarity. In addition, works using perceptual metrics, like image style conversion [30,33], novel view synthesis [31], etc., have also achieved convincing results.

Attention mechanism
In deep neural networks, the weights on connections can be regarded as resources, which are assumed to be evenly distributed at the beginning of training.
With an attention mechanism, they are reallocated according to the importance of objects: more important ones will get more resources. Recently, an attention module has been adopted in many tasks [34][35][36][37]. Vaswani et al. [37] first proposed to extract global dependency of the input by using such a module, for machine translation. At the same time, Zhang et al. [38] introduced attention mechanisms into GANs to get better generators. Inspired by these works, we also adopt an attention module in the generator to improve AO quality in the high-frequency region. Experiments in Section 4 show the effectiveness of this approach.

Structure
The main structure of our proposed AOGAN is illustrated in Fig. 2. The inputs are normals and positions from the G-buffer. We send input into the generative network G to get an initial result, which then goes through the discriminant network D and the VGG network. The final difference is summarized into the loss function, and G and D are updated by back propagation.

Components
When training G, mean squared error (MSE) is commonly used as a loss function for AO problems.
We introduce a new loss function, which takes semantic feature errors into account as well as pixel errors. The loss function L G is represented as the weighted sum of the adversarial loss, the perceptual loss, and the content loss: where λ 1 , λ 2 , and λ 3 are hyper-parameters which control the weights of each term.

Adversarial loss
The generative network G is the mapping from training data p x to real-world data G(x; θ g ), where θ g is the generator parameter. As shown in Eq. (2), the discriminant network D is trained to maximize the probability of getting the correct label for real-world data and G(x; θ g ), while the generative network G is trained to minimize log(1 − D(G(x))). These two networks are alternately trained by optimizing the value function V (D, G). min In order to alleviate the collapsing mode problem, we modify the original GAN loss function following WGAN [39]. The generator loss function is and the discriminator loss function is modified to be min where x ∼ p r is the true data distribution and x ∼ p g is the input data distribution.

Perceptual loss
As indicated above, pixel-wise losses and perceptual losses can be used simultaneously. Pixel-wise losses compensate for the differences in pixel space, but often produce blurry results [14,40]. Perceptual losses extract differences between high-level features of images from well-trained classifiers. In order to reduce the distance between the generated value and the real output in feature space, we use the feature graph of the first several layers of a pre-trained VGG-19 [20] network, so as to obtain better perceptual quality. The perceptual loss is defined as where Φ represents the intermediate feature output of the VGG network, and j represents the layer. Experiments show that using 2 2 layer features achieves the best result.

Content loss
In order to further minimize the error between the estimated AO and the ground truth, we add SSIM as content loss function to G. A larger SSIM weight can accelerate convergence of training, but causes excessive smoothness. To avoid this, we set the weight to 0.5: We use DSSIM as other loss functions should be minimized, while SSIM should be maximized. Figure 3 shows the generator structure using U-Net, which contains two parts, for down-sampling and up-sampling. A typical down-sampling block is composed of 4 layers: a convolutional layer, a BN layer, a LeakyRelu activation layer (α = 0.2), and a pooling layer. An up-sampling block is composed of 5 layers: a deconvolutional layer, an attention layer, a convolutional layer, a BN layer, and a LeakyRelu activation layer. The attention layer will be introduced in Section 3.4, which connects corresponding up and down sampling parts to further improve the generated AO visual effect. A deconvolutional layer is used in the upsampling block instead of an interpolation layer because it has better generalization ability and can achieve better results at the expense of a little more computation. The discriminator network structure follows PatchGAN [28]. The image is convolved into N patches which can better represent local details, and the authenticity of the image is evaluated by averaging that of the patches.

Attention mechanism
Our attention module (AM) is designed to improve the accuracy of the generated result; see Fig. 4. The output of the attention module x is evaluated as )) * x l (7) where l stands for the l-th layer, W T c , W T x , and W T g denote the convolution operations, σ 1 is the activation function, σ 2 is the sigmoid function, and ω l is a weight. α l and the feature map are multiplied element-wise. The purpose is to highlight the image area and suppress irrelevant features. Adding this mechanism greatly improves the rendering of small and complex scenes, as will be shown in Fig. 6. For α l of each layer, the weight ω l is different and set to 1, 0.5, 0.25, 0.125, 0.0625 respectively: with increasing network level, the size of the feature graph decreases, and the influence of the weight of each layer increases, so the weights must be reduced.

Datasets and training
The dataset was provided by Nalbach et al. [7] and Zhang et al. [21], and consists of 105,000 pairs of deferred shading G-buffer data for 27 scenes, with corresponding reference images. We used 90,000 pairs of images to train the network, 10,000 for verification, and the remaining 5000 for testing. Each pixel in the G-buffer contains a high-dimensional feature vector, in which positions and normals may be correlated and redundant, but these redundant attributes have almost no extra cost and can improve performance of the network to achieve AO effects. The size of the output AO image is 512 × 512; it can be generated from complex scenes in just a few milliseconds.
We implemented all of our models using the Tensorflow deep learning framework. Training was conducted on a GTX Titan-X GPU. Since the model is fully convolutional and is trained on image patches, it can be applied to images of arbitrary size. During training, G and D were trained alternately, and G updated D every 5 times on average. We set the decay to 0.9 for batch normalization and set λ 1 = 0.001, λ 2 = 2, λ 3 = 1 for L G . Learning uses the Adam optimizer, with a learning rate for all models of 10 −4 , with a batch size of 8, and 30 epochs, these values empirically giving better results on validation. It took on average 160 hours to train an AOGAN network.

Comparison
To evaluate the performance of AOGAN, we conducted comparative experiments on our collected AO datasets. We calculated the SSIM and PSNR with generated AO and the ground truth to measure the error. However, we believe that pixel-wise quantitative evaluation is not necessarily the most effective metric, as we want to generate an AO that is more in line with the target, rather than being exactly the same in content as the original image. We randomly selected multiple scenes as shown in Fig. 5. Even if numerical evaluation of SSIM of the results is almost the same as for the ground truth, there are obvious differences in visual effects. This difference comes from the loss function of the network, which is only the result after averaging the value of the image itself, instead of exploring high-level semantics.
Thus, we also take perceptual loss as one of the criteria for quantitative evaluation. Quantitative comparative results are shown in Table 1. Our approach achieves the best quality for several metrics. Experiments with IDs A2 and A3 in Table 1 show that although the SSIM value for DeepShading (DS) is higher, the perceptual loss value for HBAO is better than that for DS. HBAO considers some basic features of the occlusion environment in viewing space, and these features lead to lower perceptual loss. This also explains why some visualization results of HBAO are better than those from DS. We randomly selected four different scenes and present visual results of multiple algorithms for comparison in Fig. 6. Our results are superior to those of previous algorithms both in overall visual effect and in occlusion details along object edges and at corners. We attribute this to the nature of our approach which leads to more accurate generation and better image detail.

Overview
Next, we validate the contribution of each component of the our approach through a thorough ablation study. To verify the effectiveness of our network architecture and loss function, we compare AOGAN with ablated versions: • w/o gen+per : to fully consider the importance of perceived quality in the network, we remove  We train these ablated models under the same training conditions and datasets. Quantitative results of ablation studies are shown in Table 2, and example output of ablated methods are shown in Figs. 7 and 8.

Loss function
As we can see, our joint loss function achieves the best performance. The experiments for IDs B1 and B5 in Table 2 show that when the discriminator and VGG are removed, the SSIM value of B1 approximately reached the level when using joint loss, but the perceptual loss value is the highest. We may doubt that the CNN can generate high-quality AO without using adversarial loss, while the joint loss is only slightly improved in overall detail. However, Fig. 7(a) shows that the addition of adversarial loss and perceptual quality significantly improves the effect in complex scenes, due to the fact that perceived loss makes it easier to generate high-level semantic  information. The reason for the lower SSIM value in B2 is that generator parameter updates are no longer dominated by SSIM value, but still show good visualization results: see Fig. 7(c). In addition, in order to further study the importance of perceptual loss in the training process, we analyzed its effect by changing the value of λ 2 ; results are shown in Fig. 9. Making λ 2 smaller produces a smoother result; otherwise, it produces darker occlusion effects (see Fig. 9(c)): in the joint loss function, changing the value of λ 2 is equivalent to changing the proportion of λ 3 , which leads to a larger difference in pixel space. per + DSSIM outperforms both per and DSSIM , which indicates that per and DSSIM promote each other.

Network modules
By comparing IDs B1, B4, and B5, we see that our complete network achieves the best performance. Since the generator itself cannot converge quickly, as shown in Fig. 7(a), the generated results in most complex scenes are relatively rough, which indicates that adversarial training can accelerate convergence of the generator. In order to further explore the role of the attention module, three network models were trained under the same training conditions: DS, DS+AM, and DS deep , where DS+AM adds a self-attention mechanism (AM) to the DeepShading network model and DS deep deepens the network and increases the complexity through the residual network. We applied the rendering framework proposed by Zhang et al. [21] to further improve the network rendering time from 12.5 to 3.5 ms. Comparative results are provided in Table 3.
IDs C1-3 and C4&5 show that both deepening the network and adding AM can improve the accuracy of the model, but deepening the network negatively affects speed, which is unsuitable for realtime application. AM is based on calculating the covariance between each pixel and all other pixels to achieve a global estimate for each pixel; it can solve the problem of increasing the number of convolutional layers with a small computational cost. As shown in Fig. 8, this module can express the AO details of the object more clearly.

Conclusions
In this paper, we have proposed a new GANbased ambient occlusion method, which can generate accurate screen-space ambient occlusion for complex scenes. At the same time, the improvement in quality did not come at the expense of running time, while the network structure and parameter specifications are similar to those in previous work [7]. Our running time is basically comparable and can achieve realtime performance. However, our work shares the same limitation as all screen-space solutions that the G-buffers do not contain full information about the scene, which can cause artifacts on screen boundaries or culled scenes. One way to overcome this is to use an extended G-buffer or a multi-layer depth buffer as inputs to approximate more accurate AO. The proposed method is also suitable for other screenspace shading effects, which will be considered in our future work. Furthermore, we will also explore obtaining temporal consistency of AO with a GAN to avoid flicking between frames.

Declaration of competing interest
The authors have no competing interests to declare that are relevant to the content of this article.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.
The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
Other papers from this open access journal are available free of charge from http://www.springer.com/journal/41095. To submit a manuscript, please go to https://www. editorialmanager.com/cvmj.