Cast Shadow Generation Using Generative Adversarial Networks

We propose a computer graphics pipeline for 3D rendered cast shadow generation using generative adversarial networks (GANs). This work is inspired by the existing regression models as well as other convolutional neural networks such as the U-Net architectures which can be geared to produce believable global illumination effects. Here, we use a semi-supervised GANs model comprising of a PatchGAN and a conditional GAN which is then complemented by a U-Net structure. We have adopted this structure because of its training ability and the quality of the results that come forth. Unlike other forms of GANs, the chosen implementation utilises colour labels to generate believable visual coherence. We carried forth a series of experiments, through laboratory generated image sets, to explore the extent at which colour can create the correct shadows for a variety of 3D shadowed and un-shadowed images. Once an optimised model is achieved, we then apply high resolution image mappings to enhance the quality of the final render. As a result, we have established that the chosen GANs model can produce believable outputs with the correct cast shadows with plausible scores on PSNR and SSIM similarity index metrices.


Introduction
Shadow generation is a popular computer graphics topic. Depending on the level of realism required, the algorithms can be real-time such as Shadow mapping [23] and Shadow projection [3], or precomputed such as ray marching techniques, which are an expensive way to generate realistic shadows. Thus, pre-computing often become a compelling route to take, such as [28], [2] and [6]. Generative Adversarial Networks have been implemented widely to perform graphical tasks, as it requires minimum to no human interaction, which gives GANs a great advantage over conventional deep learning methods, such as image-to-image translation with single D, G semi-supervised model [7] or unsupervised dual learning [26].
We apply image-to-image translation to our own image set to generate correct cast shadows for 3D rendered images in a semi-supervised manner using colour labels. We then augment a high-resolution image to enhance the overall quality. This approach can be useful in real-time scenarios, such as in games and Augmented Reality applications, since recalling a pre-trained model is less costly in time and quality compared to 3D real-time rendering, which often sacrifices realism to enhance performance. Our approach eliminates the need to constantly render shadows within a 3D environment and only recalls a trained model at the image plane using colour maps, which could easily be generated in any 3D software. The model we use utilises a combination of PatchGAN and Conditional GAN, because of their ability to compensate for missing training data, and tailor the output image to the desired task.
There are many benefits of applying GANs to perform computer graphic tasks. GANs can interpret predictions from missing training data, which means a smaller training data set compared to classical deep learning models. GANs can operate in multi modal scenarios, and a single input can generalise to multiple correct answers that are acceptable. Also, the output images are sharper due to how GANs learn the cost function, which is based on real-fake basis rather than traditional deep learning models as they minimise the Euclidean distance by averaging all plausible outputs, which usually produces blurry results. Finally, GAN models do not require label annotations nor classifications.
The rest of this paper is structured as follows. Section 1 reviews related work in terms of traditional shadow algorithms, machine learning and GANs, then in Sect. 2 we explain the construction of the generative model we used. In Sect. 3 we present our experiments from general GAN to cGAN and DCGAN and ending with Pix2pix. In Sect. 4 we discuss our results. Then, in Sects. 5 and 6 we present the conclusion and future work, respectively.

Generative Networks
Recent advancements in machine learning has benefited computer graphics applications immensely. In terms of 3D representation, modelling chairs of different styles from large public domain 3D CAD models was proposed by [1]. [25] applied Deep Belief Network to create representation of volumetric shapes. Similarly, supervised learning can be used to generate chairs, tables, and cars using upconvolutional networks [5], [4]. Furthermore, the application of CNN extended into rendering techniques, such as enabling global illumination fast rendering in a single scene [19], and Image based relighting from small number of images [18]. Another application filters Monte Carlo noise using a non-linear regression model [8].
Deep convolutional inverse graphics network (DC-IGN) enabled producing variations of lighting and pose of the same object from a single image [10]. Such algorithms can provide full deep shading, by training end to end to produce dense per-pixel output [16]. One of the recent methods applies multiple algorithms to achieve real-time outputs, which is based on a recommendation system that learns the user's preference with the help of a CNN, and then allow the user to make adjustments using a latent space variant system [30]. GANs have been implemented widely to perform graphical tasks. In cGANs, for instance, feeding a condition into both D, G networks is essential to control output images [14]. Training the domain-discriminator to maintain relevancy between input image and generated image, to transfer input domain into a target domain in semantic level and generate target images at pixel level [27]. Semi-supervised and unsupervised models has been approached in various ways such as trading mutual information between observed and predicted categori-cal class information [20]. This can be done by enabling image translators to be trained from two unlabelled images from two domains [26], or by translating both the image and its corresponding attributes while maintaining the permutation invariance property of the instance [15], or even by training a generator along with a mask generator that models the missing data distribution [11]. [12] proposed an image-to-image translation framework based on Coupled GANs that learns a joint distribution of images in different domains by using images from the marginal distributions in individual domains. Also, [13] uses an unsupervised model that performs image-to-image translation from few images. [21] made use of other factors such as structure and style. Another method eliminated the need for image pairs, by training the distribution of G : X → Y until G(X) is indistinguishable from the Y distribution which is demonstrated with [29]. Another example is by discovering cross-domain relations [9]. Another way is forcing the discriminator to produce class labels by predicting which of N + 1 class the input belongs to during training [17]. Another model can generate 3D objects from probabilistic space with volumetric CNN and GANs [24].

Proposed Network Structure
The model we use in this paper is a Tensorflow port of the Pytorch Image-to-Image translation presented by Isola et al. [7]. This approach can be generalised to any semi-supervised model. However, this model serves us better for its two network types; PatchGAN, which allows better learning for interpreting missing data and partial data generation. And conditional GAN, which allows semisupervised learning to facilitate control over desired output images by using colour labels. It also replaces the traditional discriminator with a U-Net structure with a step, this serves two purposes. First, it solves the known drop model issue in traditional GAN structure. Second, it helps transfer more features across the bottle neck which reduces blurriness and outputs larger and higher quality images.
The objective of GAN is to train two networks (see Fig. 1) to learn the correct mapping function through gradient descent to produce outputs believable to the human eye y. The conditional GAN here learns from observed image x and the random noise z, such that, Where x is a random noise vector, z is an observed image, y is the output image. The Generator G, and a Discriminator D operate on "real" or "fake" basis. This is achieved by training both networks simultaneously with different objectives, G is trained to produce as realistic images as possible, while D is trained to distinguish which are fake, thus conditional GAN (cGAN ) can be expressed as, The aim here for G to minimise the objective against the discriminator D which aims to maximise it such that, By comparing it to an unconditional variant where the discriminator does not observe x it becomes Here the distance of L L1 is used instead of L L2 to reduce blurring such that, The final objective becomes, Both networks follow the convolution-BatchNorm-ReLu structure. However, the generator differs by following the general U-Net structure, and the discriminator is based on Markov random fields. The application of a U-Net model allows better information flow across the network than the encoder-decoder model by adding skip connections over bottle necks between layer i and layer n − i, where n is the total number of layers, by concatenating all channels at layer i with the ones in n − i, thus, producing sharper images.
For discriminator D, a PatchGAN of N × N is applied to minimize blurry results by treating the image in small patches that are classified as real-fake across the image, then averaged to produce the accumulative results of D, such that the image is modelled as a Markov random field, Isola et al. [7] refers to this PatchGAN as a form of texture/style loss.
The optimisation process alternates descent steps between D and G, by training the model to maximize log D(x, G(x, z)) and dividing the objective by 2 to slow the learning rate of D, minibatch Stochastic Gradient Descend with Adam solver is applied at the rate of 0 : 0002 for learning, and its momentum parameters are set to β 1 = 0.5, β 2 = 0 : 999. This allows the discriminator to compare minibatch samples of both generated and real samples. The G network runs at the same setting as the training phase at inference time. Dropout and batch normalization to the test batch is applied at test time with batch size of 1. Finally, random jitter is applied by extending the 256 × 256 input image size to 286 × 286 and then crop back to its original size of 256 × 256. Further processing using Photoshop, is applied manually to enhance the quality of output image, by mapping a higher resolution render of the model over the output model image, thus delivering a more realistic final image.

Experiments
Here we report our initial experiments for shadow generation as well as minor shading functions to support it. Our approach is data driven; it focuses on adjusting the image set in every iteration to achieve the correct output. For that we manually created the conditions we intended to test. Also, our image set is created using Maya with Arnold renderer. All of our experiments are conducted on an HP Pavilion laptop with 2.60 GHz Intel core i7 processor, 8 Ghz of RAM and Nvidia Geforce GTX 960M graphics card.

Image-to-Image Approach
We start with the assumption that GANs can generate both soft and hard shadows on demand, using colour labels and given a relatively small training image set. Our evaluation is based on both real-fake basis as well as similarity index matrices. Real-fake implies that the images can be clearly evaluated visually, for the network itself does not allow poor quality images by design. The similarity index matrices applied here are proposed by [22], namely, PSNR which measure the peak signal to noise ratio, which is scored between 1 and 100, and SSIM which computes the ratio between the strength of the maximum achievable power of the reconstructed signal and the strength of the corrupted noisy signal, which is scored between 0 and 1. For our image-to-image translation approach, we started with a small image set of 100 images: 80 for training, 20 for testing and validation of random cube renders, with different lighting intensities and views. The results showed correct translations as shown in Fig. 2 (b) top section. However, the output colours of the background sometimes differed from the target images. Also, some of the environments contained patches and tiling artefacts as shown in Fig. 2 bottom section. Which is understandable given the small number of training images.
Next, we trained the model with Stonehenge images. It is also lit with a single light minus the variable intensity. The camera rotates 360 degrees around the Y axis. The total image number is 500 images, 300 for training and 200 for testing and validation.
We started with the question; can we generate shadows for non-shadowed images that are not seen during training? We worked around it by designing the colour label to our specific need. In the validation step we fed images with no shadows as in Fig. 3 (c), paired with colour labels that contains the correct shadows (a). As the results show in Fig. 3 (b), the model translated the correct shadow with the appropriate amount of drop off to the input image. Next we explore generating only accurate shadows Fig. 4 (b) for nonshadowed images (c), which is accomplished by constructing the colour map to only contain shadows (a), while training the network with full shadowed images as previously, paired with shadowed labels. The results show accurate translation of shadow direction and intensity (see Fig. 4

(b) and (d)).
For the third and final set of experiments, we used two render setups of the Stanford Dragon, one for training, the second for testing and validation. The camera setup that rotate 360 degrees around the dragon's Y axis with different step angle from training and testing. Also, the camera elevates 30 degrees across the X axis to show a more complex and elevated view of the dragon model rather than an orthographic one. The image set is composed of 1600 training images, 600 testing images and 800 for validation. The training set (shown in Table 1) is broken into multiple categories, with each one represented within 200 images with overlapping features sometimes. These images help understand how input image/label affects the behaviour of the output images. For example, if we fed a standard colour map to a coloured image, will it be able to translate the colours across or will the output be of standard colour, this will better inform the training process for future applications. All networks are trained from scratch using our image sets, and the weights are initialized from a Gaussian distribution with mean 0 and standard deviation 0.02.

Testing and Validation
The test image set consisted of 600 images that are not seen in the training phase. During this, some of the features learnt from the training set are tested and the weights are adjusted accordingly.
For validation, a set of 800 images that are not seen in the training set were used, they are derived from the testing image set, but have been modified heavily. The objective here is to test the ability to generate soft and hard shadows from unseen label images, as well as colour shadows, partial shadows, shadow generation for images with no shadows. Also, in some cases we have taken the image set to extremes in order to generate shadows for images that has different orientation and colour maps than their original labels.
From here, the experiments progressed in three phases. First, we trained the model with the focus on standard labels to produce soft and hard shadows, using an image set of 1000 images, 400 of them are dedicated for standard colours and shadows. The remaining 600 images are an arbitrary collection of cases mentioned in the training section (Table 1), with similar arbitrary testing and validation image sets of 300 images each. In this experiment, we had the standard colours and shadows provide comprehensive 360-degree view of the 3D model. While purposefully choosing arbitrary coloured samples of 10-25 images, we created 6 colour variations for both images and their respective colour labels. Our first objective was to observe whether the model can generalise colour information into texture, meaning to fill the details of the image from what has been learnt from the 360 view, and overlaying colour information on top of it. Even though the general details can be seen on the models, there were heavy artefacts such as blurring and tiling in some of the coloured images with fewer training images.
With that knowledge in mind, a second experiment was carried out. We adjusted the image set to reduce the tiling effect, which is mainly due to the lack of sufficient training images for specific cases. Hence, the number of training images was increased to 200 per case, to increase the training set to 1600 images.
In the validation image set, we pushed the code to extreme cases, such as pairing images of different colour maps and different directions, as well as partial and non-shadowed images. Thus, accumulating the validation set to 800 images, assuming beforehand that we will get the same tiling effect from the previous experiment in cases where we have different angles or different colours.

Discussion
The significant training time is one of the main challenges that we face, as the training time for our set of experiments ranged between 3 days to one week using Fig. 5. The initial phase of the third set, which shows direct translation tasks. Our focus here is mainly the ability to generate believable soft and hard shadows, as well as inpainting for missing patches.
our laptop, which is considered long for training 1600 images. This why our image set was limited to 256×256 pixels. For this work, we overcome the issue with augmenting the output image with a high-resolution render. Once trained, however, with some optimisation the model should be capable of real-time execution, but this issue has not been tested by us. The two biggest limitations for this method are, it is a still semi-supervised model that needs a pair of colour maps and a target image. The second limitation is that the colour maps and image pairs are manually created, and the process is labour intensive. These issues should be considered for future work. Our method performed well in almost all cases with minimal to no errors and sharp image reproduction, especially when faced with direct translation tasks, such as Fig. 5, and colour variations Fig. 6. Even with partial and non-shadowed images, the colours remained consistent and translated correctly across most outputs. This is promising, given a relatively small training set (approximately 200 images per case) we have used. By examining Fig. 7, we notice that the model generalised correctly in most cases even though the colour maps are non-synchronised. This means our method has the breadth to interpret correctly when training set fails. However, it tends to take a more liberal approach in translating colour difference between the label and target with bias towards the colour map. This was also visible in partial images, non-shadowed images, as well as soft and hard shadows. The network struggled mostly when more than two parameters are changed, for example, a partial image and non-shadowed model will translate well. However, partial image, non-synchronised shadow and position will start to show tiling in the output image. The model seems to struggle the most with position switching than any other change, especially when paired with non-synchronised colour map as well. This is usually manifested in the form of noise, blurring and tiling (see Fig. 7), while the colours remain consistent and true to training images, and the shadows are correct in shape and intensity but are produced with noise and tiling artefacts. We conducted our quantitative assessment by applying the similarity matrices PSNR and SSIM [22] and we can confirm the previous observations. When looking at the Table 2, the lowest score were in the categories with nonsynchronized image pairs such as categories 3 and 19, while the image pairs that were approximately present in both training and testing performed the highest, which are categories 4 and 5, with overall performance leaning towards a higher scores spectrum. Table 2. This table shows how each category performed in both PSNR and SSIM similarity indices between output images and corresponding ground truth images.

Conclusion
This paper explored a framework based on conditional GANs using a pix2pix Tensorflow port to perform computer graphic functions, by instructing the network to successfully generate shadows for 3D rendered images given training images paired with conditional colour labels. To achieve this, a variety of image sets were created using an off-the-shelf 3D program.
The first set targeted soft and hard shadows under standard conditions and coloured labels and backgrounds, using 6 colour variations in the training set to test different variations, such as partial and non-shadowed images. The image set consisted of 1000 training images and 600 images for testing and validation. The results were plausible in most cases but showed clear blurring and tiling with coloured samples that did not have enough training images paired with it.
Next, we updated the image set to 3000 images, with 1600 training images, providing an equal number of training images for each of the 8 cases. We used 600 images for testing, and 800 for validation, which included more variations such as partial and non-shadowed images. In the validation set, the images included extreme cases such as non-sync pairing of position and colour. The results were believable in all cases, except the extreme cases, which resulted in tiling and blurring.
The results are promising for shadow generation especially when challenged to produce accurate partial shadows from training image set. The model is reliable to interpret successful output for images not seen during the training phase, except when paired with different colours viewpoints. However, there are still challenges to resolve. For example, the model requires a relatively long time to be trained and the output images still suffer from minor blurriness.

Future Work
This is only a proof of concept. The next logical step is to optimise the process by training the model to create highly detailed renders from lower poly-count models. This also can be tested with video-based models such as HDGAN. It is expected to output flickering results due to its learning nature and current state of the art. Another direction of interest may be to automate the generation of colour maps from video or live feed such as the work in [29]. The main challenge, however, is the computation complexity, especially for higher resolution training.