ShadowGAN: Shadow synthesis for virtual objects with conditional adversarial networks

We introduce ShadowGAN, a generative adversarial network (GAN) for synthesizing shadows for virtual objects inserted in images. Given a target image containing several existing objects with shadows, and an input source object with a specified insertion position, the network generates a realistic shadow for the source object. The shadow is synthesized by a generator; using the proposed local adversarial and global adversarial discriminators, the synthetic shadow’s appearance is locally realistic in shape, and globally consistent with other objects’ shadows in terms of shadow direction and area. To overcome the lack of training data, we produced training samples based on public 3D models and rendering technology. Experimental results from a user study show that the synthetic shadowed results look natural and authentic.


Introduction
Inserting virtual objects into scenes has a wide range of applications in visual media, from movies, advertisements, and entertainment to virtual reality. Consistency of shadows between the original scene and the inserted object contributes greatly to the naturalness of the results. If no prior scene knowledge is provided, it requires much labor and expertise to make the scene look as realistic as possible, in a tedious photo or video editing process. Even an experienced editor spends much effort to produce convincing results using commercial editing software such as Adobe Photoshop. The difficulties in this process stem from the lack of accurate estimates of illumination and scene geometry.
In this paper, we address the shadow synthesis problem for virtual objects inserted in an image. Shadow synthesis can be implemented by use of rendering techniques, which require much information, such as illumination, scene models, rendering frameworks, etc. Other methods [1][2][3][4] synthesize shadows with approximately estimated illumination and reconstructed scene geometry. Such computations either require user interaction or precise tools, and yet are time-consuming.
We propose to solve this problem using a novel deep learning-based framework without explicit knowledge of scene geometry and illumination. We use a convolutional neural network to directly predict the shadow map for a virtually inserted object, given only the target scene image and the specified insertion position in the image domain. Specifically, we use a generative adversarial network (GAN) framework, where the generator G tries to produce outputs that cannot be distinguished from "real" results, while the local discriminator D L and global discriminator D G try to detect the generator's "fakes" from local and global perspectives, respectively. During training, the generator and discriminators compete until convergence. As a result, a real-type, single-channel shadow map is predicted, from which the edited result with a synthetic shadow can be generated by a simple pixel-wise original image multiplication. The input constraints to our ShadowGAN are few while the computational efficiency is high as only a simple feed-forward operation through the network is needed.
Our method works for an image of a static scene. We assume scene surfaces to be made of Lambertian materials and we do not model specular reflection or inter-reflections between surfaces in the scene. Despite these assumptions, we can produce plausible results. To summarize, the contributions of our work are: • A convolutional neural network, ShadowGAN, which can synthesize shadows for virtually inserted objects in target images. • A local-global conditional adversarial scheme for both shape and direction supervision in shadow synthesis. • A practical dataset for shadow synthesis network training, produced using rendering techniques and public 3D models.

Related work
In this section, we discuss related prior work, mainly on shadow synthesis, shadow detection and removal, and image-to-image translation using generative adversarial networks.

Shadow synthesis
In image editing, knowledge of illumination and scene geometry is essential to achieving realistic shadow synthesis results. Previous methods have been proposed to recover such information from input images or videos. Intrinsic image decomposition algorithms aim to separate a single image I into a pixel-wise product of an albedo or reflectance layer R and a shading layer S [5][6][7][8]. The reflectance layer reveals how the material reflects incident light, and the shading layer accounts for illumination effects due to geometry, shadows, and inter-reflections. However, approaches based on pixel-wise illumination and reflectance maps are not effective enough to support complex editing operations such as object insertion. For visually plausible results, shadows must be carefully computed, which requires an analysis of scene geometry and lighting configuration in 3D space. The problem of estimating illumination from images, or inverse lighting, has been investigated. In Refs. [9,10], illumination distributions in a scene from object shadows of known shapes are recovered. Khan et al. [11] proposed editing object materials in a static image. Liu et al. [4] estimated illumination and scene geometry from video for various video applications. Ge et al. [12] proposed an object-aware image editing approach to obtain consistency in structure, color, and texture in a unified way. Rendering virtual objects into real scenes has been long investigated. A survey is provided by Kronander et al. [13]. Various ways have been explored to solve the problems of illumination and geometry recovery. Debevec [14] proposed estimating scene radiance and global illumination using a mirrored ball to capture a high-dynamic range lighting environment, to support object insertion. Karsch et al. [1] developed an image composition system to render synthetic objects into legacy photographs. The scene structure and area light are provided by user interaction or a data-driven approach [2]. Briefly, previous methods for shadow synthesis either require user interaction and scene knowledge, or recover explicit representations of scene geometry and illumination. Our method, in contrast, is novel in synthesizing shadows using a convolutional neural network without any requirements about the scene or the inserted object model.

Shadow detection and removal
The opposite problem to shadow synthesis, i.e., shadow detection and removal, has been studied in the computer vision community [15][16][17][18][19][20]. Its goals are to separate the target image into lit and shadowed areas, and thence to remove the shadows. In early work, color [15,20], edge [18], or segmentation [19] cues was used to build high level features for shadow description. Ma et al. [21] introduced appearance harmonization that makes the appearance of a deshadowed region compatible with the rest of the image. Recently, convolutional neural networks for shadow removal have been proposed [16,17]. In Ref. [17], the input image is decomposed into a shadow-free image and a shadow matte; the shadow matte is predicted using a convolutional neural network. Two stacked conditional GANs successively detect the shadow region and remove the shadow matte.
In the shadow removal problem, the objects casting shadows are commonly absent, while in the shadow synthesis problem, a virtually inserted object is present.

Image-to-image translation using generative adversarial networks
Goodfellow et al. [22] first introduced the concept of the generative adversarial network (GAN), consisting of two sub-networks: a generator (G) and a discriminator (D). G's task is to generate outputs to resemble the ground truth, while D tries to distinguish between fake and real inputs, i.e., between generated output and the ground truth. G and D work against each other, and the ideal outcome is for G to produce outputs that D cannot discriminate. Since its introduction, the GAN method has been widely applied to image-to-image translation problems, such as face image synthesis [23][24][25], image super resolution [26], and image completion [27,28]. Variations of the GAN architecture have also been developed, including conditional GAN [29,30], CycleGAN [31], StarGAN [24], etc. Isola et al. [30] proposed a GAN network that translates an image into another domain, such as from a sketch to a photo, from architectural maps to photos, from black-and-white to color photos, etc. Their approach used a U-net structure inside the generator, enabling earlier convolutions to be concatenated with later deconvolutional layers to pass down information about the input. In an image completion task [27], the contents of an arbitrary image region conditioned on its surroundings are generated by a convolutional neural network. Later, Iizuka et al. [28] proposed an image completion network with global and local discriminators. The addition of a local discriminator helps scrutinize the details of the completed image. Portenier et al. [32] developed the Faceshop system which supports interactive face editing with user provided sketch and color information as input conditions for the GAN architecture. Wei et al. [33] proposed to learn adaptive receptive fields instead of manually selecting dilated convolutional kernels.
Our proposed ShadowGAN is an adaption of GAN, which uses a local discriminator to guarantee shape correctness and a global discriminator to guarantee direction and area compatible with other objects' shadows.

Approach
Our proposed ShadowGAN is trained on synthetic data, where static scene images are rendered using 3D models indexed by ShapeNet [34]. Given an input target scene image I t including original objects with shadows and a virtually inserted object without shadow, whose position in the scene is specified by a mask m s , our goal is to predict a shadow map S, with which the output image I o with a synthetic shadow can be obtained by a simple pixel-wise product operation I o = I t * S. With the scene image I t and source object mask m s as inputs, the shadow map S is predicted using a generative network (see Fig. 3), where a reconstruction loss and two adversarial losses are used to guarantee the synthesis produces realistic output.
As a supervised deep learning-based image synthesis method, ShadowGAN requires paired input and ground truth images as training data, where the input scene image I t contains N objects (N 3 is assumed in our work) with shadows and one virtually inserted source object without a shadow; its mask m s indicating the insertion region is also provided. The ground truth shadow map S has the same size as I t . Each position p of S is associated with a real number, indicating that the output synthetic image color I o (p) can be obtained by multiplying the scene image color I t (p) by the coefficient S(p), under the assumption that ambient light is present in the scene.
Such data are impossible to effectively collect in real life. Firstly, on one hand, scenes in which a few objects have shadows and one object is fully lit do not realistically occur in reality, while on the other hand, if the virtually inserted object is copied and pasted from other photos, the ground truth shadow map S cannot be generated efficiently and realistically. Secondly, a wide variety of illumination, scenes, and camera configurations are required for training data, which is both tedious and challenging for real-life photo capture.
Instead of using real-life photos, we use rendering technology to generate the training data. We render each target scene image I t with N objects placed on the ground with shadows and one object with its shadow turned off. The shadow map S is generated by rendering a scene image I t with all the shadows turned on, then dividing it by I t : S = I t /I t .

Scenes
We use a sub-set of commonly seen 3D model categories such as can, printer, bed, etc. from a publicly available dataset, ShapeNet [34]. The object categories used for rendering are listed in Table 1. In total, 9265 objects were selected for rendering scenes. To render realistic ground planes, we downloaded textures from Internet using key-words search for, e.g., woollen, stone, tablecloth. A total of 110 textures Table 1 ShapeNet 3D model categories used to render the target  scene   bus  coach  mug  printer  stove  bowl  dishwasher  can  machine  motorcar  bag  grip  suitcase  bathtub  bookshelf  cabinet  auto  car  mailbox  microwave  washer  tower were randomly chosen for rendering the plane. In each target scene image, up to four objects were randomly selected from the model collection, one of them being the virtually inserted object, and the rest being the original objects in the scene. We assume each of the x, y, z coordinates to be in the range [−1, 1]: the ground plane is set to

Camera
The camera position P c = (x c , y c , z c ) was randomly chosen in the 3D space within the range: 3.5 x c 2 + y c 2 + z c 2 4.5 π/6 arcsin (z c / x c 2 + y c 2 + z c 2 ) π/3

Illumination
All scenes were illuminated by a single white point light with fixed intensity. The distance between the light and the center of the floor was randomly chosen in a limited range: the light position P l = (x l , y l , z l ) was randomly chosen in the following range: 3.5 x l 2 + y l 2 + z l 2 4.5 π/4 arcsin (z l / x l 2 + y l 2 + z l 2 ) π/3

Rendering
We used path tracing [35] to render the scenes, with 128 samples per pixel. To find the mask of the inserted object, we rendered it again with its material set to pure black, and then extracted its mask from the rendered image.

Training data
As a result, 12,400 training samples were generated, comprising a scene image I t , source object mask m s , and ground truth shadow map S, rendered at resolution 256 × 256.

Approach
Our goal is to train a generator G that learns a are the corresponding shadow maps y i = S i . The key requirement for learning is that the generated shadow map G(x) should reconstruct the shadow map, while not being distinguished from the ground truth shadow map data y ≈ p data (y). We introduce a local discriminator D L and a global discriminator D G which are trained to detect the generated shadow maps as "fakes" from aspects of local shape and global direction and area, respectively. Our objective thus contains a reconstruction loss L L 1 , a local adversarial loss L L GAN , and a global adversarial loss L G GAN .

Reconstruction loss
Reconstruction loss is commonly used in supervised image-to-image translation problems [28,30,36], to constrain the generated result to be similar to the ground truth in an L 1 or L 2 sense. Here we use L 1 norm reconstruction loss to measure the error between the predicted shadow map G(x) and the ground truth shadow map y:

Local adversarial loss
The local discriminator D L tries to distinguish the generated fake results G(x) from real samples y from local considerations, so only looks at the region around the source object. Intuitively, the generated shadow G(x) for the source object should be as similar as possible to the ground truth sample y within a local region. We crop a square region centered at the source object, of side half the original image size, i.e., 128 × 128 pixels, and only pass the cropped region of the predicted shadow map C(G(x)) or ground truth shadow map C(y), with conditional input scene image and source object mask C(x), to the local discriminator. Here, C(·) is the cropping operator. The local adversarial loss is defined to be (2) G tries to minimize this objective against the local adversarial D L that tries to maximize it. D L takes the cropped version of either conditional real samples x, y or generated fake samples x, G(x) as inputs. The discriminator determines whether the samples are real or fake.

Global adversarial loss
The global discriminator D G tries to distinguish the generated fake results G(x) from real samples y using a global view of the whole shadow map. In particular, the generated shadow G(x) for the source object should be compatible with other objects' shadows in the original scene in terms of direction and area.
where G tries to minimize this objective against the global adversarial D L that tries to maximize it. D G takes either conditioned real samples x, y or conditioned generated fake samples x, G(x) as inputs.

Full objective
The overall objective is the weighted sum of the loss terms: where λ = 200 controls the relative importance of the objective terms. The goal is to determine: Figure 3 visualizes the conditional shadow map generator. The generator takes an input of size 256 × 256 with 4 channels; 3 are RGB channels from the target scene and 1 is the source object mask m s . The output is a single channel shadow map of size 256×256. We adopt the encoder-decoder architecture proposed by Isola et al. [30], where skip connections (U-net) are set up to concatenate the corresponding layers in encoder and decoder. The generator downsamples the input using strided convolutions, followed by intermediate layers of dilated convolutions [37] before upsampling using transposed convolutions. We use the ReLU activation function after each layer except for the output layer, which uses a tanh activation function. In total, the proposed editing network has 15 convolutional layers with up to 256 feature channels.

Discriminator networks
Following Iizuka et al. [28] and Portenier et al. [32], we use local and global discriminators as adversaries for generator training (see Fig. 4). The input to the global discriminator is a 256 × 256 × 5 tensor: a fake shadow map sample S f or a real shadow map sample S r , conditional input target scene image I t , and the inserted object mask m s . The local discriminator uses the same input tensor but works on a cropped region of size 128 × 128 centered around the inserted object position. Both discriminators are fully-convolutional networks, with the spatial tensor dimension gradually downsampled to 1 × 1. Feature channels increase up to 512 channels then decrease to 1. The outputs of discriminators are predictions whether the inputs are more like real samples or fake ones. We use leaky ReLU activation functions with slope set to 0.2 everywhere in the discriminators, except for the last layer which uses a sigmoid activation function. Full network architectural details are provided in Tables 2 and 3.

Optimization and parameters
To optimize the proposed ShadowGAN, we follow Ref. [30] in which gradient descent steps for D and G are alternately performed. We apply the Adam solver [38] with learning rate set to 0.0002, and momentum

Initial tests
We have tested ShadowGAN on rendered synthetic scenes from the test set. The test set was rendered using the same rendering strategy as for the training set, with randomly selected models, placed object positions and orientations, illumination and camera configurations. Time for shadow synthesis was about 0.3 s for a 256 × 256 input image on a Titan 1080 Ti graphic card. A gallery of corresponding synthetic shadowed results is shown in Fig. 8. Figure 5 shows synthetic results with the same scene and illumination, but viewed from randomly selected viewpoints. It can be seen that even when observed from different view points, the synthetic shadows are visually realistic. As a further test, Fig. 6 shows results with the same scene and illumination, but slightly different camera poses caused by camera rotation. It can be seen that the synthetic shadows are temporally consistent  during the camera movement. ShadowGAN supports inserting virtual objects in sequence. Figure 7 shows an example of step-by-step object insertions with shadows synthesized using our method.
As ShadowGAN is the first deep learning-based shadow synthesis network, we next present an ablation study to demonstrate the benefits of the components of our system, followed by a user study to verify whether fake results from ShadowGAN are indistinguishable from real ones.

Ablation study
In order to evaluate the effectiveness of components of the proposed method, we re-evaluated ShadowGAN with alternative loss functions: with only the reconstruction loss (denoted as L 1 ), with the reconstruction loss and the local adversarial loss (denoted as L 1 + Local) and with the reconstruction loss and the global adversarial loss (denoted as L 1 + Global). Representative visual results are shown in Fig. 9. The results indicate that with some losses turned off, using functions L 1 , L 1 + Local, and L 1 + Global do not generalize well to the test samples and fail to predict visually plausible shadows with correct shape, area, and direction.
We also evaluated an input variation, in which the input source object position was not explicitly provided by a mask m s either for the generator or for the discriminators. Figure 10 provides a visual comparison under input variations. The results indicate that the source object mask m s is essential for ShadowGAN to obtain good results.

User study
To further assess whether the synthetic shadows for virtually inserted objects are visually natural and authentic, we conducted a user study with the task of observing and determining whether the shadows from our synthetic results look real. We also showed real scenes to the subjects and asked them to determine whether the images were real.  We collected 20 pairs of real and fake shadowed results from the test scenes; each pair shows the same scene. We invited 20 subjects without viewing or perception issues to observe and rate the images. Each subject observed a randomly selected image from each scene pair-either the synthetic result or the real shadowed image, and assessed whether the shadows in the image were real. We collected all votes from the subjects, and summarise the results of the user study in Table 4. As a result, 50.48% of our synthetic shadows were assessed to be real images. Even shadows in the real images were sometimes considered to be fake; only 57.14% were considered to be real. The summary indicates that the visual   effectiveness of synthetic results from ShadowGAN is close to that in rendered scenes.

Limitations and conclusions
ShadowGAN has limitations. Firstly, as discussed in Section 3.1, our training set and test set were produced using rendering technology on public 3D models rather than using real-life photos. As collecting real-life photos with some objects' shadows turned off is a challenging task, we regard collecting and testing real-life photos as requiring further work. Secondly, when testing our model, a scene with only one virtually inserted object is fed into the network. Synthesizing shadows for multiple objects is not supported by ShadowGAN. However as we have shown in the experimental results, users may iteratively perform insertion operations, one object at a time. As pioneering work that uses GAN to synthesize shadows for virtual object, we only tested our model on 256 × 256 images (as did Ref. [30]). In summary, we have presented a generative adversarial network-ShadowGAN-which can synthesize shadows for virtual objects in images. Shadows are predicted from a generator which during training competes against a local discriminator and a global discriminator. To our knowledge, this is the first novel shadow synthesis solution using a deep learning-based framework. It benefits from being free from input constraints and is computational effective. For network training, we have produced a large set of rendered scenes using public 3D models in commonly seen object categories. We believe both the training data and ShadowGAN will benefit the community of computer graphics and virtual reality.