RoCGAN: Robust Conditional GAN

Conditional image generation lies at the heart of computer vision and conditional generative adversarial networks (cGAN) have recently become the method of choice for this task, owing to their superior performance. The focus so far has largely been on performance improvement, with little effort in making cGANs more robust to noise. However, the regression (of the generator) might lead to arbitrarily large errors in the output, which makes cGANs unreliable for real-world applications. In this work, we introduce a novel conditional GAN model, called RoCGAN, which leverages structure in the target space of the model to address the issue. Specifically, we augment the generator with an unsupervised pathway, which promotes the outputs of the generator to span the target manifold, even in the presence of intense noise. We prove that RoCGAN share similar theoretical properties as GAN and establish with both synthetic and real data the merits of our model. We perform a thorough experimental validation on large scale datasets for natural scenes and faces and observe that our model outperforms existing cGAN architectures by a large margin. We also empirically demonstrate the performance of our approach in the face of two types of noise (adversarial and Bernoulli).

NVIDIA, Santa Clara, USA field, e.g. in dense 1 regression (Isola et al. 2017;Pathak et al. 2016;Ledig et al. 2017;Bousmalis et al. 2016;Liu et al. 2017;Miyato and Koyama 2018;Yu et al. 2018;Tulyakov et al. 2018). The major focus so far has been on improving the performance; we advocate instead that improving the generalization performance, e.g. as measured under intense noise and test-time perturbations, is a significant topic with a host of applications, e.g. facial analysis (Georgopoulos et al. 2018). If we aim to utilize cGAN or similar methods as a production technology, they need to have performance guarantees even under large amount of noise. To that end, we study the robustness of conditional GAN under noise.
Conditional Generative Adversarial Networks consist of two modules, namely a generator and a discriminator. The role of the generator role is to map the source signal, e.g. prior information in the form of an image or text, to the target signal. This mapping is completed in two steps: the source signal is embedded into a low-dimensional, latent subspace, which is then mapped to the target subspace. The generator is implemented with convolutional or fully connected layers, which are not invariant to (additive) noise. Thus, an input signal that includes (small) additive noise might be mapped arbitrarily off the target manifold (Vidal et al. 2017). In other words, cGAN do not constrain the output to lie in the target manifold which makes them liable to any input perturbation.
A notable line of research, that tackles sensitivity to noise, consists in complementing supervision with an unsupervised learning module. The unsupervised module forms a new pathway that is trained on either the same, or different data samples. The unsupervised pathway enables the network to explore the structure that is not present in the labelled training set, while implicitly constraining the output. The unsupervised module is only required during the training stage, i.e. it is removed during inference. In Rasmus et al. (2015) and Zhang et al. (2016) the authors augment the original bottom up (encoder) network with an additional top-down (decoder) module. The autoencoder, i.e. the bottom-up and the topdown modules combined, forms an auxiliary task to the original classification. However, in contrast to classification studied in Rasmus et al. (2015) and Zhang et al. (2016), in dense regression both bottom-up and top-down modules exist by default, therefore augmenting with an unsupervised module is not trivially extended.
Motivated by the combination of supervised and unsupervised modules, we propose a novel conditional GAN model which implicitly constrains the latent subspace. We coin this new model 'robust conditional GAN' (RoCGAN). The motivation behind RoCGAN is to take advantage of the structure in the target space of the model. We learn this structure with an unsupervised module which is included along with our supervised pathway. Specifically, we replace the original generator, i.e. encoder-decoder, with a two pathway module (Fig. 1). Similarly to the cGAN generator, the first pathway performs regression while the second is an autoencoder in the target domain (unsupervised pathway). The two pathways share a similar network structure, i.e. each one includes an encoder-decoder network. The weights of the two decoders are shared to force the latent representations of the two pathways to be semantically similar. Intuitively, this can be thought of as constraining the output of our dense regression to span the target subspace. The unsupervised pathway enables the utilization of all the samples in the target domain even in the absence of a corresponding input sample. During inference, the unsupervised pathway is no longer required, therefore the testing complexity remains the same as in cGAN.
In the following sections, we introduce our novel RoC-GAN and study their theoretical/experimental properties (Sect. 2). We prove that RoCGAN share similar theoretical properties with the original GAN, i.e. convergence and optimal discriminator (Sect. 2.5). An experiment with synthetic data is designed to visualize the target subspaces and assess our intuition (Sect. 2.6). We experimentally scrutinize the sensitivity of the hyper-parameters and evaluate our model in the face of intense noise (Sect. 3). Moreover, thorough experimentation with both images from natural scenes and human faces is conducted in different tasks to evaluate the model. The experimental results demonstrate that RoCGAN outperform consistently the baseline cGAN in all cases.
Our contributions are summarized as following: -We introduce RoCGAN that leverage structure of the target space and promote robustness in conditional image generation and dense regression tasks. -We scrutinize the model's performance under the effect of noise and adversarial perturbations. This robustness analysis had previously not been studied in the context of conditional GAN. -A thorough experimental analysis for different tasks is conducted. We outline how RoCGAN performs with lateral connections from encoder to decoder. The source code is made freely available for the community 2 .
Our preliminary work in Chrysos et al. (2019b) shares the same underlying idea, however this version is significantly extended. Initially, all the experiments have been conducted from scratch based on the new Chainer (Tokui et al. 2015) implementation 2 . The task of super-resolution is introduced in this version, while the noise and adversarial perturbations are categorized and extended, e.g. iterative attack case. Lastly, the manuscript is significantly modified; the experimental section is written from scratch, while other parts like related work or method section are extended substantially.
In this section, we introduce the related literature on conditional GAN and the lines of research related to our work.
Adversarial attacks Yuan et al. 2017;Samangouei et al. 2018) is an emerging line of research that correlates with our goal. Adversarial attacks are mostly applied to classification tasks; the core idea is that perturbing input samples with a small amount of noise, often imperceptible to the human eye, can lead to severe classification errors. The adversarial attacks are an active field of study with diverse clustering of the methods (Kurakin et al. 2018), e.g. single/multi-step attack, targeted/non-targeted, white/black box. Several techniques 'defend' against adversarial perturbations. A recent example is the Fortified networks of Lamb et al. (2018) which uses Denoising Autoencoders (Vincent et al. 2008) to ensure that the input samples do not fall off the target manifold. Kumar et al. (2017) estimate the tangent space to the target manifold and use that to insert invariances to the discriminator for classification purposes. Even though RoCGAN share similarities with those methods, the Fig. 1 The mapping process of the generator of the baseline cGAN (a) and our model (b). a The source signal is embedded into a lowdimensional, latent subspace, which is then mapped to the target subspace. The lack of constraints might result in outcomes that are arbitrarily off the target manifold. b On the other hand, in RoCGAN, steps 1b and 2b learn an autoencoder in the target manifold and by sharing the weights of the decoder, we restrict the output of the regression (step 2a). All figures in this work are best viewed in color scope is different since (a) the output of our method is highdimensional 3 and (b) adversarial examples are not extended to dense regression. 4 Except for the study of adversarial attacks, combining supervised and unsupervised learning has been used for enhancing the classification performance. In the Ladder network Rasmus et al. (2015) the authors adapt the bottom-up network by adding a decoder and lateral connections between the encoder (original bottom-up network) and the decoder. During training they utilize the augmented network as two pathways: (i) labelled input samples are fed to the initial bottom-up module, (ii) input samples are corrupted with noise and fed to the encoder-decoder with the lateral connections. The latter pathway is an autoencoder; the idea is that it can strengthen the resilience of the network to samples outside the input manifold, while it improves the classification performance.
The effect of noise in the source or target distributions has been the topic of several works. Lehtinen et al. (2018) demonstrate that zero-mean noise in the target distribution does not deteriorate the training, while it might even lead to an improved generalization. The seminal AmbientGAN of Bora et al. (2018) introduces a method to learn from partial or noisy data. They use a measurement function f to simulate the corruption in the output of the generator; they prove that the generator will learn the clean target distribution. The differences with our work are twofold: (a) we do not have access to the corruption function, (b) we do have a prior signal to condition the generator. The works of Li et al. (2019) and Pajot et al. (2019) extend the AmbientGAN with additional cases.  and Thekumparampil et al. (2018) study cGAN when the labels are discrete, categorical distributions; they include a noise transition model to clean the noisy labels.  extend the idea to image-to-image translation, i.e. when in addition to the conditional source image, there is a categorical, noisy label. The two main differences from our work are that: (a) we do not have categorical labels, (b) we want to constrain the output of the generator to lie in the target space. A common difference between the aforementioned works and ours is that they do not assess the robustness in the face of adversarial perturbations. Gondim-Ribeiro et al. (2018) conduct a study with adversarial perturbations in auto-encoders and conclude that auto-encoders are well-equipped for such attacks. Kos et al. (2018) propose three adversarial attacks tailored for VAE (Kingma and Welling 2014) and VAE-GAN. Arnab et al. (2018) perform the first large-scale evaluation of adversarial attacks on semantic segmentation models.
Our core goal consists in constraining the model's output. Aside from deep learning approaches, such constraints in manifolds were typically tackled with component analysis. Canonical correlation analysis (Hotelling 1936) has been extensively used for finding common subspaces that maximally correlate the data (Panagakis et al. 2016). The recent work of Murdock et al. (2018) combines the expressiveness of neural networks with the theoretical guarantees of classic component analysis.

Conditional GAN
Conditional signal generation leverages a conditioning label, e.g. a prior shape (Tran et al. 2019) or an embedded representation (Mirza and Osindero 2014), to produce the target signal. In this work, we focus on the latter setting, i.e. we assume a dense regression task with the conditioning label being an image.
Conditional image generation is a popular task in computer vision, dominated by approaches similar to the original cGAN paper (Mirza and Osindero 2014). The improvements to the original cGAN can be divided into three categories: changes in the (a) architecture of the generator (b) in the architecture of the discriminator, (c) regularization and/or loss terms. The resulting cGAN architectures and their variants have successfully been applied to a host of different tasks, e.g. inpainting (Iizuka et al. 2017;Yu et al. 2018), super-resolution (Ledig et al. 2017). In this paper, our work focuses on improving any cGAN model; we refer to the reader to more targeted applications for a thorough review of specific applications, e.g. super-resolution (Agustsson and Timofte 2017) or inpainting (Wu et al. 2017).
The majority of the architectures in the generator follow the influential work of Isola et al. (2017), widely known as 'pix2pix', that includes lateral skip connections between the encoder and the decoder of the generator. Similarly to lateral connections, residual blocks are often utilized (Ledig et al. 2017;Chrysos et al. 2019a). An additional engineering improvement is to include multiscale generation introduced by Yang et al. (2017). Coarse-to-fine architectures often emerge by training more generators, e.g. in Huang et al. (2017) and Ma et al. (2017) they utilize one generator for the global structure and one for the fine-grained result.
The discriminator in Mirza and Osindero (2014) accepts a generated signal and the corresponding target signal. Isola et al. (2017) make two core modifications in the discriminator (applicable to image-to-image translations): (a) it accepts pairs of source/gt and source/model output images, (b) the discriminator extracts patches instead of the whole image. Miyato and Koyama (2018) replace the inputs to the discriminator with a dot product of the source/gt and source/model output images. In Iizuka et al. (2017), they include two discriminators, one for the global structure and one for the local patches (block inpainting task).
The goal of the aforementioned improvements is to improve the performance or stabilize the training; none of these techniques' aim is to make cGAN more robust to noise. Therefore, our work is perpendicular to all such architecture changes and can be combined with any of the aforementioned architectures.
On the other hand, adding regularization terms in the loss function can impose stronger supervision, thus restricting the output. A variety of additional loss terms have been proposed for regularizing cGAN. The feature matching loss (Salimans et al. 2016) was proposed for stabilizing the training of the discriminator; it measures the discrepancy of the representations (in some layer) of the discriminator. The motivation lies in matching the low-dimensional distributions created by the discriminator layers. Isola et al. (2017) propose a content loss (implemented as 1 loss) for measuring the per pixel discrepancy of the generated versus the target signal. The perceptual loss is used in Ledig et al. (2017) and Johnson et al. (2016) instead of a per pixel loss. The perceptual loss denotes the difference between the representations 5 of the target and the generated signal. Frequently, task-specific losses are utilized, such as identity preservation or symmetry loss in Huang et al. (2017).
The aforementioned regularization terms provide implicit supervision in the generator's output through similarity with the target signal. However, this supervision does not restrict the generated signals to lie in the target manifold.

Method
In this section, we elucidate our proposed RoCGAN. In the following paragraphs, we develop the problem statement (Sect. 2.1), we review the original conditional GAN model (Sect. 2.2), and introduce RoCGAN (Sect. 2.3). Sequentially, we study a special case of generators, i.e. the generators that include lateral skip connections from the encoder to the decoder, and we pose the modifications required (Sect. 2.4). In Sect. 2.5, we prove that RoCGAN share the same properties as the original GAN ) and in Sect. 2.6 the intuition behind the model is assessed with synthetic data.

Problem Statement
The task of conditional signal generation is posed as generating signals given an input label 6 s. We assume the label s ∈ S, where S is the domain of labels, follows a different distribution from the target signals y ∈ Y , where Y is the domain of target signals. Also, we frequently want to include some stochasticity in the mapping; we include a latent variable z ∈ Z where Z is a known distribution, e.g. Gaussian.
Mathematically, if G denotes the mapping we want to learn, then: To learn G, we assume we have access to a database of N pairs D = {(s (1) , y (1) ), . . . , (s (n) , y (n) ), . . . , (s (N ) , y (N ) )} with n ∈ [1, N ]. In the following paragraphs we drop the index, i.e. we denote s (n) as s, to avoid cluttering the notation.
Conditional GAN, which we develop below, have been dominating in the literature for learning such mappings G. However, our interest lies in studying the case that during inference time the source signal is s + f (s, G) instead of s, i.e. there is some unwanted noise in our source signal. We argue that such noise is of both theoretical and practical value for commercial applications.
Notation A bold letter represents a vector/tensor; a plain letter designates a scalar number. Unless explicitly mentioned otherwise || · || will declare an 1 norm. The symbols L * define loss terms, while λ * denote regularization hyperparameters optimized on the validation set. For a matrix M, diag(M) denotes its diagonal elements.

Conditional GAN
GAN consist of a generator and a discriminator module commonly optimized with alternating gradient descent. The generator's goal is to model the target distribution p d , while the discriminator's to discern the samples synthesized by the generator and the target (ground-truth) distributions. More precisely, the generator samples z from a prior distribution p z , e.g. uniform, and maps that to a sample; the discriminator D tries to distinguish between the synthesized sample and one sample from p d .
The idea behind conditional GAN (cGAN) (Mirza and Osindero 2014) is to provide some additional labels to the generator. The generator G typically takes the form of an encoder-decoder network, where the encoder projects the label into a low-dimensional latent subspace and the decoder performs the opposite mapping, i.e. from low-dimensional to high-dimensional subspace. In other words, the generator performs the regression from the source to the target signal.
The core loss of cGAN is the adversarial loss, which determines the alternating role of the generator and the discriminator:

(G(s, z)|s))]
(2) The loss is optimized through the following min-max problem: where w G , w D denote the generator's and the discriminator's parameters respectively. To simplify the notation, we drop the dependencies on the parameters and the noise z in the rest of the paper. In our experiments, we use a discriminator that is not conditioned on the input, i.e. D( y); we include a related ablation study in Sect. 3.4.3.
Aside of the adversarial loss, cGAN models include auxiliary losses, e.g. task-specific 1 reconstruction or regularization terms for discriminator. Those losses do not affect the core model nor its adaptation to RoCGAN; we symbolize with L cG AN the total loss function.

RoCGAN
Our main goal is to improve robustness to noise in dense regression tasks. To that end, we introduce our model that leverages structure in the target space of the model to enhance the generator's regression. Our model shares the same structure as cGAN, i.e. it consists of a generator that performs the regression and a discriminator that separates the synthesized from the target signal. We achieve our goal by constructing a generator that includes two pathways.
The generator of RoCGAN includes two pathways instead of the single pathway of the original cGAN. The first pathway, referred as reg pathway henceforth, performs a similar regression as its counterpart in cGAN; it accepts a sample from the source domain and maps it to the target domain. We introduce an additional unsupervised pathway, named AE pathway. AE pathway works as an autoencoder in the target domain. Both pathways consist of similar encoder-decoder networks. 7 By sharing the weights of their decoders, we promote the regression outputs to span the target manifold and not induce arbitrarily large errors. A schematic of the generator is illustrated in Fig. 2. The discriminator can remain the same as the cGAN: it accepts the reg pathway's output along with the corresponding target sample as input.
To simplify the notation below, the superscript 'AE' abbreviates modules of the AE pathway and 'G' modules of the reg pathway.We denote G(s) = d (G) (e (G) (s)) the output of the reg pathway and G (AE) ( y) = d (AE) (e (AE) ( y)) the output of the AE pathway; e, d symbolize the encoder and decoder of a pathway respectively.
The unsupervised module (autoencoder in the target domain) contributes the following loss term: where f AE d denotes a function to measure the divergence. 8 Despite sharing the weights of the decoders, we cannot ensure that the latent representations of the two pathways span the same subspace. To further reduce the distance of the two representations in the latent space, we introduce the latent loss term L lat . This term minimizes the distance between the encoders' outputs, i.e. the two representations are spatially close (in the subspace spanned by the encoders). The latent loss term is: where f lat d can be any divergence function. In practice, for both L lat and L AE we employ ordinary loss functions, e.g. 1 or 2 norms. As a future step we intend to replace the latent loss term L lat with a kernel-based method (Gretton et al. 2007) or a learnable metric for matching the distributions (Ma et al. 2018).
The final loss function of RoCGAN combines the loss terms of the original cGAN L cG AN with the additional two terms for the AE pathway:

RoCGAN with Skip Connections
The need of memorizing all the information. The shortcut connection allows a low-level representation from an encoder layer to be propagated directly to a decoder layer without passing through the long path, i.e. the network without the lateral skip connections. An autoencoder (AE) with such a skip connection can achieve close to zero reconstruction error by simply propagating the representation through the shortcut. This shatters the signal in the long path (Rasmus et al. 2015), which is an unwanted behavior.
To achieve training the long path, we explore a number of regularization methods. Our first approach in our original work was to include a regularization loss term. In this work, we propose an additional regularization technique for the skip case.
In the first approach, we implicitly tackle the issue by maximizing the variance captured by the longer path representations. We add a loss term that penalizes the correlations in the representations (of a layer) and thus implicitly encourage the representations to capture diverse and useful information. We implement the decov loss (Cogswell et al. 2016): where C is the covariance matrix of the layer's representations. The loss is minimized when the covariance matrix is diagonal, i.e. it imposes a cost to minimize the covariance of hidden units without restricting the diagonal elements that include the variance of the hidden representations. A similar loss is explored by Valpola (2015), where the decorrelation loss is applied in every layer. Their loss term has stronger constraints: (i) it favors an identity covariance matrix but also (ii) penalizes the smaller eigenvalues of the covariance more. We have not explored this alternative loss term, as the decov loss worked in our case without the additional assumptions of the Valpola (2015).
In this work, we consider an alternative regularization technique. The approach is motivated by Rasmus et al. (2015) who include noise in the lateral skip connections. We do include zero-mean Gaussian noise in the shortcut connection, i.e. the representation of the encoder is modified by some additive Gaussian noise when skipped to the decoder. In our experimentation, both approaches can lead to improved results, we prefer to use the latter in the experiments.

Theoretical Analysis
In the next few paragraphs, we prove that RoCGAN share the properties of the original GAN . Even though the derivations follow similar steps as the original GAN, but are added to make the paper self-contained.
We derive the optimal discriminator and then compute the optimal value of L adv (G, D).
Proposition 1 For a fixed generator G (reg pathway), the optimal discriminator is: where p g is the model (generator) distribution.
Proof Since the generator is fixed, the goal of the discriminator is to maximize the L adv where: To maximize the L adv , we need to optimize the integrand above. We note that with respect to D the integrand has the form f (y) = a · log(y) + b · log(1 − y). The function f for a, b ∈ (0, 1) as in our case, obtains a global maximum in a a+b , so: thus L adv obtains the maximum with D * .
Proposition 2 Given the optimal discriminator D * the global minimum of L adv is reached if and only if p g = p d , i.e. when the model (generator) distribution matches the data distribution.
Proof From Proposition 1, we have found the optimal discriminator as D * , i.e. the arg max D L adv . If we replace the optimal value we obtain: We add and subtract log(2) from both terms, which after few math operations provides: where in the last row KL symbolizes the Kullback-Leibler divergence. The latter one can be rewritten more conveniently with the help of the Jensen-Shannon (JSD) divergence as The Jensen-Shannon divergence is non-negative and obtains the zero value only if p d = p g . Equivalently, the last equation has a global minimum (under the constraint that the discriminator is optimal) when p d = p g .

Experiment on Synthetic Data
We design an experiment on synthetic data to explore the differences between the original generator and our two pathway generator. Specifically, we design a network where each encoder/decoder consists of two fully connected layers; each layer followed by a RELU. We optimize the generators only, to avoid adding extra learned parameters.
The inputs/outputs of this network span a low-dimensional space, which depends on two independent variables x, y ∈ [−1, 1]. We've experimented with several arbitrary functions in the input and output vectors and they perform in a similar way. We exhibit here the case with input vector [x, y, e 2x ] and output vector [x + 2y + 4, e x + 1, x + y + 3, x + 2]. The reg pathway accepts the three inputs, projects it into a two-dimensional space and the decoder maps it to the target four-dimensional space.
We train the baseline and the autoencoder modules separately and use their pre-trained weights to initialize the two pathway network. The loss function of the two pathway network consists of the L lat (Eq. 4) and 2 content losses in the two pathways. The networks are trained either till convergence or till 100,000 iterations (batch size 128) are completed. Fig. 3 Qualitative results in the synthetic experiment of Sect. 2.6. Each plot corresponds to the respective manifolds in the output vector; the first and third depend on both x, y (xyz plot), while the rest on x (xz plot). The green color visualizes the target manifold, the red the baseline and the blue ours. Even though the two models include the same parameters during inference, the baseline does not approximate the target manifold as well as our method (Color figure online) During testing, 6400 new points are sampled and the overlaid results are depicted in Fig. 3; the individual figures for each output can be found in the supplementary. The 1 errors for the two cases are: 9843 for the baseline and 1520 for the two pathway generator. We notice that the two pathway generator approximates the target manifold better with the same number of parameters during inference.

Experiments
In the following paragraphs we initially design and explain the noise models (Sect. 3.1), we review the implementation details (Sect. 3.2) and the experimental setup (Sect. 3.3). Sequentially, we conduct an ablation study and evaluate our model on real-worlds datasets, including natural scenes and human faces.

Noise Models
In this work, we explore two different types of noise with multiple variants tested in each type. Those two types are Bernoulli noise and adversarial noise.
Bernoulli noise For an input s, the noise model is represented by a Bernoulli function v (s, θ).
To provide a practical example, assume that v = 0 and θ = 0.5, then an image s has half of its pixels converted to black, which is known as sparse inpainting.
Adversarial examples Apart from testing in the face of additional Bernoulli noise, we explore adversarial attacks in the context of dense regression. Recent works, e.g. Szegedy et al. (2014), Yuan et al. (2017), Samangouei et al. (2018) and Madry et al. (2018), explore the robustness of (deep) classifiers.
Contrary to classification case, there has not been much investigation of adversarial attacks in the context of imageto-image translation or any other dense regression task. However, since an adversarial example perturbs the source signal, dense regression tasks can be vulnerable to such modifications. We conduct a thorough investigation of this phenomenon by attacking our model with three adversarial attacks for dense regression. We introduce the adversarial attacks in the following paragraphs.
The first, and most ubiquitous attack is the fast gradient sign method (FGSM), introduced by Goodfellow et al. (2015). It is the simplest attack and the basis for several variants. In addition, the authors of Dou et al. (2018) mathematically prove the efficacy of this attack in the classification case. Let us define the auxiliary function: with L(s, y) = || y − G(s)|| 1 . Then, each source signal s is modified as: The perturbation η is defined as: with a hyper-parameter, y the target signal and L an appropriate loss function. However, to make the perturbation stronger, we iterate the gradient computation. The iterative FGSM (IFGSM) method of Dou et al. (2018) 9 is: where k is the kth iteration,s (0) = s and Cli p function restricts the outputs in the source signal range.
The second attack that is selected is the projected gradient descent method (PGD) of Madry et al. (2018). PGD is an iterative method which given the source signals (0) = s, it modifies it as: Robustness to this attack typically implies robustness to all first order methods (Madry et al. 2018), making it a particularly interesting case study. The third adversarial method is the latent attack of Kos et al. (2018). The loss in this attack is computed in the latent space, i.e. the output of the encoder e (G) (s).
In the following sections, we model the (I)FGSM attacks with the tuple (k, ) that declare the total iterative steps and the hyper-parameter value respectively. In the Bernoulli case, we use three cases of v, i.e. v = 0 corresponding to black pixels, v = 1 corresponding to white pixels and channel-wise v = 0. We abbreviate the three cases with a triplet (θ v=0 , θ v=1 , θ v=0,channel ) denoting the θ probability in each case. For instance, the triplet (50, 0, 0) denotes Bernoulli noise with v = 0 and probability 50%. Unless explicitly mentioned otherwise, the default adversarial attack below is the IFGSM.

Conditional GAN model
where π() extracts the features from the penultimate layer of the discriminator. The final loss function for the cGAN is the following: where λ c , λ π are hyper-parameters to balance the loss terms. RoCGAN model To fairly compare against the aforementioned cGAN model, we make only the following three 10 Referred to as projection loss in Chrysos et al. (2019a). adaptations: (i) we duplicate the encoder/decoder (for the new AE pathway); (ii) we share the decoder's weights in the two pathways; (iii) we augment the loss function with the additional loss terms. We emphasize that this is only performed for experimental validation; in practice the encoder of the AE pathway can have a different structure or new taskspecific loss terms can be introduced; we have made no effort to optimize further RoCGAN. We use 1 loss for both the L lat and L AE .
Training details A 'layer' refers to a block of three units: a convolutional unit with a 4×4 kernel size, followed by Leaky RELU and batch normalization (Ioffe and Szegedy 2015). The hyper-parameters introduced by our model are: λ l = 1, λ ae = 100. The values of the common hyper-parameters, e.g. λ c , λ π , are the same between the cGAN/RoCGAN. A mild data augmentation technique is utilized for training cGAN/RoCGAN: The training images are reshaped to 75 × 75 and random patches of 64 × 64 are fed into the network. Each training image is horizontally flipped with probability 0.5; no other augmentation is used. A constant learning rate of 2 · 10 −4 (same as in Isola et al. 2017) is used for 3 · 10 5 iterations with a batch size of 64. During training, we run validation every 10 4 iterations and export the best model, which is used for testing. The discriminator consists of 3 convolutional layers followed by a fully-connected layer. The input to the discriminator is either the output of the generator or the respective target image, i.e. we do not condition the discriminator on the source image.
Our workhorse for testing is a network denoted '5layer', because each encoder and decoder consists of 5 layers. In the following experiments 'Baseline-5layer' represents the cGAN '5layer' case, while ours is indicated as 'Ours-5layer'. In the skip case, we add a skip connection from the output of the third layer of the encoder to the respective decoder layer; we add a '-skip' in the respective method name.
We train an adversarial autoencoder (AAE) (Makhzani et al. 2015) as an established method capable of learning compressed representations as an upper performance bound baseline. Each module of the AAE shares the same architecture as its cGAN counterpart, while the AAE is trained with images in the target space. The target images are used as the input to the AAE and its output, i.e. the reconstruction, is used for the evaluation. In our experimental setting, AAE can be thought of as an upper performance limit of RoCGAN/cGAN for a given capacity (number of parameters).
The task selected for our testing is super-resolution by 4×. That is, we downsample an image 4 times; we upsample it with bilinear interpolation and use this interpolated as the corrupted image. In the supplementary, we include an experiment with sparse inpainting.

Experimental Setup
Datasets In addition to validating our model on synthetic data, we utilize a variety of real-world datasets: -MS-Celeb (Guo 2016) is introduced for large scale face recognition. It contains approximately 10 million facial images from 1 million celebrities. The dataset was collected semi-automatically, while the noise was not manually removed from the training images. We export 3 million samples for training and use 100 thousand images for validation. -CelebFaces attributes dataset (Celeb-A) (Liu et al. 2015) consists a popular benchmark for large-scale face attribute classification. Each image is annotated with 40 binary attributes. Celeb-A is used in conjunction with MS-Celeb in this work, where the latter is used for training and the former is used for testing. All the 202,500 samples of Celeb-A are used for testing. This combination is the main focus for our experiments; specifically it is used in Sects. 3.4, 3.5, 3.6 and 3.8. -300 Videos in the Wild (300VW) (Shen et al. 2015) is a benchmark for face tracking; it includes a sparse set of points annotated per frame. It includes three categories of videos with increasing difficulty; in this work we use as testset the most challenging category (categ 3) that includes over 27,000 frames. We use 300VW in Sect. 3.7 for assessing the performance of RoCGAN in video datasets. -ImageNet (Deng et al. 2009) is a large image database with 1000 different objects. An average of over five hundred images per objects exist. In the experiment for natural scenes, we utilize the training set of Imagenet which consists of 1, 2 million images and its testset that includes 98 thousand images (Sect. 3.5).
The two categories of images, i.e. faces and natural scenes, are extensively used in computer vision and machine learning both for their commercial value as well as for their online availability. For the experiments with faces, Ms-Celeb consists the training set, while for the natural scenes ImageNet.
Error metrics In the comparisons of RoCGAN against cGAN the following metrics are used: -Structural similarity (SSIM) (Wang et al. 2004): A metric used to quantify the perceived image quality of an image. We use it to compare every output image with respect to the reference (ground-truth) image; it ranges from [0, 1] with higher values demonstrating better quality. -Frechet inception distance (FID) Heusel et al. (2017): A measure for the quality of the generated images, frequently used with GAN. It extracts second order information from a pretrained classifier 11 applied to the images. FID assumes that the two distributions p 1 and p 2 are multivariate Gaussian, i.e. N (μ 1 , C 1 ) and N (μ 2 , C 2 ). Then: In our work p 1 is the distribution of the ground-truth images, while p 2 is the distribution of the generated images from each method. FID is lower bounded (by 0) in the case that p 2 matches p 1 ; a lower FID score translates to the distributions being 'closer'. We compute the FID score using the Inception network (in Chainer).

Ablation Study
In the following paragraphs we conduct an ablation study to assess RoCGAN in different cases; specifically we evaluate the sensitivity in a hyper-parameter range and different initialization options. We also summarize different options for loss functions and other architecture-related choices. Unless mentioned otherwise, '5layer' network is used; the task selected is face super-resolution while SSIM is reported as a metric in this part. The options selected in the ablation study are used in the following experiments and comparisons against cGAN.

Initialization of RoCGAN
We conduct an experiment to evaluate different initialization options for RoCGAN. The motivation for the different initializations is to assess the necessity of the pretrained models as used in Chrysos et al. (2019b). The options are: -Random initialization for all modules.
-Initializing the e (AE) to the pretrained weights of the respective AAE encoder and the rest modules from the pretrained cGAN. -Initializing only the unsupervised pathway from the respective pretrained generator of AAE. The rest modules are initialized randomly.
The results in Table 1 demonstrate that the initializations are not crucial for the final performance, however the second option performs slightly worse. We postulate that the pretrained cGAN makes RoCGAN get stuck 'near' the cGAN optimum. In the remaining experiments, we use the third option, i.e. we initialize the unsupervised pathway from the The abbreviations 'RI', 'PRETR', 'AE' stand for the three options of (i) random initialization, (ii) pretrained models, (iii) only pretrained AE pathway. Note that all RI and AE initializations are equivalent in terms of SSIM, while the PRETR is worse. Therefore, we can select either RI or AE for initializing RoCGAN The final SSIM values do not vary much for λ l in a wide range, which indicates that our model is robust to λ l choices respective AAE generator while the rest modules are initialized randomly.

Hyper-Parameter Range
Our model introduces two new loss terms, i.e. L lat and L AE , that need to be validated. Below, we scrutinize one hyperparameter every time, while we keep the rest in their selected value. During our experimentation, we observed that the optimal values of these hyper-parameters might differ per case/network, however unless we mention it explicitly in an experiment the hyper-parameters remain the same as aforementioned.
The search space for each term is decided from its theoretical properties and our intuition. In more details, λ ae would have a value at most equal to λ c . 12 The latent loss encourages the two pathways' latent representations to be similar, however since the final evaluation is performed in the pixel space, we postulate that a value smaller than λ c is appropriate.
In Table 2, different values for the λ l are presented. The optimal values emerge in the interval [1, 10), however even for the rest choices the SSIM values are similar. In our experimentation, RoCGAN is more resilient to changes in λ l than other hyper-parameters.
Different values of λ ae are considered in Table 3. RoC-GAN are robust to a wide range of values and both the visual and the quantitative results remain similar. In the following experiments we use λ ae = λ c = 100 because of the semantic similarity with the content loss; further improvements can be obtained by the best validation values. The network remains robust for a wide range of values of the hyperparameter λ ae . The best performance is obtained for lower values of λ ae , i.e. λ ae < 50, however in our evaluation we use λ ae = λ c = 100 for the semantic meaning. For further improvements one of the rest values or even further search might result in better hyper-parameter values The 'Concat' abbreviates the concatenation of Isola et al. (2017), while 'Proj' abbreviates the projective discriminator of Miyato and Koyama (2018). All three discriminators result in a similar performance, with the projective discriminator resulting in a marginal deterioration in the score. However, we believe that for larger networks, there might be indeed difference in the performance

Robustness on Discriminator Variants
Since the advent of cGAN, several discriminator architectures have been used. In the original paper, the discriminator accepts as input only the output of the generator or a sample from the target distribution. By contrast, Isola et al. (2017), propose to instead concatenate the source and the target images. Miyato and Koyama (2018) argue that instead of concatenation, the inner product of the source and the target image should be computed. We assess the robustness of RoCGAN under these different discriminators. As a reminder, we consider the discriminator of Mirza and Osindero (2014) as the default; to implement the variants of Isola et al. (2017) and Miyato and Koyama (2018), we do not change the number of depth of the layers, but only perform the respective concatenation, projection respectively.
In Table 4 the evaluation demonstrates that all three discrminators perform similarly. There is a marginal performance drop in the case of the projective discriminator, but this could be mitigated with a stronger generator for example. This experiment demonstrates that the proposed RoCGAN is not tied to a single discriminator, but rather can work with a number of discriminator architectures.

Other Training Options
We evaluate two more options for training our model: (a) whether the improvement can be obtained without batch nor- The two options (along withe the 'Default') are (a) to use an 2 loss for L lat and (b) to remove batch normalization from the generator pathways. In both cases the performance remains the same In both scenes and faces datasets RoCGAN verifies our intuition and outperforms the baseline The task is face super-resolution and the results are similar to the networks without skip connections malization, (b) a different latent loss function ( 2 ). In Table 5 we add the two options along with the default options from above. The results indicate that (i) batch normalization does not seem to contribute in RoCGAN's performance in this network, (ii) our choice of 1 can be replaced from another function with similar results. In the rest of the experiments, we use batch normalization and 1 for L lat .

Testing on Static Images
Our first evaluation against baseline cGAN is on testing without any additional noise (other than the implicit biases of the datasets). The task for both the faces and the scenes is superresolution in the respective domain. The training images are from Ms-Celeb and ImageNet respectively, while the testing images from Celeb-A and ImageNet testset. The numerical results in Table 6 dictate that in both cases and with both metrics, RoCGAN outperform cGAN. We also experiment with the '5layer-skip' networks to assess the performance in the skip case. The results in Table 7 illustrate similar behavior to the previous case, i.e. our model outperforms the baseline.

Testing Under Additional Noise
We conduct a dedicated experiment to evaluate the resilience of the models to noise. The idea is to artificially corrupt the source signal s with the noise models of Sect. 3.1, i.e. feed as input s + f (s, G) for some corruption function f . We use the '5layer' networks in the face super-resolution task and corrupt them with (a) adversarial and (b) Bernoulli noise.
Bernoulli noise As a reminder the noise in this experiment is used exclusively during testing. All three cases of (1, 0, 0), (0, 1, 0), (0, 0, 1) are assessed 13 , along with mixed cases. The quantitative results for Bernoulli noise are reported Table 8. Our model is consistently better with a relative performance gain of up to 9.9%. Indicative visual results are depicted in Fig. 4.
Adversarial noise The performance under the three different adversarial attacks is assessed. For IFGSM , we initially start with a small value of , i.e. = 0.01, and progressively increase either the steps or the hyper-parameter's value. As expected, the results in Table 9 highlight that increasing values of either the steps or deteriorate the performance of the networks. However, the performance of cGAN decline with a faster pace when compared to our proposed RoCGAN. The relative performance difference (in SSIM) is 4.9% in the original testing, while it progressively grows up to 24.3% in the (1, 0.1) noise. The effect of the steps in IFGSM is further explored in Fig. 5. We fix = 0.01 and study the evolution in performance as we vary number of steps. Note that the curve of cGAN is much steeper than that of RoCGAN as the number of steps increase. Beyond 10 steps, the performance of cGAN drops below 0.5 and can essentially be considered as noise. We perform the same experiment with the PGD attack; the effect of the increasing steps are visualized in Fig. 6. We note that after 10 steps there is substantial difference between the two models. This difference is maintained and increased if we increase the steps to 30. We also compare the two models under the latent attack in Fig. 7. For 1 or 2 iteration of the latent attack, the curves are similar to the previous two, however for more steps the curves become steeper than in previous attacks, while the performance gap grows faster in this attack. The efficiency of the three attacks differs when it comes to the number of steps required, with the latent attack being the most successful. Remarkably though, all three attacks have similar effects in the two models, i.e. the performance gap increases as the number of steps increase. By implementing three adversarial attacks, we illustrate that empirically the proposed model is more robust in the face of noise against the baseline.   To further analyze the differences between the two models, we create a histogram plot based on the SSIM values. The interval of [0.5, 0.95] that the SSIM values lie is divided in 20 bins, while the vertical axis depicts the frequency of each bin. A histogram with values concentrated to the right (towards 1) signifies superior performance. The histograms comparing '5layer' cGAN/RoCGAN under IFGSM (adversarial noise) are plotted in Fig. 8 (respectively for the Bernoulli noise, the histograms are in Fig. 9). We note that there is an increasing difference between the original histogram (no noise) and the increasing steps of IFGSM, e.g. Fig. 8a versus Fig. 8d. The same difference is observed as increases; in the extreme case of = 0.1 there is only minor overlap between the two methods. In Fig. 10

Testing on a Video Sequence
Aside of the experiment with the static testset, we use the 300VW (Shen et al. 2015) video dataset to assess RoCGAN. The videos include non-linear corruptions, e.g. compression, blurriness, rapid motion; such corruptions make a video dataset the perfect testbed for our evaluation.
In Table 10 we add the results of the experiment. 14 The performance of cGAN is slightly worse than the related experiment in Celeb-A, while RoCGAN's performance remains similar to the static case. The difference in the performance increases in the additional noise cases. The FID performance differs from the respective static experiment, since the mean and covariance for the empirical target distribution are extracted from Celeb-A in both cases. We provide a video of the results in https://youtu.be/RvoW4AYnzQU.

Cross-Noise Experiments
A reasonable question is whether data augmentation can be used to make the model robust. In our particular setup, we scrutinize this assumption below: we augment the training samples with noise and assess the testing performance. Specifically, we scrutinize the performance of cGAN/RoCGAN with cross-noise experiments, i.e. we train with one type of noise and test with a different type of noise. For a fair comparison with the aforementioned experiments, we keep the same architectures as above, i.e. the '5layer' network, while the task is face super-resolution.
The first experiment is conducted by training with Bernoulli noise; while during testing, adversarial perturbations (IFGSM) are used. The Bernoulli noise (during training) is (5, 0, 0); the variants (10, 0, 0) and (θ, 0, 0) with θ uniformly sampled in each iteration from [0, 10] were tried but resulted in similar outcomes. The effect of IFGSM for different steps is plotted in Fig. 11; both models exhibit a small improvement with respect to their counterparts trained without noise in Sect. 3.6. Nevertheless, the RoCGAN outperform substantially the cGAN baseline in the face of increasing IFGSM steps.
An additional experiment is conducted with a completely new type of noise, Gaussian noise, i.e. a type of noise that has not been used previously in any of our models. Each training sample is perturbed with additive Gaussian noise. In every iteration a dense noise mask is sampled online from N (0, 10) (for pixels in the [0, 255] range). The perturbed input for each method is s + N (0, 10); see Fig. 12 for a visual illustration. The results when trained with adversarial noise (IFGSM) are visualized in Fig. 13, while the comparison with both Bernoulli and adversarial noise is reported in Table 11. The patterns of previous sections (e.g. Sect. 3.6) emerge under Bernoulli noise, i.e. the more intense the noise the larger the performance gap. For instance, the original difference of 0.041 is converted into a difference of 0.069 with 1% white pixels; this intensifies to 0.073 under the (1, 1, 1) case. The performance of both methods improves when trained with Gaussian noise in under both Bernoulli and adversarial noise during testing. However, the performance gap between the baseline and our model remains similar when we increase the number of steps (IFGSM); see Fig. 13.

Conclusion
In this work we study the robustness of conditional GANs in the face of noise. Despite their notorious sensitivity to noise, the topic has so far been relatively understudied. In this paper, we introduced the robust conditional gan (RoCGAN) model, a new conditional GAN capable of leveraging unsupervised data to learn better latent representations. RoCGAN modify the generator into a two-pathway generator. The first pathway (reg pathway), performs the . It is noticeable that as the noise increases cGAN outputs deteriorate fast in contrast to their RoCGAN outputs. Notice the ample differences for intense noise; for instance, in columns (e) versus (f) where cGAN includes unnatural lines in all cases The relative gain (of RoCGAN in SSIM) in the video sequence is 8% (original testset), while it grows up to 33% (intense noise) Performance of cGAN/RoCGAN (mean SSIM) when trained with Gaussian noise. Both models are more robust when trained with Gaussian noise; it requires 15 adversarial steps instead of 10 to achieve the same degradation. Nevertheless, the same pattern with increasing performance gap emerges in the Gaussian noise regression from the source to the target domain. The new, added pathway (AE pathway) is an autoencoder in the target domain. By adding weight sharing between the two decoders, we implicitly constrain the reg pathway to output signals that span the target manifold. We prove that our model shares similar convergence properties with generative adversarial networks. We demonstrated through large scale experiments on images, for both natural scenes and faces, that RoCGAN outperform existing, state-of-the-art conditional GAN models, especially in the face of intense noise. Our model can be used with any form of data and has successfully been applied to sparse inpainting/denoising in Chrysos et al. (2019b) as well as super-resolution. We hope that our work can pave the way towards more robust conditional GANs. Going forward, we aim to study how to merge different types of noise and how to achieve foolproof robustness in a dense regression setting. Additionally, we aim to study how to combine the polynomial networks (Chrysos et al. 2020) with RoCGAN.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecomm ons.org/licenses/by/4.0/. The initial difference of 0.041 is converted into a difference of 0.069 with 1% white pixels; that is RoCGAN increases the performance gap under unseen noise. The same trend is observed in the adversarial (IFGSM) noise