1 Introduction

Image-to-image translation and more generally conditional image generation lie at the heart of computer vision. Conditional generative adversarial networks (cGAN) (Mirza and Osindero 2014) have become a dominant approach in the field, e.g. in denseFootnote 1 regression (Isola et al. 2017; Pathak et al. 2016; Ledig et al. 2017; Bousmalis et al. 2016; Liu et al. 2017; Miyato and Koyama 2018; Yu et al. 2018; Tulyakov et al. 2018). The major focus so far has been on improving the performance; we advocate instead that improving the generalization performance, e.g. as measured under intense noise and test-time perturbations, is a significant topic with a host of applications, e.g. facial analysis (Georgopoulos et al. 2018). If we aim to utilize cGAN or similar methods as a production technology, they need to have performance guarantees even under large amount of noise. To that end, we study the robustness of conditional GAN under noise.

Conditional Generative Adversarial Networks consist of two modules, namely a generator and a discriminator. The role of the generator role is to map the source signal, e.g. prior information in the form of an image or text, to the target signal. This mapping is completed in two steps: the source signal is embedded into a low-dimensional, latent subspace, which is then mapped to the target subspace. The generator is implemented with convolutional or fully connected layers, which are not invariant to (additive) noise. Thus, an input signal that includes (small) additive noise might be mapped arbitrarily off the target manifold (Vidal et al. 2017). In other words, cGAN do not constrain the output to lie in the target manifold which makes them liable to any input perturbation.

A notable line of research, that tackles sensitivity to noise, consists in complementing supervision with an unsupervised learning module. The unsupervised module forms a new pathway that is trained on either the same, or different data samples. The unsupervised pathway enables the network to explore the structure that is not present in the labelled training set, while implicitly constraining the output. The unsupervised module is only required during the training stage, i.e. it is removed during inference. In Rasmus et al. (2015) and Zhang et al. (2016) the authors augment the original bottom up (encoder) network with an additional top-down (decoder) module. The autoencoder, i.e. the bottom-up and the top-down modules combined, forms an auxiliary task to the original classification. However, in contrast to classification studied in Rasmus et al. (2015) and Zhang et al. (2016), in dense regression both bottom-up and top-down modules exist by default, therefore augmenting with an unsupervised module is not trivially extended.

Motivated by the combination of supervised and unsupervised modules, we propose a novel conditional GAN model which implicitly constrains the latent subspace. We coin this new model ‘robust conditional GAN’ (RoCGAN). The motivation behind RoCGAN is to take advantage of the structure in the target space of the model. We learn this structure with an unsupervised module which is included along with our supervised pathway. Specifically, we replace the original generator, i.e. encoder–decoder, with a two pathway module (Fig. 1). Similarly to the cGAN generator, the first pathway performs regression while the second is an autoencoder in the target domain (unsupervised pathway). The two pathways share a similar network structure, i.e. each one includes an encoder–decoder network. The weights of the two decoders are shared to force the latent representations of the two pathways to be semantically similar. Intuitively, this can be thought of as constraining the output of our dense regression to span the target subspace. The unsupervised pathway enables the utilization of all the samples in the target domain even in the absence of a corresponding input sample. During inference, the unsupervised pathway is no longer required, therefore the testing complexity remains the same as in cGAN.

Fig. 1
figure 1

The mapping process of the generator of the baseline cGAN (a) and our model (b). a The source signal is embedded into a low-dimensional, latent subspace, which is then mapped to the target subspace. The lack of constraints might result in outcomes that are arbitrarily off the target manifold. b On the other hand, in RoCGAN, steps 1b and 2b learn an autoencoder in the target manifold and by sharing the weights of the decoder, we restrict the output of the regression (step 2a). All figures in this work are best viewed in color

In the following sections, we introduce our novel RoCGAN and study their theoretical/experimental properties (Sect. 2). We prove that RoCGAN share similar theoretical properties with the original GAN, i.e. convergence and optimal discriminator (Sect. 2.5). An experiment with synthetic data is designed to visualize the target subspaces and assess our intuition (Sect. 2.6). We experimentally scrutinize the sensitivity of the hyper-parameters and evaluate our model in the face of intense noise (Sect. 3). Moreover, thorough experimentation with both images from natural scenes and human faces is conducted in different tasks to evaluate the model. The experimental results demonstrate that RoCGAN outperform consistently the baseline cGAN in all cases.

Our contributions are summarized as following:

  • We introduce RoCGAN that leverage structure of the target space and promote robustness in conditional image generation and dense regression tasks.

  • We scrutinize the model’s performance under the effect of noise and adversarial perturbations. This robustness analysis had previously not been studied in the context of conditional GAN.

  • A thorough experimental analysis for different tasks is conducted. We outline how RoCGAN performs with lateral connections from encoder to decoder. The source code is made freely available for the communityFootnote 2.

Our preliminary work in Chrysos et al. (2019b) shares the same underlying idea, however this version is significantly extended. Initially, all the experiments have been conducted from scratch based on the new Chainer (Tokui et al. 2015) implementation\(^{2}\). The task of super-resolution is introduced in this version, while the noise and adversarial perturbations are categorized and extended, e.g. iterative attack case. Lastly, the manuscript is significantly modified; the experimental section is written from scratch, while other parts like related work or method section are extended substantially.

In this section, we introduce the related literature on conditional GAN and the lines of research related to our work.

Adversarial attacks (Szegedy et al. 2014; Yuan et al. 2017; Samangouei et al. 2018) is an emerging line of research that correlates with our goal. Adversarial attacks are mostly applied to classification tasks; the core idea is that perturbing input samples with a small amount of noise, often imperceptible to the human eye, can lead to severe classification errors. The adversarial attacks are an active field of study with diverse clustering of the methods (Kurakin et al. 2018), e.g. single/multi-step attack, targeted/non-targeted, white/black box. Several techniques ‘defend’ against adversarial perturbations. A recent example is the Fortified networks of Lamb et al. (2018) which uses Denoising Autoencoders (Vincent et al. 2008) to ensure that the input samples do not fall off the target manifold. Kumar et al. (2017) estimate the tangent space to the target manifold and use that to insert invariances to the discriminator for classification purposes. Even though RoCGAN share similarities with those methods, the scope is different since (a) the output of our method is high-dimensionalFootnote 3 and (b) adversarial examples are not extended to dense regression.Footnote 4

Except for the study of adversarial attacks, combining supervised and unsupervised learning has been used for enhancing the classification performance. In the Ladder network Rasmus et al. (2015) the authors adapt the bottom-up network by adding a decoder and lateral connections between the encoder (original bottom-up network) and the decoder. During training they utilize the augmented network as two pathways: (i) labelled input samples are fed to the initial bottom-up module, (ii) input samples are corrupted with noise and fed to the encoder–decoder with the lateral connections. The latter pathway is an autoencoder; the idea is that it can strengthen the resilience of the network to samples outside the input manifold, while it improves the classification performance.

The effect of noise in the source or target distributions has been the topic of several works. Lehtinen et al. (2018) demonstrate that zero-mean noise in the target distribution does not deteriorate the training, while it might even lead to an improved generalization. The seminal AmbientGAN of Bora et al. (2018) introduces a method to learn from partial or noisy data. They use a measurement function f to simulate the corruption in the output of the generator; they prove that the generator will learn the clean target distribution. The differences with our work are twofold: (a) we do not have access to the corruption function, (b) we do have a prior signal to condition the generator. The works of Li et al. (2019) and Pajot et al. (2019) extend the AmbientGAN with additional cases. Kaneko et al. (2019) and Thekumparampil et al. (2018) study cGAN when the labels are discrete, categorical distributions; they include a noise transition model to clean the noisy labels. Kaneko and Harada (2019) extend the idea to image-to-image translation, i.e. when in addition to the conditional source image, there is a categorical, noisy label. The two main differences from our work are that: (a) we do not have categorical labels, (b) we want to constrain the output of the generator to lie in the target space. A common difference between the aforementioned works and ours is that they do not assess the robustness in the face of adversarial perturbations. Gondim-Ribeiro et al. (2018) conduct a study with adversarial perturbations in auto-encoders and conclude that auto-encoders are well-equipped for such attacks. Kos et al. (2018) propose three adversarial attacks tailored for VAE (Kingma and Welling 2014) and VAE-GAN. Arnab et al. (2018) perform the first large-scale evaluation of adversarial attacks on semantic segmentation models.

Our core goal consists in constraining the model’s output. Aside from deep learning approaches, such constraints in manifolds were typically tackled with component analysis. Canonical correlation analysis (Hotelling 1936) has been extensively used for finding common subspaces that maximally correlate the data (Panagakis et al. 2016). The recent work of Murdock et al. (2018) combines the expressiveness of neural networks with the theoretical guarantees of classic component analysis.

1.1 Conditional GAN

Conditional signal generation leverages a conditioning label, e.g. a prior shape (Tran et al. 2019) or an embedded representation (Mirza and Osindero 2014), to produce the target signal. In this work, we focus on the latter setting, i.e. we assume a dense regression task with the conditioning label being an image.

Conditional image generation is a popular task in computer vision, dominated by approaches similar to the original cGAN paper (Mirza and Osindero 2014). The improvements to the original cGAN can be divided into three categories: changes in the (a) architecture of the generator (b) in the architecture of the discriminator, (c) regularization and/or loss terms. The resulting cGAN architectures and their variants have successfully been applied to a host of different tasks, e.g. inpainting (Iizuka et al. 2017; Yu et al. 2018), super-resolution (Ledig et al. 2017). In this paper, our work focuses on improving any cGAN model; we refer to the reader to more targeted applications for a thorough review of specific applications, e.g. super-resolution (Agustsson and Timofte 2017) or inpainting (Wu et al. 2017).

The majority of the architectures in the generator follow the influential work of Isola et al. (2017), widely known as ‘pix2pix’, that includes lateral skip connections between the encoder and the decoder of the generator. Similarly to lateral connections, residual blocks are often utilized (Ledig et al. 2017; Chrysos et al. 2019a). An additional engineering improvement is to include multiscale generation introduced by Yang et al. (2017). Coarse-to-fine architectures often emerge by training more generators, e.g. in Huang et al. (2017) and Ma et al. (2017) they utilize one generator for the global structure and one for the fine-grained result.

The discriminator in Mirza and Osindero (2014) accepts a generated signal and the corresponding target signal. Isola et al. (2017) make two core modifications in the discriminator (applicable to image-to-image translations): (a) it accepts pairs of source/gt and source/model output images, (b) the discriminator extracts patches instead of the whole image. Miyato and Koyama (2018) replace the inputs to the discriminator with a dot product of the source/gt and source/model output images. In Iizuka et al. (2017), they include two discriminators, one for the global structure and one for the local patches (block inpainting task).

The goal of the aforementioned improvements is to improve the performance or stabilize the training; none of these techniques’ aim is to make cGAN more robust to noise. Therefore, our work is perpendicular to all such architecture changes and can be combined with any of the aforementioned architectures.

On the other hand, adding regularization terms in the loss function can impose stronger supervision, thus restricting the output. A variety of additional loss terms have been proposed for regularizing cGAN. The feature matching loss (Salimans et al. 2016) was proposed for stabilizing the training of the discriminator; it measures the discrepancy of the representations (in some layer) of the discriminator. The motivation lies in matching the low-dimensional distributions created by the discriminator layers. Isola et al. (2017) propose a content loss (implemented as \(\ell _1\) loss) for measuring the per pixel discrepancy of the generated versus the target signal. The perceptual loss is used in Ledig et al. (2017) and Johnson et al. (2016) instead of a per pixel loss. The perceptual loss denotes the difference between the representationsFootnote 5 of the target and the generated signal. Frequently, task-specific losses are utilized, such as identity preservation or symmetry loss in Huang et al. (2017).

The aforementioned regularization terms provide implicit supervision in the generator’s output through similarity with the target signal. However, this supervision does not restrict the generated signals to lie in the target manifold.

2 Method

In this section, we elucidate our proposed RoCGAN. In the following paragraphs, we develop the problem statement (Sect. 2.1), we review the original conditional GAN model (Sect. 2.2), and introduce RoCGAN (Sect. 2.3). Sequentially, we study a special case of generators, i.e. the generators that include lateral skip connections from the encoder to the decoder, and we pose the modifications required (Sect. 2.4). In Sect. 2.5, we prove that RoCGAN share the same properties as the original GAN (Goodfellow et al. 2014) and in Sect. 2.6 the intuition behind the model is assessed with synthetic data.

2.1 Problem Statement

The task of conditional signal generation is posed as generating signals given an input labelFootnote 6\(\textit{\textbf{s}}\). We assume the label \(\textit{\textbf{s}} \in \textit{\textbf{S}}\), where \(\textit{\textbf{S}}\) is the domain of labels, follows a different distribution from the target signals \(\textit{\textbf{y}} \in \textit{\textbf{Y}}\), where \(\textit{\textbf{Y}}\) is the domain of target signals. Also, we frequently want to include some stochasticity in the mapping; we include a latent variable \(\textit{\textbf{z}} \in \textit{\textbf{Z}}\) where \(\textit{\textbf{Z}}\) is a known distribution, e.g. Gaussian.

Mathematically, if \(\textit{\textbf{G}}\) denotes the mapping we want to learn, then:

$$\begin{aligned} \textit{\textbf{G}}: \textit{\textbf{S}} \times \textit{\textbf{Z}} \xrightarrow {} \textit{\textbf{Y}} \end{aligned}$$
(1)

To learn \(\textit{\textbf{G}}\), we assume we have access to a database of N pairs \(D=\{(\textit{\textbf{s}}^{(1)}, \textit{\textbf{y}}^{(1)}), \ldots , (\textit{\textbf{s}}^{(n)}, \textit{\textbf{y}}^{(n)}), \ldots , (\textit{\textbf{s}}^{(N)}, \textit{\textbf{y}}^{(N)})\}\) with \(n \in [1, N]\). In the following paragraphs we drop the index, i.e. we denote \(\textit{\textbf{s}}^{(n)}\) as \(\textit{\textbf{s}}\), to avoid cluttering the notation.

Conditional GAN, which we develop below, have been dominating in the literature for learning such mappings \(\textit{\textbf{G}}\). However, our interest lies in studying the case that during inference time the source signal is \(\textit{\textbf{s}} + \textit{\textbf{f}}(\textit{\textbf{s}}, \textit{\textbf{G}})\) instead of \(\textit{\textbf{s}}\), i.e. there is some unwanted noise in our source signal. We argue that such noise is of both theoretical and practical value for commercial applications.

Notation A bold letter represents a vector/tensor; a plain letter designates a scalar number. Unless explicitly mentioned otherwise \(||\cdot ||\) will declare an \(\ell _1\) norm. The symbols \({\mathcal {L}}_{*}\) define loss terms, while \(\lambda _{*}\) denote regularization hyper-parameters optimized on the validation set. For a matrix \(\mathbf {M}\), \(\mathrm{diag}(\mathbf {M})\) denotes its diagonal elements.

2.2 Conditional GAN

GAN consist of a generator and a discriminator module commonly optimized with alternating gradient descent. The generator’s goal is to model the target distribution \(p_{d}\), while the discriminator’s to discern the samples synthesized by the generator and the target (ground-truth) distributions. More precisely, the generator samples \(\textit{\textbf{z}}\) from a prior distribution \(p_{\textit{\textbf{z}}}\), e.g. uniform, and maps that to a sample; the discriminator \(\textit{\textbf{D}}\) tries to distinguish between the synthesized sample and one sample from \(p_{d}\).

The idea behind conditional GAN (cGAN) (Mirza and Osindero 2014) is to provide some additional labels to the generator. The generator \(\textit{\textbf{G}}\) typically takes the form of an encoder–decoder network, where the encoder projects the label into a low-dimensional latent subspace and the decoder performs the opposite mapping, i.e. from low-dimensional to high-dimensional subspace. In other words, the generator performs the regression from the source to the target signal.

The core loss of cGAN is the adversarial loss, which determines the alternating role of the generator and the discriminator:

$$\begin{aligned} {\mathcal {L}}_{adv}= & {} {\mathbb {E}}_{\textit{\textbf{s}},\textit{\textbf{y}} \sim p_{d} (\textit{\textbf{s}},\textit{\textbf{y}})}[\log \textit{\textbf{D}}(\textit{\textbf{y}} | \textit{\textbf{s}})] \nonumber \\&+\,{\mathbb {E}}_{\textit{\textbf{s}} \sim p_{d}(\textit{\textbf{s}}), \textit{\textbf{z}} \sim p_{z} (\textit{\textbf{z}})}[\log (1-\textit{\textbf{D}}(\textit{\textbf{G}}(\textit{\textbf{s}},\textit{\textbf{z}}) | \textit{\textbf{s}}))] \end{aligned}$$
(2)

The loss is optimized through the following min-max problem:

$$\begin{aligned} \min _{\textit{\textbf{w}}_G} \max _{\textit{\textbf{w}}_D} {\mathcal {L}}_{adv}= & {} \min _{\textit{\textbf{w}}_G} \max _{\textit{\textbf{w}}_D} {\mathbb {E}}_{\textit{\textbf{s}}, \textit{\textbf{y}} \sim p_{d}(\textit{\textbf{s}},\textit{\textbf{y}})}[\log \textit{\textbf{D}}(\textit{\textbf{y}} | \textit{\textbf{s}}, \textit{\textbf{w}}_D)] \nonumber \\&+\,{\mathbb {E}}_{\textit{\textbf{s}} \sim p_{d}(\textit{\textbf{s}}), \textit{\textbf{z}} \sim p_{z} (\textit{\textbf{z}})}[\log (1-\textit{\textbf{D}}(\textit{\textbf{G}}(\textit{\textbf{s}},\textit{\textbf{z}}| \textit{\textbf{w}}_G) | \textit{\textbf{s}}, \textit{\textbf{w}}_D))]\nonumber \\ \end{aligned}$$

where \(\textit{\textbf{w}}_G, \textit{\textbf{w}}_D\) denote the generator’s and the discriminator’s parameters respectively. To simplify the notation, we drop the dependencies on the parameters and the noise \(\textit{\textbf{z}}\) in the rest of the paper. In our experiments, we use a discriminator that is not conditioned on the input, i.e. \(\textit{\textbf{D}}(\textit{\textbf{y}})\); we include a related ablation study in Sect. 3.4.3.

Aside of the adversarial loss, cGAN models include auxiliary losses, e.g. task-specific \(\ell _1\) reconstruction or regularization terms for discriminator. Those losses do not affect the core model nor its adaptation to RoCGAN; we symbolize with \({\mathcal {L}}_{cGAN}\) the total loss function.

2.3 RoCGAN

Our main goal is to improve robustness to noise in dense regression tasks. To that end, we introduce our model that leverages structure in the target space of the model to enhance the generator’s regression. Our model shares the same structure as cGAN, i.e. it consists of a generator that performs the regression and a discriminator that separates the synthesized from the target signal. We achieve our goal by constructing a generator that includes two pathways.

The generator of RoCGAN includes two pathways instead of the single pathway of the original cGAN. The first pathway, referred as reg pathway henceforth, performs a similar regression as its counterpart in cGAN; it accepts a sample from the source domain and maps it to the target domain. We introduce an additional unsupervised pathway, named AE pathway. AE pathway works as an autoencoder in the target domain. Both pathways consist of similar encoder–decoder networks.Footnote 7 By sharing the weights of their decoders, we promote the regression outputs to span the target manifold and not induce arbitrarily large errors. A schematic of the generator is illustrated in Fig. 2. The discriminator can remain the same as the cGAN: it accepts the reg pathway’s output along with the corresponding target sample as input.

To simplify the notation below, the superscript ‘AE’ abbreviates modules of the AE pathway and ‘G’ modules of the reg pathway.We denote \(\textit{\textbf{G}}(\textit{\textbf{s}}) = \textit{\textbf{d}}^{(G)}(\textit{\textbf{e}}^{(G)}(\textit{\textbf{s}}))\) the output of the reg pathway and \(\textit{\textbf{G}}^{(AE)}(\textit{\textbf{y}}) = \textit{\textbf{d}}^{(AE)}(\textit{\textbf{e}}^{(AE)}(\textit{\textbf{y}}))\) the output of the AE pathway; \(\textit{\textbf{e}}, \textit{\textbf{d}}\) symbolize the encoder and decoder of a pathway respectively.

Fig. 2
figure 2

Schematic of the generator of a cGAN versus b our proposed RoCGAN. The single pathway of the original model is replaced with two pathways

The unsupervised module (autoencoder in the target domain) contributes the following loss term:

$$\begin{aligned} {\mathcal {L}}_{AE} = {\mathbb {E}}_{\textit{\textbf{y}} \sim p_{d}(\textit{\textbf{y}})} [f_d^{AE} (\textit{\textbf{y}}, \textit{\textbf{G}}^{(AE)} (\textit{\textbf{y}}))] \end{aligned}$$
(3)

where \(f_d^{AE}\) denotes a function to measure the divergence.Footnote 8

Despite sharing the weights of the decoders, we cannot ensure that the latent representations of the two pathways span the same subspace. To further reduce the distance of the two representations in the latent space, we introduce the latent loss term \({\mathcal {L}}_{lat}\). This term minimizes the distance between the encoders’ outputs, i.e. the two representations are spatially close (in the subspace spanned by the encoders). The latent loss term is:

$$\begin{aligned} {\mathcal {L}}_{lat} = {\mathbb {E}}_{\textit{\textbf{s}},\textit{\textbf{y}} \sim p_{d} (\textit{\textbf{s}},\textit{\textbf{y}})} [f_d^{lat}(\textit{\textbf{e}}^{(G)} (\textit{\textbf{s}}), \textit{\textbf{e}}^{(AE)} (\textit{\textbf{y}}))] \end{aligned}$$
(4)

where \(f_d^{lat}\) can be any divergence function. In practice, for both \({\mathcal {L}}_{lat}\) and \({\mathcal {L}}_{AE}\) we employ ordinary loss functions, e.g. \(\ell _1\) or \(\ell _2\) norms. As a future step we intend to replace the latent loss term \({\mathcal {L}}_{lat}\) with a kernel-based method (Gretton et al. 2007) or a learnable metric for matching the distributions (Ma et al. 2018).

The final loss function of RoCGAN combines the loss terms of the original cGAN \({\mathcal {L}}_{cGAN}\) with the additional two terms for the AE pathway:

$$\begin{aligned} {\mathcal {L}}_{RoCGAN} = {\mathcal {L}}_{cGAN} +\lambda _{ae}\cdot {\mathcal {L}}_{AE} + \lambda _{l}\cdot {\mathcal {L}}_{lat} \end{aligned}$$
(5)

2.4 RoCGAN with Skip Connections

The RoCGAN model of Sect. 2.3 describes a family of networks and not a predefined set of layers. A special case of RoCGAN emerges when skip connections from the encoder to the decoder are included. In this section, skip connections refer only to the case of lateral skip connections from the encoder to the decoder. We study below the modifications required for this case.

Skip connections are frequently used as they enable deeper layers to capture more abstract representations without the need of memorizing all the information. The shortcut connection allows a low-level representation from an encoder layer to be propagated directly to a decoder layer without passing through the long path, i.e. the network without the lateral skip connections. An autoencoder (AE) with such a skip connection can achieve close to zero reconstruction error by simply propagating the representation through the shortcut. This shatters the signal in the long path (Rasmus et al. 2015), which is an unwanted behavior.

To achieve training the long path, we explore a number of regularization methods. Our first approach in our original work was to include a regularization loss term. In this work, we propose an additional regularization technique for the skip case.

In the first approach, we implicitly tackle the issue by maximizing the variance captured by the longer path representations. We add a loss term that penalizes the correlations in the representations (of a layer) and thus implicitly encourage the representations to capture diverse and useful information. We implement the decov loss (Cogswell et al. 2016):

$$\begin{aligned} {\mathcal {L}}_{decov} = \frac{1}{2} \left( ||\textit{\textbf{C}}||_F^2 - ||\mathrm{diag}(\textit{\textbf{C}})||_2^2\right) \end{aligned}$$
(6)

where \(\textit{\textbf{C}}\) is the covariance matrix of the layer’s representations. The loss is minimized when the covariance matrix is diagonal, i.e. it imposes a cost to minimize the covariance of hidden units without restricting the diagonal elements that include the variance of the hidden representations.

A similar loss is explored by Valpola (2015), where the decorrelation loss is applied in every layer. Their loss term has stronger constraints: (i) it favors an identity covariance matrix but also (ii) penalizes the smaller eigenvalues of the covariance more. We have not explored this alternative loss term, as the decov loss worked in our case without the additional assumptions of the Valpola (2015).

In this work, we consider an alternative regularization technique. The approach is motivated by Rasmus et al. (2015) who include noise in the lateral skip connections. We do include zero-mean Gaussian noise in the shortcut connection, i.e. the representation of the encoder is modified by some additive Gaussian noise when skipped to the decoder. In our experimentation, both approaches can lead to improved results, we prefer to use the latter in the experiments.

2.5 Theoretical Analysis

In the next few paragraphs, we prove that RoCGAN share the properties of the original GAN (Goodfellow et al. 2014). Even though the derivations follow similar steps as the original GAN, but are added to make the paper self-contained. We derive the optimal discriminator and then compute the optimal value of \({\mathcal {L}}_{adv}(\textit{\textbf{G}}, \textit{\textbf{D}})\).

Proposition 1

For a fixed generator \(\textit{\textbf{G}}\) (reg pathway), the optimal discriminator is:

$$\begin{aligned} \textit{\textbf{D}}^{*} = \frac{p_{d}(\textit{\textbf{s}}, \textit{\textbf{y}})}{p_{d}(\textit{\textbf{s}}, \textit{\textbf{y}}) +p_{g}(\textit{\textbf{s}}, \textit{\textbf{y}})} \end{aligned}$$
(7)

where \(p_{g}\) is the model (generator) distribution.

Proof

Since the generator is fixed, the goal of the discriminator is to maximize the \({\mathcal {L}}_{adv}\) where:

$$\begin{aligned} {\mathcal {L}}_{adv}(\textit{\textbf{G}}, \textit{\textbf{D}})= & {} \int _{\textit{\textbf{y}}} \int _{\textit{\textbf{s}}} p_{d} (\textit{\textbf{y}}, \textit{\textbf{s}}) \log \textit{\textbf{D}}(\textit{\textbf{y}} | \textit{\textbf{s}}) \textit{\textbf{dy}} \textit{\textbf{ds}} \nonumber \\&+\int _{\textit{\textbf{s}}} \int _{\textit{\textbf{z}}} p_{d}(\textit{\textbf{s}}) p_{z}(\textit{\textbf{z}}) \log (1-\textit{\textbf{D}}(\textit{\textbf{G}}(\textit{\textbf{s}},\textit{\textbf{z}}) | \textit{\textbf{s}})) \textit{\textbf{ds}} \textit{\textbf{dz}} \nonumber \\= & {} \int _{\textit{\textbf{y}}} \int _{\textit{\textbf{s}}} p_{d}(\textit{\textbf{s}}, \textit{\textbf{y}}) \log \textit{\textbf{D}} (\textit{\textbf{y}} | \textit{\textbf{s}}) \textit{\textbf{dy}} \nonumber \\&+\,p_{g}(\textit{\textbf{s}}, \textit{\textbf{y}}) \log (1 - \textit{\textbf{D}}(\textit{\textbf{y}} | \textit{\textbf{s}})) \textit{\textbf{dy}} \textit{\textbf{ds}} \end{aligned}$$
(8)

To maximize the \({\mathcal {L}}_{adv}\), we need to optimize the integrand above. We note that with respect to \(\textit{\textbf{D}}\) the integrand has the form \(f(y) = a \cdot log(y) + b \cdot log(1 - y)\). The function f for \(a, b \in (0, 1)\) as in our case, obtains a global maximum in \(\frac{a}{a + b}\), so:

$$\begin{aligned} {\mathcal {L}}_{adv}(\textit{\textbf{G}}, \textit{\textbf{D}})\le & {} \int _{\textit{\textbf{y}}} \int _{\textit{\textbf{s}}} p_{d} (\textit{\textbf{s}}, \textit{\textbf{y}}) \log \textit{\textbf{D}}^{*}(\textit{\textbf{y}} | \textit{\textbf{s}}) \textit{\textbf{dy}} \nonumber \\&+\,p_{g}(\textit{\textbf{s}}, \textit{\textbf{y}}) \log (1 - \textit{\textbf{D}}^{*}(\textit{\textbf{y}} | \textit{\textbf{s}})) \textit{\textbf{dy}} \textit{\textbf{ds}} \end{aligned}$$
(9)

with

$$\begin{aligned} \textit{\textbf{D}}^{*} = \frac{p_{d}(\textit{\textbf{s}}, \textit{\textbf{y}})}{p_{d}(\textit{\textbf{s}}, \textit{\textbf{y}}) +p_{g}(\textit{\textbf{s}}, \textit{\textbf{y}})} \end{aligned}$$
(10)

thus \({\mathcal {L}}_{adv}\) obtains the maximum with \(\textit{\textbf{D}}^{*}\). \(\square \)

Proposition 2

Given the optimal discriminator \(\textit{\textbf{D}}^{*}\) the global minimum of \({\mathcal {L}}_{adv}\) is reached if and only if \(p_{g} = p_{d}\), i.e. when the model (generator) distribution matches the data distribution.

Proof

From Proposition 1, we have found the optimal discriminator as \(\textit{\textbf{D}}^{*}\), i.e. the \(\mathrm{arg}\,\mathrm{max}_{\textit{\textbf{D}}} {\mathcal {L}}_{adv}\). If we replace the optimal value we obtain:

$$\begin{aligned}&\max _{\textit{\textbf{D}}} {\mathcal {L}}_{adv}(\textit{\textbf{G}}, \textit{\textbf{D}})\nonumber \\&\quad = \int _{\textit{\textbf{y}}} \int _{\textit{\textbf{s}}} p_{d}(\textit{\textbf{s}}, \textit{\textbf{y}}) \log \textit{\textbf{D}}(\textit{\textbf{y}} | \textit{\textbf{s}}) \textit{\textbf{dy}} \nonumber \\&\qquad +\,p_{g}(\textit{\textbf{s}}, \textit{\textbf{y}}) \log (1 - \textit{\textbf{D}}(\textit{\textbf{y}} | \textit{\textbf{s}})) \textit{\textbf{dy}} \textit{\textbf{ds}} \nonumber \\&\quad =\int _{\textit{\textbf{y}}} \int _{\textit{\textbf{s}}} p_{d}(\textit{\textbf{s}}, \textit{\textbf{y}}) \log \left( \frac{p_{d}(\textit{\textbf{s}}, \textit{\textbf{y}})}{p_{d}(\textit{\textbf{s}}, \textit{\textbf{y}}) + p_{g}(\textit{\textbf{s}}, \textit{\textbf{y}})}\right) \nonumber \\&\qquad +\,p_{g}(\textit{\textbf{s}}, \textit{\textbf{y}}) \log \left( 1 - \frac{p_{d}(\textit{\textbf{s}}, \textit{\textbf{y}})}{p_{d}(\textit{\textbf{s}}, \textit{\textbf{y}}) + p_{g}(\textit{\textbf{s}}, \textit{\textbf{y}})}\right) \textit{\textbf{dy}} \textit{\textbf{ds}} \nonumber \\&\quad =\int _{\textit{\textbf{y}}} \int _{\textit{\textbf{s}}} p_{d}(\textit{\textbf{s}}, \textit{\textbf{y}}) \log \left( \frac{p_{d} (\textit{\textbf{s}}, \textit{\textbf{y}})}{p_{d}(\textit{\textbf{s}}, \textit{\textbf{y}}) + p_{g}(\textit{\textbf{s}}, \textit{\textbf{y}})}\right) \nonumber \\&\qquad +\,p_{g}(\textit{\textbf{s}}, \textit{\textbf{y}}) \log \left( \frac{p_{g}(\textit{\textbf{s}}, \textit{\textbf{y}})}{p_{d}(\textit{\textbf{s}}, \textit{\textbf{y}}) + p_{g}(\textit{\textbf{s}}, \textit{\textbf{y}})}\right) \textit{\textbf{dy}} \textit{\textbf{ds}} \end{aligned}$$
(11)

We add and subtract \(\log (2)\) from both terms, which after few math operations provides:

$$\begin{aligned} \max _{\textit{\textbf{D}}} {\mathcal {L}}_{adv}(\textit{\textbf{G}}, \textit{\textbf{D}})= & {} -2\cdot \log (2) + KL\left( p_{d} || \frac{p_{d} + p_{g}}{2}\right) \nonumber \\&+\, KL\left( p_{g} || \frac{p_{d} + p_{g}}{2}\right) \end{aligned}$$

where in the last row KL symbolizes the Kullback–Leibler divergence. The latter one can be rewritten more conveniently with the help of the Jensen–Shannon (JSD) divergence as

$$\begin{aligned} \max _{\textit{\textbf{D}}} {\mathcal {L}}_{adv}(\textit{\textbf{G}}, \textit{\textbf{D}}) = -\log (4) + 2\cdot JSD(p_{d} ||p_{g}) \end{aligned}$$
(12)

The Jensen–Shannon divergence is non-negative and obtains the zero value only if \(p_{d} = p_{g}\). Equivalently, the last equation has a global minimum (under the constraint that the discriminator is optimal) when \(p_{d} = p_{g}\). \(\square \)

Fig. 3
figure 3

Qualitative results in the synthetic experiment of Sect. 2.6. Each plot corresponds to the respective manifolds in the output vector; the first and third depend on both xy (xyz plot), while the rest on x (xz plot). The green color visualizes the target manifold, the red the baseline and the blue ours. Even though the two models include the same parameters during inference, the baseline does not approximate the target manifold as well as our method (Color figure online)

2.6 Experiment on Synthetic Data

We design an experiment on synthetic data to explore the differences between the original generator and our two pathway generator. Specifically, we design a network where each encoder/decoder consists of two fully connected layers; each layer followed by a RELU. We optimize the generators only, to avoid adding extra learned parameters.

The inputs/outputs of this network span a low-dimensional space, which depends on two independent variables \(x, y \in [-1, 1]\). We’ve experimented with several arbitrary functions in the input and output vectors and they perform in a similar way. We exhibit here the case with input vector \([x, y, e^{2x}]\) and output vector \([x + 2y + 4, e^x + 1, x + y + 3, x + 2]\). The reg pathway accepts the three inputs, projects it into a two-dimensional space and the decoder maps it to the target four-dimensional space.

We train the baseline and the autoencoder modules separately and use their pre-trained weights to initialize the two pathway network. The loss function of the two pathway network consists of the \({\mathcal {L}}_{lat}\) (Eq. 4) and \(\ell _2\) content losses in the two pathways. The networks are trained either till convergence or till 100,000 iterations (batch size 128) are completed.

During testing, 6400 new points are sampled and the overlaid results are depicted in Fig. 3; the individual figures for each output can be found in the supplementary. The \(\ell _1\) errors for the two cases are: 9843 for the baseline and 1520 for the two pathway generator. We notice that the two pathway generator approximates the target manifold better with the same number of parameters during inference.

3 Experiments

In the following paragraphs we initially design and explain the noise models (Sect. 3.1), we review the implementation details (Sect. 3.2) and the experimental setup (Sect. 3.3). Sequentially, we conduct an ablation study and evaluate our model on real-worlds datasets, including natural scenes and human faces.

3.1 Noise Models

In this work, we explore two different types of noise with multiple variants tested in each type. Those two types are Bernoulli noise and adversarial noise.

Bernoulli noise For an input \( \textit{\textbf{s}} \), the noise model is represented by a Bernoulli function \( \Phi _v(\textit{\textbf{s}}, \theta ) \). Specifically, we have

$$\begin{aligned} \Phi _v(\textit{\textbf{s}}, \theta )_{i,j} = {\left\{ \begin{array}{ll} v &{} \text{ with } \text{ probability } \theta \\ \textit{\textbf{s}}_{i,j} &{} \text{ with } \text{ probability } 1 - \theta \end{array}\right. } \end{aligned}$$
(13)

To provide a practical example, assume that \(v=0\) and \(\theta =0.5\), then an image \(\textit{\textbf{s}}\) has half of its pixels converted to black, which is known as sparse inpainting.

Adversarial examples Apart from testing in the face of additional Bernoulli noise, we explore adversarial attacks in the context of dense regression. Recent works, e.g. Szegedy et al. (2014), Yuan et al. (2017), Samangouei et al. (2018) and Madry et al. (2018), explore the robustness of (deep) classifiers.

Contrary to classification case, there has not been much investigation of adversarial attacks in the context of image-to-image translation or any other dense regression task. However, since an adversarial example perturbs the source signal, dense regression tasks can be vulnerable to such modifications. We conduct a thorough investigation of this phenomenon by attacking our model with three adversarial attacks for dense regression. We introduce the adversarial attacks in the following paragraphs.

The first, and most ubiquitous attack is the fast gradient sign method (FGSM), introduced by Goodfellow et al. (2015). It is the simplest attack and the basis for several variants. In addition, the authors of Dou et al. (2018) mathematically prove the efficacy of this attack in the classification case. Let us define the auxiliary function:

$$\begin{aligned} \textit{\textbf{u}}(\textit{\textbf{s}}) = \textit{\textbf{s}} + {\epsilon }\mathrm{sign}\left( \nabla _{\textit{\textbf{s}}} {\mathcal {L}} (\textit{\textbf{s}}, \textit{\textbf{y}}) \right) \end{aligned}$$
(14)

with \({\mathcal {L}}(\textit{\textbf{s}}, \textit{\textbf{y}}) = ||\textit{\textbf{y}} -\textit{\textbf{G}}(\textit{\textbf{s}})||_1\). Then, each source signal \(\textit{\textbf{s}}\) is modified as:

$$\begin{aligned} \tilde{\textit{\textbf{s}}} = \textit{\textbf{s}} + \varvec{\eta } \end{aligned}$$
(15)

The perturbation \(\varvec{\eta }\) is defined as:

$$\begin{aligned} \varvec{\eta } = \textit{\textbf{u}}(\textit{\textbf{s}}) \end{aligned}$$
(16)

with \({\epsilon }\) a hyper-parameter, \(\textit{\textbf{y}}\) the target signal and \({\mathcal {L}}\) an appropriate loss function.

However, to make the perturbation stronger, we iterate the gradient computation. The iterative FGSM (IFGSM) method of Dou et al. (2018)Footnote 9 is:

$$\begin{aligned} \tilde{\textit{\textbf{s}}}_{(k)} = Clip\{ \textit{\textbf{u}}(\tilde{\textit{\textbf{s}}}_{(k-1)}) \} \end{aligned}$$
(17)

where k is the kth iteration, \(\tilde{\textit{\textbf{s}}}_{(0)} = \textit{\textbf{s}}\) and Clip function restricts the outputs in the source signal range.

The second attack that is selected is the projected gradient descent method (PGD) of Madry et al. (2018). PGD is an iterative method which given the source signal \(\tilde{\textit{\textbf{s}}}_{(0)} = \textit{\textbf{s}}\), it modifies it as:

$$\begin{aligned} {\tilde{\textit{\textbf{s}}}_{(k)} = Clip\{ \Pi _{\textit{\textbf{s}}} \textit{\textbf{u}}(\tilde{\textit{\textbf{s}}}_{(k-1)}) \}} \end{aligned}$$
(18)

Robustness to this attack typically implies robustness to all first order methods (Madry et al. 2018), making it a particularly interesting case study.

The third adversarial method is the latent attack of Kos et al. (2018). The loss in this attack is computed in the latent space, i.e. the output of the encoder \(\textit{\textbf{e}}^{(G)}(\textit{\textbf{s}})\).

In the following sections, we model the (I)FGSM attacks with the tuple \((k, {\epsilon })\) that declare the total iterative steps and the \({\epsilon }\) hyper-parameter value respectively. In the Bernoulli case, we use three cases of v, i.e. \(v=0\) corresponding to black pixels, \(v=1\) corresponding to white pixels and channel-wise \(v=0\). We abbreviate the three cases with a triplet \((\theta _{v=0}, \theta _{v=1}, \theta _{v=0,channel})\) denoting the \(\theta \) probability in each case. For instance, the triplet (50, 0, 0) denotes Bernoulli noise with \(v=0\) and probability \(50\%\). Unless explicitly mentioned otherwise, the default adversarial attack below is the IFGSM.

3.2 Implementation Details

Conditional GAN model Several cGAN models have been proposed (see Sect. 1.1). In our experiments, we employ a simple cGAN model based on the best (experimental) practices so far Isola et al. (2017), Salimans et al. (2016) and Zhu et al. (2017).

The works of Salimans et al. (2016) and Isola et al. (2017) demonstrate that auxiliary loss terms, i.e. feature matching and content loss, improve the final outcome, hence we consider those as part of the baseline cGAN. The feature matching lossFootnote 10 is:

$$\begin{aligned} {\mathcal {L}}_{f} = {\mathbb {E}}_{\textit{\textbf{s}},\textit{\textbf{y}} \sim p_{d} (\textit{\textbf{s}},\textit{\textbf{y}})} ||\pi (\textit{\textbf{G}}(\textit{\textbf{s}})) - \pi (\textit{\textbf{y}})|| \end{aligned}$$
(19)

where \(\pi ()\) extracts the features from the penultimate layer of the discriminator.

The final loss function for the cGAN is the following:

$$\begin{aligned} {\mathcal {L}}_{cGAN} = {\mathcal {L}}_{adv} + \lambda _c \cdot \underbrace{{\mathbb {E}}_{\textit{\textbf{s}},\textit{\textbf{y}} \sim p_{d} (\textit{\textbf{s}},\textit{\textbf{y}})} [||\textit{\textbf{G}}(\textit{\textbf{s}}) - \textit{\textbf{y}}||]}_{content-loss} + \lambda _{\pi }\cdot {\mathcal {L}}_{f} \end{aligned}$$
(20)

where \(\lambda _c, \lambda _{\pi }\) are hyper-parameters to balance the loss terms.

RoCGAN model To fairly compare against the aforementioned cGAN model, we make only the following three adaptations: (i) we duplicate the encoder/decoder (for the new AE pathway); (ii) we share the decoder’s weights in the two pathways; (iii) we augment the loss function with the additional loss terms. We emphasize that this is only performed for experimental validation; in practice the encoder of the AE pathway can have a different structure or new task-specific loss terms can be introduced; we have made no effort to optimize further RoCGAN. We use \(\ell _1\) loss for both the \({\mathcal {L}}_{lat}\) and \({\mathcal {L}}_{AE}\).

Training details A ‘layer’ refers to a block of three units: a convolutional unit with a \(4\times 4\) kernel size, followed by Leaky RELU and batch normalization (Ioffe and Szegedy 2015). The hyper-parameters introduced by our model are: \(\lambda _{l} = 1\), \(\lambda _{ae} = 100\). The values of the common hyper-parameters, e.g. \(\lambda _{c}\), \(\lambda _{\pi }\), are the same between the cGAN/RoCGAN. A mild data augmentation technique is utilized for training cGAN/RoCGAN: The training images are reshaped to \(75\times 75\) and random patches of \(64\times 64\) are fed into the network. Each training image is horizontally flipped with probability 0.5; no other augmentation is used. A constant learning rate of \(2\cdot 10^{-4}\) (same as in Isola et al. 2017) is used for \(3\cdot 10^5\) iterations with a batch size of 64. During training, we run validation every \(10^4\) iterations and export the best model, which is used for testing. The discriminator consists of 3 convolutional layers followed by a fully-connected layer. The input to the discriminator is either the output of the generator or the respective target image, i.e. we do not condition the discriminator on the source image.

Our workhorse for testing is a network denoted ‘5layer’, because each encoder and decoder consists of 5 layers. In the following experiments ‘Baseline-5layer’ represents the cGAN ‘5layer’ case, while ours is indicated as ‘Ours-5layer’. In the skip case, we add a skip connection from the output of the third layer of the encoder to the respective decoder layer; we add a ‘-skip’ in the respective method name.

We train an adversarial autoencoder (AAE) (Makhzani et al. 2015) as an established method capable of learning compressed representations as an upper performance bound baseline. Each module of the AAE shares the same architecture as its cGAN counterpart, while the AAE is trained with images in the target space. The target images are used as the input to the AAE and its output, i.e. the reconstruction, is used for the evaluation. In our experimental setting, AAE can be thought of as an upper performance limit of RoCGAN/cGAN for a given capacity (number of parameters).

The task selected for our testing is super-resolution by \(4\times \). That is, we downsample an image 4 times; we upsample it with bilinear interpolation and use this interpolated as the corrupted image. In the supplementary, we include an experiment with sparse inpainting.

3.3 Experimental Setup

Datasets In addition to validating our model on synthetic data, we utilize a variety of real-world datasets:

  • MS-Celeb (Guo 2016) is introduced for large scale face recognition. It contains approximately 10 million facial images from 1 million celebrities. The dataset was collected semi-automatically, while the noise was not manually removed from the training images. We export 3 million samples for training and use 100 thousand images for validation.

  • CelebFaces attributes dataset (Celeb-A) (Liu et al. 2015) consists a popular benchmark for large-scale face attribute classification. Each image is annotated with 40 binary attributes. Celeb-A is used in conjunction with MS-Celeb in this work, where the latter is used for training and the former is used for testing. All the 202,500 samples of Celeb-A are used for testing. This combination is the main focus for our experiments; specifically it is used in Sects. 3.4, 3.5, 3.6 and 3.8.

  • 300 Videos in the Wild (300VW) (Shen et al. 2015) is a benchmark for face tracking; it includes a sparse set of points annotated per frame. It includes three categories of videos with increasing difficulty; in this work we use as testset the most challenging category (categ 3) that includes over 27,000 frames. We use 300VW in Sect. 3.7 for assessing the performance of RoCGAN in video datasets.

  • ImageNet (Deng et al. 2009) is a large image database with 1000 different objects. An average of over five hundred images per objects exist. In the experiment for natural scenes, we utilize the training set of Imagenet which consists of 1, 2 million images and its testset that includes 98 thousand images (Sect. 3.5).

The two categories of images, i.e. faces and natural scenes, are extensively used in computer vision and machine learning both for their commercial value as well as for their online availability. For the experiments with faces, Ms-Celeb consists the training set, while for the natural scenes ImageNet.

Error metrics In the comparisons of RoCGAN against cGAN the following metrics are used:

  • Structural similarity (SSIM) (Wang et al. 2004): A metric used to quantify the perceived image quality of an image. We use it to compare every output image with respect to the reference (ground-truth) image; it ranges from [0, 1] with higher values demonstrating better quality.

  • Frechet inception distance (FID) Heusel et al. (2017): A measure for the quality of the generated images, frequently used with GAN. It extracts second order information from a pretrained classifierFootnote 11 applied to the images. FID assumes that the two distributions \(p_1\) and \(p_2\) are multivariate Gaussian, i.e. \({\mathcal {N}}(\varvec{\mu }_1, \textit{\textbf{C}}_1)\) and \({\mathcal {N}}(\varvec{\mu }_2, \textit{\textbf{C}}_2)\). Then:

    $$\begin{aligned} FID(p_1, p_2) = ||\varvec{\mu }_1 - \varvec{\mu }_2||_2^2 + Tr(\textit{\textbf{C}}_1 +\textit{\textbf{C}}_2 - 2 (\textit{\textbf{C}}_1 \textit{\textbf{C}}_2)^{\frac{1}{2}}) \end{aligned}$$
    (21)

    In our work \(p_1\) is the distribution of the ground-truth images, while \(p_2\) is the distribution of the generated images from each method. FID is lower bounded (by 0) in the case that \(p_2\) matches \(p_1\); a lower FID score translates to the distributions being ‘closer’. We compute the FID score using the Inception network (in Chainer).

3.4 Ablation Study

In the following paragraphs we conduct an ablation study to assess RoCGAN in different cases; specifically we evaluate the sensitivity in a hyper-parameter range and different initialization options. We also summarize different options for loss functions and other architecture-related choices.

Unless mentioned otherwise, ‘5layer’ network is used; the task selected is face super-resolution while SSIM is reported as a metric in this part. The options selected in the ablation study are used in the following experiments and comparisons against cGAN.

3.4.1 Initialization of RoCGAN

We conduct an experiment to evaluate different initialization options for RoCGAN. The motivation for the different initializations is to assess the necessity of the pretrained models as used in Chrysos et al. (2019b). The options are:

  • Random initialization for all modules.

  • Initializing the \(\textit{\textbf{e}}^{(AE)}\) to the pretrained weights of the respective AAE encoder and the rest modules from the pretrained cGAN.

  • Initializing only the unsupervised pathway from the respective pretrained generator of AAE. The rest modules are initialized randomly.

The results in Table 1 demonstrate that the initializations are not crucial for the final performance, however the second option performs slightly worse. We postulate that the pretrained cGAN makes RoCGAN get stuck ‘near’ the cGAN optimum. In the remaining experiments, we use the third option, i.e. we initialize the unsupervised pathway from the respective AAE generator while the rest modules are initialized randomly.

Table 1 Quantitative results evaluating the different initialization options (Sect. 3.4.1)

3.4.2 Hyper-Parameter Range

Our model introduces two new loss terms, i.e. \({\mathcal {L}}_{lat}\) and \({\mathcal {L}}_{AE}\), that need to be validated. Below, we scrutinize one hyper-parameter every time, while we keep the rest in their selected value. During our experimentation, we observed that the optimal values of these hyper-parameters might differ per case/network, however unless we mention it explicitly in an experiment the hyper-parameters remain the same as aforementioned.

The search space for each term is decided from its theoretical properties and our intuition. In more details, \(\lambda _{ae}\) would have a value at most equal to \(\lambda _c\).Footnote 12 The latent loss encourages the two pathways’ latent representations to be similar, however since the final evaluation is performed in the pixel space, we postulate that a value smaller than \(\lambda _c\) is appropriate.

In Table 2, different values for the \(\lambda _{l}\) are presented. The optimal values emerge in the interval [1, 10), however even for the rest choices the SSIM values are similar. In our experimentation, RoCGAN is more resilient to changes in \(\lambda _{l}\) than other hyper-parameters.

Table 2 Validation of \(\lambda _{l}\) hyper-parameter in the ‘5layer’ network

Different values of \(\lambda _{ae}\) are considered in Table 3. RoCGAN are robust to a wide range of values and both the visual and the quantitative results remain similar. In the following experiments we use \(\lambda _{ae} = \lambda _{c} = 100\) because of the semantic similarity with the content loss; further improvements can be obtained by the best validation values.

Table 3 Validation of \(\lambda _{ae}\) values (hyper-parameter choices) in the ‘5layer’ network

3.4.3 Robustness on Discriminator Variants

Since the advent of cGAN, several discriminator architectures have been used. In the original paper, the discriminator accepts as input only the output of the generator or a sample from the target distribution. By contrast, Isola et al. (2017), propose to instead concatenate the source and the target images. Miyato and Koyama (2018) argue that instead of concatenation, the inner product of the source and the target image should be computed.

Table 4 Quantitative results on the discriminator variants (see Sect. 3.4.3)

We assess the robustness of RoCGAN under these different discriminators. As a reminder, we consider the discriminator of Mirza and Osindero (2014) as the default; to implement the variants of Isola et al. (2017) and Miyato and Koyama (2018), we do not change the number of depth of the layers, but only perform the respective concatenation, projection respectively.

In Table 4 the evaluation demonstrates that all three discrminators perform similarly. There is a marginal performance drop in the case of the projective discriminator, but this could be mitigated with a stronger generator for example. This experiment demonstrates that the proposed RoCGAN is not tied to a single discriminator, but rather can work with a number of discriminator architectures.

3.4.4 Other Training Options

We evaluate two more options for training our model: (a) whether the improvement can be obtained without batch normalization, (b) a different latent loss function (\(\ell _2\)). In Table 5 we add the two options along with the default options from above. The results indicate that (i) batch normalization does not seem to contribute in RoCGAN’s performance in this network, (ii) our choice of \(\ell _1\) can be replaced from another function with similar results. In the rest of the experiments, we use batch normalization and \(\ell _1\) for \({\mathcal {L}}_{lat}\).

3.5 Testing on Static Images

Our first evaluation against baseline cGAN is on testing without any additional noise (other than the implicit biases of the datasets). The task for both the faces and the scenes is super-resolution in the respective domain. The training images are from Ms-Celeb and ImageNet respectively, while the testing images from Celeb-A and ImageNet testset. The numerical results in Table 6 dictate that in both cases and with both metrics, RoCGAN outperform cGAN. We also experiment with the ‘5layer-skip’ networks to assess the performance in the skip case. The results in Table 7 illustrate similar behavior to the previous case, i.e. our model outperforms the baseline.

Table 5 Quantitative results evaluating training options (Sect. 3.4.4)
Table 6 Quantitative comparison of cGAN/RoCGAN (Sect. 3.5)
Table 7 Quantitative comparison of cGAN/RoCGAN for the case of skip connections (Sect. 3.5)
Table 8 Quantitative evaluation of the ‘5layer’ network under Bernoulli noise (face super-resolution; Sect. 3.6)
Fig. 4
figure 4

Visual results depicting Bernoulli noise. Similarly to Fig. 10 different samples are visualized per row. The corrupted images are visualized in the original size to make the additional noise more visible. The compared methods have to perform denoising in addition to the translation they are trained on

Table 9 Quantitative evaluation of the ‘5layer’ network under adversarial noise (face super-resolution; Sect. 3.6)

3.6 Testing Under Additional Noise

We conduct a dedicated experiment to evaluate the resilience of the models to noise. The idea is to artificially corrupt the source signal \(\textit{\textbf{s}}\) with the noise models of Sect. 3.1, i.e. feed as input \(\textit{\textbf{s}} + \textit{\textbf{f}}(\textit{\textbf{s}}, \textit{\textbf{G}})\) for some corruption function \(\textit{\textbf{f}}\).

We use the ‘5layer’ networks in the face super-resolution task and corrupt them with (a) adversarial and (b) Bernoulli noise.

Bernoulli noise As a reminder the noise in this experiment is used exclusively during testing. All three cases of (1, 0, 0), (0, 1, 0), (0, 0, 1) are assessedFootnote 13, along with mixed cases. The quantitative results for Bernoulli noise are reported Table 8. Our model is consistently better with a relative performance gain of up to 9.9%. Indicative visual results are depicted in Fig. 4.

Adversarial noise The performance under the three different adversarial attacks is assessed. For IFGSM, we initially start with a small value of \({\epsilon }\), i.e. \({\epsilon }=0.01\), and progressively increase either the steps or the hyper-parameter’s value. As expected, the results in Table 9 highlight that increasing values of either the steps or \({\epsilon }\) deteriorate the performance of the networks. However, the performance of cGAN decline with a faster pace when compared to our proposed RoCGAN. The relative performance difference (in SSIM) is \(4.9\%\) in the original testing, while it progressively grows up to \(24.3\%\) in the (1, 0.1) noise. The effect of the steps in IFGSM is further explored in Fig. 5. We fix \({\epsilon }=0.01\) and study the evolution in performance as we vary number of steps. Note that the curve of cGAN is much steeper than that of RoCGAN as the number of steps increase. Beyond 10 steps, the performance of cGAN drops below 0.5 and can essentially be considered as noise. We perform the same experiment with the PGD attack; the effect of the increasing steps are visualized in Fig. 6. We note that after 10 steps there is substantial difference between the two models. This difference is maintained and increased if we increase the steps to 30. We also compare the two models under the latent attack in Fig. 7. For 1 or 2 iteration of the latent attack, the curves are similar to the previous two, however for more steps the curves become steeper than in previous attacks, while the performance gap grows faster in this attack. The efficiency of the three attacks differs when it comes to the number of steps required, with the latent attack being the most successful. Remarkably though, all three attacks have similar effects in the two models, i.e. the performance gap increases as the number of steps increase. By implementing three adversarial attacks, we illustrate that empirically the proposed model is more robust in the face of noise against the baseline.

Fig. 5
figure 5

Performance of cGAN/RoCGAN with respect to the number of steps in the IFGSM noise (mean SSIM on the left, FID score on the right). We emphasize that a higher SSIM (or a lower FID) indicates better performance. The number of steps vary from 1 to 10, while the highlighted region denotes the variance (left). The cGANmodel exhibits steeper curve over RoCGAN

Fig. 6
figure 6

Performance of cGAN/RoCGAN with respect to the number of steps in the latent attack (mean SSIM on the left, FID score on the right). The performance drop of cGAN model is steeper than the RoCGAN, however notice that this attack is more successful

Fig. 7
figure 7

Performance of cGAN/RoCGAN with respect to the number of steps in the latent attack (mean SSIM on the left, FID score on the right). The performance drop of cGAN model is steeper than the RoCGAN, however notice that this attack is more successful

Fig. 8
figure 8

Histogram plots for the SSIM under adversarial noise (Sect. 3.6). The two distributions differ in the original testing, however the difference increases dramatically for more intense noise

Fig. 9
figure 9

Histogram plots for the SSIM under Bernoulli noise (Sect. 3.6)

To further analyze the differences between the two models, we create a histogram plot based on the SSIM values. The interval of [0.5, 0.95] that the SSIM values lie is divided in 20 bins, while the vertical axis depicts the frequency of each bin. A histogram with values concentrated to the right (towards 1) signifies superior performance. The histograms comparing ‘5layer’ cGAN/RoCGAN under IFGSM (adversarial noise) are plotted in Fig. 8 (respectively for the Bernoulli noise, the histograms are in Fig. 9). We note that there is an increasing difference between the original histogram (no noise) and the increasing steps of IFGSM, e.g. Fig. 8a versus Fig. 8d. The same difference is observed as \({\epsilon }\) increases; in the extreme case of \({\epsilon }=0.1\) there is only minor overlap between the two methods. In Fig. 10, qualitative results demonstrating the adversarial noise are depicted.

3.7 Testing on a Video Sequence

Aside of the experiment with the static testset, we use the 300VW (Shen et al. 2015) video dataset to assess RoCGAN. The videos include non-linear corruptions, e.g. compression, blurriness, rapid motion; such corruptions make a video dataset the perfect testbed for our evaluation.

In Table 10 we add the results of the experiment.Footnote 14 The performance of cGAN is slightly worse than the related experiment in Celeb-A, while RoCGAN’s performance remains similar to the static case. The difference in the performance increases in the additional noise cases. The FID performance differs from the respective static experiment, since the mean and covariance for the empirical target distribution are extracted from Celeb-A in both cases. We provide a video of the results in https://youtu.be/RvoW4AYnzQU.

Fig. 10
figure 10

Visual results for testing with adversarial noise (IFGSM). The columns correspond to a the target images, b the original corrupted (i.e. downsampled) images, c, d the outputs of the no-noise (i.e. images of b), e-h pairs of cGAN/RoCGAN outputs with adversarial noise (see Sect. 3.1 for the encoding). It is noticeable that as the noise increases cGAN outputs deteriorate fast in contrast to their RoCGAN outputs. Notice the ample differences for intense noise; for instance, in columns (e) versus (f) where cGAN includes unnatural lines in all cases

Table 10 Quantitative results for the video sequence testing (Sect. 3.7)

3.8 Cross-Noise Experiments

A reasonable question is whether data augmentation can be used to make the model robust. In our particular setup, we scrutinize this assumption below: we augment the training samples with noise and assess the testing performance. Specifically, we scrutinize the performance of cGAN/RoCGAN with cross-noise experiments, i.e. we train with one type of noise and test with a different type of noise. For a fair comparison with the aforementioned experiments, we keep the same architectures as above, i.e. the ‘5layer’ network, while the task is face super-resolution.

The first experiment is conducted by training with Bernoulli noise; while during testing, adversarial perturbations (IFGSM) are used. The Bernoulli noise (during training) is (5, 0, 0); the variants (10, 0, 0) and \((\theta , 0, 0)\) with \(\theta \) uniformly sampled in each iteration from [0, 10] were tried but resulted in similar outcomes. The effect of IFGSM for different steps is plotted in Fig. 11; both models exhibit a small improvement with respect to their counterparts trained without noise in Sect. 3.6. Nevertheless, the RoCGAN outperform substantially the cGAN baseline in the face of increasing IFGSM steps.

An additional experiment is conducted with a completely new type of noise, Gaussian noise, i.e. a type of noise that has not been used previously in any of our models. Each training sample is perturbed with additive Gaussian noise. In every iteration a dense noise mask is sampled online from \({\mathcal {N}}(0, 10)\) (for pixels in the [0, 255] range). The perturbed input for each method is \(\textit{\textbf{s}} + {\mathcal {N}} (0, 10)\); see Fig. 12 for a visual illustration. The results when trained with adversarial noise (IFGSM) are visualized in Fig. 13, while the comparison with both Bernoulli and adversarial noise is reported in Table 11. The patterns of previous sections (e.g. Sect. 3.6) emerge under Bernoulli noise, i.e. the more intense the noise the larger the performance gap. For instance, the original difference of 0.041 is converted into a difference of 0.069 with \(1\%\) white pixels; this intensifies to 0.073 under the (1, 1, 1) case. The performance of both methods improves when trained with Gaussian noise in under both Bernoulli and adversarial noise during testing. However, the performance gap between the baseline and our model remains similar when we increase the number of steps (IFGSM); see Fig. 13.

Fig. 11
figure 11

Performance of a cGAN/RoCGAN (mean SSIM) trained with Bernoulli noise. The x-axis depicts an increasing number of iterations of the IFGSM from 1 to 10. The highlighted region in each curve denotes the variance

Fig. 12
figure 12

Visual example of the training with Gaussian noise (see Sect. 3.8). The ground-truth image is downsampled for the ‘Corr’ version; Gaussian noise (‘GNoise’) is sampled and added to the corrupted image; the ‘Corr+GNoise’ consists the training image for each method

Fig. 13
figure 13

Performance of cGAN/RoCGAN (mean SSIM) when trained with Gaussian noise. Both models are more robust when trained with Gaussian noise; it requires 15 adversarial steps instead of 10 to achieve the same degradation. Nevertheless, the same pattern with increasing performance gap emerges in the Gaussian noise

4 Conclusion

In this work we study the robustness of conditional GANs in the face of noise. Despite their notorious sensitivity to noise, the topic has so far been relatively under-studied. In this paper, we introduced the robust conditional gan (RoCGAN) model, a new conditional GAN capable of leveraging unsupervised data to learn better latent representations. RoCGAN modify the generator into a two-pathway generator. The first pathway (reg pathway), performs the regression from the source to the target domain. The new, added pathway (AE pathway) is an autoencoder in the target domain. By adding weight sharing between the two decoders, we implicitly constrain the reg pathway to output signals that span the target manifold. We prove that our model shares similar convergence properties with generative adversarial networks. We demonstrated through large scale experiments on images, for both natural scenes and faces, that RoCGAN outperform existing, state-of-the-art conditional GAN models, especially in the face of intense noise. Our model can be used with any form of data and has successfully been applied to sparse inpainting/denoising in Chrysos et al. (2019b) as well as super-resolution. We hope that our work can pave the way towards more robust conditional GANs. Going forward, we aim to study how to merge different types of noise and how to achieve foolproof robustness in a dense regression setting. Additionally, we aim to study how to combine the polynomial networks (Chrysos et al. 2020) with RoCGAN.

Table 11 Quantitative evaluation (mean SSIM) of the ‘5layer’ network when trained with Gaussian noise (Sect. 3.8)