Lipschitz constrained GANs via boundedness and continuity

One of the challenges in the study of generative adversarial networks (GANs) is the difficulty of its performance control. Lipschitz constraint is essential in guaranteeing training stability for GANs. Although heuristic methods such as weight clipping, gradient penalty and spectral normalization have been proposed to enforce Lipschitz constraint, it is still difficult to achieve a solution that is both practically effective and theoretically provably satisfying a Lipschitz constraint. In this paper, we introduce the boundedness and continuity (BC) conditions to enforce the Lipschitz constraint on the discriminator functions of GANs. We prove theoretically that GANs with discriminators meeting the BC conditions satisfy the Lipschitz constraint. We present a practically very effective implementation of a GAN based on a convolutional neural network (CNN) by forcing the CNN to satisfy the BC conditions (BC–GAN). We show that as compared to recent techniques including gradient penalty and spectral normalization, BC–GANs have not only better performances but also lower computational complexity.

bounds [19], enable Wasserstein distance estimation [6], and also alleviate the training difficulty in GANs.Thus, a number of works have advocated the Lipschitz constraint.To be specific, weight clipping was first introduced to enforce the Lipschitz constraint [2].However, it has been found that weight clipping may lead to the capacity underuse problem where training favors a discriminator that uses only a few features [6].To overcome the weakness of weight clipping, regularization terms like gradient penalty are added to the loss function to enforce Lipschitz constraint on D [6,12,15].More recently, Miyato et al. [13] introduce spectral normalization to control the Lipschitz constraint of D by normalizing the weight matrix of the layers, which is regarded as an improvement on orthonormal regularization [18].Using gradient penalty or spectral normalization can stabilize the training and gainimproved performance.However, it has been found that gradient penalty suffers from the problem of not being able to regularize the function at the points outside of the support of the current generative distribution [13].In addition, spectral normalization has been found to suffer from the problem of gradient norm attenuation [1,10], i.e., a layer with a Lipschitz bound of 1 can reduce the norm of the gradient during backpropagation, and each step of backprop gradually attenuates the gradient norm, resulting in a much smaller Jacobian for the network's function than is theoretically allowed.Also as we will show in Sects.3 and 4.3, these new methods have the capacity underuse problem (see Proposition 1 and Fig. 1).Therefore, despite recent progress, it remains challenging to achieve practical success as well as provably satisfying a Lipschitz constraint.
In this paper, we introduce the boundedness and continuity (BC) conditions to enforce the Lipschitz constraint and introduce a CNN-based implementation of GANs with discriminators satisfying the BC conditions.We make the following contributions: (a) We prove that SN-GANs, one of the latest GAN training algorithms that use spectral normalization, will prevent the discriminator functions from obtaining the optimal solution when applying Wasserstein distance as the loss metric even though the Lipschitz constraint is satisfied.(b) We present BC conditions to enforce the Lipschitz constraint for the GANs' discriminator functions and introduce a CNN-based implementation of GANs by enforcing the BC conditions (BC-GANs).We show that the performances of BC-GANs are competitive to state-of-the-art algorithms such as SN-GAN and WGAN-GP but having lower computational complexity.
2 Related work

Generative adversarial networks (GANs)
Generative adversarial networks (GANs) are a special generative model to learn a generator G to capture the data distribution via an adversarial process.Specifically, a discriminator D is introduced to distinguish the generated images from the real ones, while the generator G is updated to confuse the discriminator.The adversarial process is formulated as a minimax game as: where min and max of G and D are taken over the set of the generator and discriminator functions, respectively.V(G, D) is to evaluate the difference in the two distributions of q x and q g , where q x is the data distribution, and q g is the generated distribution.The conventional form of V(G, D) is given by Kullback-Leibler (KL) divergence:

Methods to enforce Lipschitz constraint
Applying KL divergence as the implementation of V(G, D) could lead to the training difficulty, e.g., the vanishing gradient problem.Thus, numerous methods have been introduced to solve this problem by enforcing the Lipschitz constraint, including weight clipping [2], gradient penalty [4] and spectral normalization [13].
Weight clipping was introduced by Wasserstein GAN (WGAN) [2], which used Wasserstein distance to measure the differences between real and fake distributions instead of KL divergence.

WðP r
where WðP r ; P g Þ represents the Wasserstein distance, P r and P g are the real and fake distributions, respectively.Weight clipping enforces the Lipschitz constraint by truncating each element of the weight matrices.Wasserstein distance shows superiority over KL divergence, because it can effectively avoid the vanishing gradient problem brought by KL divergence.In contrast to weight clipping, gradient penalty [6] penalizes the gradient at sample points to enforce Lipschitz constraint: where L D is the loss objective for the discriminator, and a is a hyperparameter.Spectral normalization is a weight normalization method, which controls the Lipschitz constraint of the discriminator function by literally constraining the spectral norm of each layer.The implementation of the spectral normalization can be expressed as: where W represents the weight matrix in each network layer, rðWÞ is the spectral norm of matrix W, which equals to the largest singular value of the matrix W, and W SN ðWÞ represents the normalized weight matrix.To a certain extent, spectral normalization has succeeded in facilitating stable training and improving performance.

Existing problems
Although heuristic methods have been proposed to enforce Lipschitz constraint, it is still difficult to achieve a solution that is both practically effective and theoretically provably satisfying the Lipschitz constraint.To be specific, weight clipping was proven to be unsatisfactory in [4], and it can lead to the capacity underuse problem where training favors a discriminator that uses only a few features [6].In addition, gradient penalty suffers from the obvious problem of not being able to regularize the function at the points outside of the support of the current generative distribution.
In fact, the generative distribution and its support gradually change in the course of the training, and this can destabilize the effect of the regularization itself [13].Moreover, it has been found that spectral normalization suffers from the gradient norm attenuation problem [1,10].Furthermore, we have found that applying spectral normalization prevents the discriminator functions from obtaining the optimal solutions when using Wasserstein distance as the loss metric.To provide an explanation to this problem, we present Proposition 1.
Let P r and P g be the distributions of real images and generated images in X, a compact metric space.The discriminator function f is constructed based on a neural network of the following form with input x: where h :¼ fW 1 ; W 2 ; :::; W Lþ1 g is the learning parameter set, and a l is an element-wise nonlinear activation function.Spectral normalization is applied on f to guarantee the Lipschitz constraint.
Proposition 1 When using Wasserstein distance as the loss metric of f, the optimal solution to f is unreachable.

Enforcing boundedness and continuity in CNN-based GANs
Finding a proper way to enforce the Lipschitz constraint remains an open problem.Motivated by this, we search for a better way to enforce the Lipschitz constraint.

BC Conditions
The purpose is to find the discriminator from the set of k-Lipschitz continuous functions [7], which obeys the following condition: Equation ( 6) is referred to as the Lipschitz continuity or Lipschitz constraint.If the discriminator function f satisfies the following conditions, it is guaranteed to meet the condition of Eq. ( 6): (a) Boundedness: f is a bounded function.
(b) Continuity: f is a continuous function, and the number of points where f is continuous but not differentiable is finite.Besides, if f is differentiable at point x, its derivative is finite.
Conditions (a) and (b) are referred to as the boundedness and continuity (BC) conditions.A discriminator satisfying the BC conditions is referred as a bounded discriminator, and a GAN model with BC conditions enforced is referred to as BC-GAN.Following Theorems 1 and 2 guarantee that meeting the BC conditions is sufficient to enforce the Lipschitz constraint of Eq. ( 6) (see Proofs in ''Appendix'') Theorem 1 Let W be the set of all f : X !R, where f is a continuous function.In addition, the number of points where f is continuous but not differentiable is finite.Besides, if f is differentiable at point x, its derivative is finite.Then, f in W satisfies Lipschitz constraint.
Theorem 2 Let P r and P g be the distributions of real images and generated images in X, a compact metric space.Let X be the set of all f : X !R, where f is a continuous and bounded function.And, the number of points where f is continuous but not differentiable is finite.Besides, if f is differentiable at point x, its derivative is finite.The set X can be expressed as: where m represents the bound.Then, there must exist a k, and we have a computable k Á WðP r ; P g Þ: where WðP r ; P g Þ represents the Wasserstein distance between P r and P g [5,23].
According to Theorems 1 and 2, it is obvious that the BC conditions are sufficient to enforce the Lipschitz constraint.Furthermore, k Á WðP r ; P g Þ is bounded and computable and can be obtained as: Then, k Á WðP r ; P g Þ can be applied as a new loss metric to guide the training of D. Logically, the new objective for D is: Theorem 3 in [2] tells us that where h is the parameters of G. Equation (11) indicates that using gradient descent to update the parameters in G is a principled method to train the network of G. Finally, the new objective for G can be obtained:

Implementation of BC conditions
In this paper, we introduce a simple but efficient implementation of BC conditions.When applying the BC conditions to D, the training of D can be equivalently regarded as a conditional (constrained) optimization process.Then, Eq. ( 10) can be updated as: for t=1,2,...,n critic do 5: In this paper, the discriminator function f is implemented by a deep neural network, which applies a series of convolutional and nonlinear operations.Both convolutional and nonlinear functions are continuous, which means that D is a continuous function.Moreover, the gradients of the output of D with respect to the input are always finite.As a result, condition (b) is satisfied naturally.To guarantee condition (a), the Lagrange multiplier method can be applied here; then, the objective of D can be written as the following equation: where b is the hyperparameter and m represents the bound.The term maxð f ðxÞ k kÀ m; 0Þ plays the role of forcing D to be a bounded function, while E z $ pðzÞ f ðGðzÞÞ ½ À E x $ pðxÞ f ðxÞ ½ is used to determine k Á WðP r ; P g Þ.The procedure of training the BC-GAN is described in Algorithm 1.

Validity
In order to verify the validity of proposed BC conditions, we use synthetic datasets as those presented in [15] to test discriminator's performance.Specifically, discriminators are trained to distinguish the fake distribution from the real one.The toy distributions hold the fake distribution P g as the real distribution P r plus unit-variance Gaussian noise.Theoretically, discriminator with good performance is more likely to learn the high moments of the data distributions and model the real distribution.Figure 1 illustrates the value surfaces of the discriminator.It is clearly seen that discriminator enforced by BC conditions has a good performance on discriminating the real samples from the fake ones, demonstrating the validity of proposed method.

Comparison with spectral normalization and gradient penalty
Gradient penalty, spectral normalization and our proposed method are inspired by different motivations to enforce the Lipschitz constraint on D. Therefore, they differ in the way of implementation and in principle.The first difference is the way of implementation.Gradient penalty and our method operate on the loss function directly, while spectral normalization constrains the weight matrix instead of the loss metric.Secondly, they differ in principle.For BC-GAN, k Á WðP r ; P g Þ is applied to evaluate the difference between the fake and real distributions instead of WðP r ; P g Þ, which is used in WGAN-GP and WGAN.Moreover, WGAN-GP and SN-GAN strictly constrain the Lipschitz constant to be 1 or a known constant, while BC-GAN eases the restriction on the Lipschitz constant, and k is an unknown scalar parameter which will have no influence on the training of the network.Therefore, k Á WðP r ; P g Þ can be employed as a new loss metric to guide the training of D.
To visualize the differences, we still use the synthetic datasets to test discriminators' performance.Figure 1 illustrates the value surfaces of the discriminators.It is obvious that discriminators trained with gradient penalty as well as spectral normalization have pathological value surfaces even when optimization has completed, and they have failed to capture the high moments of the data distributions and instead model very simple approximations to the optimal functions.In contrast, BC-GANs have successfully learned the higher moments of the data distributions, and the discriminator can distinguish the real distribution from the fake one much better.

Convergence measure
One advantage of using Wasserstein distance as the metric over KL divergence is the meaningful loss.The Wasserstein distance WðP r ; P g Þ shows the property of convergence [6].If it stops decreasing, then the training of the network can be terminated.This property is useful as one does not have to stare at the generated samples to figure out the failure modes.To obtain the convergence measure in the proposed BC-GAN, a corresponding indicator of the training stage is introduced: To prove that proposed indicator I GD is capable of convergence measure, Theorem 3 is introduced.
Theorem 3 Let P r and P g be the distributions of real and generated images, x is the image located in P r and P g , and f is the discriminator function, bounded by the BC conditions.I GD in Eq. 15 is proportional to WðP r ; P g Þ.

Experimental setup
In order to assess the performance of BC-GAN, image generation experiments are conducted on CIFAR-10 [20], STL-10 [8] and CELEBA [25] datasets.Two widely used GAN architectures, including the standard CNN and ResNet-based CNN [6], are applied for image generation task.For the architecture details, please see ''Appendix''.Equations ( 12) and ( 14) are used as the loss metric of D and G, respectively.I GD in Eq. ( 15) acts as the role of measuring convergence.m and b in Eq. ( 14) are set as 0.5 and 2, respectively.For optimization, the Adam [9] is utilized in all the experiments with a = 0.0002, b 1 ¼ 0, b 2 ¼ 0:9.D updates 5 times per G update.To keep it identical to previous GANs, we set the batch size as 64.Inception score [17] and Fre ´chet inception distance [8] are utilized for quantitative assessment of generated examples.
Although inception score and Fre ´chet inception distance are widely used as an evaluation metric for GANs, Barratt [3] suggests that it should be more systematic and careful when evaluating and comparing generative models, because inception score may not correlate well with the image quality strictly.Recently, Catherine [14] proposes a new method to evaluate the generative models, called skill rating.Skill rating evaluates models by carrying out tournaments between the discriminators and generators.For better evaluation, results assessed by skill rating are also presented.

Results on image generation
Image generation tasks are carried out on the CIFAR-10 and STL-10 datasets.Based on the ResNet-based CNN architecture, we obtain the average inception score of 8.40 and 9.15 for image generation on CIFAR-10 and STL-10, respectively.We compare our algorithm against multiple benchmark methods.In Table 1, we show the inception score and Fre ´chet inception distance of different methods with their corresponding optimal settings on CIFAR-10 and STL-10 datasets.As illustrated in Table 1, BC-GAN has comparable performances with the state-of-the-art GANs.We also conduct image generation on CELEBA [25] dataset.Examples of generated images are shown in Fig. 2  and 3.
Skill rating [14] is recently introduced to judge the GAN model by matches between G and D. To determine the outcome of a match between G and D, D judges two batches: one batch of samples from G and one batch of real data.Every sample x that is not judged correctly by D (e.g., D(x) [0.5 for the generated data or D(x) \0.5 for the real data) counts as a win for G and is used to compute its win rate.Win rate tests the performance between D and G dynamically in the training process and judges whether D or G dominates, while the other stops updating.If D dominates and G stops updating, win rate for G decreases dramatically.We make some modifications, because we use Wasserstein distance to determine the difference between fake and real data instead of probability.As a result, we show the loss of D instead of the win rate in Fig. 4. When D in the latter iteration is used to distinguish the generated images in the early iteration from real images, it outputs a large loss, meaning that D can easily distinguish the generated images (fake images) from real images.And the images generated in the latter iteration can also easily fool D in the early iteration.Therefore, there is a healthy training, and the performance of D and G is continuously improved in the training process.
When applying KL divergence as the loss metric of D, the training of GANs suffers from the vanishing gradient  4 shows a healthy training during the entire iterations, further indicating the effectiveness of BC-GANs.

Bound m
The parameter m in Eq. ( 14) represents the bound of D, and it actually controls the gradient oL D /ox, where L D is the loss of D, x is the image and oL D /ox is the gradient backpropagated from D to G, which indeed affects the training of G and further influences the model performance.
Explanation is as followed.The discriminator f is a bounded function.Given enough iterations, f x $ Pr ðxÞ would always converge to m and f x $ Pg ðxÞ would converge to Àm.
And considering that f satisfies k-Lipschitz constraint, the following condition is satisfied: k determines the upper bound of the gradient backpropagated from D to G and is directly proportional to D.
Increasing m enhances the upper bound of the gradients oL D =ox.This is verified by the experiment shown in Fig. 5a.Moreover, the gradients are used to guide the training of the generator and naturally affect the performance of the model.Increasing m from 0.5 to 2 leads to decreased performance (inception score drops from 8.40 to 7.56).Therefore, properly controlling the gradient is important for improving the performance of GAN models.And the bound m provides such a mechanism for controlling the gradient.m is recommended to be taken as 0.5 for We also monitor the variation of the gradient on WGAN-GP and SN-GAN.It is found that the behavior of the gradient variation varies on different models.The gradient penalty term in WGAN-GP forces the gradient of the output of D with respect to the input to be a fixed number.Therefore, as shown in Fig. 5b, the gradient is around 1 in the whole training process.For SN-GAN and our BC-GAN in Fig. 5c, the variation of the gradient is similar.With training process going on, the gradient tends to increase until convergence is reached.The difference is that the amplitude of the gradient in SN-GAN is larger than that in BC-GAN.As mentioned above, the amplitude of the gradient indeed affects the training of the generator.However, SN-GAN provides no mechanism for controlling the gradient, while the bound m in BC-GAN acts as the role of controlling the gradient.Thus, at least in this perspective, BC-GAN has a better performance control over SN-GAN.

Meaningful training stage indicator I GD
We introduce a new indicator I GD for monitoring the training stage.Figure 6a shows the correlation ofÀI GD with inception score during the training process.Because I GD decreases with the iteration, we use ÀI GD instead.As we can see, ÀI GD has a positive correlation with the inception score.As it is easier to visualize the correlation between I GD and image quality in higher-resolution images, we perform image generation task on CELEBA [25] dataset and show the variation of I GD with iterations in Fig. 6b .It is clearly seen that I GD correlates well with image quality during the training process.

Training time
It is worth noting that BC-GAN is computationally efficient.We list the computational time for 100 generator updates in Fig. 7. WGAN-GP requires more computational time because it needs to calculate the gradient of the gradient norm kO x Dk 2 , which needs one whole round of forward and backward propagation.And spectral normalization needs to calculate the largest singular value of the matrices in each layer.What is worse, for gradient penalty and spectral normalization, the extra computational costs increase with the increase in layers.As for BC-GAN, there is no matrix operation or gradient calculation in the backpropagation.As a result, it has lower computational cost.

Concluding remarks
In this paper, we have introduced a new generative adversarial network training technique called BC-GAN which utilizes bounded discriminator to enforce Lipschitz constraint.In addition to provide theoretical background, we have also presented practical implementation procedures for training BC-GAN.Experiments on synthetical as well as real data show that the new BC-GAN performs better and has lower computational complexity than recent techniques such as spectral normalization GAN (SN-GAN) and Wasserstein GAN with gradient penalty (WGAN-GP).We have also introduced a new training convergence measure which correlates directly with the image quality of the generator output and can be conveniently used to monitor training progress and to decide when training is completed.

Compliance with ethical standards
Spectral normalization is applied on f to guarantee the Lipschitz constraint.Proposition 1 When using Wasserstein distance as the loss metric of f, the optimal solution to f is unreachable.
Proof Corollary 1 in [6] has proven that the optimal discriminator f Ã has gradient norm 1 almost everywhere under P r and P g when using Wasserstein distance as the loss metric.
Suppose x can be expressed as ½x 1 ; x 2 ; Á Á Á ; x n , and W T W has eigenvalues k ¼ ½k 1 ; k 2 ; Á Á Á ; k n : The eigenvectors of W T W can be expressed as V ¼ ½v 1 ; v 2 ; . ..; v n .Then, we have: Supposing the transformation Vx ¼ ½y 1 ; y 2 ; . ..; y n , and using the relationship V T V ¼ I, we can have When spectral normalization is applied, k 1 is normalized to 1.As a result: We can see that applying spectral normalization can guarantee W satisfies the Lipschitz constraint.The discriminator function f is implemented by convolutional neural networks, which is a combination of convolutional and nonlinear operations (Eq.( 18)).Therefore, the following inequality is applied to observe the bound on jjf jj Lip [13]: where rðWÞ is the spectral norm of W.
When applying Wasserstein distance as the loss metric, Corollary 1 in [6] has proven that the optimal solution to the Lipschitz constrained discriminator has gradient norm 1 almost everywhere under P r and P g , which means jjf jj Lip needs to reach upper bound of 1.However, if jjf jj Lip in Eq. ( 23) needs to obtain the upper bound 1, the discriminator function becomes a linear function.However, the discriminator function is implemented by the combination of convolutional operation and nonlinear operation.Taking the Relu function as a representation of the nonlinear operation a l , a l ðxÞ ¼ xðx [ 0Þ or a l ðxÞ ¼ 0ðx 6 0Þ.In another word, jja l ðxÞjj ¼ jjxjjðx [ 0Þ, and jja l ðxÞjj ¼ 0\jjxjjðx 6 0Þ.If the discriminator function needs to obtain the upper bound of the Lipschitz constraint, all the nonlinear operations need to reach the upper bound as well: jja l ðxÞjj ¼ jjxjj.Then, all the nonlinear functions are linear functions, and the discriminator function turns to a linear function.Obviously, a linear discriminator is not the optimal solution.Therefore, with the existence of nonlinear operation, applying spectral normalization prevents the discriminator functions from the optimal solution when applying Wasserstein distance as the loss metric.
Theorem 1 Let W be the set of all f : X !R, where f is a continuous function.In addition, the number of points where f is continuous but not differentiable is finite.
Besides, if f is differentiable at point x, its derivative is finite.Then, f in W satisfies Lipschitz constraint.
Proof (i) Considering that f is derivable.According to Lagrange's mean value theorem, Because of ox is finite: where k is finite.Moreover, we have: Then, f satisfies Lipschitz constraint.
(ii) Considering that f is not derivable, f is a continuous function; then, there must be at least one point x 0 , at which f is continuous but not derivable.We only consider that there is only one such point.For multiple points, the conclusion is the same.For any x 1 and x 2 (x 1 ; x 2 \x 0 or x 1 ; x 2 [ x 0 ), f should satisfy the following: because f is continuous and derivable in ½x 1 ; x 2 .For x 1 and x 2 (x 1 \x 0 \x 2 ), we have Because f is continuous in ½x 1 ; x 0 and ½x 0 ; x 2 , and derivable in ðx 1 ; x 0 Þ and ðx 0 ; x 2 Þ, we can obtain: Then, we can have: where k ¼ maxðk 1 ; k 2 Þ.Considering the relationship that x 1 \x 0 \x 2 , we can have: As we can see, even though f is not derivable at x 0 , for any x 1 and x 2 , f still satisfies: To sum up, f always satisfies Lipschitz constraint at the given conditions.
Theorem 2 Let P r and P g be the distributions of real images and generated images in X, a compact metric space.Let X be the set of all f : X !R, where f is a continuous and bounded function.And, the number of points where f is continuous but not differentiable is finite.Besides, if f is differentiable at point x, its derivative is finite.The set X can be expressed as: where m represents the bound.Then, there must exist a k, and we have a computable k Á WðP r ; P g Þ: where WðP r ; P g Þ represents the Wasserstein distance [5,23] between P r and P g .
Proof According to Theorem 1, for f in X, there exists a k to satisfy Eq. (32).Then, X is the set, which contains all the k-Lipschitz constrained functions f.Kantorovich-Rubinstein duality [5,23] tell us that the supremum over all the functions in X is k Á WðP r ; P g Þ.As a result, we can obtain Eq. (34).To guarantee the boundedness and computability of k Á WðP r ; P g Þ, f is supposed to be a bounded function.
Because, even though k in Theorem 1 is a finite number, it can be super large k ! 1, leading to the incomputability of k Á WðP r ; P g Þ. Enforcing f to be a bounded function can ensure the boundedness and computability of k Á WðP r ; P g Þ: Theorem 3 Let P r and P g be the distributions of real and generated images, x is the image located in P r and P g , and f is the discriminator function, bounded by the BC conditions.I GD in Eq. 15 is proportional to WðP r ; P g Þ.
Proof f is bounded by the BC conditions.Given enough iterations, f x $ P r ðxÞ would always converge to m and f x $ P g ðxÞ would converge to Àm.As a result, k Á WðP r ; P g Þ will always converge to 2m: It is clear that WðP r ; P g Þ is proportional to E x r À x g Â Ã , because both of them evaluate the difference between P r and P g .Then, we can use the following term GD to estimate WðP r ; P g Þ: where x r , x g are the real image and generated image, respectively.As expressed above, the term jjf x r $ p r ðx r Þ À f x g $ p g ðx g Þjj would always converge to 2m, and WðP r ; Therefore, GD is inversely related to WðP r ; P g Þ , and the reciprocal of GD can be used to roughly estimate WðP r ; P g Þ. h According to Lagrange's mean value theorem, where x 2 x g ; x r Â Ã .For the convenience of calculation, x is taken as x ¼ a Á x r þ ð1 À aÞ Á x g , and a 2[0, 1].Then, O x f ðxÞ k k 2 is inversely related to WðP r ; P g Þ.Finally, I GD is proportional to WðP r ; P g Þ.

Appendix 2: Architecture
Discriminator in the toy model is listed in Table 2. Standard CNN architectures for CIFAR-10 and STL-10 are listed in Tables3 and 4. ResNet-based CNN architectures for CIFAR10 and STL-10 are listed in Tables 5 and 6.Architectures for image generation on CELEBA dataset are listed in Tables 7 and 8.

Fig. 1
Fig. 1 Value surface of the discriminators trained to optimality on toy datasets.The yellow dots are data points, the lines are the value surfaces of the discriminators.Left column: spectral normalization.Middle column: gradient penalty.Right column: the proposed method.The upper, middle and lower rows are trained on 8-Gaussian,

Fig. 4 Fig. 5 a
Fig. 4 Matches between D and G. Wasserstein distance is utilized to indicate the results instead of the win rate.With larger value of the Wasserstein distance, D is more likely to distinguish the real images from the fake ones.Lower value of the Wasserstein distance indicates that G is more likely to fool D

Table 1
IS and FID of unsupervised image generation on CIFAR-10 and STL-10.IS is the inception score, and FID represents Fre ´chet inception distance.For IS, higher is better, while lower is better for FID problem, i.e., zero gradient would backpropagate to G, and the training would completely stop.As a comparison, Fig.

Table 2
Discriminator in the toy modelInput points : x 2 R 2

Table 8
Discriminator architecture for image generation on CELEBA dataset Input RGB image : x 2 R 128Â128Â3