Evaluation Metrics for Conditional Image Generation

We present two new metrics for evaluating generative models in the class-conditional image generation setting. These metrics are obtained by generalizing the two most popular unconditional metrics: the Inception Score (IS) and the Fr\'{e}chet Inception Distance (FID). A theoretical analysis shows the motivation behind each proposed metric and links the novel metrics to their unconditional counterparts. The link takes the form of a product in the case of IS or an upper bound in the FID case. We provide an extensive empirical evaluation, comparing the metrics to their unconditional variants and to other metrics, and utilize them to analyze existing generative models, thus providing additional insights about their performance, from unlearned classes to mode collapse.


Introduction
Unconditional image generation models have seen rapid improvement both in terms of generation quality and diversity. These generative models are successful, if the generated images are indistinguishable from real images sampled from the training distribution. This property can be evaluated in many different ways, the most popular are the Inception Score (IS), which considers the output of a pretrained classifier, and the Fréchet Inception Distance (FID), which measures the distance between the distributions of extracted features of the real and the generated data.
While unconditional generative models take as input a random vector, conditional generation allows one to control the class or other properties of the synthesized image. In this work, we consider class-conditioned models, introduced in [17], where the user specifies the desired class of the generated image. Employing unconditional metrics, such as IS and FID, in order to evaluate conditional image generation fails to take into account whether the generated images satisfy the required condition. On the other hand, classification metrics, such as accuracy and precision, currently used to evaluate conditional generation, have no regard for image quality and diversity.
One may opt to combine the unconditional generation and the classification metrics to produce a valuable measurement for conditional generation. However, this suffers from a few problems. First, the two components are of different scales and the trade-off between them is unclear. Second, they do not capture changes in variance within the distribution of each class. To illustrate this, consider Fig. 1, which depicts two different distributions. Distribution 'A' has a zero mean and a standard deviation 1.0 for the first class and 3.0 for the second class independently in each axis. Distribution 'B' has the same mean and a standard deviation 1.5 for the first class and 2.5 for the second class in each axis. The FID and the classification error (classifier shown in green) are zero, despite the in-class distributions being different.
In order to provide valuable metrics to evaluate and compare conditional models, we present two metrics, called Conditional Inception Score (CIS) and Conditional Fréchet Inception Distance (CFID). The metrics contain two components each: (i) the within-class component (WCIS/WCFID) measures the quality and diversity for each of the conditional classes in the generated data. In other words, it measures the ability to replicate the distribution of each class in the true samples; (ii) the between-class component (BCIS/BCFID) measures how close the representation of classes in the generated distribution is to the representation in the real data distribution. See Fig. 2 for an illustration.
In contrast to the combined FID and classifier, our WCFID and BCFID components of the FID are both larger than zero for the example in Fig. 1, successfully capturing the differences between the distributions.
Our analysis shows direct links between the novel conditional metrics and their unconditional counterparts. The (unconditional) Inception Score can be decomposed to a multiplication between BCIS and WCIS. We further show that due to the bounded region of the metrics, this translates to a trade-off between BCIS and WCIS and that each one of them form a tight lower bound on the IS. In the analysis of the FID score, we show that the sum of WCFID and BCFID forms a tight upper bound of the FID.
After analyzing the metrics, we performed various experiments to ground the theoretical claims and to highlight the role of the new metrics in evaluating conditional generation models. First, a set of simulations was conducted, in which we performed label noising, image noising, and simulated mode collapse. Under all conditions, our methods came out as the most sensitive to the applied augmentations. We then evaluated several pretrained models of popular architectures on various datasets and training schemes using the proposed scores and identified significant insights that were detected by our metrics. Our metrics were found to be a decisive factor to determine the generation performance in each dataset.

Related Work
Generative Models Generative models, and in particular Generative Adversarial Networks [6] aim to generate realistic looking images from a target distribution, while capturing the diversity of images. Advances in the loss Fig. 1 A case against measuring success by relying on FID combined with a classification score. Shown are two distributions, each with two classes, in a two-dimensional feature space. The green circle acts as the classifier. For both distributions, the overall mean and variance are equal, therefore, the FID is zero. The classification error is also zero, and the two distributions are, therefore, indistinguishable by this score as well. However, within the classes we see a shift in variance in the second distribution compared to the first. Our proposed WCFID and BCFID metrics both show values above zero and, therefore, detect the difference between the distributions. and architecture allowed for improved quality and diversity of generation. For instance, [1,7] attempt to minimize the Wasserstein distance between the generated and real distributions. This allowed for improved variability in generation, in particular by reducing mode collapse. On the architectural side, Progressive GANs [12], StyleGAN [13] and StyleGANv2 [14], introduced advanced architectures and training methods, allowing for further improvements.

Conditional Generation
In conditional generation, control over the generation is provided, e.g., by class-conditioning [17,4,24], a given text [27,26], requiring specific semantic features [11], or finding analogs to images from another distribution [10,28]. The recent state of the art in class-conditional generation, which is the BigGAN method [3], can learn conditional representation on ImageNet [21] with high quality and diversity.
To train these models, several changes have been proposed to the unconditional method. CGAN [17] injects the conditional component to the discriminator along with the image, ACGAN [20] added an auxiliary classifier tasked to accurately predict the conditioned label, SGAN [19] modified the discriminator output to detect real classes while treating fake images as an additional class. A special unsupervised setting was proposed by InfoGAN [4], where the condition is unlabelled in the real data and the model constructs a disentangled representation by maximizing the mutual information between the conditioned variable and the observation.
Evaluation Metrics To evaluate different models in terms of high quality generation and diversity, evaluation metrics were proposed for the unconditional generation setting. The Inception Score (IS) [22] uses the predictions of a pretrained classifier, InceptionV3 [25], to assess: 1. Quality: whether the conditional probability of a generated sample G(z) over the labels y, is highly predictable (low entropy) and 2. Diversity: whether the marginal probability of labels over all generated samples is highly diverse (high entropy). The Fréchet Inception Distance (FID) [8], was proposed as an alternative to the IS, by considering the distribution of features of real data and generated data. FID models these distributions as multivariate Gaussian distributions and measures the distance between them. FID was shown to be sensitive to mode collapse and more robust to noise than IS. Additional metrics, such as Perceptual Path Length (PPL) [13] and Kernel Inception Distance (KID) [2] were also introduced. Still, the IS and FID are the most widely accepted metrics for image generation.
Nevertheless, these measures are designed for the unconditional setting, and for class-conditional, they do not assess the level at which the categorical condition manifests itself in the generated data. In this work, we extend the IS and FID to the class-conditional setting, showing their relation to their unconditional counterparts and demonstrate the usefulness of these metrics in the conditional setting.
Some recent attempts have been made to asses conditional generation with modified versions of the IS and FID. A Conditional Inception Score for the Imageto-Image translation task (loosely similar to our BCIS component) was proposed by Huang et al. [9] and Miy-ato et al. [18] were the first to our knowledge to measure the intra-FID (equal to our WCFID component). Both these contributions lacked thorough experiments and comparison with other methods and presented no theoretical analysis. Nevertheless, these previous attempts to establish conditional metrics confirm the need for conditional evaluations scores and point to the possible ingredients of such solutions. Our work contributes to both approaches by providing justification and explanation, and also by improving upon them to build more unified metrics.

Problem Setup
We consider a distribution of real samples: where D C is a distribution over a set of K > 1 classes and D c R is the conditional distribution of a sample x ∈ R n taken from the class c. The algorithm is provided with a dataset of i.i.
that were sampled from the generative process in Eq. 1. In addition, the distribution D C of classes and its corresponding probability density function p(c) are known or assumed to be uniform. The distribution of real samples x marginalized over c ∼ D C is denoted by D R and the corresponding probability density function by p R (x).
In conditional generation, the algorithm learns a generative model G that tries to generate samples that are similar to the real samples in D R . The generative model takes a random seed z ∼ D Z ⊂ R d and a class c ∼ D C as inputs and returns a generated sample G(z, c). Here, D Z is a pre-defined distribution over a latent space R d of dimension d < n, where n is the dimensionality of the samples. Typically D Z is the standard normal distribution. We denote by D G the distribution of generated samples.
In conditional generation, we are interested in two aspects of the generation. Images from the same conditioned variable should be of the same class in D R , and different conditioned variables z should cover the range of each class.
The category discovery setting is a special case of the conditional generation, proposed in [4], where the algorithm is provided with a set of unlabelled samples The algorithm is still aware of the existence of the partition of the data into classes that are distributed according to D C . The goal of the algorithm is to generate samples that are similar to the real samples and also have them clustered in a proper manner into K clusters.

Inception Score
The Inception Score (IS) is a method for measuring the realism of a generative model's outputs. For a given generative model G, a latent vector z ∼ D Z and a random class c ∼ D C , we apply a pretrained classifier on the generated image x = G(z, c) to obtain a distribution over the labels, which is denoted by p G (y|x). We denote the corresponding random variables by Z, C, X = G(Z, C) and Y ∼ p G (y|X), the distribution of X by D G and the probability density function of x by p G (x). Images that contain meaningful objects should have a condition distribution p G (y|x) of low entropy. Furthermore, we expect the model to generate varied images, so the should have a high entropy. The Inception Score is computed as: where D KL (p q) is the KL-divergence between two probability density functions. A high score indicates both a high variety in data and that the images are meaningful. The Inception Score can also be formulated using the mutual information between the generated samples and the class labels: where I(X; Y ) is the mutual information between X and Y . As can be seen, by maximizing the IS, one maximizes the mutual information between X and Y . However, this equation indicates that the IS is not sufficient in order to evaluate generative models in the conditional generation settings, since the score does not take the conditioned class into account. Due to the properties of the mutual information, it can be seen that for a domain with K classes, the score is within the range [1, K].

Fréchet Inception Distance
The Fréchet distance d 2 (D 1 , D 2 ) between two distributions D 1 , D 2 is defined by: where the minimization is taken over all random variables X and Y having marginal distributions D 1 and D 2 , respectively. In general, the Fréchet distance is intractable, due to its minimization over the set of arbitrary random variables. Fortunately, as shown by [5], for the special case of multivariate normal distributions D 1 and D 2 , the distance takes the form: where µ i and Σ i are the mean and covariance matrix of D i . The first term measures the distance between the centers of the two distributions. The second term: defines a metric on the space of all covariance matrices of order n. For two given distributions D R of real samples and D G of the generated data, the FID score [8] computes the Fréchet distance between the real data distribution and generated data distribution using a given feature extractor f under the assumption that the extracted features are of multivariate normal distribution: where µ R , Σ R and µ G , Σ G are the centers and covariance matrices of the distributions f • D R and f • D G , respectively. For evaluation, the mean vectors and covariance matrices are approximated through sampling from the distribution.

Method
In this section, we introduce the class-conditioned extensions of the Inception Score and FID.

Conditional Inception Score
The conditional analysis of the Inception Score addresses both aspects of conditional generation: the need to create realistic and diverse images, and the need to have each generated image match its condition. We define two scores: the between-class (BCIS) and the within-class (WCIS).
BCIS evaluates the IS on the class averages. It is a measurement of the mutual information between the conditioned classes and the real classes. The prediction probabilities for all the samples in each conditioned class are averaged to produce the average prediction probability of the entire class, then the IS is computed on these averages.
The BCIS is defined in the following manner: where, WCIS evaluates the IS within each category. It is a measurement of the mutual information between the real classes conditioned on the samples and the real classes conditioned on the conditioned classes. The final score is the geometric average score over all the classes, which is equivalent to the exponent on the arithmetic average of the mutual information over all the classes. To define this measure, we define two random variables X c := (X|C = c) and Y c := (Y |C = c) which are the random variables X and Y conditioned on the class being c.
The WCIS is defined as: where the mutual information is computed as follows: where D c G is the distribution of X c . In general, we wish the BCIS to be as high as possible and the WCIS to be as low as possible. High BCIS indicates a distinct class representation for each conditioned class and a wide coverage across the conditioned classes, which is a desired property. High WCIS indicates a wide coverage of real classes within the conditioned classes, which is an undesired property, since each conditioned class should represent only a single real class. In this way, one obtains consistent prediction within each class and has high variability between classes.
The following theorem presents the compositional relationship between IS and the proposed conditional measures.
Theorem 1 Let C ∼ D C and Z ∼ D Z be two independent random variable. Let X = G(Z, C) for a continuous generator function G and let Y be a discrete random variable distributed by p(y|X). Then, The proof is provided in the appendix.
By definition, as with the IS, both BCIS and WCIS lie within [1, K]. Since we wish IS to be as large as possible and both BCIS and WCIS lie in the same interval, the theorem asserts that there is a tension between the BCIS and WCIS measures, since both of them cannot be large at the same time. In addition, since both components are larger than 1, the theorem shows that they both provide a lower bound on the IS and the bound is tight when the other component is equal to 1. The final realization is that the IS can be very high even when the BCIS component is low, simply by having a high WCIS. This gap between IS and BCIS indicates bad conditional representation which is overlooked by the unconditional evaluation.
On these grounds, we propose the BCIS and WCIS together as the conditional alternative to the IS. Each metric shows a different property of the generated data and, as shown in the theorem, the IS is readily obtained by multiplying the conditional components.

Conditional Fréchet Inception Distance
For conditional FID, we want to measure the distance between different distributions, according to the feature vector f (x), produced by the pre-trained feature extractor f on a sample x. Analogous to the conditional IS metrics, we measure the between-class distance between averages of conditioned class features and averages of real class features, as well as the average within-class distance for each matching pair of real and conditioned classes.
BCFID measures the FID between the distribution of the average feature vector of conditioned classes in the generated data and the distribution of the average feature vector of real classes in the class real data. It evaluates the coverage of the conditioned classes over the real classes.
For each distribution specifier E = R, G, we estimate the per-class mean µ E c , the mean of means µ E B , and the covariance of the feature vectors Σ E B .
The BCFID is defined as: WCFID measures the FID between the distribution of the generated data and the real data within each one of the classes. It evaluates how similar each conditioned class is to its respective real class. The total score is the mean FID within the classes.
For each distribution specifier E = R, G, the withinclass covariance matrices are defined as: The WCFID is defined as: : Note that we compare between matching pairs of conditioned and real classes. When a mapping between conditioned and real classes exists, i.e., in conditional GANs, this is straightforward. In the case when there is no such mapping, i.e., in the class discovery case, such as when employing the InfoGAN method, a mapping needs to be created. For example, this can be done by using a classifier to get the prediction probabilities for the generated images. Then average the probabilities for each conditioned class and apply the Hungarian algorithm on the average probabilities.
In general, the desire is to minimize both component, since each computes a different aspect of the distance between the real and the generated distributions.
The following theorem ties the FID and the conditional FID components.
Theorem 2 Let D R and D G be the distributions of real and generated samples. Then, and the bound is tight under certain conditions. The proof is provided in the appendix.
By this theorem, in conditional generation, FID gives an optimistic evaluation to the model that ignores bad cases. A good unconditional score can be obtained even though there is a considerable friction between the real and generated distributions in terms of conditional generation. This friction can occur either by bad representation of classes (high BCFID) or unmatching diversity within classes (high WCFID). For this reason, we propose the BCFID and WCFID as the conditional alternative to the FID. In addition to providing two meaningful scores that are similarly scaled, an upper bound to the FID can be computed by adding the two components.

Experiments
Our experiments employ three datasets: MNIST [16], CIFAR10 [15], and ImageNet [21]. We first consider controlled simulations on MNIST to show the behavior of our metrics compared to existing unconditional metrics. Three cases are considered: noisy labels, noisy images, and mode collapse within classes. We then consider our metrics on a variety of well-established generative models and draw visual insights for the reported metric scores. Finally, a user study was held to compare the numeric results to human perception. Evaluation procedure When evaluating the models, we use an equal number of randomly sampled real and generated samples for each class. For MNIST and CI-FAR10, the test set was used as real samples, with 1000 samples from each class. For ImageNet, 50 validation samples for each class were used, for a total of 50, 000 validation samples.
To obtain the scores of the 'Real Data' in Tab. 1, 2 (i.e., the score obtained not from generating but from the training data itself, which serves as an unofficial upper bound of the performance), an equal number of samples were taken from the train set. For instance, for MNIST, 10, 000 samples were taken from the train data (1000 for each class). These same samples were also used for the three synthetic noise and mode collapse experiments where they undergo various augmentations.
For each dataset, we applied a pretrained classifier, to give class probabilities for calculating the Inception Scores, and as a feature extractor, to calculate the FID scores. For ImageNet, we used the InceptionV3 [25] architecture, as used in the original formulation of the IS [22] and FID [8]. For CIFAR10, we used the VGG-16 [23] architecture, and for MNIST, a classifier with two convolutional blocks and two fully connected layers. The test accuracy is 99.06% for MNIST, 85.20% for CIFAR10 and 77.45% for ImageNet.
The activations of the last hidden layer (a.k.a the penultimate layer) were employed as the extracted features f (x). The feature dimension is 128 for MNIST, 512 for CIFAR10 and 2048 for ImageNet. For ImageNet, since the number of samples used compared to the feature dimension is small, the rank of the estimated covariance matrix in Eq. 17 is much smaller than its dimension, which causes an inaccurate estimation of the WCFID. Instead, for 50 independent trials, we randomly select 50 features from the 2048 feature vector, compute the FID score using these features, and finally average the FID scores of all trials to get a final FID score.
Note that since the classification and feature extraction differs between each dataset, model scores should be compared per dataset, and not between datasets.

Synthetic Experiments
Label noising Label noising is the process of assigning random labels to some of the images, instead of their ground truth labels. This process simulates different levels of adherence to the conditional input. To maintain an equal number of images per conditioned class, instead of simply re-selecting a random class, we performed a random permutation of a subset of the images proportionally to a parameter p ∈ [0, 1]. When p = 0 no noising was applied and when p = 1 all image labels were randomly permuted. Fig. 3 shows how label noising simulates decline in conditional generation performance. In Fig. 3(a) each row of each subfigure represents a conditioned class, the red images highlight when the conditional generation fails. When setting p = 0, all images are correctly generated on their conditional input and as p increases, more images are incorrectly generated.
In Fig. 3(b) and (c), the IS and FID metrics and our proposed conditional variants are presented under the effect of label noising. The plots depict a number of interesting behaviors. First, the unconditional IS and FID remain constant across the experiment. That is because these metrics do not consider any conditional requirements from the generated images, and the uncon- ditional performance has remained the same. Second, label noising has a dramatic effect on the conditional IS and FID metrics. The BCIS, which evaluates both the consistency of each condition in the target classes and the coverage of the target classes falls immediately due to the declining consistency in the conditioned images.
The WCIS, on the other hand, which measures inconsistency, shows a rapid increase as a compensation of the decline of the BCIS score. All conditional components of the FID increase, since the label noise inflicts a shift in the distribution within each class and on the class averages. Image noising We applied four types of noise on the images and compared the effect on the scores. The noise was applied with increasing magnitude p between [0, 1]. We applied Gaussian noise with mean 0 and variance p, salt & pepper noise with probability p per pixel, and random pixel permutation with probability p. Fig. 4 shows the IS and FID with the conditional scores. For IS, the BCIS declines more rapidly than the IS, making it more sensitive to image quality. This is matched with an increase of the WCIS, which defines the gap between BCIS and IS. The WCIS provides a support for the IS which gives a false sense of generation quality, best seen during pixel permutation. For FID, the conditional metrics have the same trend as the unconditional one.
With the gap between the FID and the conditional FID sum increasing with the level of noise. Mode collapse Mode collapse occurs when the model fails to generalize on the distribution of the target dataset and collapses to represent only a portion of the distribution. It is a common failure of generative models, which occurs when the model G generates similar images for many different initial priors z. In the conditional setting, the collapse can be more specific and occur only within a specific class. Fig. 5 shows how the unconditional and conditional FID metrics react to the collapse.(a) shows a single class collapse in where in each step the diversity in that class gradually declines. (b) shows all of the classes fully collapse one by one at each step. Our metric is more sensitive to mode collapse, both when it occurs in a single class or in multiple classes. No evaluation on the unconditional and conditional IS was performed in this setting, since they both cannot detect mode collapse.

Model Comparison
We next evaluate the performance of various pretrained conditional GAN models on different datasets. In Tab. 1, for CIFAR10 and MNIST, we consider CGAN [17], SGAN [19], InfoGAN [4] and ACGAN [20]. Note that for SGAN, the generator is not class conditioned, and so we modified the generator to accept both noise and class label as input, and the adversarial loss was applied on the conditioned class.
For conditional generation, there are four extreme cases: (i) good unconditional and good conditional generation, (ii) bad unconditional and bad conditional generation, (iii) good unconditional and bad conditional generation, (iv) bad unconditional and good conditional generation. We argue that the fourth scenario is impossible since the conditional generation metrics always present a more critical evaluation (i.e. a lower bound in IS and upper bound in FID) than the unconditional metric. Therefore bad unconditional generation always leads to bad conditional generation as well. Cases (i) and (ii) are the more trivial cases where the model is either good or bad on both tasks. Case (iii) tells a scenario where the unconditional generation is good but the conditional requirement failed. We will now inspect each model and identify under which scenarios it falls.
The analysis is done by looking at the results in Tab. 1. Additionally, Fig. 6,7 show examples of the generation of CGAN, InfoGAN and SGAN. ACGAN lies under case (i). In CIFAR10, it consistently had the best Table 1 Unconditional and conditional metrics on CIFAR10 and MNIST for different conditional GANs. ↓ indicates that a lower value is better and ↑ otherwise.

Evaluation metrics
User study InfoGAN also highlights the problem of using accuracy as a measure. The accuracy received for the generated images is higher than that of CGAN even though our conditional metrics say otherwise. By inspecting the generated images, we conclude that CGAN performed better and thus our metrics depict the conditional performance of the models better. For MNIST, InfoGAN and SGAN lie either under case (ii) or (iii), depending on how good one considers their unconditional scores. Nevertheless, the performance decrease of InfoGAN and SGAN relative to CGAN becomes much more noticeable when looking at the conditional metrics where a clear drop in performance on all scores is noticeable.
To see how the metrics translate to human perception, we performed a user study on CGAN, InfoGAN and SGAN for both MNIST and CIFAR10. The user study was performed on 20 participants with knowledge in this field. The participants were not aware of the purpose of the study and did not know which model they were evaluating. The participants were asked to grade the 'quality', 'diversity' and 'class relation' of the generated images between 1 (low) and 10 (high), for each model separately. The results in Tab. 1 show that CGAN got higher scores on both datasets. This is aligned with the results of the conditional metrics in our experiments.

BigGAN In-Depth Analysis
BigGAN is a state of the art image generation model on the ImageNet dataset. In this section we evaluate BigGAN with our metrics and use them to perform an in-depth analysis of BigGAN's conditional generation capabilities.
BigGAN's performance on the various metrics can be seen in Tab. 2. Note that for FID, the score is different than in the original paper since we normalized the score by the size of the feature vector. The results show that BigGAN's performance is very close to the performance on real data, in both the unconditional and conditional metrics.
A closer inspection shows a variance in generation quality of the model for the different classes. Fig. 8(a) shows us that not all classes have the same WCFID and, instead, some classes are better represented. Which classes are better than others, can serve as a useful insight for fine-tuning a trained model to concentrate on the worst represented classes, or to compare between various trained generative models. Fig. 9 shows the 10 best and worst classes represented in terms of WCFID. The WCFID metric has a strong correlation with the quality of the class. The images from classes with the best scores are of high quality and resemblance to the real images. The images for classes with the worst scores do not resemble their target class. In addition, as evident from the experiment, several classes (for example, 'digital clock') have a high WCFID that is due to mode collapse.
To validate that the accuracy score cannot deliver these insights, we present in Fig. 8(b) the accuracy for each class, sorted according to their WCFID (same order as (a)). Similarly to WCFID, not all classes have the same score. However,we can observe that the accuracy scores for each class are only partly (inversely) correlated with the performance in WCFID.
In order to try and understand the difference between the scores in WCFID and accuracy, Fig. 10 shows the best and worst classes in terms of accuracy. Some classes were placed in the top 10 in both metrics (WC-FID and accuracy), but others were not equally ranked. When looking at the worst ranked classes, we notice that the low rank in accuracy does not always correlate with a low quality or diversity. For example, 'notebook' and 'monitor' were both ranked at the bottom when considering the accuracy, but looked not as bad as the  worst classes in WCFID. We observe that these classes were ranked low not because they were poorly generated, but because it is hard to tell them apart.

Conclusions
We presented two new evaluation procedures for classconditional image generation based on well established metrics for unconditional generation. The proposed metrics are supported by theoretical analysis and a number of experiments. Our metrics are beneficial in comparing trained models and gaining significant insights when developing models.
and the bound is tight.
Proof First, we recall the definitions of the FID and BCFID measures: and We notice that µ R B = µ R and µ G B = µ G . Hence, the only difference between the two quantities arises from the second terms. Next, we would like to develop the formulation of Σ E W,c for E ∈ {R, G}: In particular, Therefore, we summarize: Now we can say the following: Finally, we would like to demonstrate the tightness of the bound, aside from the trivial case of all or some of the covariance matrices being 0 and µ R c = µ G c for all c. Consider a case where all of the matrices Σ E B and Σ E W,c are simultanously diagonalizable, i.e., there exist an invertible matrix P , such that: where Λ E B and Λ E W,c are the diagonal matrices of the eigenvalues of Σ E B and Σ E W,c respectively. Since all matrices are diagonal, we can rewrite M as follows: where σ E m,d denotes the d-th element on the diagonal of the matrix Λ E m (m is a specifier of the form (W, c) or B). In addition, k is the output dimension of f .
In addition, assume that (i) for each d ∈ [k] there is only one member of {σ R W,c,d } k d=1 ∪ {σ R B,d } that is nonzero, and the same for {σ G W,c,d } k d=1 ∪ {σ G B,d } and, (ii) it has the same index. e.g. for d = 3, only σ R W,2,3 is nonzero in the variances of the real data and the equivalent σ G W,2,3 is the only nonzero variance in the generated data. Finally, if we also have, µ R c = µ G c , for all c, then we get M = 0.