Quantifying quality of class-conditional generative models in time series domain

Despite recent breakthroughs in the domain of implicit generative models, the task of evaluating these models remains a challenging task. With no single metric to assess overall performance, various existing metrics only offer partial information. This issue is further compounded for unintuitive data types such as time series, where manual inspection is infeasible. This deficiency hinders the confident application of modern implicit generative models on time series data. To alleviate this problem, we propose two new metrics, the InceptionTime Score (ITS) and the Fréchet InceptionTime Distance (FITD), to assess the quality of class-conditional generative models on time series data. We conduct extensive experiments on 80 different datasets to study the discriminative capabilities of proposed metrics alongside two existing evaluation metrics: Train on Synthetic Test on Real (TSTR) and Train on Real Test on Synthetic (TRTS). Our evaluations reveal that the proposed assessment evaluation metrics, i.e., ITS and FITD in combination with TSTR, can accurately assess class-conditional generative model performance and detect common issues in implicit generative models. Our findings suggest that the proposed evaluation framework can be a valuable tool for confidently applying modern implicit generative models in time series analysis.


Introduction
In recent years, implicit generative models have gained immense popularity due to the emergence of Generative Adversarial Networks (GANs) [1].With the astounding success of generative models in various domains such as image, video, music, and speech, it becomes imperative to quantify their performance.So far, various qualitative and quantitative assessment methods [2,3] have been proposed to evaluate these models' performance and make the comparison between generative models possible.For intuitive data, there are qualitative elements like human judgment to measure the performance of a generative model.Among the quantitative methods, Inception Score (IS) and Fréchet Inception Distance (F ID) have become the standard assessment methods in the image domain.Unfortunately, there is no consensual and reliable standard for evaluating generative models in the time-series domain.This deficiency impedes developing and applying deep generative models in the time-series domain and makes comparing the few existing models impossible.
Inspired by IS and F ID from the image domain, this study introduces the InceptionTime Score (IT S) and Fréchet Inception Time Distance (F IT D) to assess generative models in the time-series domain.In doing so, we investigate whether we can transfer the above-mentioned image domain standard to the time-series domain.In the literature, attempts to assess generative models have been proposed, but most notable are T ST R and T RT S introduced by [4].These constitute a diametral approach compared to our F IT D and IT S score and thus are used within our experiments to examine and control the capabilities of our newly introduced assessment metrics.
Namely, let P data denote the data distribution and P model the distribution that our generative model learned.Ideally, we expect P model to be sampled from P data and to cover its mode space.These properties should be detectable by an assessment metric δ for it to be reliable.We designed an extensive experimental setting that includes 80 datasets from the UCR archive 1 to investigate the quality of the sampling induced by P model as well as its capability to reproduce the mode space.We also involved T ST R and T RT S in the whole experimental pipeline and presented the efficacy so that intended researchers with an interest in conditional GANs for time-series can understand it more intuitively and gain confidence in the efficacy of the assessment metrics presented in this paper.

Pre-trained InceptionTime Classifier
The output of penultimate layer Calculate FITD SoftMax Calculate ITS

Fake labels
Fig. 1: The proposed evaluation pipeline for F IT D and IT S.

Related Work
The effectiveness of generative models is normally assessed by gauging the gain in performance on the downstream task.This methodology of evaluation holds independent of the modality, i.e., image, audio, time-series, etc. Haradal et al. [5] used the improvement in a classification task to measure the quality of their generative model.Wiese et al. [6] described the performance of their generative model on the finance domain using statistical properties of data that are most relevant for their target domain.Another popular method is evaluating a generative model based on its performance on a surrogate task, such as supervised classification.Esteban et al. 3 Quantitative Assessment for Deep Generative Models on the Time-Series Domain This section introduces different methods available for the assessment/evaluation of generative models with a focus on the time-series domain.All these methods employ a classifier in their pipeline, either for calculating the score or extracting features from input.To have comparable results across various studies, it is crucial to use the same testbed.For instance, on the image domain, a pre-trained inception network [9] trained on ImageNet dataset [10] is employed for computing assessment metrics.Therefore, in this study, we propose to adopt InceptionTime [11] for determining our evaluation metrics.InceptionTime is a CNN-based time-series classifier that acquired impressive accuracy on the time-series classification task.In this study, we employed a similar network structure across all datasets; however, due to the high variance between the dynamics of various time-series datasets, it is not viable to utilize a single pre-trained network across different datasets.Hence, the Incep-tionTime network is trained separately for each dataset.We adopt the same network structure and training pipeline as the authors of InceptionTime provided in the project git repository2 .An overview of our evaluation pipeline is represented graphically in Fig. 1.

InceptionTime Score (IT S)
Inspired by IS for assessing generative models in the image domain, we proposed the InceptionTime Score (IT S) as the evaluation metric for the quality synthetic data in the time-series domain.Given x as the set of synthetic timeseries samples and y as their corresponding labels, we expect high-quality generated data to have low entropy conditional label distribution p(y | x).This is to be compared with the data's marginal distribution p(y), which is expected to be high for diverse samples.Thus, in the ideal case, the shapes of p(y | x) and p(y) are opposite: namely narrow vs uniform.The score should reflect this property and be higher the more the conditional label and the marginal distributions differ.This is achieved by taking the exponentiation of their respective KL divergence: By definition, IT S is a positively oriented metric.Its lowest value is 1.0, and its upper bound is the number of classes in the dataset.To acquire the label of synthetic time-series data, we employed a pre-trained InceptionTime network.

Fréchet InceptionTime Distance (F IT D)
IT S relies solely on the statistics of the generated samples and ignores real samples.Hence, it assigns a high score to a model with sharply distributed marginal and diverse training samples, regardless of whether the generated samples follow the target distribution.To address this problem on image domain, Heusel et al. [12] proposed Fréchet Inception Distance (F ID).To exploit F ID on time-series data, we defined Fréchet InceptionTime Distance (F IT D).We extract the feature vectors for the real and the generated samples from the penultimate layer of a pre-trained InceptionTime Classifier.We assume each of these feature vectors follows a continuous multivariate Gaussian.Subsequently, we calculate the Fréchet Distance (also known as Wasserstein-2 distance) between these two Gaussian, i.e.
where (µ r , Σ r ) and (µ g , Σ g ) are the mean and covariance matrices of the real data and generated data, respectively.Lower F IT D indicates a smaller distance between data distribution and real distribution, and the minimum value is zero.F IT D is a robust and efficient metric; however, its assumption on multivariate Gaussian distribution in feature space is not always true.

Assessment Based on Classification Accuracy
We can use a classifier to explicitly benefit from labeled data to assess the class-conditional generative models.The core idea is that if a generative model can generate realistic data samples, it should perform well in the downstream tasks.In this case, a classifier can be trained on real data and tested on synthetic data in terms of classification accuracy.This paper refers to this method as T RT S (Train on Real, Test on Synthetic).T RT S implies that if the distribution learned by the generative model P model matches the data distribution P data , then a discriminative model trained on samples from P data can accurately classify generated samples from P model .T RT S outputs low accuracy if generated samples fall out of P data .However, if P model ⊂ P data , then T RT S might assign a high accuracy, in neglection of the fact that the mode space is only partially covered by P model .
Another classifier-based method is to train a model on synthetic data and test it on real data.We refer to this method as T ST R (Train on Synthetic, Test on Real).Like T RT S, the T ST R argues that if P model ≈ P data , then a classifier trained on generated samples can score high accuracy while classifying real samples.Unlike T RT S, the T ST R can detect the situation where P model partially covers P data ; however, it cannot reflect the existence of synthetic samples that do not follow P data .In other words, T ST R provide high accuracy even if P data ⊂ P model .This latter case is more intuitively known as an overparametrized model.
In this study, we employed the InceptionTime model as the classier for calculating T RT S and T ST R.

Evaluation Data -UCR Time-series Classification Archive
The UCR archive [13] is a collection of 128 univariate time-series datasets designed for the classification task.It thus enables us to perform our experiments on a broad spread of datasets with various properties across different domains.Furthermore, the InceptionTime model has demonstrated impressive performance in the classification task on the UCR archive.As discussed above, we need highly classifiable and diverse features to precisely calculate F IT D and IT S. Therefore, for our experimental setting, we select a subset of datasets from the UCR archive on which the InceptionTime model acquires at least 80% accuracy, resulting in 80 datasets.Appendix A lists the names of these datasets, their properties, and the accuracy scored by the InceptionTime model.

Experiments and Results
To investigate the discriminative ability of IT S, F IT D, T RT S, and T ST R in the time-series domain, we first design scenarios to replicate common problems of generative models, namely: • Decline in Quality • Mode Drop, and Subsequently, we apply our assessment methods and study how they can indicate these problems.

Experimental Evaluation Score
In our experiments, we train InceptionTime on the train set of these datasets and calculate our scores on the respective test set to obtain the base score (score base ) on each dataset.Since the test set is obtained from data distribution, we consider score base as the best score we can acquire on each dataset empirically.Also, we indicate the score of generated samples as score gen .Finally, we define rel(score) = score base − score gen as the score of generated samples relative to the base score.We expect rel(ITS) ≥ 0 , rel(TRTS) ≥ 0 , rel(TSTR) ≥ 0 and rel(FITD) ≤ 0 in all cases.In other words, we do not expect a better score than the base score.

Experiment 1 -Decline in Quality
An assessment method should express the quality of the generated samples quantitatively.For this experiment, we added a noise signal to the samples in the test set to simulate the decrease in quality.The noise is sampled from a Gaussian distribution with µ = 0 , and σ is selected from an equally spaced grid    of values in [0, 5].The standard deviation value indicates the noise strength and the amount of corruption in the original data.We expect the assessment scores to worsen with the increase in standard deviation.Figure 2 presents our experiment's results on four datasets (the rest of the visualization are presented in Appendix B).
FITD: The F IT D response behaves differently than others.Since F IT D does not have an upper bound, it increases with the increase of corruption into data.Other scores converge to their lower bound at some noise strength (σ = ∆) and cannot indicate the increasing strength of noise on data when σ > ∆.
TRTS and ITS: The behavior of T RT S and IT S are very similar Both IT S and T RT S use the InceptionTime model trained on the train set as the backbone of their computation.Once σ > ∆, the classifier fails to classify the samples, and its prediction is not better than a random guess.The T RT S converges to random guess accuracy, which depends on the number of classes on the dataset, and IT S converges to 1.0.
TSTR: The T ST R response has more variance than T RT S. The reason is that T RT S is trained on a train-set of real data, which does not change during experiments, while T ST R is trained on synthetic data, and as a result, we trained a new model for each value of σ.
The value of ∆ depends on the scale of the data.We need more substantial noise to corrupt the data with a larger data scale.For instance, it seems that our scores cannot detect the presence of noise on data in the Chinatown data set in figure 2. However, figure 3 reveals that this data set has a great scale, ranging approximately between [0,2000].Therefore, we need a much larger σ to corrupt the data meaningfully.

Experiment 2 -Mode Drop
Mode Drop happens when the generative model ignores some modes of real data while generating artificial samples.This could be due to a lack of model capacity or inadequate optimization [14].We design three experiment scenarios to evaluate the capabilities of IT S and F IT D in recognizing mode drop in the time-series domain.

Single Mode Drop
In the first experiment, we remove all the samples belonging to one class from the test set to simulate the mode drop scenario.We calculate all scores for the mode drop caused by removing each class.Hence, for the dataset with N classes, we would have N values for each score.Figures 4 and 5 illustrate rel(score) of our scores' responses on all datasets.
FITD: The changes in F IT D depend on the degree to which the removed class affects the properties of assumed Gaussian distribution in latent space.In most datasets, the drop of a single class did not change the Gaussian distribution properties in latent space significantly.Thus, the F IT D reflects the single mode drop poorly.On the other hand, on a few datasets, the F IT D response with high variance indicates that at least one of the class samples significantly impacts the mean and covariance matrix of points in latent space.Since the feature vectors are generated with a non-linear transformation to a high dimensional space, it is impossible to interpret the F IT D response given the samples in the data space.
ITS: The IT S response is mostly positive but has a great variance.When we remove a class, we change the diversity of labels.Therefore, we expect that H(P (y | x)) remains unaffected while H(P (y)) decreases due to the reduction in diversity.The drop of each class affects H(P (y)) differently, which results in a high variance between responses.If the distribution of labels is closer to a uniform distribution, the drop of each class will decrease the H(P (y)) similarly.In contrast, if the label distribution is heavily unbalanced, then the drop of a major class would increase H(P (y)).That is why we can observe the improvement in IT S after mode drop in some rare cases.

Extreme Mode Drop
In the second case, we simulate the extreme case of mode drop, where we keep only one of the classes in the test set.We follow the same approach as the previous experiment but retain only one class.Therefore, for the dataset with N classes, we would have N values for each score.Figures 6 and 7 portray the results.To make the comparison easier across all datasets, the cube root of FITD: In the case of F IT D, the extreme mode drop drastically changes the properties of the assumed Gaussian in latent space, and we can see this shift in the F IT D response.Additionally, this change is more prominent with a large number of classes.
ITS: If we assume error-free classification, with drop of all modes except one, the IT S = 1 since H(P (y)) = 0 and H(P (y | x)) = 0 .Hence, rel(IT S) = IT S base − 1 = N − 1 where N is the number of classes.In practice and considering classification error, we still observe that the IT S response is close to theoretical expectation.
TRTS: Similar to the previous experiment, T RT S response cannot highlight extreme mode drop since P model ⊂ P data .
TSTR: The T ST R denotes the extreme mode drop in all datasets.Furthermore, with the increase in the number of classes, we have greater divergence from T ST R base .Please note that we have low accuracy for T ST R base for datasets with N > 20 since we trained the base model for all datasets similarly regardless of the number of classes.

Successive Mode Dropping
In our final experiment, we fill the gap between the first and second experiments, drop the modes one by one, and inspect the response of our assessment method.Figure 8 demonstrates the scores on four datasets (the rest of the visualization are presented in Appendix C).The results are consistent with previous experiments.
FITD: F IT D is less sensitive when a few classes are dropped.However, when the number of dropped classes crosses a certain threshold, F IT D increases sharply.Seemingly, the properties of assumed Gaussian distribution are quite robust against removing a few samples from the test set.However, once we remove samples belonging to most classes, the distribution begins to change dramatically with every additional class we drop from the test set.
ITS and TSTR: IT S and T ST R decrease linearly with the number of dropped classes.
TRTS: T RT S does not change with successive drop modes.9 and 10 summarize the performance of our scores relative to their base score in detecting this simulated mode collapse.
ITS: In the presence of a perfect classifier, IT S should reach its maximum since then H(P (y | x)) = 0 in (1) , and we have maximum diversity among labels, hence H(P (y)) = N , where N indicates the number of classes.However, the average sample might not accurately represent a class's samples.Therefore, there is a high chance of misclassification.Since our generated samples are small and are limited to a single average sample per class, any misclassification would significantly change ITS from its expected value.Therefore, we can observe in figure 9 that the IT S has been improved in some datasets.FITD: The F IT D responds correctly to mode collapse on most datasets, but its responses' strength is inconsistent across datasets.Again, interpreting the F IT D response depends on how samples are mapped in latent space.If the averaged samples can replicate the test set Gaussian distribution properties, we would obtain F IT D close to F IT D base .Otherwise, F IT D would diverge from its base score.TRTS: The T RT S displays a hit-and-miss behavior.If the averaged samples can represent the original samples of the dataset, then they would classify correctly, and T RT S cannot detect mode collapse.Otherwise, the misclassification of averaged samples would reflect the mode collapse problem.TSTR: The T ST R can detect mode collapse in most datasets.When the mode collapse happens, the diversity of generated samples decreases.Therefore, it is difficult for a classifier to learn the probability distribution of a class accurately, given only samples from the mode of the distribution.Thus, we expect a high classification error once the classifier evaluates the real data due to the limited generalization capacity of the model.The T ST R behavior which is illustrated in figure 10 is aligned with our expectations.

Conclusion and Final Remarks
With new advancements in the deep neural network front, the generative models are on the rise; however, their application has been hindered in the time-series domain due to the lack of a standard assessment method.In this work, we tried to alleviate this problem by introducing a framework to transform two widely used evaluation metrics on the image domain, namely IS and F ID, to time-series.We employed the InceptionTime classifier as the backbone of our framework and introduced IT S and F IT D for quantifying the performance of the generative model on the time-series domain.We conducted various experiments on 80 datasets to investigate the capabilities of IT S and F IT D in detecting common problems of generative models and compare their discriminative abilities with T RT S and T ST R, two commonly used assessment methods for class-conditional generative models.Table 1 summarizes the capabilities of these metrics in detecting three problems that generative models commonly face.Furthermore, our main findings on each metric are summarized as follows:  1: The summary of the scores' capabilities in detecting common problems of generative models.
• ITS can respond correctly to all the studied problems in most of the datasets; however, its behavior is most consistent in detecting the Mode Drop problem.Furthermore, H(P (y)) seems to be the most defining component of IT S response in detecting the studied problems.• FITD behavior heavily depends on how the samples are mapped into latent space.Since the transformation to latent space is complex and non-linear, the interpretation of the F IT D response is not straightforward.Additionally, since F IT D does not have an upper bound, it can quantify the quality of generated samples better than the other metrics.• TRTS performance is disappointing compared to others.In the presence of other metrics, it is unnecessary to compute T RT S for investigating studied problems.• TSTR shines when the generative model has learned a subset of the real distribution.Therefore, it is the most reliable to detect Mode Drop and Mode Collapse compared to others.
This work can be extended by adopting the recent advancement of generative model assessment on image domain [3] to time-series domain.Another potential direction is to extend the list of studied problems or investigate other aspects of evaluation metrics such as computation time or sample efficiency.

Appendix B Extra Visualization for Decline in Experiment
Figures B2 to B8 provide visualization of studied metrics response for the decline in the quality experiment for all datasets in the UCR archive.

[ 4 ]
specified their generative model performance based on T ST R (Train on Synthetic, Test on Real) and T RT S (Train on Real, Test on Synthetic).The T RT S is defined by training a classifier on real data and testing it on synthetic data.Similarly, T ST R is calculated by training a classifier on synthetic data and testing it on real data.Smith et al. [7] employed a similar method for quantifying the performance of TSGAN.Furthermore, the authors defined 1D F ID by training a simple classifier separately on each dataset and using this network for F ID calculation.However, the 1D F ID was not aligned with the visual observation from generated samples in some cases.The authors of T-CGAN [8] outlined the performance of their model based on the T ST R only and reported AUROC instead of accuracy.While T ST R and T RT S can provide an indirect assessment of a generative model, they rely heavily on the choice of the classifier.Furthermore, they cannot reflect diversity in generated samples [2].

Fig. 2 :
Fig. 2: Changes in the scores when data quality is declined by introducing noise into data progressively.
Data after adding noise with σ = 5

Fig. 3 :
Fig.3: The comparison between original and noisy data from the Chinatown dataset.Due to the large scale of data, the introduction of noise with σ = 5 does not change the data significantly to cause a response in our scores.

Fig. 4 :
Fig. 4: Relative IT S and F IT D score when one mode is dropped from a dataset.

Fig. 5 :
Fig. 5: Relative T RT S and T ST R score when one mode is dropped from a dataset.

Fig. 6 :
Fig. 6: Relative IT S and F IT D score for extreme mode drop scenario.

Fig. 7 :
Fig. 7: Relative T RT S and T ST R score for extreme mode drop scenario.

Fig. 8 :
Fig. 8: Changes in the scores when modes are removed one by one.

Fig. B2 :
Fig. B2: Changes in studied metrics when data quality is declined by introducing noise into data progressively.

B3:
Changes in studied metrics when data quality is declined by introducing noise into data progressively.

Fig. B4 :Fig. B5 :
Fig. B4: Changes in studied metrics when data quality is declined by introducing noise into data progressively.

Fig. B6 :Fig. B7 :
Fig. B6: Changes in studied metrics when data quality is declined by introducing noise into data progressively.

Fig. C9 :
Fig. C9: Changes in studied metrics when the modes are removed one by one from a dataset.

C10:
Changes in studied metrics when the modes are removed one by one from a dataset.