Interpretable tabular data generation

Generative adversarial network (GAN) models have been successfully utilized in a wide range of machine learning applications, and tabular data generation domain is not an exception. Notably, some state-of-the-art models of tabular data generation, such as CTGAN, TableGan, MedGAN, etc. are based on GAN models. Even though these models have resulted in superior performance in generating artificial data when trained on a range of datasets, there is a lot of room (and desire) for improvement. Not to mention that existing methods do have some weaknesses other than performance. For example, the current methods focus only on the performance of the model, and limited emphasis is given on the interpretation of the model. Secondly, the current models operate on raw features only, and hence they fail to exploit any prior knowledge on explicit feature interactions that can be utilized during data generation process. To alleviate the two above-mentioned limitations, in this work, we propose a novel tabular data generation model—GenerativeAdversarial Network modelling inspired fromNaiveBayes andLogisticRegression’s relationship (GANBLR\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${ { \texttt {GANBLR} } }$$\end{document}), which not only address the interpretation limitation of existing tabular GAN-based models but provides capability to handle explicit feature interactions as well. Through extensive evaluations on wide range of datasets, we demonstrate GANBLR\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${ { \texttt {GANBLR} } }$$\end{document}’s superior performance as well as better interpretable capability (explanation of feature importance in the synthetic generation process) as compared to existing state-of-the-art tabular data generation models.


Introduction
Nowadays, Generative adversarial network (GAN) model and its variants are widely utilized in a wide range of domains ranging from Computer Vision [1], Data Privacy [2], Medicine [3] etc. due to their excellent performance when compared to other methods. Typical GAN B Nayyar Zaidi nayyar.zaidi@deakin.edu.au 1 School of IT, Deakin University, Melbourne, VIC, Australia 2 College of Computer Science, Xi'an Shiyou University, Xi'an, China models [4] constitute of two components: the Generator learns to produce a synthetic output from the input noise, whereas the Discriminator learns to distinguish the generator's synthetic data from the real one. These two components interact with each other based on a gametheoretic algorithm such that as the training continues, the generator learns to produce better and better synthetic data, and the discriminator learns to more accurately distinguish between the synthetic and the real data. Typically the generator and the discriminator components are modelled as deep Artificial Neural Networks (ANN), with the convolution and dense layers. Though their application to synthetic data generation for computer vision tasks is groundbreaking [5][6][7], their application to tabular data generation, has been marred with challenges. Of course, the main challenge stems from the fact that no explicit structure is available among the input data features which can be exploited by the convolutional layers [8], leaving the responsibility to dense layers to engineer features in order to capture the correlation among features. Note that tabular datasets mainly constitute of categorical and numeric features. How to seamlessly handle these different feature types is not trivial. Regardless of the challenge, application of GAN to tabular data generation has seen promising results, with models like CTGAN [9], TableGAN [8], MedGAN [3] leading to the state-of-the-art (SOTA) results. E.g, CTGAN generates tabular data via conditional GAN-like strategy, where the categorical features are regarded as some condition, while using the Gaussian Mixture Model estimation for numeric features. It utilizes the Wasserstein Distance with gradient penalty to generate the synthetic data. TableGAN, on the other hand, actually uses the convolutional layers in both the generator and the discriminator stages, and introduces an information loss-based objective function.
We believe that tabular data generation (especially with GAN models) is still in early days, and there is strong demand for more accurate and far better interpretable data generation models. In this work, we will demonstrate that GAN strategy is indeed effective, but (instead of relying on ANN and its variants) a fundamentally different approach to its data generator and discriminator components can lead to much better results. Before we discuss our formulation, let us discuss some limitations of existing GAN-based tabular data generation models: • First, effective modelling of feature interactions is critical in many machine learning tasks, and data generation is no exception [10,11]. Of course, generators and discriminators in vanilla GAN models can capture feature interaction, but this implicit feature interaction will not be interpretable [11]. Also, there is no guarantee that any particular interaction is captured by the model. For example, we may have some prior knowledge that feature Salary is highly correlated with feature Age-but the generator in vanilla GAN may or may not capture this interaction. Because, it is likely that the model can find some other more useful interactions. This lack of explicit feature interaction modelling is one of the main factors impacting the performance of synthetic data generation. Of course, one solution is to craft the feature interaction manually-but this will be tedious and time consuming. • Secondly, related to the first issue that we discussed, the usage of ANN and variants in existing GAN-based tabular data generation models lead to a generation process that is hard to interpret. For tabular data generation, since the goal is to improve the performance of machine learning tasks (e.g. classification, regression, etc.), the demand for interpretability is far more crucial.
It can be seen that these two limitations of existing vanilla GANs mainly stem from the fact that their generator component is carved out of a deep ANN (or its variants). Can we utilize a different model as the generator to address these limitations? Answering this question has been the main motivation of this work. In this work, we propose a radically different formulation, which departs from existing vanilla GAN-based data generation-instead of using a deep ANN (or its variants) as the generator (and the discriminator), we utilize the Bayesian Network (BN) model. A BN is a directed acyclic graphical model-in which the training process incorporates learning the structure as well as its associated parameters. By specifying (or learning) the structure, one can explicitly incorporate feature interactions. Note, since its parameters correspond to actual probabilities, the model is interpretable. It can be seen that BN has desirable data generation properties which can address the above-mentioned limitations of existing tabular GAN models. But how can one use BN in the GAN formulation? To answer this question, let us dive deeper in BN.
Typical BN models works with tabular data and maximizes the log-likelihood (LL) where m is the number of data points, x is a vector of independent features with discretized values and y is the dependent target feature. Note, the numeric features of x are discretized prior to calculating the probabilities in BN models. A BN is an example of a generative model, as one can use it to generate (sample) data once the model is learned. Of course, one can calculate P(y i |x i ) in BN to obtain a classifier. However, the predictive performance of BN classifiers is not generally good, as compared to the models that directly optimizes P(y i |x i )-also known as the discriminative models. Interestingly, generative models such as BN can be trained by optimizing a discriminative objective function such as the conditional log-likelihood (CLL) m i=1 log P(y i |x i ) [12]. For example, a popular example of BN is naive Bayes (NB) classifier, whose discriminative equivalent is the well-known Logistic Regression (LR) model which of course optimizes the CLL. NB and LR are well-known examples of generative-discriminative equivalent models-in general, one can train any BN by optimizing the CLL-leading to a respective generative-discriminative equivalence. Following the notation in [13], we denote a BN trained by optimizing the CLL as BN d , and then under this notation, we have LR ≡ NB d [14]. Typically, since structure learning of BN is time consuming-we can resort to simple restricted models such as Tree-Augmented Naive Bayes (TAN) or K-Dependence Bayesian estimators (KDB), in which the structure learning only takes one or two passes through the data. Now, one can train TAN or KDB by optimizing a discriminative objective function, i.e. CLL-leading to either TAN d or KDB d formulations.
Although one can sample from a standard BN the samples might not be of good qualitysince when optimizing P(y, x), there is no guarantee that P(y|x) will be well-calibrated [15]. One solution is to directly sample from BN d , which optimizes P(y|x). Well, here the issue is that the parameters are not constrained to be actual probabilities. 1 One solution is to constrain the weights to be the actual probabilities during the training of a discrmiminative objective function. Such weights constraining has been explored for LR and KDB in [13]. Following the naming conventions from existing work in this area, we denote a BN which is learned discriminatively, but constrain weights so that they conform to actual probabilities as BN e .
We claim that BN e is the perfect model to be used as a generator in GAN models. 2 Particularly, it optimizes a discriminative objective function (and hence can be learned endto-end with back-propagation algorithm), the weights are actual probabilities so you can sample, and importantly you can interpret the model based on the learned weights. One drawback of using BN in GAN models is that it can only process and hence generate data with categorical attributes only. Note, it has been recently shown that contrary to popular belief, discretization can lead to models that have superior performance as compared to their numeric counter-parts [16]. Therefore, discretizing numeric attributes, and then sampling using BN e can be an effective alternative to directly generating the numeric attributes.
In our proposed formulation of GAN models, we will use typical Bayesian Network models as the generator (BN e , e.g. NB e , TAN e , KDB e ) and their discriminative counter-parts as discriminator (BN d , e.g. NB d , TAN d , KDB d , etc.). We name this new formulation as: Generative Adversarial Network modelling inspired from Naive Bayes and Logistic Regression's relationship (GANBLR). However, NB and LR terms in the GANBLR acronym are only figurative to represent broad generative and discriminative learning paradigms. In practice, one can use any generative model as the generator and its corresponding discriminative model as the discriminator.
Let us summarize the contributions of this work: • We propose a novel model of tabular data generation, namely GANBLR-which uses BN e as the generator and BN d as the discriminator. Note, the generator in GANBLR cooperates with the discriminator to form an adversarial structure to improve the quality of synthetic tabular data. Note, GANBLR is limited to producing datasets that have categorical attributes only. • Even though BN e has been studied in the context of classification-this is the first work which studies its effectiveness as a data generation technique. • We compare GANBLR to existing SOTA GAN models on 15 public tabular datasets.
The results demonstrate that GANBLR not only outperforms in terms of machine learning utility 3 and statistical similarity measured with Jensen-Shannon Divergence and Wasserstein Distance, but also provides better interpretability.
The rest of the paper is organized as follows. We will discuss related work in Sect. 2. The details of GANBLR are provided in Sect. 3. We provide an extensive experimental analysis in Sect. 4. We conclude in Sect. 5 with pointers to future works.

Related work
In this section, we will start by discussing the existing GAN-based models for tabular synthetic data generation. Later, we will discuss discriminative training of Bayesian Network models.

Tabular data generation-GAN models
The current research on the application of GAN models for the tabular synthetic data generation has taken two directions. The first direction utilizes the vanilla GAN structure, whereas the second direction utilizes the conditional models based on conditional GAN structure. In the following, let us discuss these two directions in detail.

Vanilla GAN-based tabular generation
This stream of research is based on the foundational work of [4], in which a random noise (generated from a predetermined distribution) is used as the input for the generator. The generator will use the input to approximate a real data distribution with the encoder and decoder model. The output of the generator is the generated synthetic data. The discriminator uses this synthetic data generated and also the real data to train a classifier for distinguishing the synthetic data from the real data. There are four notable works that utilize this vanilla GAN strategy, and we will discuss them in the following. Note, we will use the term vanilla GAN to refer to the model of [4]. Of course, almost all works utilizing GANs are variants of the framework proposed in [4].
MedGAN [3] model is one of the earliest work to use the auto-encoder architecture as the structure of the generator. The model can generate both categorical and numerical features that are needed to generate authentic medical electronic health record. The training in MedGAN model utilizes mini-batch averaging to address the mode collapse problem. Additionally, the batch normalization with shortcut connections are utilized.
CrGAN model [17] is designed to generate Passenger Name Records (PNR) data for aviation industry. The PNR data contains details of passenger's personal information such as name, date of birth, the reserved trip information, the flight information and the payment details, etc. Note, PNR data can consist of both categorical and numerical features which can have missing values. For generating features with missing values, use of normal vanilla GAN structure can be challenging. To address this issue, CrGAN model propose categorical feature embedding as well a Cross-Net architecture. It is shown that CrGAN model can generate high-quality PNR data.
A convolutional neural network is utilized as the generator of TableGAN [8], and an information loss is used instead. It is shown that TableGAN not only ensures a high machine learning utility, but claim to preserve the privacy of the data. Note PATEGAN [2] is another notable work that is similar to TableGAN, that is designed to prevent privacy attacks. 4 The above-mentioned vanilla GAN-based tabular generation models are effective in different domains, but they still face two limitations. Firstly, these models have been designed and proposed for binary class datasets on specialized domains, and therefore, their suitability and generalization to other domains (e.g. for multi-class datasets) is not clear. Secondly, the above models are not capable of generating synthetic data with a guaranteed value of one of the features. The second limitation makes these models unsuitable for generating imbalanced machine learning datasets, e.g. in fraud detection or abnormal detection, etc.

Conditional GAN-based tabular generation
The conditional GAN-based tabular data generation models make use of a conditional-vector to specify the particular feature value or class label to be generated. Notable work in this stream of research is that of CW-GAN [18], which has been shown to get results better than competing methods on credit data generation. There are three different loss functions included in the CW-GAN model. The first loss is the Wasserstein Distance Loss which is calculated between the synthetic and the real data. The second loss is the gradient penalty which can be used to regularize the model complexity of the discriminator model. The last loss is the auxiliary classifier loss which can encourage generator in generating synthetic data which belongs to the specified class. The current SOTA model for tabular data generation CTGAN [9] falls in this research stream, as well. CTGAN leverages mode-specific normalization and training-by-sampling process to generate better quality synthetic datasets with both categorical and numeric features. Inspired by Gaussian Mixture Model, the mode-specific normalization firstly computes the modes of a numeric feature. After this, the mean and standard deviation values for each of the modes are captured. Then numerical feature values are normalized with associated mean and standard deviation. The newly obtained normalized value is concatenated with the categorical features together to represent the input for CTGAN model. The training-by-sampling strategy is the key component of the CTGAN's generator. It ensures that the instances from the minor class have the similar chance to be sampled as the instances from the major class.
Although CW-GAN and CTGAN can generate the synthetic data with particular feature value, the drawbacks of existing GAN-based methods for tabular data generation, that we mentioned in Sect. 1, are still outstanding. Firstly, the models are not interpretable, i.e. the synthetic data generation process is not interpretable for the practitioners to determine as to why a generated data belongs to a particular class. Secondly, the input feature interactions cannot be modelled directly during the training. Lastly, the performance of these methods on a wide range of datasets ranging from small to large, binary to multi-class, etc. is still to be systematically studied.

Discriminative Bayesian network
Let us discuss discriminative Bayesian Networks in this section. In standard Bayesian networks, one can learn feature interactions as part of the structure learning under the restricted [19] or unrestricted mode. In unrestricted mode, we learn the structure of the network from the data and do not limit the number of parents each attribute can take. This process can be computationally intensive and hence time consuming. Alternative to this is the restricted mode-where we make use of count statistics such as Mutual Information, etc. and limit the number of parents each attribute can take, leading to model such as TAN and KDB [20], etc. The KDB model can learn the structure in just one or two passes through the data, and is, therefore, very computationally efficient. The second phase of learning in a Bayesian Network is the learning of the parameters, and this depends on what objective function is optimized. As discussed in Sect. 1, traditionally, Bayesian networks are optimized with log-likelihood, a generative objective function leading to a closed-form solution. However, one can optimize conditional log-likelihood, a discriminative objective function, leading to formulations of discriminative class-conditional Bayesian Network (BN d ) and the extended class-conditional Bayesian Network (BN e ) [21]-optimized via an iterative optimization algorithm. Since the goal of Bayesian Network has been to estimate probabilities of the form P(y|x), there is some debate over the parameters of discriminative Bayesian Network to be actual probabilities or not. This is the main difference between two formulation BN d and BN e -where parameters are not constrained in the former, but are constrained in the later to be actual probabilities. Of course, the main benefit of using discriminative Bayesian Network (BN e ) is the interpretability and also capability of incorporation of higher-order feature interaction. For example, if the parameters associated with highorder interactions are constrained to be actual probabilities, it offers excellent interpretable capabilities.
Real dataset used in generator Represents the interactions modelled by the Generator model, y denotes the class value, π x i denotes the values of features which are parents to feature i Random noise input

Method
Let us start by presenting some preliminary work to be used as a foundation, as well as some notation that is used through-out in this paper. Later, we will delve deep in our proposed algorithm-GANBLR, and discuss the learning algorithm, as well as discussing its variant.

Preliminary and notations
We denote the generative model (generator) as G and the discriminative model (discriminator) as D. The real (or original) dataset is denoted as data with a total of m instances each having n independent features, with k-order feature interaction present among them. Similarly, Y g = [y 1 , y 2 .., y m ], where y i ∈ R 1 , constituting corresponding class labels. Data (D data ) has a maximum level of interactions present among features, which is denoted by k here 5 . Of course, for a generator to produce samples effectively, it must be able to model these k-order interactions present in the data. If we are using a BN e model as the generator, we can easily specify k; however, if a traditional deep ANN is used, modelling of interactions of order-k is more of a trial-and-error practice to determine the right breadth and depth of the generator network. The term g in the notation makes it explicit that the dataset is to be processed by the generator model-G.
We denote the real data distribution as P data (X g , Y g ) or P data (·) from which a sample D data is generated.
In GAN formulation, G is trained to approximate the real data distribution P data (.) with some random (noise) input. We denote the random input data as: Z = [z 1 , . . . , z m ]. And the distribution generating Z as P Z (Z) or P Z (.).
The synthetic dataset is denoted as Here,x i ∈ R n andȳ i ∈ R 1 . Again, the superscript k denotes that the synthetic data should have the same order of feature interactions as in the original dataset.
As we know that in GAN formulation the generator generates synthetic data from the noise-in our notation, we express it as: S data ∼ G(Z). The discriminator model-D is trained to discriminate between D data and S data . To do this, an auxiliary label Y d = 1 and Y d = 0 is appended with D data , and S data respectively, specifying if the sample belongs to original or synthetic data. Formally, the objective function of tabular GAN models leads to solving the min-max adversarial game, which in our notation is expressed as: It can be seen that G(Z) generates the synthetic dataset samples S data , and D tries to map the synthetic data to a scalar value representing the probability of it being real or not.
In the following, we will discuss how to use BN e as the generator and BN d as the discriminator, leading to our GANBLR formulation. List of all the symbols used in this work is given in Table 1.

GANBLR-components
The generator in our proposed formulation deviates from vanilla GAN as it has two roles to play: • Its first role is to learn the parameters of BN e . By doing so, it learns the weights by optimizing the discriminative objective function, while fulfilling the probability constraints on the weights. We denote the generator for this training role asG. Note, the input toG is the original data D data , i.e. we can write:G(D data ). • The second role of the generator is to sample data after the discriminative Bayesian Network BN e is trained. Since the optimized parameters are conditional probabilities, now we can use BN e in the generative mode to sample the synthetic data samples S data . We denote this sampling role asḠ. The input to this role of generator is null-i.e. we can write:Ḡ(.). Note, this formulation deviates from existing tabular GAN models, as our generator does not generate from random noise distribution.
The two roles of the generator in GANBLR work seamlessly in an overall adversarial training framework-first, the generator operates in a training role (G) for optimizing its weights discriminatively under some constraints, and then shifts its role to sampling (Ḡ), for synthetic dataset generation. For the sake of simplicity, we will use notation G for generator in cases where the role of generator is cleared from the context. The discriminator D in GANBLR is again a Bayesian Network-BN d (which is trained discriminatively, but no constraints on the weights). It learns to distinguish between D data and S data . The loss from D is back-propagated to the generator G for the improvement of the synthetic data generation. Let us delve into the details of each component of GANBLR.

Generator G
Let us establish the form of the generator first. In GANBLR, we recommend to use restricted Bayesian network model of KDB. Although any form of Bayesian network can be used in GANBLR framework, restricted Bayesian networks have some advantages. First, the structure can be configured easily by specifying the hyper-parameter, i.e. number of parents (k). Therefore, GANBLR only focuses on parameter learning given the structure. Secondly, since, a BN with immoral nodes 6 can lead to non-convex problems according to [22], using a restricted Bayesian network decreases the chances to obtain immoral nodes. For example, with k < 2, we do not have the problem of immoral nodes. However, with k ≥ 2, discriminative training of BN can lead to non-convex optimization. 7 The KDB model uses the mutual and conditional mutual information to learn the structure. A typical feature interaction in a KDB model includes feature itself, the target feature, and its parent feature(s). As discussed earlier, we wish to train KDB discriminatively with some constraints fulfilled-leading to KDB e formulation. However, we will use the term BN e instead (for sake of generalization).
The generator in GANBLR optimize following two objective functions: • MaximizingG(D data ) which is the conditional log-likelihood of the form: log(P(Y g |X k g )), and • Minimizing the loss log(1 − D(Ḡ(.))) or log(1 − D(S data )).
Instead of minimizing log(1 − D(S data )) we can maximize: − log(1 − D(S data )) instead, which leads to the objective function for the generator G as: Note, just like vanilla GAN models, − log(1 − D(S data )) part of the objective function is not involved while training the parameters of G, as the discriminator D is fixed during G's optimization. Let us focus on the generator parameters (θ g θ g θ g ) that are to be learned in Eq. 2. For this, we write log(P(Y g |X k g )) as: Here, θ y denotes the weight associated with the class (can be considered as class-prior or the intercept term), x i denotes the feature value for i-th feature, and π x i denotes the set of feature values of those features which are feature i's parents. Note, y denotes the class value, and also class is the parent of each feature. Since Bayesian Network structure leads to conditional probabilities, our notation is symbolic as we represent a weight in our network as: θ x i |y,π x i . In practice, GANBLR has a parameter θ associated with each interaction: is the normalization term to make sure that P(Y g |X k g ) is between 0 and 1. Notably, GANBLR enforces the constraints on θ g θ g θ g , making sure that: where X i represents the cardinality of feature i, and x j represents j-th feature value. Additionally, the following constraint is satisfied: Once the BN e weights are trained, the second roleḠ of the generator, that is, to generate the data, begins. One can generate the synthetic data S data by using the forward sampling [23]. One can set the size of synthetic dataset size (m) in forward sampling, whereas the feature interaction order k is expected to be the same as that of generator G's input, i.e. D data . The sampling process of generator G is quite evident, and can be expressed as:

Discriminator D
The discriminator D determines the quality of the synthetic data S data during the training, and then of course, back-propagate the error to the generator G. Note, as we discussed, the generator in GANBLR G gets the loss from the discriminator D to adjust its weights θ g θ g θ g . The input for discriminator D is shaped with [D data , Y d = 1] and the synthetic data [S data , Y d = 0]. In GANBLR, the discriminator D is again a Bayesian Network model (BN d ) trained to optimize the CLL with the hyper-parameter (k) same as that of generator's Bayesian Network (BN e ). The training of discriminator D aims to maximize: The complete algorithm of GANBLR is provided in Algorithm 1, and we provide its architecture in Fig. 1. In a total of Q iterations (epochs), the input to GANBLR is used to trainG, while fixing the discriminator D. Afterwards, the discriminator D is trained to discriminate between the synthetic and real datasets.

GANBLR-no adversarial learning
It can be seen from Algorithm 1 that GANBLR can still fulfil its goal of synthesizing data without an adversarial learning component (i.e. D). In practice, we can get rid of D from GANBLR-leading to a variant configuration that we call GANBLR-nAL. However, we argue that having an adversarial learning component can lead to much better data generation model as we will discuss later in Sect. 4.5-where we compare the performance of GANBLR with that of GANBLR-nAL. Nonetheless, the GANBLR-nAL algorithm is provided in Algorithm 2:

GANBLR-summary
Let us briefly discuss two salient features of GANBLR. From Algorithm 1, it can be seen that GANBLR can generate the synthetic dataset S data using Eq 6. As we mentioned in Sect. 2, a desirable property of tabular generation models is generating the synthetic data with

Algorithm 2: Algorithm GANBLR-nAL
Input : X k g , Y g Output: Synthetic dataX k g ,Ȳ g 1 for iteration q ⊂ Q in training do 2 Sample m instances (X k g , Y g ) ∼ P data (.) 3 Obtain θ g θ g θ g by optimizingG via Equation 2 with gradient descent 4 Generate S data via Equation 6 5 return S data ≡X k g ,Ȳ g particular feature value (e.g. to resolve the imbalanced dataset limitation). Of course, the GANBLR's generator can simply sample the synthetic dataset by specifying the particular feature value with rejection sampling [24], and can be effective for imbalanced data generation. Additionally, the learned parameters in GANBLR are actually probabilities, which can be used to interpret the generation process.

Experiment and analysis
Let us discuss the efficiency of GANBLR in synthetic data generation in this section. We consider 15 commonly used datasets to compare GANBLR performance with three SOTA GAN models for tabular data generation. Specifically, we will evaluate the effectiveness of GANBLR in terms of: • Machine learning utility-which reflects the quality of the synthetic data.
• Statistical similarity-which measures the statistical similarity between the synthetic and the real data. • Interpretability-which shows the interpretable capability of GANBLR.
Both the machine learning utility and the statistical similarity are standard measures to determine the quality of data generation algorithms [9]. Moreover, we perform various ablation studies to study: • The effect of GANBLR's hyper-parameter k, and • To determine the effectiveness of adversarial component of GANBLR, by comparing GANBLR with GANBLR-nAL.
We will also study the efficacy of GANBLR by studying and comparing its performance on two synthetic datasets. The best results are highlighted with bold font in our experiments.

Datasets
We use 15 commonly used classification datasets and 2 synthetic datasets. Within the 15 datasets used, 13 are from UCI dataset repository and 2 are from Kaggle-namely, Credit and Loan. The two synthetic datasets are generated based on Poker-hand dataset. All these datasets have a specific dependent variable and a set of independent features. Among them, 5 are large datasets with more than 50K instances (denoted as Large), 5 are medium with less than 50K but greater than 15K instances (denoted as Medium), while the other 5 with less than 15K instances (denoted as Small). The details of datasets are summarized in Table 2.

Baselines and evaluation metric
We compare GANBLR with CTGAN, TableGAN and MedGAN. All baseline methods are trained with 150 epochs for 5 Large datasets, and 100 epochs for the Medium and Small datasets. Each experiment is repeated 3 times with 2-fold cross-validation, and averaged results are reported. It can be seen from Table 2 that most datasets have > 2 classes, and hence, we have reported the accuracy (instead of widely used AUC measure).

Configuration and running environment
The parameter k in the experiments is set to 0-that is the Bayesian network in the generator model is naive Bayes and in discriminator, it is Logistic Regression. In Sect. 4.5, we will study the effect of varying the value of k. GANBLR is coded in Python 3.7 in Tensorflow 2.5 framework, with 8 core Intel i8 CPU machine with 32 GB RAM memory.

Machine learning utility
Machine learning utility refers to the accuracy obtained from a machine learning model [9]. In common scenario, the data used for training is real and the testing data is also real; but to evaluate the data generator models, the data used for training is synthetic and testing data is actually real. More precisely, we will make use of the following two ways to assess the machine learning utility performance: • TSTR-Training on Synthetic data, Testing on Real data, Accuracy is reported.
• TRTR-Training on Real data, and Testing on Real data, Accuracy is reported.
To obtain the TSTR and TRTR performance of GANBLR and competing baseline methods, we will use four commonly used machine learning classification algorithms (Sect Note the TSTR is used to report the quality of the synthetic data generated from the proposed GANBLR and other baseline methods, whereas the TRTR is only used to indicate the ideal machine learning utility. Figure 2 explains the evaluation process for the machine learning utility. For TSTR, we first split the real datasets into real training and real testing datasets; the real training datasets are used as the input for GANBLR and its baseline methods for training. Once training is completed, the synthetic datasets are generated. The synthetic training datasets are used to train the above-mentioned machine learning classification algorithms which will then be evaluated on the real testing datasets. The result of TSTR could not only show the realistic machine learning utility of all the compared methods, but also answer the question: "Can synthetic data be used as substitute of the real data without significantly impacting the performance of machine learning tasks?". Ideally, higher the accuracy from TSTR (high machine learning utility), the better the data generation algorithm. In contrast, TRTR is training of machine learning classification algorithms with real training datasets, and evaluating on real testing datasets. TRTR is included in the comparison for highlighting the ideal machine learning utility.
Note, we are interested in data generation methods which have higher values of TSTR.

Statistical similarity
Two metrics are used to quantitatively measure the statistical similarity between the real and the synthetic datasets generated by GANBLR and its baseline methods, that is: • Jensen-Shannon Divergence (JSD). The JSD quantifies the difference between the probability mass distribution of individual categorical feature in the real data and the synthetic dataset, and it is bounded between 0 and 1. • Wasserstein Distance (WD). Similarly, WD captures the earth moving distance on features between the real and synthetic dataset.
Note, we use the distance as a proxy of similarity, and therefore, the lower the distance, higher the similarity. Of course, we are after the data generation method that leads to lower values of JSD and WD.

Results analysis on synthetic datasets
In this subsection, two synthetic datasets are used to evaluate the effectiveness of GANBLR in modelling high-order feature interaction within the data. Of course, a higher accuracy on these datasets will indicate that a model has superior capability to capture higher-order interactions. The two datasets that are synthesized are based on Poker-hand dataset. The Poker-hand dataset represents a dataset where each data represents a poker hand, constituting of five cards. There are a total of 4 suits with each suit having 13 ranks. It can be seen that any model should be able to handle order-5 interactions to learn to distinguish a particular type of hand. The first dataset that we synthesized is labelled Synthetic1, which is a four-hand version of original Poker-hand dataset. There are a total of 4 cards and 4 suits and 13 ranks, which are the same as normal Poker-hand dataset. Of course, the four-hand Poker-hand would require an order-4 model to successfully capture interactions to distinguish each hand.
The second synthetic dataset-Synthetic2 is a six-hand version original Poker-hand dataset. There are total of 6 cards in the Synthetic2 dataset and having the same number of suits and ranks as normal Poker-hand dataset. Again, the six-hand version of Poker-hand will require an order-6 model to capture the interactions in order to distinguish each hand.
In order to synthesize these two datasets, we designed an algorithm as follows: • First decide the version of the synthetic Poker-hand, i.e. either Synthetic1 or Synthetic2. • Identify the rules of each class from Poker-hand. For example, Full house is not available for four-hand synthetic Poker-hand. • Uniformly samples the cards for each respective hand enforcing the rules of each class.

Synthetic1: Four-hand Poker-hand
In this part of experiment, we compare the performance of GANBLR with the current SOTA CTGAN. Note the purpose of this experiment is to evaluate how well GANBLR's perform  Table 3 shows the machine learning utility performance (TSTR) on GANBLR with different k values. It can be seen that GANBLR has better accuracy when compared to CTGAN on Synthetic1 dataset, achieving an accuracy of 85.61% on XGBT with k=2, and an accuracy of 83.20% on RF with k=2. The results suggest that the BN e in GANBLR has a strong capability to capture higher-order features in Synthetic1 dataset. Note, the advantage of GANBLR over CTGAN grows with larger values of k. Table 4 shows the statistical similarity performance of GANBLR with varying values of k. It can be seen that the GANBLR has better statistical similarity when compared to CTGAN. The results also reveal that with increasing feature interaction level (k), the capability of capturing higher-order features from BN e in GANBLR can make GANBLR produce better quality dataset.

Synthetic2: Six-hand Poker-hand
Let us now compare the performance of GANBLR with CTGAN on six-hand Poker-hand dataset. Again, we will evaluate performance in terms of machine learning utility and statistical similarity. Table 5 shows the results of machine learning utility of GANBLR for different values of k. It can be seen that GANBLR always perform better than CTGAN in terms of machine learning utility. Particularly, when the value of k is increased from 0 to 1, the performance of GANBLR with XGBT is substantially boosted from an accuracy of 53.11% to 72.39%. This result is extremely encouraging and indicates that GANBLR can leverage the higher-order feature interaction from six-hand Poker-hand dataset effectively. The statistical similarity on six-hand Poker-hand dataset is evaluated to identify the quality of generated synthetic dataset. Table 6 presents the statistical similarity performance of GANBLR with different k values. Just like the statistical similarity results on four-hand Poker-hand dataset, it can be seen that GANBLR outperforms the CTGAN in terms of both JSD and WD distances. And the best results, as expected, are achieved with the higher values of k. Table 7 provides the averaged machine learning utility results in terms of TSTR and TRTR, on Large, Medium and Small datasets. It is clear that GANBLR outperforms all other baseline methods in terms of TSTR. Particularly on the small and medium datasets, GANBLR has significantly better performance than other baseline methods. It is interesting to see that GANBLR's TSTR performance is close to TRTR performance, while none of the other baseline methods have the TSTR performance close to TRTR. Similar findings can be drawn   from Table 9 which provide detailed TSTR performance for all datasets. These results are extremely encouraging, as they demonstrate that the synthetic data generated from GANBLR is far more useful for the machine learning tasks than from any other existing SOTA data generation algorithm. While Table 7 provides averaged results, let us look at distribution of accuracies for different datasets. In Fig. 3, we plot the box plots of the (TSTR) accuracy of GANBLR and its baseline methods on 4 machine learning algorithms. Again, we break the results in terms of Large, Medium and Small datasets. We plot TRTR for sake of comparison as well. It can be seen that no matter the size of datasets, GANBLR significantly outperforms all the baseline methods. In particular, for Small and Medium datasets, the performance of GANBLR is extremely impressive as the box plots of GANBLR (red) match highly to those of TRTR (orange).

Statistical similarity
To obtain JSD results-for a dataset, each feature from synthetic dataset is measured against the same feature in real dataset in terms of JSD. We repeat the process for all features and  for all datasets and then report the averaged result in Table 8-which as we discussed can be seen as the measure of statistical similarly between synthetic and original dataset. It can be seen from Table 8, that GANBLR stands-out when compared to the other baseline methods. If it is not the best, it is always the second best. Particularly, on Small datasets, GANBLR has smaller JSD and WD values than all the competing baselines, highlighting that it produces dataset of superior quality. On Medium datasets, GANBLR has performance similar to CTGAN in terms of JSD and is the second best, while has WD performance similar to TableGAN and again is the second best. On Large datasets, GANBLR has the best performance in terms of WD, though it marginally loses to CTGAN in terms of JSD distance.
Delving into why GANBLR has sub-optimal results on Medium datasets-we believe that this could be due to GANBLR's generator-Bayesian Network's ability to generate some feature value which are rarely seen in the real dataset. Clearly, statistical similarity evaluation based on JSD and WD does not credits the generation of data which is not present in real data, but is useful for classification task.
We conjecture that another reason could be due to discriminative training of BN e , which aims to produce features that enhance the discriminative power of a BN, and therefore, can produce slightly different datasets (Table 9). We further ran the t test to test the significance of the similarity results of Table 8. It can be seen from Table 10 that the statistical similarity results are significant between GANBLR and all other baselines.

Interpretation analysis
Let us study the interpretable capability of our proposed model. We believe that a good interpretation in any tabular data generator should: • Provide the local interpretability with clear understanding of why a synthetic data point belongs to the generated synthetic label at any time during the training. For example, given a generated synthetic data instance, the probability of each possible synthetic label should be provided. • Provide the global interpretability on how the features can impact the synthetic label generation generally. For example, which feature has the largest impact on the synthetic label.

Local interpretability
As discussed in Sect. 3, BN e is a discriminatively trained Bayesian Network. Therefore, the learned parameters (θ θ θ ) are actually conditional probabilities of the form P(x|y, (x)).
Here (x) denotes the function that returns the parents of each features of x. One nice thing about probabilities is that they are super-interpretable and having access to these parameters during discriminative training gives GANBLR capability for interpretable learning. For example, during training, one can interpret the importance of features in determining the value of class or predicting class based on the posterior probability P(y|x, (x)). For example, one can use the following formula to determine the importance of a feature or feature-set (P(y|x, (x))) for class y as: It can be seen that I y x, (x) is proportional to the conditional probabilities P(y|x, (x)). The conditional probabilities P(y|x, (x)) could be obtained by conducting the inferencing on the BN e -the actual learned Bayesian network from generator G. Note, when k = 0, the BN e from generator G is the naive Bayes and the feature does not have the interaction, so (x) = , i.e. it is an empty set. In the case, we have: I y x ∝ P(y|x). In order to investigate the interpretation capability of GANBLR, we used the CAR dataset for multi-class classification with GANBLR with k = 0 and k = 1. Table 11 shows the feature-set importance on 3 instances for k = 0 and 2 instances for k = 1. Here, the 5 instances from CAR synthetic dataset are randomly selected and used for demonstrating the local interpretability. In order to determine the credibility of the local interpretability of GANBLR, we employed the popular method-LIME, which can explain why features belongs to a certain class for particular instance. Notably, when k = 0, the feature of the instance will not have any interaction, and therefore the (x) is empty 8 . For Instance 1 (Table 11), the score of each feature and the class is listed in the Table-note the true class is y = 0, also established by GANBLR as can be seen by the probability of class y = 0 which is the highest (shown in bold). Moreover, for Instance 1, the feature Safety = 0 and Persons = 1 contribute the most for this decision (probabilities shown in bold). The results of LIME in Fig. 4 for Instance 1 have the same result which shows that the Safety = 0 and Persons = 1 are the top two contributors. For Instance 2 and Instance 3, similarly, the top contributors from Table 11 are same when compared to the results of LIME in Fig. 4, which demonstrates that GANBLR can produce results similar to that of LIME. Note, for Instance 2 and Instance 3, the true classes are 3 and 2, respectively, which are also established by GANBLR, as can be seen from the bold probabilities in the table.
Instances 4 and 5 in Table 11 depict the case with k = 1. Maint = 2, Buying = 3 represents x, (x) pair, i.e. x = Maint, and (x) = Buying. Note, for x = Doors, and (x) = . For Instance 4, the GANBLR makes the decision of y = 0 and the feature of Persons = 1, Safety = 0 has the highest contribution in this decision. Interestingly, the same finding is observed using LIME experiment on the Instance 4, i.e. the highest probability is observed on y = 0 and the feature of Persons = 1, Safety = 0 contributes the most. For Instance 5, the decision for this class y = 3 is based on Persons = 3, Safety = 3, respectively, from GANBLR. The LIME experiment provides the same finding on the feature contribution.
Based on the above demonstrations of the interpretability comparison of GANBLR and LIME, we found similar pattern of coherence of GANBLR and LIME's interpretability, which indicates that the local interpretability of GANBLR is highly reliable even during the training phase and is equivalent to the LIME, which is basically interpretation of the model after the training.

Global interpretability
In Fig. 5, global interpretability of GANBLR at different training stage can be drawn by listing the weights learned from the generator in GANBLR. The darker colour means bigger impact, while the lighter colour means smaller impact on the corresponding class. It can be seen that features impact differently on different class labels during training phase, as can be seen at epoch = 1, epoch = 50 and epoch = 100. We can see that features Persons and Safety have the largest impact on the synthetic label 0, while features Maints, Persons and Safety have the largest impact on synthetic label 1, and features Persons, Lug_boot and Safety have the largest impact on synthetic label 2. However, feature Safety has far more impact on the synthetic label 3 than other 5 features; therefore, feature Safety is most important factor to decide the car with high value which is the meaning of class = 3. Again, the purpose of this analysis is to demonstrate GANBLR's global interpretable capability.

Ablation analysis
To illustrate the impact of hyper-parameter k and the discriminative component on GANBLR, we conducted the ablation studies by changing the configuration of GANBLR as below: • GANBLR-nAL As we discussed in Sect. 3.4, the GANBLR-nAL does not include the discriminator part and the generator has slightly simplified objective function-i.e. it is trained solely by maximizing log(P(Y g |X k g )) as shown in Algorithm 2. • k = 0 In this experiment, the Bayesian network generator in GANBLR has k = 0, i.e. we use a naive Bayes model, this means that no feature interaction is modelled. • k = 1 In this experiment, the generator in GANBLR is a Bayesian network with k = 1, i.e. order 1 feature interactions are modelled. • k = 2 In this experiment, the generator in GANBLR is a Bayesian Network with k = 2, i.e. order 2 feature interaction are modelled.
We compare the performance of GANBLR and GANBLR-nAL using similar strategy that we used to compare GANBLR with other competing baselines. It can be seen from Table 12 that GANBLR has better average performance than GANBLR-nAL, especially on large datasets for various values of k, demonstrating its superior machine learning utility. GANBLR is better than GANBLR-nAL except for k = 1 for medium and k = 2 for small. This highlights the significance of adversarial component in GANBLR formulation. Nonetheless, the comparison also highlights the usage of GANBLR-nAL as an effective sampling method which does not employ a game-theoretic adversarial learning. Table 12 presents the impact of various values of k for GANBLR. As expected, it can be seen that higher values of k lead to better results for large and medium datasets. However, for small datasets, generally k = 1 leads to superior performance. An obvious reason for this is that k = 2 might be over-fitting on the small datasets-and traditional bias-variance trade-off is coming into effect. Interestingly, this study revealed suitability of GANBLR's hyper-parameters to various sized datasets.
For statistical similarity comparison, Table 13 shows that GANBLR has much smaller difference than GANBLR-nAL in terms of JSD distance between the generated synthetic data and the real data. Particularly, on large dataset, with the k = 2, GANBLR can generate high-quality synthetic data with strong similarity. Similar results can be seen in terms of WD distance, where GANBLR can be seen to perform better in general when compared to GANBLR-nAL (Table 14). The above findings indicates that the adversarial component in GANBLR can help significantly to generate higher quality of the synthetic data especially with higher values of k.

Conclusion
In this work, we presented a novel technique to generate tabular data utilizing the GAN strategy. Our proposed GANBLR framework relies on discriminatively trained Bayesian networks as the generator as well as discriminator, which learns by optimizing a game-theoretic objective function. We showed that GANBLR not only advances the existing SOTA GAN-based models but also leads to a framework with excellent interpretability during the training. We evaluated the data generation performance of GANBLR by comparing it against several SOTA baselines and analysed its performance in terms of machine learning utility as well as statistical similarity. The results show that the synthetic datasets generated from GANBLR have the best performance on machine learning utility and statistical similarity comparable to SOTA methods. The remarkable results of GANBLR demonstrate its potential for a wide range of applications which can greatly contribute to tabular data generation and augmentation in various sectors such as banking, insurance, health and many other industries. We highlight some future works as: • We have constrained ourselves to Bayesian networks with k ≤ 2 in this work. We are interested to see variation in the performance of GANBLR as the value of k is increased. • We have focused mainly on restricted BN model, i.e. KDB models in our current GANBLR formulation-we are keen to study the model under unrestricted Bayesian network models. • Enhancing GANBLR to generate numerical attributes is also one direction, we are exploring.
Funding Open Access funding enabled and organized by CAUL and its Member Institutions.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. He has also conducted research projects on tourism and hospitality management. He served on the Program Committee for over 150 international conferences in artificial intelligence, data mining, machine learning, tourism and hospitality management, and is a regular reviewer for International Journals in the areas of data science, privacy protection, recommendation system, and business intelligence.