Abstract
Performing data augmentation for mixed datasets remains an open challenge. We propose an adaptation of the Mixed Deep Gaussian Mixture Models (MDGMM) to generate such complex data. The MDGMM explicitly handles the different data types and learns a continuous latent representation of the data that captures their dependence structure and can be exploited to conduct data augmentation. We test the ability of our method to simulate crossings of variables that were rarely observed or unobserved during training. The performances are compared with recent competitors relying on Generative Adversarial Networks, Random Forest, Classification And Regression Trees, or Bayesian networks on the UCI Adult dataset.
Granted by the Research Chair DIALog under the aegis of the Risk Foundation, an initiative by CNP Assurances.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Buuren, S.V., Brand, J.P., Groothuis-Oudshoorn, C.G., Rubin, D.B.: Fully conditional specification in multivariate imputation. J. Stat. Comput. Simul. 76(12), 1049–1064 (2006)
Cagnone, S., Viroli, C.: A factor mixture model for analyzing heterogeneity and cognitive structure of dementia. AStA Advances in Statistical Analysis 98(1), 1–20 (2013). https://doi.org/10.1007/s10182-012-0206-5
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: Smote: synthetic minority over-sampling technique. J. Artif. Int. Res. 16(1), 321–357 (2002)
Engelmann, J., Lessmann, S.: Conditional wasserstein gan-based oversampling of tabular data for imbalanced learning. Expert Syst. Appl. 174, 114582 (2021)
Feldman, J., Kowal, D.: A bayesian framework for generation of fully synthetic mixed datasets (2021)
Fuchs, R., Pommeret, D., Viroli, C.: Mixed deep gaussian mixture model: a clustering model for mixed datasets. In: Advances in Data Analysis and Classification, pp. 1–23 (2021)
Guu, K., Lee, K., Tung, Z., Pasupat, P., Chang, M.: Retrieval augmented language model pre-training. In: III, H.D., Singh, A. (eds.) Proceedings of the 37th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 119, pp. 3929–3938. PMLR, 13–18 Jul 2020
He, H., Bai, Y., Garcia, E.A., Li, S.: Adasyn: adaptive synthetic sampling approach for imbalanced learning. In: IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), IJCNN 2008, pp. 1322–1328 (2008)
Hu, J., Reiter, J.P., Wang, Q., et al.: Dirichlet process mixture models for modeling and generating synthetic versions of nested categorical data. Bayesian Anal. 13(1), 183–200 (2018)
Kamthe, S., Assefa, S., Deisenroth, M.: Copula flows for synthetic data generation (2021)
Kohavi, R.: Scaling up the accuracy of naive-bayes classifiers: a decision-tree hybrid. In: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, KDD 1996, pp. 202–207. AAAI Press (1996)
Kullback, S., Leibler, R.A.: On information and sufficiency. Ann. Math. Statist. 22(1), 79–86 (1951)
Lee, S.S.: Noisy replication in skewed binary classification. Comput. Stat. Data Anal. 34(2), 165–191 (2000)
Liu, Y., et al.: Wasserstein gan-based small-sample augmentation for new-generation artificial intelligence: a case study of cancer-staging data in biology. Engineering (2019)
Lucic, M., Kurach, K., Michalski, M., Bousquet, O., Gelly, S.: Are gans created equal? a large-scale study. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 698–707. NIPS 2018, Curran Associates Inc., Red Hook (2018)
Menardi, G., Torelli, N.: Training and assessing classification rules with imbalanced data. Data Min. Knowl. Disc. 28(1), 92–122 (2012). https://doi.org/10.1007/s10618-012-0295-5
Moreno-Barea, F.J., Jerez, J.M., Franco, L.: Improving classification accuracy using data augmentation on small data sets. Expert Syst. Appl. 161, 113696 (2020)
Moustaki, I.: A general class of latent variable models for ordinal manifest variables with covariate effects on the manifest and latent variables. Br. J. Math. Stat. Psychol. 56(2), 337–357 (2003)
Moustaki, I., Knott, M.: Generalized latent trait models. Psychometrika 65(3), 391–411 (2000)
Murray, J.S., Reiter, J.P.: Multiple imputation of missing categorical and continuous values via bayesian mixture models with local dependence. J. Am. Stat. Assoc. 111(516), 1466–1479 (2016)
Nowok, B., Raab, G.M., Dibben, C.: Synthpop: bespoke creation of synthetic data in R. J. Stat. Soft. 74(11), 1–26 (2016). https://doi.org/10.18637/jss.v074.i11
Park, N., Mohammadi, M., Gorde, K., Jajodia, S., Park, H., Kim, Y.: Data synthesis based on generative adversarial networks. Proc. VLDB Endow. 11(10), 1071–1083 (2018)
Ping, H., Stoyanovich, J., Howe, B.: Datasynthesizer: privacy-preserving synthetic datasets. In: Proceedings of the 29th International Conference on Scientific and Statistical Database Management, SSDBM 2017, Association for Computing Machinery, New York (2017)
Richardson, E., Weiss, Y.: On gans and gmms. In: Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 31. Curran Associates, Inc. (2018)
Sun, Y., Cuesta-Infante, A., Veeramachaneni, K.: Learning vine copula models for synthetic data generation. In: AAAI (2019)
Viroli, C., McLachlan, G.J.: Deep gaussian mixture models. Stat. Comput. 29(1), 43–51 (2019)
Xu, L., Skoularidou, M., Cuesta-Infante, A., Veeramachaneni, K.: Modeling tabular data using conditional gan. In: NeurIPS (2019)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendices
Appendix
A Datasets Details
The variables of the Adult dataset are according to the UCI documentation:
-
Income: binary (>50K, \(<=\)50K).
-
Age: continuous.
-
Workclass: categorical (Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked).
-
Fnlwgt: continuous.
-
Education-num: ordinal.
-
Marital-status: categorical (Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse).
-
Occupation: categorical (Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces).
-
Relationship: categorical (Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried).
-
Race: categorical (White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black).
-
Sex: binary (Female, Male).
-
Capital-gain: ordinal.
-
Capital-loss: ordinal.
-
Hours-per-week: continuous.
-
Native-country: categorical (United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc.), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad &Tobago, Peru, Hong, Holand-Netherlands).
B Evaluation Metrics Details
Our overall criterion between a test dataset and a generated dataset is the association distance obtained as follows:
where p denotes the number of variables (14 in the Adult dataset), \(P=(p^2-p)/2\), and \(M_{ij}(test)\) (resp. \(M_{ij}(gen))\) is the (i, j)th entry of the test (resp. generated) Association Matrix.
To measure the similarity between the dependence structures of the vectors formed by the three continuous variables (Age, Fnlwgt, and Hours) we used the multivariate Kullback Leibler divergence [12].
For qualitative data, we chose the mean absolute errors (MAE) between proportions. More precisely, for a kth intersection of modalities we consider
where \(p_k(test)\) (resp. \(p_k(gen)\)) stands for kth test (resp. generated) proportion. The final MAE is the mean of all the MAE(k) over all the possibilities.
C Additional Results: Unobserved Marginal Density Reconstruction
When it comes to reconstructing univariate densities, MIAMI generates well-identified unimodal densities contrary to DataSynthesizer which generates flat densities, or SynthPop-CART which generates multi-modal densities (Fig. 8). More precisely, Figs. 8 and 9 represent the estimations of the observed density versus the generated one for Age in the case of the Bivariate and Trivariate Unbalanced designs. We observe that MIAMI can recover the right distribution, yet not observed in the training set. It means that MIAMI captures well the dependence structure of such a partially unobserved variable. CTGAN and SynthPop-RF also seem to work well for both designs. DataSynthesizer shows a larger variance. This illustration shows that we cannot clearly decide between the methods by looking only at the marginal distributions. Only a criterion like the association distance can take into account a more complex multivariate dependence structure.
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Fuchs, R., Pommeret, D., Stocksieker, S. (2022). MIAMI: MIxed Data Augmentation MIxture. In: Gervasi, O., Murgante, B., Hendrix, E.M.T., Taniar, D., Apduhan, B.O. (eds) Computational Science and Its Applications – ICCSA 2022. ICCSA 2022. Lecture Notes in Computer Science, vol 13375. Springer, Cham. https://doi.org/10.1007/978-3-031-10522-7_9
Download citation
DOI: https://doi.org/10.1007/978-3-031-10522-7_9
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-10521-0
Online ISBN: 978-3-031-10522-7
eBook Packages: Computer ScienceComputer Science (R0)