Abstract
This article introduces the approach of using Bayesian sampling to estimate the mixture copula with discrete margins, we further apply our models to solve the class imbalanced problems in data science by oversampling. The methodology makes it possible to learn and sample from the data set with the discrete and continuous features exists simultaneously. On the other hand, the discreetness of factors in a data set are not naturally considered for the classic SMOTE algorithm and classic random oversampling is simply performed by generating the already existing points, which do not give any new information to the classifiers and is easy to overfit. Copula methods enable us to generate new points with the correlation structure memorized by learning from the training set. Hence, the overfitting problems are reduced. Experiments with synthetic and real data are done in the article following the introduction of the methodology. The outcomes shows the validity of the approach when compared with the benchmark methods.
Similar content being viewed by others
Data availability
The experimental data set used for the current study is available in the KEEL repository: https://sci2s.ugr.es/keel/datasets.php.
References
Alam, T. M., Shaukat, K., Hameed, I. A., Luo, S., Sarwar, M. U., Shabbir, S., Li, J., & Khushi, M. (2020). An investigation of credit card default prediction in the imbalanced datasets. IEEE Access, 8, 201173–201198.
Alcalá-Fdez, J., Sanchez, L., Garcia, S., del Jesus, M. J., Ventura, S., Garrell, J. M., Otero, J., Romero, C., Bacardit, J., Rivas, V. M., et al. (2009). Keel: a software tool to assess evolutionary algorithms for data mining problems. Soft Computing, 13(3), 307–318.
Arakelian, V., & Karlis, D. (2014). Clustering dependencies via mixtures of copulas. Communications in Statistics-Simulation and Computation, 43(7), 1644–1661.
Azzalini, A. (1985). A class of distributions which includes the normal ones. Scandinavian Journal of Statistics, 12(2), 171–178.
Azzalini, A. (2013). The skew-normal and related families (Vol. 3). Cambridge University Press.
Azzalini, A., & Valle, A. D. (1996). The multivariate skew-normal distribution. Biometrika, 83(4), 715–726.
Cai, Z., & Wang, X. (2014). Selection of mixed copula model via penalized likelihood. Journal of the American Statistical Association, 109(506), 788–801.
Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). Smote: synthetic minority over-sampling technique. Journal of artificial intelligence research, 16, 321–357.
Chawla, N. V., Japkowicz, N., & Kotcz, A. (2004). Special issue on learning from imbalanced data sets. ACM SIGKDD Explorations Newsletter, 6(1), 1–6.
Deligiannidis, G., & Doucet, A. (2018). The correlated pseudomarginal method. Journal of the Royal Statistical Society Series B (Statistical Methodology), 80(5), 839–870.
Faugeras, O. P. (2017). Inference for copula modeling of discrete data: a cautionary tale and some facts. Dependence Modeling, 5(1), 121–132.
Fernández, A., García, S., Galar, M., Prati, R. C., Krawczyk, B., & Herrera, F. (2018). Learning from imbalanced data sets (Vol. 10). Springer.
Fotouhi, S., Asadi, S., & Kattan, M. W. (2019). A comprehensive data level analysis for cancer diagnosis on imbalanced data. Journal of Biomedical Informatics, 90, 103089.
Geenens, G. (2020). Copula modeling for discrete random vectors. Dependence Modeling, 8(1), 417–440.
Genest, C., & Rivest, L.-P. (1993). Statistical inference procedures for bivariate Archimedean copulas. Journal of the American Statistical Association, 88(423), 1034–1043.
Gunawan, D., Tran, M.-N., Suzuki, K., Dick, J., & Kohn, R. (2019). Computationally efficient Bayesian estimation of high-dimensional Archimedean copulas with discrete and mixed margins. Statistics and Computing, 29(5), 933–946.
Gupta, S., & Gupta, M. K. (2022). A comprehensive data-level investigation of cancer diagnosis on imbalanced data. Computational Intelligence, 38(1), 156–186.
He, H., & Garcia, E. A. (2009). Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering, 21(9), 1263–1284.
Holzmann, H., Munk, A., & Gneiting, T. (2006). Identifiability of finite mixtures of elliptical distributions. Scandinavian Journal of Statistics, 33(4), 753–763.
Hu, L. (2006). Dependence patterns across financial markets: a mixed copula approach. Applied Financial Economics, 16(10), 717–729.
Huard, D., Evin, G., & Favre, A.-C. (2006). Bayesian copula selection. Computational Statistics & Data Analysis, 51(2), 809–822.
Jiryaie, F., Withanage, N., Wu, B., & De Leon, A. (2016). Gaussian copula distributions for mixed data, with application in discrimination. Journal of Statistical Computation and Simulation, 86(9), 1643–1659.
Joe, H. & Xu, J. J. (1996). The estimation method of inference functions for margins for multivariate models. Technical Report No. 166, pp 1–21
Kosmidis, I., & Karlis, D. (2016). Model-based clustering using copulas with applications. Statistics and Computing, 26, 1079–1099.
Krawczyk, B. (2016). Learning from imbalanced data: open challenges and future directions. Progress in Artificial Intelligence, 5(4), 221–232.
Liu, Y., Ao, X., Qin, Z., Chi, J., Feng, J., Yang, H., & He, Q. (2021). Pick and choose: a gnn-based imbalanced learning approach for fraud detection. Proceedings of the Web Conference, 2021, 3168–3177.
Liu, Y., Xie, D., & Yu, S. (2023). Bayesian mixture copula estimation and selection with applications. Analytics, 2(2), 530–545.
Loaiza-Maya, R., & Smith, M. S. (2019). Variational Bayes estimation of discrete-margined copula models with application to time series. Journal of Computational and Graphical Statistics, 28(3), 523–539.
MacKenzie, D., & Spears, T. (2014). ‘A device for being able to book P &L’: the organizational embedding of the Gaussian copula. Social Studies of Science, 44(3), 418–440.
Mazo, G., & Averyanov, Y. (2019). Constraining kernel estimators in semiparametric copula mixture models. Computational Statistics & Data Analysis, 138, 170–189.
McLachlan, G. J., Lee, S. X., & Rathnayake, S. I. (2019). Finite mixture models. Annual Review of Statistics and its Application, 6, 355–378.
McNeil, A. J., Frey, R., Embrechts, P., et al. (2015). Quantitative risk management: concepts. Economics Books
Meyer, C. (2013). The bivariate normal copula. Communications in Statistics-Theory and Methods, 42(13), 2402–2422.
Mohammed, R., Rawashdeh, J., & Abdullah, M. (2020). Machine learning with oversampling and undersampling techniques: overview study and experimental results. In: 2020 11th international conference on information and communication systems (ICICS), pp 243–248. IEEE
Nasr, B. R. & Remillard, B. N. (2023). Identifiability and inference for copula-based semiparametric models for random vectors with arbitrary marginal distributions. arXiv preprint arXiv:2301.13408.
Otiniano, C., Rathie, P., & Ozelim, L. (2015). On the identifiability of finite mixture of skew-normal and skew-t distributions. Statistics & Probability Letters, 106, 103–108.
Panagiotelis, A., Czado, C., & Joe, H. (2012). Pair copula constructions for multivariate discrete data. Journal of the American Statistical Association, 107(499), 1063–1072.
Pitt, M., Chan, D., & Kohn, R. (2006). Efficient Bayesian inference for Gaussian copula regression models. Biometrika, 93(3), 537–554.
Provost, F. (2000). Machine learning from imbalanced data sets 101. In: Proceedings of the AAAI’2000 workshop on imbalanced data sets. AAAI Press
Renard, B., & Lang, M. (2007). Use of a Gaussian copula for multivariate extreme value analysis: some case studies in hydrology. Advances in Water Resources, 30(4), 897–912.
Rousseau, J., & Mengersen, K. (2011). Asymptotic behaviour of the posterior distribution in overfitted mixture models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73(5), 689–710.
Sklar, M. (1959). Fonctions de repartition an dimensions et leurs marges. Annales de l’ISUP, 8, 229–231.
Smith, M. S. (2011). Bayesian approaches to copula modelling. arXiv preprint arXiv:1112.4204.
Smith, M. S., Gan, Q., & Kohn, R. J. (2012). Modelling dependence using skew t copulas: Bayesian inference and applications. Journal of Applied Econometrics, 27(3), 500–522.
Smith, M. S., & Khaled, M. A. (2012). Estimation of copula models with discrete margins via Bayesian data augmentation. Journal of the American Statistical Association, 107(497), 290–303.
Teicher, H. (1961). Identifiability of mixtures. The Annals of Mathematical statistics, 32(1), 244–248.
Teicher, H. (1963). Identifiability of finite mixtures. The Annals of Mathematical statistics, 34(4), 1265–1269.
Wang, B. X. & Japkowicz, N. (2004). Imbalanced data set learning with synthetic samples. In: Proc. IRIS machine learning workshop, volume 19, p 435
Wang, X. (2008). Selection of mixed copulas and finite mixture models with applications in finance. PhD thesis, The University of North Carolina at Charlotte
Wei, Z., Kim, S., Choi, B., & Kim, D. (2019). Multivariate skew normal copula for asymmetric dependence: estimation and application. International Journal of Information Technology & Decision Making, 18(01), 365–387.
Wu, J., Wang, X., & Walker, S. G. (2014). Bayesian nonparametric inference for a multivariate copula function. Methodology and Computing in Applied Probability, 16(3), 747–763.
Xue, Y., Li, G., Li, Z., Wang, P., Gong, H., & Kong, F. (2022). Intelligent prediction of rockburst based on copula-mc oversampling architecture. Bulletin of Engineering Geology and the Environment, 81(5), 1–14.
Xue-Kun Song, P. (2000). Multivariate dispersion models generated from Gaussian copula. Scandinavian Journal of Statistics, 27(2), 305–320.
Yakowitz, S. J., & Spragins, J. D. (1968). On the identifiability of finite mixtures. The Annals of Mathematical Statistics, 39(1), 209–214.
Zhu, Q., Wang, S., Chen, Z., He, Y., & Xu, Y. (2019). A virtual sample generation method based on kernel density estimation and copula function for imbalanced classification. In 2019 IEEE 8th Data Driven Control and Learning Systems Conference (DDCLS), pp. 969–975. IEEE.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
None declared.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Liu, Y., Xie, D., Edwards, D.A. et al. Mixture copulas with discrete margins and their application to imbalanced data. J. Korean Stat. Soc. 52, 878–900 (2023). https://doi.org/10.1007/s42952-023-00226-3
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s42952-023-00226-3