Skip to main content
Log in

Mixture copulas with discrete margins and their application to imbalanced data

  • Research Article
  • Published:
Journal of the Korean Statistical Society Aims and scope Submit manuscript

Abstract

This article introduces the approach of using Bayesian sampling to estimate the mixture copula with discrete margins, we further apply our models to solve the class imbalanced problems in data science by oversampling. The methodology makes it possible to learn and sample from the data set with the discrete and continuous features exists simultaneously. On the other hand, the discreetness of factors in a data set are not naturally considered for the classic SMOTE algorithm and classic random oversampling is simply performed by generating the already existing points, which do not give any new information to the classifiers and is easy to overfit. Copula methods enable us to generate new points with the correlation structure memorized by learning from the training set. Hence, the overfitting problems are reduced. Experiments with synthetic and real data are done in the article following the introduction of the methodology. The outcomes shows the validity of the approach when compared with the benchmark methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Algorithm 1
Algorithm 2
Fig. 1

Similar content being viewed by others

Data availability

The experimental data set used for the current study is available in the KEEL repository: https://sci2s.ugr.es/keel/datasets.php.

References

  • Alam, T. M., Shaukat, K., Hameed, I. A., Luo, S., Sarwar, M. U., Shabbir, S., Li, J., & Khushi, M. (2020). An investigation of credit card default prediction in the imbalanced datasets. IEEE Access, 8, 201173–201198.

    Google Scholar 

  • Alcalá-Fdez, J., Sanchez, L., Garcia, S., del Jesus, M. J., Ventura, S., Garrell, J. M., Otero, J., Romero, C., Bacardit, J., Rivas, V. M., et al. (2009). Keel: a software tool to assess evolutionary algorithms for data mining problems. Soft Computing, 13(3), 307–318.

    Google Scholar 

  • Arakelian, V., & Karlis, D. (2014). Clustering dependencies via mixtures of copulas. Communications in Statistics-Simulation and Computation, 43(7), 1644–1661.

    MathSciNet  MATH  Google Scholar 

  • Azzalini, A. (1985). A class of distributions which includes the normal ones. Scandinavian Journal of Statistics, 12(2), 171–178.

    MathSciNet  MATH  Google Scholar 

  • Azzalini, A. (2013). The skew-normal and related families (Vol. 3). Cambridge University Press.

    MATH  Google Scholar 

  • Azzalini, A., & Valle, A. D. (1996). The multivariate skew-normal distribution. Biometrika, 83(4), 715–726.

    MathSciNet  MATH  Google Scholar 

  • Cai, Z., & Wang, X. (2014). Selection of mixed copula model via penalized likelihood. Journal of the American Statistical Association, 109(506), 788–801.

    MathSciNet  MATH  Google Scholar 

  • Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). Smote: synthetic minority over-sampling technique. Journal of artificial intelligence research, 16, 321–357.

    MATH  Google Scholar 

  • Chawla, N. V., Japkowicz, N., & Kotcz, A. (2004). Special issue on learning from imbalanced data sets. ACM SIGKDD Explorations Newsletter, 6(1), 1–6.

    Google Scholar 

  • Deligiannidis, G., & Doucet, A. (2018). The correlated pseudomarginal method. Journal of the Royal Statistical Society Series B (Statistical Methodology), 80(5), 839–870.

    MathSciNet  MATH  Google Scholar 

  • Faugeras, O. P. (2017). Inference for copula modeling of discrete data: a cautionary tale and some facts. Dependence Modeling, 5(1), 121–132.

    MathSciNet  MATH  Google Scholar 

  • Fernández, A., García, S., Galar, M., Prati, R. C., Krawczyk, B., & Herrera, F. (2018). Learning from imbalanced data sets (Vol. 10). Springer.

    Google Scholar 

  • Fotouhi, S., Asadi, S., & Kattan, M. W. (2019). A comprehensive data level analysis for cancer diagnosis on imbalanced data. Journal of Biomedical Informatics, 90, 103089.

    Google Scholar 

  • Geenens, G. (2020). Copula modeling for discrete random vectors. Dependence Modeling, 8(1), 417–440.

    MathSciNet  MATH  Google Scholar 

  • Genest, C., & Rivest, L.-P. (1993). Statistical inference procedures for bivariate Archimedean copulas. Journal of the American Statistical Association, 88(423), 1034–1043.

    MathSciNet  MATH  Google Scholar 

  • Gunawan, D., Tran, M.-N., Suzuki, K., Dick, J., & Kohn, R. (2019). Computationally efficient Bayesian estimation of high-dimensional Archimedean copulas with discrete and mixed margins. Statistics and Computing, 29(5), 933–946.

    MathSciNet  MATH  Google Scholar 

  • Gupta, S., & Gupta, M. K. (2022). A comprehensive data-level investigation of cancer diagnosis on imbalanced data. Computational Intelligence, 38(1), 156–186.

    MathSciNet  Google Scholar 

  • He, H., & Garcia, E. A. (2009). Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering, 21(9), 1263–1284.

    Google Scholar 

  • Holzmann, H., Munk, A., & Gneiting, T. (2006). Identifiability of finite mixtures of elliptical distributions. Scandinavian Journal of Statistics, 33(4), 753–763.

    MathSciNet  MATH  Google Scholar 

  • Hu, L. (2006). Dependence patterns across financial markets: a mixed copula approach. Applied Financial Economics, 16(10), 717–729.

    Google Scholar 

  • Huard, D., Evin, G., & Favre, A.-C. (2006). Bayesian copula selection. Computational Statistics & Data Analysis, 51(2), 809–822.

    MathSciNet  MATH  Google Scholar 

  • Jiryaie, F., Withanage, N., Wu, B., & De Leon, A. (2016). Gaussian copula distributions for mixed data, with application in discrimination. Journal of Statistical Computation and Simulation, 86(9), 1643–1659.

    MathSciNet  MATH  Google Scholar 

  • Joe, H. & Xu, J. J. (1996). The estimation method of inference functions for margins for multivariate models. Technical Report No. 166, pp 1–21

  • Kosmidis, I., & Karlis, D. (2016). Model-based clustering using copulas with applications. Statistics and Computing, 26, 1079–1099.

    MathSciNet  MATH  Google Scholar 

  • Krawczyk, B. (2016). Learning from imbalanced data: open challenges and future directions. Progress in Artificial Intelligence, 5(4), 221–232.

    Google Scholar 

  • Liu, Y., Ao, X., Qin, Z., Chi, J., Feng, J., Yang, H., & He, Q. (2021). Pick and choose: a gnn-based imbalanced learning approach for fraud detection. Proceedings of the Web Conference, 2021, 3168–3177.

    Google Scholar 

  • Liu, Y., Xie, D., & Yu, S. (2023). Bayesian mixture copula estimation and selection with applications. Analytics, 2(2), 530–545.

    Google Scholar 

  • Loaiza-Maya, R., & Smith, M. S. (2019). Variational Bayes estimation of discrete-margined copula models with application to time series. Journal of Computational and Graphical Statistics, 28(3), 523–539.

    MathSciNet  MATH  Google Scholar 

  • MacKenzie, D., & Spears, T. (2014). ‘A device for being able to book P &L’: the organizational embedding of the Gaussian copula. Social Studies of Science, 44(3), 418–440.

    Google Scholar 

  • Mazo, G., & Averyanov, Y. (2019). Constraining kernel estimators in semiparametric copula mixture models. Computational Statistics & Data Analysis, 138, 170–189.

    MathSciNet  MATH  Google Scholar 

  • McLachlan, G. J., Lee, S. X., & Rathnayake, S. I. (2019). Finite mixture models. Annual Review of Statistics and its Application, 6, 355–378.

    MathSciNet  Google Scholar 

  • McNeil, A. J., Frey, R., Embrechts, P., et al. (2015). Quantitative risk management: concepts. Economics Books

  • Meyer, C. (2013). The bivariate normal copula. Communications in Statistics-Theory and Methods, 42(13), 2402–2422.

    MathSciNet  MATH  Google Scholar 

  • Mohammed, R., Rawashdeh, J., & Abdullah, M. (2020). Machine learning with oversampling and undersampling techniques: overview study and experimental results. In: 2020 11th international conference on information and communication systems (ICICS), pp 243–248. IEEE

  • Nasr, B. R. & Remillard, B. N. (2023). Identifiability and inference for copula-based semiparametric models for random vectors with arbitrary marginal distributions. arXiv preprint arXiv:2301.13408.

  • Otiniano, C., Rathie, P., & Ozelim, L. (2015). On the identifiability of finite mixture of skew-normal and skew-t distributions. Statistics & Probability Letters, 106, 103–108.

    MathSciNet  MATH  Google Scholar 

  • Panagiotelis, A., Czado, C., & Joe, H. (2012). Pair copula constructions for multivariate discrete data. Journal of the American Statistical Association, 107(499), 1063–1072.

    MathSciNet  MATH  Google Scholar 

  • Pitt, M., Chan, D., & Kohn, R. (2006). Efficient Bayesian inference for Gaussian copula regression models. Biometrika, 93(3), 537–554.

    MathSciNet  MATH  Google Scholar 

  • Provost, F. (2000). Machine learning from imbalanced data sets 101. In: Proceedings of the AAAI’2000 workshop on imbalanced data sets. AAAI Press

  • Renard, B., & Lang, M. (2007). Use of a Gaussian copula for multivariate extreme value analysis: some case studies in hydrology. Advances in Water Resources, 30(4), 897–912.

    Google Scholar 

  • Rousseau, J., & Mengersen, K. (2011). Asymptotic behaviour of the posterior distribution in overfitted mixture models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73(5), 689–710.

    MathSciNet  MATH  Google Scholar 

  • Sklar, M. (1959). Fonctions de repartition an dimensions et leurs marges. Annales de l’ISUP, 8, 229–231.

    MATH  Google Scholar 

  • Smith, M. S. (2011). Bayesian approaches to copula modelling. arXiv preprint arXiv:1112.4204.

  • Smith, M. S., Gan, Q., & Kohn, R. J. (2012). Modelling dependence using skew t copulas: Bayesian inference and applications. Journal of Applied Econometrics, 27(3), 500–522.

    MathSciNet  Google Scholar 

  • Smith, M. S., & Khaled, M. A. (2012). Estimation of copula models with discrete margins via Bayesian data augmentation. Journal of the American Statistical Association, 107(497), 290–303.

    MathSciNet  MATH  Google Scholar 

  • Teicher, H. (1961). Identifiability of mixtures. The Annals of Mathematical statistics, 32(1), 244–248.

    MathSciNet  MATH  Google Scholar 

  • Teicher, H. (1963). Identifiability of finite mixtures. The Annals of Mathematical statistics, 34(4), 1265–1269.

    MathSciNet  MATH  Google Scholar 

  • Wang, B. X. & Japkowicz, N. (2004). Imbalanced data set learning with synthetic samples. In: Proc. IRIS machine learning workshop, volume 19, p 435

  • Wang, X. (2008). Selection of mixed copulas and finite mixture models with applications in finance. PhD thesis, The University of North Carolina at Charlotte

  • Wei, Z., Kim, S., Choi, B., & Kim, D. (2019). Multivariate skew normal copula for asymmetric dependence: estimation and application. International Journal of Information Technology & Decision Making, 18(01), 365–387.

    Google Scholar 

  • Wu, J., Wang, X., & Walker, S. G. (2014). Bayesian nonparametric inference for a multivariate copula function. Methodology and Computing in Applied Probability, 16(3), 747–763.

    MathSciNet  MATH  Google Scholar 

  • Xue, Y., Li, G., Li, Z., Wang, P., Gong, H., & Kong, F. (2022). Intelligent prediction of rockburst based on copula-mc oversampling architecture. Bulletin of Engineering Geology and the Environment, 81(5), 1–14.

    Google Scholar 

  • Xue-Kun Song, P. (2000). Multivariate dispersion models generated from Gaussian copula. Scandinavian Journal of Statistics, 27(2), 305–320.

    MathSciNet  MATH  Google Scholar 

  • Yakowitz, S. J., & Spragins, J. D. (1968). On the identifiability of finite mixtures. The Annals of Mathematical Statistics, 39(1), 209–214.

    MathSciNet  MATH  Google Scholar 

  • Zhu, Q., Wang, S., Chen, Z., He, Y., & Xu, Y. (2019). A virtual sample generation method based on kernel density estimation and copula function for imbalanced classification. In 2019 IEEE 8th Data Driven Control and Learning Systems Conference (DDCLS), pp. 969–975. IEEE.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dejun Xie.

Ethics declarations

Conflict of interest

None declared.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Liu, Y., Xie, D., Edwards, D.A. et al. Mixture copulas with discrete margins and their application to imbalanced data. J. Korean Stat. Soc. 52, 878–900 (2023). https://doi.org/10.1007/s42952-023-00226-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s42952-023-00226-3

Keywords

Navigation