Mixture copulas with discrete margins and their application to imbalanced data

Liu, Yujian; Xie, Dejun; Edwards, David A.; Yu, Siyi

doi:10.1007/s42952-023-00226-3

Mixture copulas with discrete margins and their application to imbalanced data

Research Article
Published: 09 September 2023

Volume 52, pages 878–900, (2023)
Cite this article

Journal of the Korean Statistical Society Aims and scope Submit manuscript

Yujian Liu^1,2,
Dejun Xie¹,
David A. Edwards³ &
…
Siyi Yu²

148 Accesses
Explore all metrics

Abstract

This article introduces the approach of using Bayesian sampling to estimate the mixture copula with discrete margins, we further apply our models to solve the class imbalanced problems in data science by oversampling. The methodology makes it possible to learn and sample from the data set with the discrete and continuous features exists simultaneously. On the other hand, the discreetness of factors in a data set are not naturally considered for the classic SMOTE algorithm and classic random oversampling is simply performed by generating the already existing points, which do not give any new information to the classifiers and is easy to overfit. Copula methods enable us to generate new points with the correlation structure memorized by learning from the training set. Hence, the overfitting problems are reduced. Experiments with synthetic and real data are done in the article following the introduction of the methodology. The outcomes shows the validity of the approach when compared with the benchmark methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Clustering Using Student t Mixture Copulas

Article Open access 13 February 2021

Model-based clustering using copulas with applications

Article 23 July 2015

New Estimation Method for Mixture of Normal Distributions

Data availability

The experimental data set used for the current study is available in the KEEL repository: https://sci2s.ugr.es/keel/datasets.php.

References

Alam, T. M., Shaukat, K., Hameed, I. A., Luo, S., Sarwar, M. U., Shabbir, S., Li, J., & Khushi, M. (2020). An investigation of credit card default prediction in the imbalanced datasets. IEEE Access, 8, 201173–201198.
Google Scholar
Alcalá-Fdez, J., Sanchez, L., Garcia, S., del Jesus, M. J., Ventura, S., Garrell, J. M., Otero, J., Romero, C., Bacardit, J., Rivas, V. M., et al. (2009). Keel: a software tool to assess evolutionary algorithms for data mining problems. Soft Computing, 13(3), 307–318.
Google Scholar
Arakelian, V., & Karlis, D. (2014). Clustering dependencies via mixtures of copulas. Communications in Statistics-Simulation and Computation, 43(7), 1644–1661.
MathSciNet MATH Google Scholar
Azzalini, A. (1985). A class of distributions which includes the normal ones. Scandinavian Journal of Statistics, 12(2), 171–178.
MathSciNet MATH Google Scholar
Azzalini, A. (2013). The skew-normal and related families (Vol. 3). Cambridge University Press.
MATH Google Scholar
Azzalini, A., & Valle, A. D. (1996). The multivariate skew-normal distribution. Biometrika, 83(4), 715–726.
MathSciNet MATH Google Scholar
Cai, Z., & Wang, X. (2014). Selection of mixed copula model via penalized likelihood. Journal of the American Statistical Association, 109(506), 788–801.
MathSciNet MATH Google Scholar
Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). Smote: synthetic minority over-sampling technique. Journal of artificial intelligence research, 16, 321–357.
MATH Google Scholar
Chawla, N. V., Japkowicz, N., & Kotcz, A. (2004). Special issue on learning from imbalanced data sets. ACM SIGKDD Explorations Newsletter, 6(1), 1–6.
Google Scholar
Deligiannidis, G., & Doucet, A. (2018). The correlated pseudomarginal method. Journal of the Royal Statistical Society Series B (Statistical Methodology), 80(5), 839–870.
MathSciNet MATH Google Scholar
Faugeras, O. P. (2017). Inference for copula modeling of discrete data: a cautionary tale and some facts. Dependence Modeling, 5(1), 121–132.
MathSciNet MATH Google Scholar
Fernández, A., García, S., Galar, M., Prati, R. C., Krawczyk, B., & Herrera, F. (2018). Learning from imbalanced data sets (Vol. 10). Springer.
Google Scholar
Fotouhi, S., Asadi, S., & Kattan, M. W. (2019). A comprehensive data level analysis for cancer diagnosis on imbalanced data. Journal of Biomedical Informatics, 90, 103089.
Google Scholar
Geenens, G. (2020). Copula modeling for discrete random vectors. Dependence Modeling, 8(1), 417–440.
MathSciNet MATH Google Scholar
Genest, C., & Rivest, L.-P. (1993). Statistical inference procedures for bivariate Archimedean copulas. Journal of the American Statistical Association, 88(423), 1034–1043.
MathSciNet MATH Google Scholar
Gunawan, D., Tran, M.-N., Suzuki, K., Dick, J., & Kohn, R. (2019). Computationally efficient Bayesian estimation of high-dimensional Archimedean copulas with discrete and mixed margins. Statistics and Computing, 29(5), 933–946.
MathSciNet MATH Google Scholar
Gupta, S., & Gupta, M. K. (2022). A comprehensive data-level investigation of cancer diagnosis on imbalanced data. Computational Intelligence, 38(1), 156–186.
MathSciNet Google Scholar
He, H., & Garcia, E. A. (2009). Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering, 21(9), 1263–1284.
Google Scholar
Holzmann, H., Munk, A., & Gneiting, T. (2006). Identifiability of finite mixtures of elliptical distributions. Scandinavian Journal of Statistics, 33(4), 753–763.
MathSciNet MATH Google Scholar
Hu, L. (2006). Dependence patterns across financial markets: a mixed copula approach. Applied Financial Economics, 16(10), 717–729.
Google Scholar
Huard, D., Evin, G., & Favre, A.-C. (2006). Bayesian copula selection. Computational Statistics & Data Analysis, 51(2), 809–822.
MathSciNet MATH Google Scholar
Jiryaie, F., Withanage, N., Wu, B., & De Leon, A. (2016). Gaussian copula distributions for mixed data, with application in discrimination. Journal of Statistical Computation and Simulation, 86(9), 1643–1659.
MathSciNet MATH Google Scholar
Joe, H. & Xu, J. J. (1996). The estimation method of inference functions for margins for multivariate models. Technical Report No. 166, pp 1–21
Kosmidis, I., & Karlis, D. (2016). Model-based clustering using copulas with applications. Statistics and Computing, 26, 1079–1099.
MathSciNet MATH Google Scholar
Krawczyk, B. (2016). Learning from imbalanced data: open challenges and future directions. Progress in Artificial Intelligence, 5(4), 221–232.
Google Scholar
Liu, Y., Ao, X., Qin, Z., Chi, J., Feng, J., Yang, H., & He, Q. (2021). Pick and choose: a gnn-based imbalanced learning approach for fraud detection. Proceedings of the Web Conference, 2021, 3168–3177.
Google Scholar
Liu, Y., Xie, D., & Yu, S. (2023). Bayesian mixture copula estimation and selection with applications. Analytics, 2(2), 530–545.
Google Scholar
Loaiza-Maya, R., & Smith, M. S. (2019). Variational Bayes estimation of discrete-margined copula models with application to time series. Journal of Computational and Graphical Statistics, 28(3), 523–539.
MathSciNet MATH Google Scholar
MacKenzie, D., & Spears, T. (2014). ‘A device for being able to book P &L’: the organizational embedding of the Gaussian copula. Social Studies of Science, 44(3), 418–440.
Google Scholar
Mazo, G., & Averyanov, Y. (2019). Constraining kernel estimators in semiparametric copula mixture models. Computational Statistics & Data Analysis, 138, 170–189.
MathSciNet MATH Google Scholar
McLachlan, G. J., Lee, S. X., & Rathnayake, S. I. (2019). Finite mixture models. Annual Review of Statistics and its Application, 6, 355–378.
MathSciNet Google Scholar
McNeil, A. J., Frey, R., Embrechts, P., et al. (2015). Quantitative risk management: concepts. Economics Books
Meyer, C. (2013). The bivariate normal copula. Communications in Statistics-Theory and Methods, 42(13), 2402–2422.
MathSciNet MATH Google Scholar
Mohammed, R., Rawashdeh, J., & Abdullah, M. (2020). Machine learning with oversampling and undersampling techniques: overview study and experimental results. In: 2020 11th international conference on information and communication systems (ICICS), pp 243–248. IEEE
Nasr, B. R. & Remillard, B. N. (2023). Identifiability and inference for copula-based semiparametric models for random vectors with arbitrary marginal distributions. arXiv preprint arXiv:2301.13408.
Otiniano, C., Rathie, P., & Ozelim, L. (2015). On the identifiability of finite mixture of skew-normal and skew-t distributions. Statistics & Probability Letters, 106, 103–108.
MathSciNet MATH Google Scholar
Panagiotelis, A., Czado, C., & Joe, H. (2012). Pair copula constructions for multivariate discrete data. Journal of the American Statistical Association, 107(499), 1063–1072.
MathSciNet MATH Google Scholar
Pitt, M., Chan, D., & Kohn, R. (2006). Efficient Bayesian inference for Gaussian copula regression models. Biometrika, 93(3), 537–554.
MathSciNet MATH Google Scholar
Provost, F. (2000). Machine learning from imbalanced data sets 101. In: Proceedings of the AAAI’2000 workshop on imbalanced data sets. AAAI Press
Renard, B., & Lang, M. (2007). Use of a Gaussian copula for multivariate extreme value analysis: some case studies in hydrology. Advances in Water Resources, 30(4), 897–912.
Google Scholar
Rousseau, J., & Mengersen, K. (2011). Asymptotic behaviour of the posterior distribution in overfitted mixture models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73(5), 689–710.
MathSciNet MATH Google Scholar
Sklar, M. (1959). Fonctions de repartition an dimensions et leurs marges. Annales de l’ISUP, 8, 229–231.
MATH Google Scholar
Smith, M. S. (2011). Bayesian approaches to copula modelling. arXiv preprint arXiv:1112.4204.
Smith, M. S., Gan, Q., & Kohn, R. J. (2012). Modelling dependence using skew t copulas: Bayesian inference and applications. Journal of Applied Econometrics, 27(3), 500–522.
MathSciNet Google Scholar
Smith, M. S., & Khaled, M. A. (2012). Estimation of copula models with discrete margins via Bayesian data augmentation. Journal of the American Statistical Association, 107(497), 290–303.
MathSciNet MATH Google Scholar
Teicher, H. (1961). Identifiability of mixtures. The Annals of Mathematical statistics, 32(1), 244–248.
MathSciNet MATH Google Scholar
Teicher, H. (1963). Identifiability of finite mixtures. The Annals of Mathematical statistics, 34(4), 1265–1269.
MathSciNet MATH Google Scholar
Wang, B. X. & Japkowicz, N. (2004). Imbalanced data set learning with synthetic samples. In: Proc. IRIS machine learning workshop, volume 19, p 435
Wang, X. (2008). Selection of mixed copulas and finite mixture models with applications in finance. PhD thesis, The University of North Carolina at Charlotte
Wei, Z., Kim, S., Choi, B., & Kim, D. (2019). Multivariate skew normal copula for asymmetric dependence: estimation and application. International Journal of Information Technology & Decision Making, 18(01), 365–387.
Google Scholar
Wu, J., Wang, X., & Walker, S. G. (2014). Bayesian nonparametric inference for a multivariate copula function. Methodology and Computing in Applied Probability, 16(3), 747–763.
MathSciNet MATH Google Scholar
Xue, Y., Li, G., Li, Z., Wang, P., Gong, H., & Kong, F. (2022). Intelligent prediction of rockburst based on copula-mc oversampling architecture. Bulletin of Engineering Geology and the Environment, 81(5), 1–14.
Google Scholar
Xue-Kun Song, P. (2000). Multivariate dispersion models generated from Gaussian copula. Scandinavian Journal of Statistics, 27(2), 305–320.
MathSciNet MATH Google Scholar
Yakowitz, S. J., & Spragins, J. D. (1968). On the identifiability of finite mixtures. The Annals of Mathematical Statistics, 39(1), 209–214.
MathSciNet MATH Google Scholar
Zhu, Q., Wang, S., Chen, Z., He, Y., & Xu, Y. (2019). A virtual sample generation method based on kernel density estimation and copula function for imbalanced classification. In 2019 IEEE 8th Data Driven Control and Learning Systems Conference (DDCLS), pp. 969–975. IEEE.

Download references

Author information

Authors and Affiliations

School of Mathematics and Physics, Xi’an Jiaotong-Liverpool University, Suzhou, China
Yujian Liu & Dejun Xie
School of Economics and Management, Shanghai University of Sport, Shanghai, China
Yujian Liu & Siyi Yu
Department of Mathematical Sciences, University of Delaware, Newark, DE, USA
David A. Edwards

Authors

Yujian Liu
View author publications
You can also search for this author in PubMed Google Scholar
Dejun Xie
View author publications
You can also search for this author in PubMed Google Scholar
David A. Edwards
View author publications
You can also search for this author in PubMed Google Scholar
Siyi Yu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dejun Xie.

Ethics declarations

Conflict of interest

None declared.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Liu, Y., Xie, D., Edwards, D.A. et al. Mixture copulas with discrete margins and their application to imbalanced data. J. Korean Stat. Soc. 52, 878–900 (2023). https://doi.org/10.1007/s42952-023-00226-3

Download citation

Received: 06 March 2023
Accepted: 28 August 2023
Published: 09 September 2023
Issue Date: December 2023
DOI: https://doi.org/10.1007/s42952-023-00226-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Mixture copulas with discrete margins and their application to imbalanced data

Abstract

Access this article

Similar content being viewed by others

Clustering Using Student t Mixture Copulas

Model-based clustering using copulas with applications

New Estimation Method for Mixture of Normal Distributions

Data availability

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Mixture copulas with discrete margins and their application to imbalanced data

Abstract

Access this article

Similar content being viewed by others

Clustering Using Student t Mixture Copulas

Model-based clustering using copulas with applications

New Estimation Method for Mixture of Normal Distributions

Data availability

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation