Exploring Maximum Tree Depth and Random Undersampling in Ensemble Trees to Optimize the Classification of Imbalanced Big Data

Hancock, John T.; Khoshgoftaar, Taghi M.

doi:10.1007/s42979-023-01880-4

Exploring Maximum Tree Depth and Random Undersampling in Ensemble Trees to Optimize the Classification of Imbalanced Big Data

Original Research
Published: 22 June 2023

Volume 4, article number 462, (2023)
Cite this article

SN Computer Science Aims and scope Submit manuscript

89 Accesses
1 Citation
Explore all metrics

Abstract

We present findings from experiments in Medicare fraud detection, that are the result of research on two new, publicly available datasets. In this research, we employ popular, open-source Machine Learning algorithms to identify fraudulent healthcare providers in Medicare insurance claims data. As far as we know, we are the first to publish a study that includes datasets compiled from the latest Medicare Part B and Medicare Part D data. The datasets became available in 2021, and are the largest such datasets that we know of. We report details on two important findings. The first finding is that increased maximum tree depth is associated with the best performance in terms of area under the receiver-operating characteristic curve (AUC) for both datasets. The second finding, which is an important counterbalance to the first finding, is that one may utilize random undersampling (RUS) to reduce the size of the training data and simultaneously achieve similar or better AUC scores.To the best of our knowledge, our study is novel in reporting the importance of maximum tree depth for classifying imbalanced Big Data. Moreover, this work is unique in demonstrating that one may employ RUS to mitigate the increased resource consumption of higher maximum tree depth.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Evaluating classifier performance with highly imbalanced Big Data

Article Open access 11 April 2023

Data reduction techniques for highly imbalanced medicare Big Data

Article Open access 03 January 2024

Severely imbalanced Big Data challenges: investigating data sampling approaches

Article Open access 30 November 2019

Notes

References

The Centers for Medicare and Medicaid Services: Medicare Physician & Other Practitioners: by provider and service. 2021. https://data.cms.gov/provider-summary-by-type-of-service/medicare-physician-other-practitioners/medicare-physician-other-practitioners-by-provider-and-service. Accessed 9 May 2022.
The Centers for Medicare and Medicaid Services: Medicare Part D Prescribers: by provider and drug. 2021. https://data.cms.gov/provider-summary-by-type-of-service/medicare-part-d-prescribers/medicare-part-d-prescribers-by-provider-and-drug. Accessed 18 Feb 2022.
Centers for Medicare and Medicaid Services: 2019 Estimated Improper Payment Rates for Centers for Medicare & Medicaid Services (CMS) Programs. 2019. https://www.cms.gov/newsroom/fact-sheets/2019-estimated-improper-payment-rates-centers-medicare-medicaid-services-cms-programs. Accessed 1 Mar 2022.
Civil Division, U.S. Department of Justice: Fraud Statistics, Overview. 2020. https://www.justice.gov/opa/press-release/file/1354316/download. Accessed 18 Jan 2022.
Bekkar M, Djemaa HK, Alitouche TA. Evaluation measures for models assessment over imbalanced data sets. J Inf Eng Appl. 2013;3(10):27–38.
Google Scholar
Hancock J, Khoshgoftaar TM. Optimizing ensemble trees for big data healthcare fraud detection. In: 2022 IEEE 23rd international conference on information reuse and integration for data science (IRI); 2022. IEEE. p. 243–49
Prokhorenkova L, Gusev G, Vorobev A, Dorogush AV, Gulin A. Catboost: unbiased boosting with categorical features. Adv Neural Inf Process Syst. 2018;31:1–11.
Google Scholar
Chen T, Guestrin C. Xgboost: a scalable tree boosting system. Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining-KDD ’16; 2016.
Breiman L. Random forests. Mach Learn. 2001;45(1):5–32.
Article MATH Google Scholar
Geurts P, Ernst D, Wehenkel L. Extremely randomized trees. Mach Learn. 2006;63(1):3–42.
Article MATH Google Scholar
Bauder RA, Khoshgoftaar TM, Hasanin T. Data sampling approaches with severely imbalanced big data for medicare fraud detection. In: 2018 IEEE 30th international conference on tools with artificial intelligence (ICTAI); 2018. IEEE. p. 137–42
The Centers for Medicare and Medicaid Services: Medicare Durable Medical Equipment, Devices & Supplies: by Referring Provider and Service. 2021. https://data.cms.gov/provider-summary-by-type-of-service/medicare-durable-medical-equipment-devices-supplies/medicare-durable-medical-equipment-devices-supplies-by-referring-provider-and-service. Accessed 18 Jan 2022
Le Cessie S, Van Houwelingen JC. Ridge estimators in logistic regression. J R Stat Soc Ser C (Appl Stat). 1992;41(1):191–201.
MATH Google Scholar
Meng X, Bradley J, Yavuz B, Sparks E, Venkataraman S, Liu D, Freeman J, Tsai D, Amde M, Owen S. Mllib: machine learning in apache spark. J Mach Learn Res. 2016;17(1):1235–41.
MathSciNet MATH Google Scholar
Zaharia M, Xin RS, Wendell P, Das T, Armbrust M, Dave A, Meng X, Rosen J, Venkataraman S, Franklin MJ. Apache spark: a unified engine for big data processing. Commun ACM. 2016;59(11):56–65.
Article Google Scholar
Han H, Wang W-Y, Mao B-H. Borderline-smote: a new over-sampling method in imbalanced data sets learning. In: International conference on intelligent computing; 2005. Springer. p. 878–887
Lin W, Wu Z, Lin L, Wen A, Li J. An ensemble random forest algorithm for insurance big data analysis. IEEE Access. 2017;5:16568–75.
Article Google Scholar
Del Río S, López V, Benítez JM, Herrera F. On the use of mapreduce for imbalanced big data using random forest. Inf Sci. 2014;285:112–37.
Article Google Scholar
Herrera VM, Khoshgoftaar TM, Villanustre F, Furht B. Random forest implementation and optimization for big data analytics on lexisnexis’5s high performance computing cluster platform. J Big Data. 2019;6(1):1–36.
Article Google Scholar
Genuer R, Poggi J-M, Tuleau-Malot C, Villa-Vialaneix N. Random forests for big data. Big Data Res. 2017;9:28–46.
Article Google Scholar
Fauzan MA, Murfi H. The accuracy of xgboost for insurance claim prediction. Int J Adv Soft Comput Appl. 2018;10(2):159–71.
Google Scholar
Li H, Cao Y, Li S, Zhao J, Sun Y. Xgboost model and its application to personal credit evaluation. IEEE Intell Syst. 2020;35(3):52–61.
Article Google Scholar
XingFen W, Xiangbin Y, Yangchun M. Research on user consumption behavior prediction based on improved xgboost algorithm. In: 2018 IEEE international conference on big data (Big Data); 2018. IEEE. p. 4169–175.
Johnson JM, Khoshgoftaar TM. Deep learning and data sampling with imbalanced big data. In: 2019 IEEE 20th international conference on information reuse and integration for data science (IRI); 2019. IEEE. p. 175–83.
LEIE: Office of Inspector General Leie Downloadable Databases. [Online]. https://oig.hhs.gov/exclusions/index.asp. Accessed 12 Apr 2022
Herland M, Khoshgoftaar TM, Bauder RA. Big data fraud detection using multiple medicare data sources. J Big Data. 2018;5(1):1–21.
Article Google Scholar
The Centers for Medicare and Medicaid Services: Medicare Physician & Other Practitioners: by Provider and Service Data Dictionary. 2021. https://data.cms.gov/resources/medicare-physician-other-practitioners-by-provider-and-service-data-dictionary. Accessed 28 Jan 2022.
The Centers for Medicare and Medicaid Services: Medicare Part D Prescribers: by provider and drug data dictionary. 2021. https://data.cms.gov/resources/medicare-part-d-prescribers-by-provider-and-drug-data-dictionary. Accessed 4 May 2022.
Breiman L. Bagging predictors. Mach Learn. 1996;24(2):123–40.
Article MATH Google Scholar
Hancock J, Khoshgoftaar TM. Performance of catboost and xgboost in medicare fraud detection. In: 2020 19th IEEE international conference on machine learning and applications (ICMLA); 2020. IEEE. p. 572–79.
Friedman JH. Greedy function approximation: a gradient boosting machine. Ann Stat. 2001;29:1189–232.
Article MathSciNet MATH Google Scholar
Hancock JT, Khoshgoftaar TM. Catboost for big data: an interdisciplinary review. J Big Data. 2020;7(1):1–45.
Article Google Scholar
Van Rossum G, Drake F. Python 3 reference manual createspace. Scotts Valley; 2009.
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V. Scikit-learn: machine learning in python. J Mach Learn Res. 2011;12:2825–30.
MathSciNet MATH Google Scholar
Johnson JM, Khoshgoftaar TM. Hcpcs2vec: Healthcare procedure embeddings for medicare fraud prediction. In: 2020 IEEE 6th international conference on collaboration and internet computing (CIC); 2020. IEEE. p. 145–52.
Hancock JT, Khoshgoftaar TM. Survey on categorical data for neural networks. J Big Data. 2020;7(1):1–41.
Article Google Scholar
Parameters. Yandex Corporation. https://catboost.ai/en/docs/references/training-parameters/common. Accessed 09 July 2022
XGBoost Parameters. XGBoost Developers. https://xgboost.readthedocs.io/en/stable/parameter.html. Accessed 09 July 2022.
Hancock JT, Khoshgoftaar TM. Hyperparameter tuning for medicare fraud detection in big data. SN Comput Sci. 2022;3(6):1–13.
Article Google Scholar

Download references

Acknowledgements

We would like to thank the Data Mining and Machine Learning Research Group at Florida Atlantic University for their assistance in preparing this manuscript.

Author information

Authors and Affiliations

Department of Electrical Engineering and Computer Science, Florida Atlantic University, 777 Glades Road, Boca Raton, FL, 33431, USA
John T. Hancock III & Taghi M. Khoshgoftaar

Authors

John T. Hancock III
View author publications
You can also search for this author in PubMed Google Scholar
Taghi M. Khoshgoftaar
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to John T. Hancock III.

Ethics declarations

Conflict of Interest

On behalf of all authors, the corresponding author states that there is no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article is part of the topical collection “Recent Trends on AI for HealthCare” guest edited by Lydia Bouzar-Benlabiod.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Hancock, J.T., Khoshgoftaar, T.M. Exploring Maximum Tree Depth and Random Undersampling in Ensemble Trees to Optimize the Classification of Imbalanced Big Data. SN COMPUT. SCI. 4, 462 (2023). https://doi.org/10.1007/s42979-023-01880-4

Download citation

Received: 22 December 2022
Accepted: 08 May 2023
Published: 22 June 2023
DOI: https://doi.org/10.1007/s42979-023-01880-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Exploring Maximum Tree Depth and Random Undersampling in Ensemble Trees to Optimize the Classification of Imbalanced Big Data

Abstract

Access this article

Similar content being viewed by others

Evaluating classifier performance with highly imbalanced Big Data

Data reduction techniques for highly imbalanced medicare Big Data

Severely imbalanced Big Data challenges: investigating data sampling approaches

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Exploring Maximum Tree Depth and Random Undersampling in Ensemble Trees to Optimize the Classification of Imbalanced Big Data

Abstract

Access this article

Similar content being viewed by others

Evaluating classifier performance with highly imbalanced Big Data

Data reduction techniques for highly imbalanced medicare Big Data

Severely imbalanced Big Data challenges: investigating data sampling approaches

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation