Abstract
We present findings from experiments in Medicare fraud detection, that are the result of research on two new, publicly available datasets. In this research, we employ popular, open-source Machine Learning algorithms to identify fraudulent healthcare providers in Medicare insurance claims data. As far as we know, we are the first to publish a study that includes datasets compiled from the latest Medicare Part B and Medicare Part D data. The datasets became available in 2021, and are the largest such datasets that we know of. We report details on two important findings. The first finding is that increased maximum tree depth is associated with the best performance in terms of area under the receiver-operating characteristic curve (AUC) for both datasets. The second finding, which is an important counterbalance to the first finding, is that one may utilize random undersampling (RUS) to reduce the size of the training data and simultaneously achieve similar or better AUC scores.To the best of our knowledge, our study is novel in reporting the importance of maximum tree depth for classifying imbalanced Big Data. Moreover, this work is unique in demonstrating that one may employ RUS to mitigate the increased resource consumption of higher maximum tree depth.
Similar content being viewed by others
References
The Centers for Medicare and Medicaid Services: Medicare Physician & Other Practitioners: by provider and service. 2021. https://data.cms.gov/provider-summary-by-type-of-service/medicare-physician-other-practitioners/medicare-physician-other-practitioners-by-provider-and-service. Accessed 9 May 2022.
The Centers for Medicare and Medicaid Services: Medicare Part D Prescribers: by provider and drug. 2021. https://data.cms.gov/provider-summary-by-type-of-service/medicare-part-d-prescribers/medicare-part-d-prescribers-by-provider-and-drug. Accessed 18 Feb 2022.
Centers for Medicare and Medicaid Services: 2019 Estimated Improper Payment Rates for Centers for Medicare & Medicaid Services (CMS) Programs. 2019. https://www.cms.gov/newsroom/fact-sheets/2019-estimated-improper-payment-rates-centers-medicare-medicaid-services-cms-programs. Accessed 1 Mar 2022.
Civil Division, U.S. Department of Justice: Fraud Statistics, Overview. 2020. https://www.justice.gov/opa/press-release/file/1354316/download. Accessed 18 Jan 2022.
Bekkar M, Djemaa HK, Alitouche TA. Evaluation measures for models assessment over imbalanced data sets. J Inf Eng Appl. 2013;3(10):27–38.
Hancock J, Khoshgoftaar TM. Optimizing ensemble trees for big data healthcare fraud detection. In: 2022 IEEE 23rd international conference on information reuse and integration for data science (IRI); 2022. IEEE. p. 243–49
Prokhorenkova L, Gusev G, Vorobev A, Dorogush AV, Gulin A. Catboost: unbiased boosting with categorical features. Adv Neural Inf Process Syst. 2018;31:1–11.
Chen T, Guestrin C. Xgboost: a scalable tree boosting system. Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining-KDD ’16; 2016.
Breiman L. Random forests. Mach Learn. 2001;45(1):5–32.
Geurts P, Ernst D, Wehenkel L. Extremely randomized trees. Mach Learn. 2006;63(1):3–42.
Bauder RA, Khoshgoftaar TM, Hasanin T. Data sampling approaches with severely imbalanced big data for medicare fraud detection. In: 2018 IEEE 30th international conference on tools with artificial intelligence (ICTAI); 2018. IEEE. p. 137–42
The Centers for Medicare and Medicaid Services: Medicare Durable Medical Equipment, Devices & Supplies: by Referring Provider and Service. 2021. https://data.cms.gov/provider-summary-by-type-of-service/medicare-durable-medical-equipment-devices-supplies/medicare-durable-medical-equipment-devices-supplies-by-referring-provider-and-service. Accessed 18 Jan 2022
Le Cessie S, Van Houwelingen JC. Ridge estimators in logistic regression. J R Stat Soc Ser C (Appl Stat). 1992;41(1):191–201.
Meng X, Bradley J, Yavuz B, Sparks E, Venkataraman S, Liu D, Freeman J, Tsai D, Amde M, Owen S. Mllib: machine learning in apache spark. J Mach Learn Res. 2016;17(1):1235–41.
Zaharia M, Xin RS, Wendell P, Das T, Armbrust M, Dave A, Meng X, Rosen J, Venkataraman S, Franklin MJ. Apache spark: a unified engine for big data processing. Commun ACM. 2016;59(11):56–65.
Han H, Wang W-Y, Mao B-H. Borderline-smote: a new over-sampling method in imbalanced data sets learning. In: International conference on intelligent computing; 2005. Springer. p. 878–887
Lin W, Wu Z, Lin L, Wen A, Li J. An ensemble random forest algorithm for insurance big data analysis. IEEE Access. 2017;5:16568–75.
Del Río S, López V, Benítez JM, Herrera F. On the use of mapreduce for imbalanced big data using random forest. Inf Sci. 2014;285:112–37.
Herrera VM, Khoshgoftaar TM, Villanustre F, Furht B. Random forest implementation and optimization for big data analytics on lexisnexis’5s high performance computing cluster platform. J Big Data. 2019;6(1):1–36.
Genuer R, Poggi J-M, Tuleau-Malot C, Villa-Vialaneix N. Random forests for big data. Big Data Res. 2017;9:28–46.
Fauzan MA, Murfi H. The accuracy of xgboost for insurance claim prediction. Int J Adv Soft Comput Appl. 2018;10(2):159–71.
Li H, Cao Y, Li S, Zhao J, Sun Y. Xgboost model and its application to personal credit evaluation. IEEE Intell Syst. 2020;35(3):52–61.
XingFen W, Xiangbin Y, Yangchun M. Research on user consumption behavior prediction based on improved xgboost algorithm. In: 2018 IEEE international conference on big data (Big Data); 2018. IEEE. p. 4169–175.
Johnson JM, Khoshgoftaar TM. Deep learning and data sampling with imbalanced big data. In: 2019 IEEE 20th international conference on information reuse and integration for data science (IRI); 2019. IEEE. p. 175–83.
LEIE: Office of Inspector General Leie Downloadable Databases. [Online]. https://oig.hhs.gov/exclusions/index.asp. Accessed 12 Apr 2022
Herland M, Khoshgoftaar TM, Bauder RA. Big data fraud detection using multiple medicare data sources. J Big Data. 2018;5(1):1–21.
The Centers for Medicare and Medicaid Services: Medicare Physician & Other Practitioners: by Provider and Service Data Dictionary. 2021. https://data.cms.gov/resources/medicare-physician-other-practitioners-by-provider-and-service-data-dictionary. Accessed 28 Jan 2022.
The Centers for Medicare and Medicaid Services: Medicare Part D Prescribers: by provider and drug data dictionary. 2021. https://data.cms.gov/resources/medicare-part-d-prescribers-by-provider-and-drug-data-dictionary. Accessed 4 May 2022.
Breiman L. Bagging predictors. Mach Learn. 1996;24(2):123–40.
Hancock J, Khoshgoftaar TM. Performance of catboost and xgboost in medicare fraud detection. In: 2020 19th IEEE international conference on machine learning and applications (ICMLA); 2020. IEEE. p. 572–79.
Friedman JH. Greedy function approximation: a gradient boosting machine. Ann Stat. 2001;29:1189–232.
Hancock JT, Khoshgoftaar TM. Catboost for big data: an interdisciplinary review. J Big Data. 2020;7(1):1–45.
Van Rossum G, Drake F. Python 3 reference manual createspace. Scotts Valley; 2009.
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V. Scikit-learn: machine learning in python. J Mach Learn Res. 2011;12:2825–30.
Johnson JM, Khoshgoftaar TM. Hcpcs2vec: Healthcare procedure embeddings for medicare fraud prediction. In: 2020 IEEE 6th international conference on collaboration and internet computing (CIC); 2020. IEEE. p. 145–52.
Hancock JT, Khoshgoftaar TM. Survey on categorical data for neural networks. J Big Data. 2020;7(1):1–41.
Parameters. Yandex Corporation. https://catboost.ai/en/docs/references/training-parameters/common. Accessed 09 July 2022
XGBoost Parameters. XGBoost Developers. https://xgboost.readthedocs.io/en/stable/parameter.html. Accessed 09 July 2022.
Hancock JT, Khoshgoftaar TM. Hyperparameter tuning for medicare fraud detection in big data. SN Comput Sci. 2022;3(6):1–13.
Acknowledgements
We would like to thank the Data Mining and Machine Learning Research Group at Florida Atlantic University for their assistance in preparing this manuscript.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of Interest
On behalf of all authors, the corresponding author states that there is no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This article is part of the topical collection “Recent Trends on AI for HealthCare” guest edited by Lydia Bouzar-Benlabiod.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Hancock, J.T., Khoshgoftaar, T.M. Exploring Maximum Tree Depth and Random Undersampling in Ensemble Trees to Optimize the Classification of Imbalanced Big Data. SN COMPUT. SCI. 4, 462 (2023). https://doi.org/10.1007/s42979-023-01880-4
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s42979-023-01880-4