Abstract
Hyperparameter tuning is the collection of techniques to discover optimal values for settings we supply to machine learning algorithms. Put another way, hyperparameters are not optimized by the algorithm. When researching Big Data, we face the dilemma of whether it will be useful to do hyperparameter tuning with the maximum possible amount of data. This is because hyperparameter tuning may consume far more resources than conducting a single experiment with default values for hyperparameters. Each combination of algorithm settings results in an additional experiment. Here, we show that hyperparameter tuning with all available data is beneficial in the scope of our experiments. We conduct experiments with three Big Data Medicare Insurance Claims datasets. The experiments are exercises in Medicare fraud detection. We show that for each dataset, we obtain better performance from LightGBM and CatBoost classifiers with tuned hyperparameters. Since some features of the data we are working with are high cardinality categorical features, we have an opportunity to try different encoding techniques in our experiments. We find that across the different encoding techniques, hyperparameter tuning Provides an improvement in the performance of both LightGBM and CatBoost.
Similar content being viewed by others
References
Deng J, Dong W, Socher R, Li L.-J, Li K, Fei-Fei L. Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition. 2009; pp. 248–255. Ieee
Krizhevsky A, Sutskever I, Hinton GE. Imagenet classification with deep convolutional neural networks. Adv Neural Inf Process Syst. 2012;25:1097–105.
Bergstra J, Bengio Y. Random search for hyper-parameter optimization. J Mach Learn Res. 2012;13(2):281–305.
Ke G, Meng Q, Finley T, Wang T, Chen W, Ma W, Ye Q, Liu T-Y. Lightgbm: A highly efficient gradient boosting decision tree. Adv Neural Inf Process Syst. 2017;30:3146–54.
Prokhorenkova L, Gusev G, Vorobev A, Dorogush AV, Gulin A. Catboost: unbiased boosting with categorical features. Adv Neural Inf Processing Syst. 2018;31:6638–6648
De Mauro A, Greco M, Grimaldi M. A formal definition of big data based on its essential features. Library Review (2016)
CMS office of enterprise data and analytics: medicare fee-for-service provider utilization & payment data physician and other supplier public use file: a methodological overview (2017)
CMS office of enterprise data and analytics: medicare fee-for service provider utilization & payment data part d prescriber public use file: a methodological overview. 2020.
CMS office of enterprise data and analytics. medicare fee-for-service provider utilization and payment data referring durable medical equipment, prosthetics. A methodological overview: orthotics and suppliespublic Use File. 2020.
Hancock JT, Khoshgoftaar TM. Survey on categorical data for neural networks. J Big Data. 2020;7:1–41.
Hancock J, Khoshgoftaar TM. Impact of hyperparameter tuning in classifying highly imbalanced big data. In: 2021 IEEE 22nd international conference on information reuse and integration for data science (IRI). 2021; 348–354. IEEE
Centers For Medicare & Medicaid Services: 2021 annual report of the boards of trustees of the federal hospital insurance and federal supplementary medical insurance trust funds. https://www.cms.gov/files/document/2021-medicare-trustees-report.pdf
Medicare CF, Services M. Estimated improper payment rates for centers for medicare & medicaid services (CMS) programs. 2020. https://www.cms.gov
Bagdoyan SJ. Testimony before the subcommittee on oversight, committee on ways and means, house of representatives. https://www.gao.gov/assets/700/693156.pdf
Victoria PC, Padma PD. Influence of optimizing xgboost to handle class imbalance in credit card fraud detection. In: 2020 Third International Conference on Smart Systems and Inventive Technology (ICSSIT). 2020; pp. 1309–1315. IEEE.
Johnson JM, Khoshgoftaar TM. Semantic embeddings for medical providers and fraud detection. In: 2020 IEEE 21st International conference on information reuse and integration for data science (IRI). 2020; pp. 224–230. IEEE.
Bauder RA, Khoshgoftaar TM. A novel method for fraudulent medicare claims detection from expected payment deviations (application paper). In: 2016 IEEE 17th International Conference on Information Reuse and Integration (IRI). 2016; pp. 11–19. https://doi.org/10.1109/IRI.2016.11
Ekin T, Ieva F, Ruggeri F, Soyer R. On the use of the concentration function in medical fraud assessment. Am Stat. 2017;71(3):236–41.
Branting LK, Reeder F, Gold J, Champney T. Graph analytics for healthcare fraud risk estimation. In: 2016 IEEE/ACM international conference on advances in social networks analysis and mining (ASONAM). 2016; pp. 845–51. https://doi.org/10.1109/ASONAM.2016.7752336
LEIE: Office of inspector general Leie downloadable databases. https://oig.hhs.gov/exclusions/index.asp
Herland M, Khoshgoftaar TM, Bauder RA. Big data fraud detection using multiple medicare data sources. J Big Data. 2018;5(1):1–21.
Su Y, Zhu X, Dong B, Zhang Y, Wu XW. Medfrodetect: medicare fraud detection with extremely imbalanced class distributions. In: The thirty-third international FLAIRS conference (FLAIRS-32). 2020.
Friedman JH. Greedy function approximation: a gradient boosting machine. Ann Stat. 2001;1189–1232
Johnson JM, Khoshgoftaar TM. Medicare fraud detection using neural networks. J Big Data. 2019;6(1):1–35.
Bauder RA, da Rosa R, Khoshgoftaar TM. Identifying medicare provider fraud with unsupervised machine learning. In: 2018 IEEE International Conference on Information Reuse and Integration (IRI). 2018; pp. 285–292. https://doi.org/10.1109/IRI.2018.00051
Hancock JT, Khoshgoftaar TM. Catboost for big data: an interdisciplinary review. J Big Data 2020;7:1–45.
Bauder RA, Khoshgoftaar TM. The effects of varying class distribution on learner behavior for medicare fraud detection with imbalanced big data. Health Inf Sci Syst. 2018;6(1):9.
Van Rossum G, Drake F. Python 3 reference manual createspace. Scotts Valley; 2009.
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V. Scikit-learn: machine learning in python. J Mach Learn Res. 2011;12:2825–30.
Buitinck L, Louppe G, Blondel M, Pedregosa F, Mueller A, Grisel O, Niculae V, Prettenhofer P, Gramfort A, Grobler J, Layton R, VanderPlas J, Joly A, Holt B, Varoquaux G. API design for machine learning software: experiences from the scikit-learn project. In: ECML PKDD Workshop: Languages for Data Mining and Machine Learning. 2013; pp. 108–22.
Schratz P, Muenchow J, Iturritxa E, Richter J, Brenning A. Hyperparameter tuning and performance assessment of statistical and machine-learning algorithms using spatial data. Ecol Model. 2019;406:109–20.
Bekkar M, Djemaa HK, Alitouche TA. Evaluation measures for models assessment over imbalanced data sets. J Inf Eng Appl. 2013;3(10):27–38.
Microsoft corporation: parameters tuning. https://lightgbm.readthedocs.io/en/latest/Parameters-Tuning.html
Microsoft corporation: parameters. https://lightgbm.readthedocs.io/en/latest/Parameters.html
Yandex: parameter tuning. https://catboost.ai/docs/concepts/parameter-tuning.html
Lawson TR, Faul AC, Verbist AN. Research and statistics for social workers. New York: Routledge; 2019.
Iversen GR, Norpoth H. Analysis of variance, vol. 1. Newbury Park: Sage; 1987.
Tukey JW. Comparing individual means in the analysis of variance. Biometrics 1949; 99–114
McGill R, Tukey JW, Larsen WA. Variations of box plots. Am Stat. 1978;32(1):12–6.
Hancock JT, Khoshgoftaar TM. Gradient boosted decision tree algorithms for medicare fraud detection. SN Comput Sci. 2021;2(4):1–12.
Acknowledgements
The authors would like to express their gratitude to the reviewers at the Data Mining and Machine Learning Laboratory of Florida Atlantic University for the help in preparing this study.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of Interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This article is part of the topical collection “Innovative AI in Medical Applications” guest edited by Lydia Bouzar-Benlabiod, Stuart H. Rubin and Edwige Pissaloux.
Rights and permissions
Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Hancock, J.T., Khoshgoftaar, T.M. Hyperparameter Tuning for Medicare Fraud Detection in Big Data. SN COMPUT. SCI. 3, 440 (2022). https://doi.org/10.1007/s42979-022-01348-x
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s42979-022-01348-x