Skip to main content
Log in

Hyperparameter Tuning for Medicare Fraud Detection in Big Data

  • Original Research
  • Published:
SN Computer Science Aims and scope Submit manuscript

Abstract

Hyperparameter tuning is the collection of techniques to discover optimal values for settings we supply to machine learning algorithms. Put another way, hyperparameters are not optimized by the algorithm. When researching Big Data, we face the dilemma of whether it will be useful to do hyperparameter tuning with the maximum possible amount of data. This is because hyperparameter tuning may consume far more resources than conducting a single experiment with default values for hyperparameters. Each combination of algorithm settings results in an additional experiment. Here, we show that hyperparameter tuning with all available data is beneficial in the scope of our experiments. We conduct experiments with three Big Data Medicare Insurance Claims datasets. The experiments are exercises in Medicare fraud detection. We show that for each dataset, we obtain better performance from LightGBM and CatBoost classifiers with tuned hyperparameters. Since some features of the data we are working with are high cardinality categorical features, we have an opportunity to try different encoding techniques in our experiments. We find that across the different encoding techniques, hyperparameter tuning Provides an improvement in the performance of both LightGBM and CatBoost.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Notes

  1. https://pandas.pydata.org/docs/reference/frame.html.

References

  1. Deng J, Dong W, Socher R, Li L.-J, Li K, Fei-Fei L. Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition. 2009; pp. 248–255. Ieee

  2. Krizhevsky A, Sutskever I, Hinton GE. Imagenet classification with deep convolutional neural networks. Adv Neural Inf Process Syst. 2012;25:1097–105.

    Google Scholar 

  3. Bergstra J, Bengio Y. Random search for hyper-parameter optimization. J Mach Learn Res. 2012;13(2):281–305.

  4. Ke G, Meng Q, Finley T, Wang T, Chen W, Ma W, Ye Q, Liu T-Y. Lightgbm: A highly efficient gradient boosting decision tree. Adv Neural Inf Process Syst. 2017;30:3146–54.

    Google Scholar 

  5. Prokhorenkova L, Gusev G, Vorobev A, Dorogush AV, Gulin A. Catboost: unbiased boosting with categorical features. Adv Neural Inf Processing Syst. 2018;31:6638–6648

  6. De Mauro A, Greco M, Grimaldi M. A formal definition of big data based on its essential features. Library Review (2016)

  7. CMS office of enterprise data and analytics: medicare fee-for-service provider utilization & payment data physician and other supplier public use file: a methodological overview (2017)

  8. CMS office of enterprise data and analytics: medicare fee-for service provider utilization & payment data part d prescriber public use file: a methodological overview. 2020.

  9. CMS office of enterprise data and analytics. medicare fee-for-service provider utilization and payment data referring durable medical equipment, prosthetics. A methodological overview: orthotics and suppliespublic Use File. 2020.

  10. Hancock JT, Khoshgoftaar TM. Survey on categorical data for neural networks. J Big Data. 2020;7:1–41.

    Article  Google Scholar 

  11. Hancock J, Khoshgoftaar TM. Impact of hyperparameter tuning in classifying highly imbalanced big data. In: 2021 IEEE 22nd international conference on information reuse and integration for data science (IRI). 2021; 348–354. IEEE

  12. Centers For Medicare & Medicaid Services: 2021 annual report of the boards of trustees of the federal hospital insurance and federal supplementary medical insurance trust funds. https://www.cms.gov/files/document/2021-medicare-trustees-report.pdf

  13. Medicare CF, Services M. Estimated improper payment rates for centers for medicare & medicaid services (CMS) programs. 2020. https://www.cms.gov

  14. Bagdoyan SJ. Testimony before the subcommittee on oversight, committee on ways and means, house of representatives. https://www.gao.gov/assets/700/693156.pdf

  15. Victoria PC, Padma PD. Influence of optimizing xgboost to handle class imbalance in credit card fraud detection. In: 2020 Third International Conference on Smart Systems and Inventive Technology (ICSSIT). 2020; pp. 1309–1315. IEEE.

  16. Johnson JM, Khoshgoftaar TM. Semantic embeddings for medical providers and fraud detection. In: 2020 IEEE 21st International conference on information reuse and integration for data science (IRI). 2020; pp. 224–230. IEEE.

  17. Bauder RA, Khoshgoftaar TM. A novel method for fraudulent medicare claims detection from expected payment deviations (application paper). In: 2016 IEEE 17th International Conference on Information Reuse and Integration (IRI). 2016; pp. 11–19. https://doi.org/10.1109/IRI.2016.11

  18. Ekin T, Ieva F, Ruggeri F, Soyer R. On the use of the concentration function in medical fraud assessment. Am Stat. 2017;71(3):236–41.

    Article  MathSciNet  Google Scholar 

  19. Branting LK, Reeder F, Gold J, Champney T. Graph analytics for healthcare fraud risk estimation. In: 2016 IEEE/ACM international conference on advances in social networks analysis and mining (ASONAM). 2016; pp. 845–51. https://doi.org/10.1109/ASONAM.2016.7752336

  20. LEIE: Office of inspector general Leie downloadable databases. https://oig.hhs.gov/exclusions/index.asp

  21. Herland M, Khoshgoftaar TM, Bauder RA. Big data fraud detection using multiple medicare data sources. J Big Data. 2018;5(1):1–21.

    Article  Google Scholar 

  22. Su Y, Zhu X, Dong B, Zhang Y, Wu XW. Medfrodetect: medicare fraud detection with extremely imbalanced class distributions. In: The thirty-third international FLAIRS conference (FLAIRS-32). 2020.

  23. Friedman JH. Greedy function approximation: a gradient boosting machine. Ann Stat. 2001;1189–1232

  24. Johnson JM, Khoshgoftaar TM. Medicare fraud detection using neural networks. J Big Data. 2019;6(1):1–35.

    Article  Google Scholar 

  25. Bauder RA, da Rosa R, Khoshgoftaar TM. Identifying medicare provider fraud with unsupervised machine learning. In: 2018 IEEE International Conference on Information Reuse and Integration (IRI). 2018; pp. 285–292. https://doi.org/10.1109/IRI.2018.00051

  26. Hancock JT, Khoshgoftaar TM. Catboost for big data: an interdisciplinary review. J Big Data 2020;7:1–45.

  27. Bauder RA, Khoshgoftaar TM. The effects of varying class distribution on learner behavior for medicare fraud detection with imbalanced big data. Health Inf Sci Syst. 2018;6(1):9.

    Article  Google Scholar 

  28. Van Rossum G, Drake F. Python 3 reference manual createspace. Scotts Valley; 2009.

    Google Scholar 

  29. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V. Scikit-learn: machine learning in python. J Mach Learn Res. 2011;12:2825–30.

    MathSciNet  MATH  Google Scholar 

  30. Buitinck L, Louppe G, Blondel M, Pedregosa F, Mueller A, Grisel O, Niculae V, Prettenhofer P, Gramfort A, Grobler J, Layton R, VanderPlas J, Joly A, Holt B, Varoquaux G. API design for machine learning software: experiences from the scikit-learn project. In: ECML PKDD Workshop: Languages for Data Mining and Machine Learning. 2013; pp. 108–22.

  31. Schratz P, Muenchow J, Iturritxa E, Richter J, Brenning A. Hyperparameter tuning and performance assessment of statistical and machine-learning algorithms using spatial data. Ecol Model. 2019;406:109–20.

    Article  Google Scholar 

  32. Bekkar M, Djemaa HK, Alitouche TA. Evaluation measures for models assessment over imbalanced data sets. J Inf Eng Appl. 2013;3(10):27–38.

  33. Microsoft corporation: parameters tuning. https://lightgbm.readthedocs.io/en/latest/Parameters-Tuning.html

  34. Microsoft corporation: parameters. https://lightgbm.readthedocs.io/en/latest/Parameters.html

  35. Yandex: parameter tuning. https://catboost.ai/docs/concepts/parameter-tuning.html

  36. Lawson TR, Faul AC, Verbist AN. Research and statistics for social workers. New York: Routledge; 2019.

    Book  Google Scholar 

  37. Iversen GR, Norpoth H. Analysis of variance, vol. 1. Newbury Park: Sage; 1987.

    Book  Google Scholar 

  38. Tukey JW. Comparing individual means in the analysis of variance. Biometrics 1949; 99–114

  39. McGill R, Tukey JW, Larsen WA. Variations of box plots. Am Stat. 1978;32(1):12–6.

    Google Scholar 

  40. Hancock JT, Khoshgoftaar TM. Gradient boosted decision tree algorithms for medicare fraud detection. SN Comput Sci. 2021;2(4):1–12.

    Article  Google Scholar 

Download references

Acknowledgements

The authors would like to express their gratitude to the reviewers at the Data Mining and Machine Learning Laboratory of Florida Atlantic University for the help in preparing this study.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to John T. Hancock.

Ethics declarations

Conflict of Interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article is part of the topical collection “Innovative AI in Medical Applications” guest edited by Lydia Bouzar-Benlabiod, Stuart H. Rubin and Edwige Pissaloux.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Hancock, J.T., Khoshgoftaar, T.M. Hyperparameter Tuning for Medicare Fraud Detection in Big Data. SN COMPUT. SCI. 3, 440 (2022). https://doi.org/10.1007/s42979-022-01348-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s42979-022-01348-x

Keywords

Navigation