Skip to main content
Log in

Gradient Boosted Decision Tree Algorithms for Medicare Fraud Detection

  • Original Research
  • Published:
SN Computer Science Aims and scope Submit manuscript

Abstract

Employing Machine Learning algorithms to identify health insurance fraud is an application of Artificial Intelligence in Healthcare. Insurance fraud spuriously inflates the cost of Healthcare. Therefore, it could limit or even deny patients necessary care and treatment. We use Medicare claims data as input to various algorithms to gauge their performance in fraud detection. The claims data contain categorical features, some of which have thousands of possible values. To the best of our knowledge, this is the first study on using CatBoost and LightGBM to encode categorical data for Medicare fraud detection. We show that CatBoost attains better performance in the task of Medicare fraud detection than other algorithms, attaining a mean AUC value of 0.77452. At a 99% confidence level (with p value 0), our analysis shows that this result is significantly better than the mean AUC value of 0.76132 that LightGBM yields. A second contribution we make is to show that when we include an additional categorical feature (Healthcare provider state), CatBoost yields a mean AUC value of 0.88245, which is also significantly better than the mean AUC value of 0.85137 that LightGBM yields. Our empirical evidence clearly indicates CatBoost is a better alternative to other classifiers for Medicare fraud detection, especially when incorporating categorical features.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Medicare provider utilization and payment data: Physician and other supplier. 2019. https://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/Medicare-Provider-Charge-Data/Physician-and-Other-Supplier. Accessed 14 Mar 2021.

  2. Python package training parameters. 2020. https://catboost.ai/docs/concepts/python-reference_parameters-list.html#python-reference_parameters-list. Accessed 15 Mar 2021

  3. Transforming categorical features to numerical features. 2020. https://catboost.ai/docs/concepts/algorithm-main-stages_cat-to-numberic.html#algorithm-main-stages_cat-to-numberic. Accessed 16 Mar 2021.

  4. Bagdoyan SJ. Testimony before the subcommittee on oversight, committee on ways and means, house of representatives. 2018. https://www.gao.gov/assets/700/693156.pdf. Accessed 19 Mar 2021.

  5. Bauder RA, da Rosa R, Khoshgoftaar TM. Identifying medicare provider fraud with unsupervised machine learning. In: 2018 IEEE International conference on information reuse and integration (IRI); 2018. pp. 285–292 . https://doi.org/10.1109/IRI.2018.00051.

  6. Bekkar M, Djemaa HK, Alitouche TA. Evaluation measures for models assessment over imbalanced data sets. J Inf Eng Appl. 2013;3(10):27–38.

    Google Scholar 

  7. Branting LK, Reeder F, Gold J, Champney T. Graph analytics for healthcare fraud risk estimation. In: 2016 IEEE/ACM international conference on advances in social networks analysis and mining (ASONAM); 2016, pp. 845–851 . https://doi.org/10.1109/ASONAM.2016.7752336.

  8. Breiman L. Random forests. Mach Learn. 2001;45(1):5–32.

    Article  Google Scholar 

  9. Centers for Medicare & Medicaid Services: Get started with medicare. [Online]. https://www.medicare.gov/sign-up-change-plans/get-started-with-medicare. Accessed 16 Mar 2021..

  10. Centers For Medicare & Medicaid Services: Trustees report & trust funds. 2018. https://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/ReportsTrustFunds/index.html. Accessed 16 Mar 2021.

  11. Chen T, Guestrin C. Xgboost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining—KDD ’16 (2016). https://doi.org/10.1145/2939672.2939785.

  12. CMS Office of Enterprise Data and Analytics: Medicare fee-for-service provider utilization & payment data physician and other supplier. 2017. https://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/Medicare-Provider-Charge-Data/Downloads/Medicare-Physician-and-Other-Supplier-PUF-Methodology.pdf. Accessed 14 Mar 2021.

  13. CMS Office of Enterprise Data and Analytics: Medicare fee-for service provider utilization & payment datapart d prescriber public use file:a methodological overview. 2020. https://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/Medicare-Provider-Charge-Data/Downloads/Prescriber_Methods.pdf. Accessed 16 Mar 2021.

  14. Friedman JH. Stochastic gradient boosting. Comput Stat Data Anal. 2002;38(4):367–78. https://doi.org/10.1016/S0167-9473(01)00065-2.

    Article  MathSciNet  MATH  Google Scholar 

  15. Goodfellow I, Bengio Y, Courville A. Deep learning. Cambridge: MIT Press; 2016.

    MATH  Google Scholar 

  16. Hancock J, Khoshgoftaar TM. Performance of catboost and xgboost in medicare fraud detection. In: 19th IEEE international conference on machine learning and applications (ICMLA). IEEE 2020.

  17. Hancock JT, Khoshgoftaar TM. Catboost for big data: an interdisciplinary review. J Big Data. 2020;7:1–45.

    Article  Google Scholar 

  18. Hancock JT, Khoshgoftaar TM. Medicare fraud detection using catboost. In: 2020 IEEE 21st international conference on information reuse and integration for data science (IRI), pp. 97–103. IEEE Computer Society; 2020.

  19. Hancock JT, Khoshgoftaar TM. Survey on categorical data for neural networks. J Big Data. 2020;7:1–41.

    Article  Google Scholar 

  20. Herland M, Bauder RA, Khoshgoftaar TM. The effects of class rarity on the evaluation of supervised healthcare fraud detection models. J Big Data. 2019;6(1):1.

    Article  Google Scholar 

  21. Herland M, Khoshgoftaar TM, Wald R. A review of data mining using big data in health informatics. J Big data. 2014;1(1):1–35.

    Article  Google Scholar 

  22. Iversen GR, Norpoth H, Norpoth HP. Analysis of variance. 1. Sage; 1987.

  23. Ke G, Meng Q, Finley T, Wang T, Chen W, Ma W, Ye Q, Liu TY. Lightgbm: a highly efficient gradient boosting decision tree. In: Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R (eds) Advances in neural information processing systems, vol. 30, pp. 3146–3154. Curran Associates, Inc.; 2017. http://papers.nips.cc/paper/6907-lightgbm-a-highly-efficient-gradient-boosting-decision-tree.pdf. Accessed 12 Mar 2021.

  24. Kleinbaum DG, Dietz K, Gail M, Klein M, Klein M. Logistic regression. New York: Springer; 2002.

    Google Scholar 

  25. Leevy JL, Khoshgoftaar TM, Bauder RA. NaeemSeliya: Investigating the relationship between time and predictive model maintenance. J Big Data. 2020;1:1–19.

    Google Scholar 

  26. LEIE: Office of inspector general leie downloadable databases. 2017. https://oig.hhs.gov/exclusions/index.asp. Accessed 15 Mar 2021.

  27. Micci-Barreca D. A preprocessing scheme for high-cardinality categorical attributes in classification and prediction problems. SIGKDD Explor Newsl. 2001;3(1):27–32.

    Article  Google Scholar 

  28. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, et al. Scikit-learn: machine learning in python. J Mach Learn Res. 2011;12(Oct):2825–30.

    MathSciNet  MATH  Google Scholar 

  29. Prokhorenkova L, Gusev G, Vorobev A, Dorogush AV, Gulin A. Catboost: unbiased boosting with categorical features. In: Bengio S, Wallach H, Larochelle H, Grauman K, Cesa-Bianchi N, Garnett R (eds) Advances in neural information processing systems, vol. 31, pp. 6638–6648. Curran Associates, Inc.; 2018. http://papers.nips.cc/paper/7898-catboost-unbiased-boosting-with-categorical-features.pdf. Accessed 18 Mar 2021.

  30. Provost F, Fawcett T. Analysis and visualization of classifier performance: comparison under imprecise class and cost distributions. In: Proc of the 3rd international conference on knowledge discovery and data mining; 1997.

  31. Tukey JW. Comparing individual means in the analysis of variance. Biometrics 1949;1949:99–114.

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgements

The authors would like to express their gratitude to the reviewers at the Data Mining and Machine Learning Laboratory of Florida Atlantic University for the help in preparing this study.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to John T. Hancock.

Ethics declarations

Conflict of interest

On behalf of all authors, the corresponding author states that there is no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article is part of the topical collection “Artificial Intelligence for HealthCare” guest-edited by Lydia Bouzar-Benlabiod, Stuart H. Rubin and Edwige Pissaloux.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Hancock, J.T., Khoshgoftaar, T.M. Gradient Boosted Decision Tree Algorithms for Medicare Fraud Detection. SN COMPUT. SCI. 2, 268 (2021). https://doi.org/10.1007/s42979-021-00655-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s42979-021-00655-z

Keywords

Navigation