Gradient Boosted Decision Tree Algorithms for Medicare Fraud Detection

Hancock, John T.; Khoshgoftaar, Taghi M.

doi:10.1007/s42979-021-00655-z

Gradient Boosted Decision Tree Algorithms for Medicare Fraud Detection

Original Research
Published: 08 May 2021

Volume 2, article number 268, (2021)
Cite this article

SN Computer Science Aims and scope Submit manuscript

1820 Accesses
40 Citations
Explore all metrics

Abstract

Employing Machine Learning algorithms to identify health insurance fraud is an application of Artificial Intelligence in Healthcare. Insurance fraud spuriously inflates the cost of Healthcare. Therefore, it could limit or even deny patients necessary care and treatment. We use Medicare claims data as input to various algorithms to gauge their performance in fraud detection. The claims data contain categorical features, some of which have thousands of possible values. To the best of our knowledge, this is the first study on using CatBoost and LightGBM to encode categorical data for Medicare fraud detection. We show that CatBoost attains better performance in the task of Medicare fraud detection than other algorithms, attaining a mean AUC value of 0.77452. At a 99% confidence level (with p value 0), our analysis shows that this result is significantly better than the mean AUC value of 0.76132 that LightGBM yields. A second contribution we make is to show that when we include an additional categorical feature (Healthcare provider state), CatBoost yields a mean AUC value of 0.88245, which is also significantly better than the mean AUC value of 0.85137 that LightGBM yields. Our empirical evidence clearly indicates CatBoost is a better alternative to other classifiers for Medicare fraud detection, especially when incorporating categorical features.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Data-Centric AI for Healthcare Fraud Detection

Article 11 May 2023

Investigating the effectiveness of one-class and binary classification for fraud detection

Article Open access 12 October 2023

The effects of class rarity on the evaluation of supervised healthcare fraud detection models

Article Open access 28 February 2019

References

Medicare provider utilization and payment data: Physician and other supplier. 2019. https://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/Medicare-Provider-Charge-Data/Physician-and-Other-Supplier. Accessed 14 Mar 2021.
Python package training parameters. 2020. https://catboost.ai/docs/concepts/python-reference_parameters-list.html#python-reference_parameters-list. Accessed 15 Mar 2021
Transforming categorical features to numerical features. 2020. https://catboost.ai/docs/concepts/algorithm-main-stages_cat-to-numberic.html#algorithm-main-stages_cat-to-numberic. Accessed 16 Mar 2021.
Bagdoyan SJ. Testimony before the subcommittee on oversight, committee on ways and means, house of representatives. 2018. https://www.gao.gov/assets/700/693156.pdf. Accessed 19 Mar 2021.
Bauder RA, da Rosa R, Khoshgoftaar TM. Identifying medicare provider fraud with unsupervised machine learning. In: 2018 IEEE International conference on information reuse and integration (IRI); 2018. pp. 285–292 . https://doi.org/10.1109/IRI.2018.00051.
Bekkar M, Djemaa HK, Alitouche TA. Evaluation measures for models assessment over imbalanced data sets. J Inf Eng Appl. 2013;3(10):27–38.
Google Scholar
Branting LK, Reeder F, Gold J, Champney T. Graph analytics for healthcare fraud risk estimation. In: 2016 IEEE/ACM international conference on advances in social networks analysis and mining (ASONAM); 2016, pp. 845–851 . https://doi.org/10.1109/ASONAM.2016.7752336.
Breiman L. Random forests. Mach Learn. 2001;45(1):5–32.
Article Google Scholar
Centers for Medicare & Medicaid Services: Get started with medicare. [Online]. https://www.medicare.gov/sign-up-change-plans/get-started-with-medicare. Accessed 16 Mar 2021..
Centers For Medicare & Medicaid Services: Trustees report & trust funds. 2018. https://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/ReportsTrustFunds/index.html. Accessed 16 Mar 2021.
Chen T, Guestrin C. Xgboost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining—KDD ’16 (2016). https://doi.org/10.1145/2939672.2939785.
CMS Office of Enterprise Data and Analytics: Medicare fee-for-service provider utilization & payment data physician and other supplier. 2017. https://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/Medicare-Provider-Charge-Data/Downloads/Medicare-Physician-and-Other-Supplier-PUF-Methodology.pdf. Accessed 14 Mar 2021.
CMS Office of Enterprise Data and Analytics: Medicare fee-for service provider utilization & payment datapart d prescriber public use file:a methodological overview. 2020. https://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/Medicare-Provider-Charge-Data/Downloads/Prescriber_Methods.pdf. Accessed 16 Mar 2021.
Friedman JH. Stochastic gradient boosting. Comput Stat Data Anal. 2002;38(4):367–78. https://doi.org/10.1016/S0167-9473(01)00065-2.
Article MathSciNet MATH Google Scholar
Goodfellow I, Bengio Y, Courville A. Deep learning. Cambridge: MIT Press; 2016.
MATH Google Scholar
Hancock J, Khoshgoftaar TM. Performance of catboost and xgboost in medicare fraud detection. In: 19th IEEE international conference on machine learning and applications (ICMLA). IEEE 2020.
Hancock JT, Khoshgoftaar TM. Catboost for big data: an interdisciplinary review. J Big Data. 2020;7:1–45.
Article Google Scholar
Hancock JT, Khoshgoftaar TM. Medicare fraud detection using catboost. In: 2020 IEEE 21st international conference on information reuse and integration for data science (IRI), pp. 97–103. IEEE Computer Society; 2020.
Hancock JT, Khoshgoftaar TM. Survey on categorical data for neural networks. J Big Data. 2020;7:1–41.
Article Google Scholar
Herland M, Bauder RA, Khoshgoftaar TM. The effects of class rarity on the evaluation of supervised healthcare fraud detection models. J Big Data. 2019;6(1):1.
Article Google Scholar
Herland M, Khoshgoftaar TM, Wald R. A review of data mining using big data in health informatics. J Big data. 2014;1(1):1–35.
Article Google Scholar
Iversen GR, Norpoth H, Norpoth HP. Analysis of variance. 1. Sage; 1987.
Ke G, Meng Q, Finley T, Wang T, Chen W, Ma W, Ye Q, Liu TY. Lightgbm: a highly efficient gradient boosting decision tree. In: Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R (eds) Advances in neural information processing systems, vol. 30, pp. 3146–3154. Curran Associates, Inc.; 2017. http://papers.nips.cc/paper/6907-lightgbm-a-highly-efficient-gradient-boosting-decision-tree.pdf. Accessed 12 Mar 2021.
Kleinbaum DG, Dietz K, Gail M, Klein M, Klein M. Logistic regression. New York: Springer; 2002.
Google Scholar
Leevy JL, Khoshgoftaar TM, Bauder RA. NaeemSeliya: Investigating the relationship between time and predictive model maintenance. J Big Data. 2020;1:1–19.
Google Scholar
LEIE: Office of inspector general leie downloadable databases. 2017. https://oig.hhs.gov/exclusions/index.asp. Accessed 15 Mar 2021.
Micci-Barreca D. A preprocessing scheme for high-cardinality categorical attributes in classification and prediction problems. SIGKDD Explor Newsl. 2001;3(1):27–32.
Article Google Scholar
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, et al. Scikit-learn: machine learning in python. J Mach Learn Res. 2011;12(Oct):2825–30.
MathSciNet MATH Google Scholar
Prokhorenkova L, Gusev G, Vorobev A, Dorogush AV, Gulin A. Catboost: unbiased boosting with categorical features. In: Bengio S, Wallach H, Larochelle H, Grauman K, Cesa-Bianchi N, Garnett R (eds) Advances in neural information processing systems, vol. 31, pp. 6638–6648. Curran Associates, Inc.; 2018. http://papers.nips.cc/paper/7898-catboost-unbiased-boosting-with-categorical-features.pdf. Accessed 18 Mar 2021.
Provost F, Fawcett T. Analysis and visualization of classifier performance: comparison under imprecise class and cost distributions. In: Proc of the 3rd international conference on knowledge discovery and data mining; 1997.
Tukey JW. Comparing individual means in the analysis of variance. Biometrics 1949;1949:99–114.
Article MathSciNet Google Scholar

Download references

Acknowledgements

The authors would like to express their gratitude to the reviewers at the Data Mining and Machine Learning Laboratory of Florida Atlantic University for the help in preparing this study.

Author information

Authors and Affiliations

Department of Computer and Electrical Engineering and Computer Science, Florida Atlantic University, 777 Glades Rd, Boca Raton, FL, 33431-0991, USA
John T. Hancock & Taghi M. Khoshgoftaar

Authors

John T. Hancock
View author publications
You can also search for this author in PubMed Google Scholar
Taghi M. Khoshgoftaar
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to John T. Hancock.

Ethics declarations

Conflict of interest

On behalf of all authors, the corresponding author states that there is no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article is part of the topical collection “Artificial Intelligence for HealthCare” guest-edited by Lydia Bouzar-Benlabiod, Stuart H. Rubin and Edwige Pissaloux.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Hancock, J.T., Khoshgoftaar, T.M. Gradient Boosted Decision Tree Algorithms for Medicare Fraud Detection. SN COMPUT. SCI. 2, 268 (2021). https://doi.org/10.1007/s42979-021-00655-z

Download citation

Received: 24 December 2020
Accepted: 19 April 2021
Published: 08 May 2021
DOI: https://doi.org/10.1007/s42979-021-00655-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Gradient Boosted Decision Tree Algorithms for Medicare Fraud Detection

Abstract

Access this article

Similar content being viewed by others

Data-Centric AI for Healthcare Fraud Detection

Investigating the effectiveness of one-class and binary classification for fraud detection

The effects of class rarity on the evaluation of supervised healthcare fraud detection models

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Gradient Boosted Decision Tree Algorithms for Medicare Fraud Detection

Abstract

Access this article

Similar content being viewed by others

Data-Centric AI for Healthcare Fraud Detection

Investigating the effectiveness of one-class and binary classification for fraud detection

The effects of class rarity on the evaluation of supervised healthcare fraud detection models

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation