The effects of varying class distribution on learner behavior for medicare fraud detection with imbalanced big data

  • Richard A. BauderEmail author
  • Taghi M. Khoshgoftaar


Healthcare in the United States is a critical aspect of most people’s lives, particularly for the aging demographic. This rising elderly population continues to demand more cost-effective healthcare programs. Medicare is a vital program serving the needs of the elderly in the United States. The growing number of Medicare beneficiaries, along with the enormous volume of money in the healthcare industry, increases the appeal for, and risk of, fraud. In this paper, we focus on the detection of Medicare Part B provider fraud which involves fraudulent activities, such as patient abuse or neglect and billing for services not rendered, perpetrated by providers and other entities who have been excluded from participating in Federal healthcare programs. We discuss Part B data processing and describe a unique process for mapping fraud labels with known fraudulent providers. The labeled big dataset is highly imbalanced with a very limited number of fraud instances. In order to combat this class imbalance, we generate seven class distributions and assess the behavior and fraud detection performance of six different machine learning methods. Our results show that RF100 using a 90:10 class distribution is the best learner with a 0.87302 AUC. Moreover, learner behavior with the 50:50 balanced class distribution is similar to more imbalanced distributions which keep more of the original data. Based on the performance and significance testing results, we posit that retaining more of the majority class information leads to better Medicare Part B fraud detection performance over the balanced datasets across the majority of learners.


Medicare fraud Class imbalance Random undersampling Big data 


Authors' contributions

The authors would like to thank the Editor-in-Chief and the two reviewers for their insightful evaluation and constructive feedback of this paper, as well as the members of the Data Mining and Machine Learning Laboratory, Florida Atlantic University, for their assistance in the review process. We acknowledge partial support by the NSF (CNS-1427536). Opinions, findings, conclusions, or recommendations in this paper are the authors’ and do not reflect the views of the NSF. All authors read and approved the final manuscript.

Competing interests

All authors declare that they have no Competing interests.

Ethics approval and consent to participate

The article does not contain any studies with human participants or animals performed by any of the authors.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.


  1. 1.
    How growth of elderly population in US compares with other countries. 2013.
  2. 2.
    Profile of older Americans: 2015. 2015.
  3. 3.
  4. 4.
    US Medicare Program. 2017.
  5. 5.
  6. 6.
    Roesems-Kerremans G. Big data in healthcare. J Healthc Commun. 2016;1:33.CrossRefGoogle Scholar
  7. 7.
    Lazer D, Kennedy R, King G, Vespignani A. The parable of google flu: traps in big data analysis. Science. 2014;343(6176):1203–5.CrossRefGoogle Scholar
  8. 8.
    Simpao AF, Ahumada LM, Gálvez JA, Rehman MA. A review of analytics and clinical informatics in health care. J Med Syst. 2014;38(4):45.CrossRefGoogle Scholar
  9. 9.
    Medicare Fraud Strike Force. Office of inspector general. 2017.
  10. 10.
  11. 11.
    Morris L. Combating fraud in health care: an essential component of any cost containment strategy. 2009.
  12. 12.
  13. 13.
    Rashidian A, Joudaki H, Vian T. No evidence of the effect of the interventions to combat health care fraud and abuse: a systematic review of literature. PLoS ONE. 2012;7(8):e41988.CrossRefGoogle Scholar
  14. 14.
    Herland M, Khoshgoftaar TM, Wald R. A review of data mining using big data in health informatics. J Big Data. 2014;1(1):2.CrossRefGoogle Scholar
  15. 15.
    Raghupathi W, Raghupathi V. Big data analytics in healthcare: promise and potential. Health Inf Sci Syst. 2014;2(1):3.CrossRefGoogle Scholar
  16. 16.
    Yoo I, Alafaireet P, Marinov M, Pena-Hernandez K, Gopidi R, Chang J-F, Hua L. Data mining in healthcare and biomedicine: a survey of the literature. J Med Syst. 2012;36(4):2431–48.CrossRefGoogle Scholar
  17. 17.
    Centers for Medicare and Medicaid Services: Research, Statistics, Data, and Systems. 2017.
  18. 18.
    Henry J. Kaiser family foundation. Medicare advantage. 2017.
  19. 19.
    Bauder RA, Khoshgoftaar TM, Seliya N. A survey on the state of healthcare upcoding fraud analysis and detection. Health Serv Outcomes Res Methodol. 2017;17(1):31–55.CrossRefGoogle Scholar
  20. 20.
    Savino JO, Turvey BE. Chapter 5—medicaid/medicare fraud. In: Turvey BE, Savino JO, Mares AC, editors. False allegations. San Diego: Academic Press. 2018. pp. 89–108. CrossRefGoogle Scholar
  21. 21.
    LEIE. (2017) Office of inspector general leie downloadable databases.
  22. 22.
    Bauder RA, Khoshgoftaar TM. A survey of medicare data processing and integration for fraud detection. In: 2018 IEEE 19th international conference on Information reuse and integration (IRI). IEEE;2018, pp. 9–14.Google Scholar
  23. 23.
    Arellano P. Making decisions with data—still looking for a needle in the big data haystack? 2017.
  24. 24.
    Witten IH, Frank E, Hall MA, Pal CJ. Data mining: practical machine learning tools and techniques. Morgan Kaufmann. 2016.Google Scholar
  25. 25.
    Feldman K, Chawla NV. Does medical school training relate to practice? Evidence from big data. Big Data. 2015;3(2):103–13.CrossRefGoogle Scholar
  26. 26.
    Pande V, Maas W. Physician medicare fraud: characteristics and consequences. Int J Pharm Healthc Market. 2013;7(1):8–33.CrossRefGoogle Scholar
  27. 27.
    Ko JS, Chalfin H, Trock BJ, Feng Z, Humphreys E, Park S-W, Carter HB, Frick KD, Han M. Variability in medicare utilization and payment among urologists. Urology. 2015;85(5):1045–51.CrossRefGoogle Scholar
  28. 28.
    Sadiq S, Tao Y, Yan Y, Shyu M-L. Mining anomalies in medicare big data using patient rule induction method. In: 2017 IEEE third international conference on multimedia big data (BigMM). IEEE. 2017. pp. 185–192.Google Scholar
  29. 29.
    Bauder RA, Khoshgoftaar TM. Multivariate outlier detection in medicare claims payments applying probabilistic programming methods. Health Serv Outcomes Res Methodol. 2017;17(3–4):256–89.CrossRefGoogle Scholar
  30. 30.
    Bauder RA, Khoshgoftaar TM. A novel method for fraudulent medicare claims detection from expected payment deviations (application paper). In: 2016 IEEE 17th international conference on information reuse and integration (IRI). IEEE;2016. pp. 11–19.Google Scholar
  31. 31.
    Bauder RA, Khoshgoftaar TM, Richter A, Herland M. Predicting medical provider specialties to detect anomalous insurance claims. In: 2016 IEEE 28th international conference on tools with artificial intelligence (ICTAI). IEEE;2016. pp. 784–790.Google Scholar
  32. 32.
    Chandola V, Sukumar SR, Schryver JC. Knowledge discovery from massive healthcare claims data. In: Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM. 2013. pp. 1312–1320.Google Scholar
  33. 33.
    Herland M, Bauder RA, Khoshgoftaar TM. Medical provider specialty predictions for the detection of anomalous medicare insurance claims. In: IEEE 18th international conference information reuse and integration (IRI). IEEE. 2017;2017:579–88.Google Scholar
  34. 34.
    Branting LK, Reeder F, Gold J, Champney T. Graph analytics for healthcare fraud risk estimation. In: 2016 IEEE/ACM international conference on advances in social networks analysis and mining (ASONAM). IEEE. 2016. pp. 845–851.Google Scholar
  35. 35.
  36. 36.
    CMS Office of Enterprise Data and Analytics. Medicare fee-for-service provider utilization & payment data physician and other supplier. 2017.
  37. 37.
  38. 38.
  39. 39.
    U.S. Government Publishing Office. False Claims. Title 31, Section 3729. 2011.
  40. 40.
    Brennan P. A comprehensive survey of methods for overcoming the class imbalance problem in fraud detection. Dublin: Institute of technology Blanchardstown; 2012.Google Scholar
  41. 41.
    Khoshgoftaar TM, Seiffert C, Van Hulse J, Napolitano A, Folleco A. Learning with limited minority class data. In: Sixth International Conference on Machine learning and applications, ICMLA 2007. IEEE. 2007;2007:348–53.Google Scholar
  42. 42.
    Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP. Smote: synthetic minority over-sampling technique. J. Artif. Intell. Res. 2002;16:321–57.CrossRefGoogle Scholar
  43. 43.
    Chawla NV. Data mining for imbalanced datasets: an overview. In: Data mining and knowledge discovery handbook. Berlin: Springer; 2009. pp. 875–886.CrossRefGoogle Scholar
  44. 44.
    Van Hulse J, Khoshgoftaar TM, Napolitano A. Experimental perspectives on learning from imbalanced data. In: Proceedings of the 24th international conference on machine learning. ACM. 2007. pp. 935–942.Google Scholar
  45. 45.
    Wallace BC, Small K, Brodley CE, Trikalinos TA. Class imbalance, redux. In: 2011 IEEE 11th international conference on data mining (ICDM). IEEE. 2011. pp. 754–763.Google Scholar
  46. 46.
    Rish I. An empirical study of the naive bayes classifier. In: IJCAI. workshop on empirical methods in artificial intelligence. IBM. 2001;3(22):41–6.Google Scholar
  47. 47.
    Le Cessie S, Van Houwelingen JC. Ridge estimators in logistic regression. In: Applied statistics. 1992. pp. 191–201.CrossRefGoogle Scholar
  48. 48.
    Cunningham P, Delany SJ. k-Nearest neighbour classifiers. Mult. Classif. Syst. 2007;34:1–17.Google Scholar
  49. 49.
    Chang C-C, Lin C-J. Libsvm: a library for support vector machines. ACM Trans. Intell. Syst. Technol. 2011;2(3):27.CrossRefGoogle Scholar
  50. 50.
    Quinlan JR. C4. 5: programs for machine learning. San Francisco: Elsevier; 2014.Google Scholar
  51. 51.
    Weiss GM, Provost F. Learning when training data are costly: the effect of class distribution on tree induction. J Artif Intell Res. 2003;19:315–54.CrossRefGoogle Scholar
  52. 52.
    Breiman L. Random forests. In: Machine learning. 2001;45(1):5–32. CrossRefGoogle Scholar
  53. 53.
    Khoshgoftaar TM, Golawala M, Van Hulse J. An empirical study of learning from imbalanced data using random forest. In: 19th IEEE international conference on tools with artificial intelligence, ICTAI 2007. IEEE. 2007;2:310–7.Google Scholar
  54. 54.
    Bekkar M, Djemaa HK, Alitouche TA. Evaluation measures for models assessment over imbalanced data sets. J Inf Eng Appl. 2013;3(10).Google Scholar
  55. 55.
    Jeni LA, Cohn JF, De La Torre F. Facing imbalanced data–recommendations for the use of performance metrics. In: 2013 Humaine association conference on affective computing and intelligent interaction (ACII). IEEE. 2013. pp. 245–51.Google Scholar
  56. 56.
    Seliya N, Khoshgoftaar TM, Van Hulse J. A study on the relationships of classifier performance metrics. In: 21st international conference on tools with artificial intelligence, 2009. ICTAI’09. IEEE. 2009. pp. 59–66.Google Scholar
  57. 57.
    Gelman A. Analysis of variance: why it is more important than ever. Ann Stat. 2005;33(1):1–53.MathSciNetCrossRefGoogle Scholar
  58. 58.
    Tukey JW. Comparing individual means in the analysis of variance. Biometrics. 1949;5(2):99–114.MathSciNetCrossRefGoogle Scholar
  59. 59.
    Ando Saabas. Treeinterpreter. 2017.
  60. 60.
    Ribeiro MT, Singh S, Guestrin C. “Why should I trust you?”: explaining the predictions of any classifier. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, San Francisco, CA, USA, August 13–17, 2016. 2016. pp. 1135–1144.Google Scholar
  61. 61.
    Joudaki H, Rashidian A, Minaei-Bidgoli B, Mahmoodi M, Geraili B, Nasiri M, Arab M. Using data mining to detect health care fraud and abuse: a review of literature. Glob J Health Sci. 2015;7(1):194.Google Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  1. 1.College of Engineering & Computer ScienceFlorida Atlantic UniversityBoca RatonUSA

Personalised recommendations