Abstract
The rising elderly population continues to demand more cost-effective healthcare programs. In particular, Medicare is a vital program serving the needs of the elderly in the United States. The growing number of people enrolled in healthcare programs such as Medicare, along with the enormous volume of money in the healthcare industry, increases the appeal for, and risk of, fraudulent activities. Out of the many possible factors for the rising cost of healthcare, fraud is a major contributor, but its impacts can be lessened through the use of fraud detection methods. In this paper, we assess possible illegitimate activities by looking at the amounts paid to providers for services rendered to patients. We propose a novel method for fraud detection that focuses on discovering outliers in Medicare payment data using multiple predictors as model inputs. Our multivariate outlier detection approach is twofold: (1) create a Multivariate Adaptive Regression Splines model to produce studentized residuals and, (2) use the residuals as input into a general univariate outlier detection model, based on full Bayesian inference, using probabilistic programming. Using this approach, we are able to incorporate multiple variables to detect outliers with a model that provides probability distributions, with credible intervals, rather than just point values, as with most common outlier detection methods. Additionally, these credible intervals further enhance confidence that the detected outliers should in fact be considered outlying values, thus possibly fraudulent activities. Our results show that the successful detection of these possibly fraudulent activities can provide effective and meaningful results for further investigation within various medical specialties.
Similar content being viewed by others
Notes
References
Aggarwal, C.C.: Data Mining: The Textbook. Springer, Berlin (2015). google-Books-ID: cfNICAAAQBAJ
Aggarwal, C.C.: Outlier analysis. In: Data Mining. Springer, Berlin, pp. 237–263 (2015)
Bauder, R., Khoshgoftaar, T.M., Seliya, N.: A survey on the state of healthcare upcoding fraud analysis and detection. Health Serv. Outcomes Res. Methodol. 17(1), 31–55 (2017). doi:10.1007/s10742-016-0154-8. [Online]
Ben-Gal, I.: Outlier detection. In: Data Mining and Knowledge Discovery Handbook. Springer, Berlin, pp. 131–146 (2005)
Berwick, D.M., Hackbarth, A.D.: liminating waste in us health care. JAMA 307(14), 1513–1516 (2012). doi:10.1001/jama.2012.362
Box, G.E., Tiao, G.C.: Bayesian Inference in Statistical Analysis, vol. 40. Wiley, New York (2011)
Breunig, M.M., Kriegel, H.-P., Ng, R.T., Sander, J.: LOF: identifying density-based local outliers. In: ACM sigmod record, vol. 29, no. 2, pp. 93–104. ACM (2000)
Brooks, S., Gelman, A., Jones, G., Meng, X.-L.: Handbook of Markov Chain Monte Carlo. CRC press, Boca Raton (2011)
Carpenter, B., Gelman, A., Hoffman, M., Lee, D., Goodrich, B., Betancourt, M., Brubaker, M.A., Guo, J., Li, P., Riddell, A.: Stan: a probabilistic programming language. J. Stat. Softw. 20 (2016)
Centers for Medicare and Medicaid Services Frequently Asked Questions. https://questions.cms.gov/
Centers for Medicare and Medicaid Services: HCPCS General Information. https://www.cms.gov/Medicare/Coding/MedHCPCSGenInfo/
Centers for Medicare and Medicaid Services: Research, Statistics, Data, and Systems. https://www.cms.gov/research-statistics-data-and-systems/research-statistics-data-and-systems.html
Chaloner, K., Brant, R.: A Bayesian approach to outlier detection and residual analysis. Biometrika, 75(4), 651–659 (1988). http://biomet.oxfordjournals.org/content/75/4/651.abstract
Compute Mahalanobis Distance and flag multivariate outliers (Sep 2016). http://www-01.ibm.com/support/docview.wss?uid=swg21480128
Cousineau, D., Chartier, S.: Outliers detection and treatment: a review. Int. J. Psychol. Res. 3(1), 58–67 (2015)
DARPA probabilistic programming for advancing machine learning (PPAML) (2016). http://www.darpa.mil/program/probabilistic-programming-for-advancing-machine-Learning
Das, M.K., Gogoi, B.: Usage of graphical displays to detect outlying observations in linear regression. Indian J. Appl. Res. 5(5) (2016)
Davenport, K.: Mahalanobis Distance and Outliers (2013). http://kldavenport.com/mahalanobis-distance-and-outliers/
Davidson-Pilon, C.: Bayesian Methods for Hackers: Probabilistic Programming and Bayesian Inference. Addison-Wesley Professional, Boston (2015)
Dobson, A.J., Barnett, A.: An Introduction to Generalized Linear Models. CRC press, Boca Raton (2008)
Ekina, T., Leva, F., Ruggeri, F., Soyer, R.: Application of Bayesian methods in detection of healthcare fraud. Chem. Eng. Trans. 33 (2013)
Friedman, J.H.: Multivariate adaptive regression splines. Ann. Stat. 19(1), 1–67 (1991)
Gelman, A., Carlin, J.B., Stern, H.S., Rubin, D.B.: Bayesian Data Analysis, vol. 2. CRC Press, Boca Raton (2014). doi:10.1080/01621459.2014.963405
Government Intervenes in Lawsuit Against Florida Cardiologist Alleging Unnecessary Peripheral Artery Interventions and Payment of Kickbacks. https://www.justice.gov/opa/pr/government-intervenes-lawsuit-against-florida-cardiologist-alleging-unnecessary-peripheral/
Greenburg, J.: Medicare fraud rate is 8 to 10 percent, says Roskam of Illinois (2013). http://www.politifact.com/truth-o-meter/statements/2013/jun/17/peter-roskam/rep-roskam-says-medicare-fraud-rate-8-10-percent/
Grubbs, F.E.: Sample criteria for testing outlying observations. Ann. Math. Stat. 27–58 (1950)
Grubbs, F.E.: Procedures for Detecting Outlying Observations in Samples, vol. 11(1), 1–21 (1969). http://www.tandfonline.com/doi/abs/10.1080/00401706.1969.10490657
Guo, J., Gabry, J., Goodrich, B., Lee, D., Sakrejda, K., Sklyar, O., Oehlschlaegel-Akiyoshi, J., Wickham, H., de Guzman, J., Fletcher, J., Heller, T., Niebler, E., the R Core Team: RSTAN: R Interface to Stan (2016). https://cran.r-project.org/web/packages/rstan/index.html
Heathcare.gov glossary (2017). https://www.healthcare.gov/glossary/
Hiers, F.: Cardiologist plagued by legal woes files for Chapter 11 bankruptcy protection. http://www.ocala.com/article/20160422/ARTICLES/160429933
How Growth of Elderly Population in US Compares With Other Countries (2013). http://www.pbs.org/newshour/rundown/how-growth-of-elderly-population-in-us-compares-with-other-countries/
Hu, J., Wang, F., Sun, J., Sorrentino, R., Ebadollahi, S.: A healthcare utilization analysis framework for hot spotting and contextual anomaly detection. In: AMIA (2012)
Joudaki, H., Rashidian, A., Minaei-Bidgoli, B., Mahmoodi, M., Geraili, B., Nasiri, M., Arab, M.: Using data mining to detect health care fraud and abuse: a review of literature. Glob. J. Health Sci. 7(1), 194–202 (2014)
Komsta, L.: Package ‘outliers (2015). https://cran.r-project.org/web/packages/outliers/outliers.pdf
Korkmaz, S., Goksuluk, D., Zararsiz, G.: MVN: An R package for assessing multivariate normality. R J. 6(2), 151–162 (2014). http://journal.r-project.org/archive/2014-2/korkmaz-goksuluk-zararsiz.pdf
Kuhn, M., Wing, J., Weston, S., Williams, A., Keefer, C., Engelhardt, A., Cooper, T., Mayer, Z., Kenkel, B., the R Core Team, Benesty, M., Lescarbeau, R., Ziem, A., Scrucca, L., Tang, Y., Candan, C.: CARET: Classification and Regression Training, 2016, R package version 6.0-68. https://CRAN.R-project.org/package=caret
Li, J., Huang, K.-Y., Jin, J., Shi, J.: A survey on statistical methods for health care fraud detection. Health Care Manag. Sci. 11(3), 275–287 (2008)
Local Outlier Factor (2016). https://turi.com/learn/userguide/anomaly_detection/local_outlier_factor.html
Matthews, S.: How the Aging Population Is Changing the Healthcare System (2013). http://www.everydayhealth.com/senior-health/aging-and-health/pressures-on-healthcare-from-booming-senior-population.aspx
Milborrow, S., Hastie, T., Tibshirani, R.: Earth: Multivariate Adaptive Regression Splines (2016). https://cran.r-project.org/web/packages/earth/index.html
Mínguez, R., Reguero, B.G., Luceno, A., Méndez, F.J.: Regression models for outlier identification (hurricanes and typhoons) in wave hindcast databases. J. Atmos. Ocean. Technol. 29(2), 267–285 (2012). doi:10.1175/JTECH-D-11-00059.1
MINITAB 17 Statistical Software [Computer software] (2010). www.minitab.com
Morris, L.: Combating Fraud In Health Care: An Essential Component of Any Cost Containment Strategy (2009). http://content.healthaffairs.org/content/28/5/1351.full
Multivariate Adaptivie Regression Splines (2015). https://documents.software.dell.com/statistics/textbook/multivariate-adaptive-regression-splines
Munro, D.: Annual U.S. healthcare spending hits $3.8 trillion (2014). http://www.forbes.com/sites/danmunro/2014/02/02/annual-u-s-healthcare-spending-hits-3-8-trillion/
National Health Accounts by service type and funding source (2014). https://www.cms.gov/Research-Statistics-Data-and-systems/Statistics-Trends-and-reports/NationalHealthExpendData/index.html
National Plan & Provider Enumeration System (NPPES): National Provider Identifier. https://nppes.cms.hhs.gov/NPPES/
Nedret, B., Gulsen, K.: A comparison of multiple outlier detection methods for regression data. Commun. Stat. Simul. Comput. 37(3), 521–545 (2008). doi:10.1080/03610910701812352
NIST/SEMATECH e-Handbook of Statistical Methods (Anderson-Darling) (2013). http://www.itl.nist.gov/div898/handbook/eda/section3/eda35e.htm
NIST/SEMATECH e-Handbook of Statistical Methods (Grubb’s test) (2013). http://www.itl.nist.gov/div898/handbook/eda/section3/eda35h.htm
NOLO: Part B Medical Insurance: What You Pay (2010). http://www.nolo.com/legal-encyclopedia/part-b-medical-insurance-what-you-pay.html
Office of Inspector General: Exclusions Program. http://oig.hhs.gov/exclusions/index.asp
Pande, P.S., Neuman, R.P.: The Six Sigma Way. McGraw-Hill, New York (2000)
Profile of older Americans: 2015 (2015). http://www.aoa.acl.gov/Aging_Statistics/Profile/2015/
R Core Team, R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria, (2017). https://www.R-project.org/
Robinson, D.: K-means clustering is not a free lunch (2015). http://varianceexplained.org/r/kmeans-free-lunch/
Rosenmai, P.: Using Mahalanobis Distance to Find Outliers (2013). http://eurekastatistics.com/using-mahalanobis-distance-to-find-outliers/
Ross, S.: A First Course in Probability, 6th edn. Prentice-Hall, Upper Saddle River (2002)
Savage, J.: A quick-start introduction to Stan for economists (2016). http://quantecon.org/notebooks.html
Stan Development Team, Stan Modeling Language Users Guide and Reference Manual
State Health Facts-Medicare (2015). http://kff.org/state-category/medicare/
Steinbusch, P.J., Oostenbrink, J.B., Zuurbier, J.J., Schaepkens, F.J.: The risk of upcoding in casemix systems: a comparative study. Health Policy 81(2–3), 289–299 (2007)
Stevens, J.P.: Outliers and influential data points in regression analysis. Psychol. Bull. 95(2), 334 (1984)
Swanson, T.: The 5 most common types of medical billing fraud (2012). http://www.business2community.com/health-wellness/the-5-most-common-types-of-medical-billing-fraud-0234197
Tallarida, R.J., Murray, R.B.: Chi-square test. In: Manual of Pharmacologic Calculations. Springer, Berlin, pp. 140–142 (1987)
The facts about rising health care costs (2015). http://www.aetna.com/health-reform-connection/aetnas-vision/facts-about-costs.html
Thornton, D., Capelleveen, G., Poel, M., Hillegersberg, J., Müller, R.M.: Outlier-based health insurance fraud detection for us medicaid data (2014)
Tomar, D., Agarwal, S.: A survey on data mining approaches for healthcare. Int. J. Bio-Sci. Bio-Technol. 5(5), 241–266 (2013)
Trnka, A.: Six sigma methodology with fraud detection. In: 9th WSEAS Interanational Conference on Data Networks, Communications, Computers (DNCOCO10): University of Algarve, Faro, Portugal, pp. 162–165 (2010)
Understanding Medicare-allowed amounts (2016). https://secure.wpsic.com/sales-materials/files/28807-medicare-approved-amounts-tip-sheet.pdf
US Medicaid Program (2016). https://www.medicaid.gov
US Medicare Program (2016). https://www.medicare.gov
Weaver, J., Chang, D.: South Florida ophthalmologist emerges as Medicare’s top-paid physician. http://www.miamiherald.com/news/local/community/miami-dade/article1962581.html
Williamson, D.F., Parker, R.A., Kendrick, J.S.: The box plot: a simple visual method to interpret data. Ann. Intern. Med. 110(11), 916–921 (1989). doi:10.7326/0003-4819-110-11-916
Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques, 2nd edn. Morgan Kaufmann, Burlington (2005)
Zhang, J.: Advancements of outlier detection: a survey. ICST Trans. Scalable Inf. Syst. 13(1), 1–26 (2013)
Acknowledgements
The authors would like to thank the anonymous reviewers and the Associate Editor for the constructive evaluation of this paper and also the various members of the Data Mining and Machine Learning Laboratory, Florida Atlantic University, for the assistance with the reviews. Also, we acknowledge partial support by the NSF (CNS-1427536). Opinions, findings, conclusions, or recommendations in this paper are the authors’ and do not reflect the views of the NSF.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
Richard A. Bauder declares that he has no conflict of interest. Taghi M. Khoshgoftaar declares that he has no conflict of interest.
Ethical approval
This article does not contain any studies with human participants or animals performed by any of the authors.
Appendices
Appendix 1: Additional information on Bayesian inference and probabilistic programming
The use probabilistic programming, as a general inference technique, enables us to easily create a model of some event or characteristic, such as the detection of outliers, and perform probabilistic reasoning for prediction, infer causes from past events, and learn from the past to improve predictions. At its core, probabilistic programming is probabilistic reasoning, where the probabilistic model is expressed using a programming language. This is, in part, because the probabilistic program is interpreted as the distribution from which one can use tools to ask questions about the distribution. Additionally, these modeling languages incorporate random events as primitives and, as mentioned, their runtime environment handles inference. There are other representation-type languages available, such as Bayesian Belief Networks and hidden Markov models. However, these methods are simply simulations, which is not machine learning. A probabilistic program is akin to a simulation that you can run and analyze.
Even though Bayesian inference updates the posterior probabilities, or beliefs, based on any prior assumptions and the evidence, the updated beliefs may not necessarily agree with our prior assumptions and, as in the real world, the evidence tends to bolster, or overtake, any prior assumptions we may have had concerning some event. In that case, we can assume non-informative, or vague, prior beliefs with little to no prior knowledge. We also have the ability to incorporate stronger beliefs and assumptions based on prior knowledge, such as expert opinions, thus improving the model fit. Other benefits of Bayesian methods include more interpretable results, the incorporation of subjective inputs such as medical knowledge, and the quantification of uncertainties. Furthermore, Bayesian techniques provide credible intervals for the different parameters in the model. The credible intervals show that a value or parameter has, for instance, an 80 or 95% probability of being within the interval bands (note these are the default intervals exported by Stan). This is much easier to interpret than the traditional confidence intervals, which indicate that if an experiment is repeated many times, the values will be within this interval 80 or 95% of the time. Additionally, as with other means to assess confidence in results, the credible intervals around the results are reliable given that the model is true.
Appendix 2: Probability model code and details
Below is our general outlier detection Stan probability model, with comments, showing inputs, unknown variables, distributions, and generated outputs. Note the Stan language is mostly vectorized, but there are instances where vectorization is not yet implemented. For instance, in the Stan probability model code, the prior distributions are vectorized functions whereas the likelihood distribution is not hence the need for a for loop. Even so, both methods result in their requisite distributions.
In the case of the prior distributions given by \(mean\_value \sim normal(100, 100);\) and \(stdev\_value \sim normal(100, 100);\) in the Stan model code, both the mean and standard deviation are prior distributions declared as normal distributions with mean 100 and standard deviation 100. The use of 100 is not set in stone, but more a starting point for the distributions used to generate the likelihood distribution for the student’s t distribution. These static parameters are typically chosen by the researcher as assumptions on the prior distributions, or prior knowledge of the data.
Rights and permissions
About this article
Cite this article
Bauder, R.A., Khoshgoftaar, T.M. Multivariate outlier detection in medicare claims payments applying probabilistic programming methods. Health Serv Outcomes Res Method 17, 256–289 (2017). https://doi.org/10.1007/s10742-017-0172-1
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10742-017-0172-1