Multivariate outlier detection in medicare claims payments applying probabilistic programming methods

Bauder, Richard A.; Khoshgoftaar, Taghi M.

doi:10.1007/s10742-017-0172-1

Multivariate outlier detection in medicare claims payments applying probabilistic programming methods

Published: 20 June 2017

Volume 17, pages 256–289, (2017)
Cite this article

Health Services and Outcomes Research Methodology Aims and scope Submit manuscript

Richard A. Bauder¹ &
Taghi M. Khoshgoftaar¹

899 Accesses
21 Citations
Explore all metrics

Abstract

The rising elderly population continues to demand more cost-effective healthcare programs. In particular, Medicare is a vital program serving the needs of the elderly in the United States. The growing number of people enrolled in healthcare programs such as Medicare, along with the enormous volume of money in the healthcare industry, increases the appeal for, and risk of, fraudulent activities. Out of the many possible factors for the rising cost of healthcare, fraud is a major contributor, but its impacts can be lessened through the use of fraud detection methods. In this paper, we assess possible illegitimate activities by looking at the amounts paid to providers for services rendered to patients. We propose a novel method for fraud detection that focuses on discovering outliers in Medicare payment data using multiple predictors as model inputs. Our multivariate outlier detection approach is twofold: (1) create a Multivariate Adaptive Regression Splines model to produce studentized residuals and, (2) use the residuals as input into a general univariate outlier detection model, based on full Bayesian inference, using probabilistic programming. Using this approach, we are able to incorporate multiple variables to detect outliers with a model that provides probability distributions, with credible intervals, rather than just point values, as with most common outlier detection methods. Additionally, these credible intervals further enhance confidence that the detected outliers should in fact be considered outlying values, thus possibly fraudulent activities. Our results show that the successful detection of these possibly fraudulent activities can provide effective and meaningful results for further investigation within various medical specialties.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Detecting bad actors in value-based payment models

Article 28 June 2021

Anomaly Detection with Sub-Extreme Values: Health Provider Billing

Article Open access 29 November 2023

Outlier classification performance of risk adjustment methods when profiling multiple providers

Article Open access 15 June 2018

Notes

References

Aggarwal, C.C.: Data Mining: The Textbook. Springer, Berlin (2015). google-Books-ID: cfNICAAAQBAJ
Book Google Scholar
Aggarwal, C.C.: Outlier analysis. In: Data Mining. Springer, Berlin, pp. 237–263 (2015)
Bauder, R., Khoshgoftaar, T.M., Seliya, N.: A survey on the state of healthcare upcoding fraud analysis and detection. Health Serv. Outcomes Res. Methodol. 17(1), 31–55 (2017). doi:10.1007/s10742-016-0154-8. [Online]
Article Google Scholar
Ben-Gal, I.: Outlier detection. In: Data Mining and Knowledge Discovery Handbook. Springer, Berlin, pp. 131–146 (2005)
Berwick, D.M., Hackbarth, A.D.: liminating waste in us health care. JAMA 307(14), 1513–1516 (2012). doi:10.1001/jama.2012.362
Article CAS PubMed Google Scholar
Box, G.E., Tiao, G.C.: Bayesian Inference in Statistical Analysis, vol. 40. Wiley, New York (2011)
Google Scholar
Breunig, M.M., Kriegel, H.-P., Ng, R.T., Sander, J.: LOF: identifying density-based local outliers. In: ACM sigmod record, vol. 29, no. 2, pp. 93–104. ACM (2000)
Brooks, S., Gelman, A., Jones, G., Meng, X.-L.: Handbook of Markov Chain Monte Carlo. CRC press, Boca Raton (2011)
Book Google Scholar
Carpenter, B., Gelman, A., Hoffman, M., Lee, D., Goodrich, B., Betancourt, M., Brubaker, M.A., Guo, J., Li, P., Riddell, A.: Stan: a probabilistic programming language. J. Stat. Softw. 20 (2016)
Centers for Medicare and Medicaid Services Frequently Asked Questions. https://questions.cms.gov/
Centers for Medicare and Medicaid Services: HCPCS General Information. https://www.cms.gov/Medicare/Coding/MedHCPCSGenInfo/
Centers for Medicare and Medicaid Services: Research, Statistics, Data, and Systems. https://www.cms.gov/research-statistics-data-and-systems/research-statistics-data-and-systems.html
Chaloner, K., Brant, R.: A Bayesian approach to outlier detection and residual analysis. Biometrika, 75(4), 651–659 (1988). http://biomet.oxfordjournals.org/content/75/4/651.abstract
Compute Mahalanobis Distance and flag multivariate outliers (Sep 2016). http://www-01.ibm.com/support/docview.wss?uid=swg21480128
Cousineau, D., Chartier, S.: Outliers detection and treatment: a review. Int. J. Psychol. Res. 3(1), 58–67 (2015)
Article Google Scholar
DARPA probabilistic programming for advancing machine learning (PPAML) (2016). http://www.darpa.mil/program/probabilistic-programming-for-advancing-machine-Learning
Das, M.K., Gogoi, B.: Usage of graphical displays to detect outlying observations in linear regression. Indian J. Appl. Res. 5(5) (2016)
Davenport, K.: Mahalanobis Distance and Outliers (2013). http://kldavenport.com/mahalanobis-distance-and-outliers/
Davidson-Pilon, C.: Bayesian Methods for Hackers: Probabilistic Programming and Bayesian Inference. Addison-Wesley Professional, Boston (2015)
Google Scholar
Dobson, A.J., Barnett, A.: An Introduction to Generalized Linear Models. CRC press, Boca Raton (2008)
Google Scholar
Ekina, T., Leva, F., Ruggeri, F., Soyer, R.: Application of Bayesian methods in detection of healthcare fraud. Chem. Eng. Trans. 33 (2013)
Friedman, J.H.: Multivariate adaptive regression splines. Ann. Stat. 19(1), 1–67 (1991)
Article Google Scholar
Gelman, A., Carlin, J.B., Stern, H.S., Rubin, D.B.: Bayesian Data Analysis, vol. 2. CRC Press, Boca Raton (2014). doi:10.1080/01621459.2014.963405
Google Scholar
Government Intervenes in Lawsuit Against Florida Cardiologist Alleging Unnecessary Peripheral Artery Interventions and Payment of Kickbacks. https://www.justice.gov/opa/pr/government-intervenes-lawsuit-against-florida-cardiologist-alleging-unnecessary-peripheral/
Greenburg, J.: Medicare fraud rate is 8 to 10 percent, says Roskam of Illinois (2013). http://www.politifact.com/truth-o-meter/statements/2013/jun/17/peter-roskam/rep-roskam-says-medicare-fraud-rate-8-10-percent/
Grubbs, F.E.: Sample criteria for testing outlying observations. Ann. Math. Stat. 27–58 (1950)
Grubbs, F.E.: Procedures for Detecting Outlying Observations in Samples, vol. 11(1), 1–21 (1969). http://www.tandfonline.com/doi/abs/10.1080/00401706.1969.10490657
Guo, J., Gabry, J., Goodrich, B., Lee, D., Sakrejda, K., Sklyar, O., Oehlschlaegel-Akiyoshi, J., Wickham, H., de Guzman, J., Fletcher, J., Heller, T., Niebler, E., the R Core Team: RSTAN: R Interface to Stan (2016). https://cran.r-project.org/web/packages/rstan/index.html
Heathcare.gov glossary (2017). https://www.healthcare.gov/glossary/
Hiers, F.: Cardiologist plagued by legal woes files for Chapter 11 bankruptcy protection. http://www.ocala.com/article/20160422/ARTICLES/160429933
How Growth of Elderly Population in US Compares With Other Countries (2013). http://www.pbs.org/newshour/rundown/how-growth-of-elderly-population-in-us-compares-with-other-countries/
Hu, J., Wang, F., Sun, J., Sorrentino, R., Ebadollahi, S.: A healthcare utilization analysis framework for hot spotting and contextual anomaly detection. In: AMIA (2012)
Joudaki, H., Rashidian, A., Minaei-Bidgoli, B., Mahmoodi, M., Geraili, B., Nasiri, M., Arab, M.: Using data mining to detect health care fraud and abuse: a review of literature. Glob. J. Health Sci. 7(1), 194–202 (2014)
Article PubMed PubMed Central Google Scholar
Komsta, L.: Package ‘outliers (2015). https://cran.r-project.org/web/packages/outliers/outliers.pdf
Korkmaz, S., Goksuluk, D., Zararsiz, G.: MVN: An R package for assessing multivariate normality. R J. 6(2), 151–162 (2014). http://journal.r-project.org/archive/2014-2/korkmaz-goksuluk-zararsiz.pdf
Kuhn, M., Wing, J., Weston, S., Williams, A., Keefer, C., Engelhardt, A., Cooper, T., Mayer, Z., Kenkel, B., the R Core Team, Benesty, M., Lescarbeau, R., Ziem, A., Scrucca, L., Tang, Y., Candan, C.: CARET: Classification and Regression Training, 2016, R package version 6.0-68. https://CRAN.R-project.org/package=caret
Li, J., Huang, K.-Y., Jin, J., Shi, J.: A survey on statistical methods for health care fraud detection. Health Care Manag. Sci. 11(3), 275–287 (2008)
Article PubMed Google Scholar
Local Outlier Factor (2016). https://turi.com/learn/userguide/anomaly_detection/local_outlier_factor.html
Matthews, S.: How the Aging Population Is Changing the Healthcare System (2013). http://www.everydayhealth.com/senior-health/aging-and-health/pressures-on-healthcare-from-booming-senior-population.aspx
Milborrow, S., Hastie, T., Tibshirani, R.: Earth: Multivariate Adaptive Regression Splines (2016). https://cran.r-project.org/web/packages/earth/index.html
Mínguez, R., Reguero, B.G., Luceno, A., Méndez, F.J.: Regression models for outlier identification (hurricanes and typhoons) in wave hindcast databases. J. Atmos. Ocean. Technol. 29(2), 267–285 (2012). doi:10.1175/JTECH-D-11-00059.1
Article Google Scholar
MINITAB 17 Statistical Software [Computer software] (2010). www.minitab.com
Morris, L.: Combating Fraud In Health Care: An Essential Component of Any Cost Containment Strategy (2009). http://content.healthaffairs.org/content/28/5/1351.full
Multivariate Adaptivie Regression Splines (2015). https://documents.software.dell.com/statistics/textbook/multivariate-adaptive-regression-splines
Munro, D.: Annual U.S. healthcare spending hits $3.8 trillion (2014). http://www.forbes.com/sites/danmunro/2014/02/02/annual-u-s-healthcare-spending-hits-3-8-trillion/
National Health Accounts by service type and funding source (2014). https://www.cms.gov/Research-Statistics-Data-and-systems/Statistics-Trends-and-reports/NationalHealthExpendData/index.html
National Plan & Provider Enumeration System (NPPES): National Provider Identifier. https://nppes.cms.hhs.gov/NPPES/
Nedret, B., Gulsen, K.: A comparison of multiple outlier detection methods for regression data. Commun. Stat. Simul. Comput. 37(3), 521–545 (2008). doi:10.1080/03610910701812352
Article Google Scholar
NIST/SEMATECH e-Handbook of Statistical Methods (Anderson-Darling) (2013). http://www.itl.nist.gov/div898/handbook/eda/section3/eda35e.htm
NIST/SEMATECH e-Handbook of Statistical Methods (Grubb’s test) (2013). http://www.itl.nist.gov/div898/handbook/eda/section3/eda35h.htm
NOLO: Part B Medical Insurance: What You Pay (2010). http://www.nolo.com/legal-encyclopedia/part-b-medical-insurance-what-you-pay.html
Office of Inspector General: Exclusions Program. http://oig.hhs.gov/exclusions/index.asp
Pande, P.S., Neuman, R.P.: The Six Sigma Way. McGraw-Hill, New York (2000)
Google Scholar
Profile of older Americans: 2015 (2015). http://www.aoa.acl.gov/Aging_Statistics/Profile/2015/
R Core Team, R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria, (2017). https://www.R-project.org/
Robinson, D.: K-means clustering is not a free lunch (2015). http://varianceexplained.org/r/kmeans-free-lunch/
Rosenmai, P.: Using Mahalanobis Distance to Find Outliers (2013). http://eurekastatistics.com/using-mahalanobis-distance-to-find-outliers/
Ross, S.: A First Course in Probability, 6th edn. Prentice-Hall, Upper Saddle River (2002)
Google Scholar
Savage, J.: A quick-start introduction to Stan for economists (2016). http://quantecon.org/notebooks.html
Stan Development Team, Stan Modeling Language Users Guide and Reference Manual
State Health Facts-Medicare (2015). http://kff.org/state-category/medicare/
Steinbusch, P.J., Oostenbrink, J.B., Zuurbier, J.J., Schaepkens, F.J.: The risk of upcoding in casemix systems: a comparative study. Health Policy 81(2–3), 289–299 (2007)
Article PubMed Google Scholar
Stevens, J.P.: Outliers and influential data points in regression analysis. Psychol. Bull. 95(2), 334 (1984)
Article Google Scholar
Swanson, T.: The 5 most common types of medical billing fraud (2012). http://www.business2community.com/health-wellness/the-5-most-common-types-of-medical-billing-fraud-0234197
Tallarida, R.J., Murray, R.B.: Chi-square test. In: Manual of Pharmacologic Calculations. Springer, Berlin, pp. 140–142 (1987)
The facts about rising health care costs (2015). http://www.aetna.com/health-reform-connection/aetnas-vision/facts-about-costs.html
Thornton, D., Capelleveen, G., Poel, M., Hillegersberg, J., Müller, R.M.: Outlier-based health insurance fraud detection for us medicaid data (2014)
Tomar, D., Agarwal, S.: A survey on data mining approaches for healthcare. Int. J. Bio-Sci. Bio-Technol. 5(5), 241–266 (2013)
Article Google Scholar
Trnka, A.: Six sigma methodology with fraud detection. In: 9th WSEAS Interanational Conference on Data Networks, Communications, Computers (DNCOCO10): University of Algarve, Faro, Portugal, pp. 162–165 (2010)
Understanding Medicare-allowed amounts (2016). https://secure.wpsic.com/sales-materials/files/28807-medicare-approved-amounts-tip-sheet.pdf
US Medicaid Program (2016). https://www.medicaid.gov
US Medicare Program (2016). https://www.medicare.gov
Weaver, J., Chang, D.: South Florida ophthalmologist emerges as Medicare’s top-paid physician. http://www.miamiherald.com/news/local/community/miami-dade/article1962581.html
Williamson, D.F., Parker, R.A., Kendrick, J.S.: The box plot: a simple visual method to interpret data. Ann. Intern. Med. 110(11), 916–921 (1989). doi:10.7326/0003-4819-110-11-916
Article CAS PubMed Google Scholar
Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques, 2nd edn. Morgan Kaufmann, Burlington (2005)
Google Scholar
Zhang, J.: Advancements of outlier detection: a survey. ICST Trans. Scalable Inf. Syst. 13(1), 1–26 (2013)
Article CAS Google Scholar

Download references

Acknowledgements

The authors would like to thank the anonymous reviewers and the Associate Editor for the constructive evaluation of this paper and also the various members of the Data Mining and Machine Learning Laboratory, Florida Atlantic University, for the assistance with the reviews. Also, we acknowledge partial support by the NSF (CNS-1427536). Opinions, findings, conclusions, or recommendations in this paper are the authors’ and do not reflect the views of the NSF.

Author information

Authors and Affiliations

Florida Atlantic University, Boca Raton, FL, USA
Richard A. Bauder & Taghi M. Khoshgoftaar

Authors

Richard A. Bauder
View author publications
You can also search for this author in PubMed Google Scholar
Taghi M. Khoshgoftaar
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Richard A. Bauder.

Ethics declarations

Conflict of interest

Richard A. Bauder declares that he has no conflict of interest. Taghi M. Khoshgoftaar declares that he has no conflict of interest.

Ethical approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Appendices

Appendix 1: Additional information on Bayesian inference and probabilistic programming

The use probabilistic programming, as a general inference technique, enables us to easily create a model of some event or characteristic, such as the detection of outliers, and perform probabilistic reasoning for prediction, infer causes from past events, and learn from the past to improve predictions. At its core, probabilistic programming is probabilistic reasoning, where the probabilistic model is expressed using a programming language. This is, in part, because the probabilistic program is interpreted as the distribution from which one can use tools to ask questions about the distribution. Additionally, these modeling languages incorporate random events as primitives and, as mentioned, their runtime environment handles inference. There are other representation-type languages available, such as Bayesian Belief Networks and hidden Markov models. However, these methods are simply simulations, which is not machine learning. A probabilistic program is akin to a simulation that you can run and analyze.

Even though Bayesian inference updates the posterior probabilities, or beliefs, based on any prior assumptions and the evidence, the updated beliefs may not necessarily agree with our prior assumptions and, as in the real world, the evidence tends to bolster, or overtake, any prior assumptions we may have had concerning some event. In that case, we can assume non-informative, or vague, prior beliefs with little to no prior knowledge. We also have the ability to incorporate stronger beliefs and assumptions based on prior knowledge, such as expert opinions, thus improving the model fit. Other benefits of Bayesian methods include more interpretable results, the incorporation of subjective inputs such as medical knowledge, and the quantification of uncertainties. Furthermore, Bayesian techniques provide credible intervals for the different parameters in the model. The credible intervals show that a value or parameter has, for instance, an 80 or 95% probability of being within the interval bands (note these are the default intervals exported by Stan). This is much easier to interpret than the traditional confidence intervals, which indicate that if an experiment is repeated many times, the values will be within this interval 80 or 95% of the time. Additionally, as with other means to assess confidence in results, the credible intervals around the results are reliable given that the model is true.

Appendix 2: Probability model code and details

Below is our general outlier detection Stan probability model, with comments, showing inputs, unknown variables, distributions, and generated outputs. Note the Stan language is mostly vectorized, but there are instances where vectorization is not yet implemented. For instance, in the Stan probability model code, the prior distributions are vectorized functions whereas the likelihood distribution is not hence the need for a for loop. Even so, both methods result in their requisite distributions.

In the case of the prior distributions given by $mean\_value \sim normal(100, 100);$ and $stdev\_value \sim normal(100, 100);$ in the Stan model code, both the mean and standard deviation are prior distributions declared as normal distributions with mean 100 and standard deviation 100. The use of 100 is not set in stone, but more a starting point for the distributions used to generate the likelihood distribution for the student’s t distribution. These static parameters are typically chosen by the researcher as assumptions on the prior distributions, or prior knowledge of the data.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bauder, R.A., Khoshgoftaar, T.M. Multivariate outlier detection in medicare claims payments applying probabilistic programming methods. Health Serv Outcomes Res Method 17, 256–289 (2017). https://doi.org/10.1007/s10742-017-0172-1

Download citation

Received: 05 October 2016
Revised: 29 May 2017
Accepted: 12 June 2017
Published: 20 June 2017
Issue Date: December 2017
DOI: https://doi.org/10.1007/s10742-017-0172-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multivariate outlier detection in medicare claims payments applying probabilistic programming methods

Abstract

Access this article

Similar content being viewed by others

Detecting bad actors in value-based payment models

Anomaly Detection with Sub-Extreme Values: Health Provider Billing

Outlier classification performance of risk adjustment methods when profiling multiple providers

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Appendices

Appendix 1: Additional information on Bayesian inference and probabilistic programming

Appendix 2: Probability model code and details

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Multivariate outlier detection in medicare claims payments applying probabilistic programming methods

Abstract

Access this article

Similar content being viewed by others

Detecting bad actors in value-based payment models

Anomaly Detection with Sub-Extreme Values: Health Provider Billing

Outlier classification performance of risk adjustment methods when profiling multiple providers

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Appendices

Appendix 1: Additional information on Bayesian inference and probabilistic programming

Appendix 2: Probability model code and details

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation