Skip to main content

Advertisement

Log in

Medical Provider Embeddings for Healthcare Fraud Detection

  • Original Research
  • Published:
SN Computer Science Aims and scope Submit manuscript

Abstract

Advances in data mining and machine learning continue to transform the healthcare industry and provide value to medical professionals and patients. In this study, we address the problem of encoding medical provider types and present four techniques for learning dense, semantic embeddings that capture provider specialty similarities. The first two methods (GloVe and Med-W2V) use pre-trained word embeddings to convert provider specialty descriptions to phrase embeddings. Next, HcpsVec and RxVec embeddings are constructed from publicly available big data using specialty-procedure and specialty-drug occurrence matrices, respectively. We evaluate the learned provider type embeddings on two real-world medicare fraud classification problems using logistic regression (LR), random forest (RF), gradient boosted tree (GBT), and multilayer perceptron (MLP) learners. Through repetition, statistical analysis, and feature importance measures, we confirm that semantic embeddings for provider types significantly improve fraud classification results. Finally, t-SNE visualizations are used to show that the learned provider type embeddings capture meaningful specialty characteristics and provider type similarities. Our primary contributions are two novel methods for encoding medical specialties using procedure-level statistics and the evaluation of four encoding techniques on two large-scale healthcare fraud classification tasks. Since all data sources are publicly available, these encoding techniques can be readily adopted and applied in future machine learning applications in the healthcare industry.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Data Availability

All data analysed during this study are referenced in this published article.

Code Availability

All software packages used during this study are open-source and are referenced in this published article.

References

  1. Medicare Provider Utilization and Payment Data. Centers for Medicare & Medicaid Services. https://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/Medicare-Provider-Charge-Data/index 2020, Accessed 15 Feb 2020.

  2. Abadi M, Agarwal A, Barham P, Brevdo E, Chen Z, Citro C, Corrado G.S, Davis A, Dean J, Devin M, Ghemawat S, Goodfellow I, Harp A, Irving G, Isard M, Jia Y, Jozefowicz R, Kaiser L, Kudlur M, Levenberg J, Mané D, Monga R, Moore S, Murray D, Olah C, Schuster M, Shlens J, Steiner B, Sutskever I, Talwar K, Tucker P, Vanhoucke V, Vasudevan V, Viégas F, Vinyals O, Warden P, Wattenberg M, Wicke M, Yu Y, Zheng X. TensorFlow: large-scale machine learning on heterogeneous systems. http://tensorflow.org/ 2015, Accessed 15 Feb 2020.

  3. Aronson A, Lang FM. An overview of MetaMap: historical perspective and recent advances. J Am Med Inform Assoc JAMIA. 2010;17:229–36. https://doi.org/10.1136/jamia.2009.002733.

    Article  Google Scholar 

  4. Arora S, Liang Y, Ma T. A simple but tough-to-beat baseline for sentence embeddings. In: ICLR; 2017.

  5. Bauder RA, Khoshgoftaar TM. A novel method for fraudulent medicare claims detection from expected payment deviations (application paper). In: 2016 IEEE 17th international conference on information reuse and integration (IRI); 2016. p. 11–19. https://doi.org/10.1109/IRI.2016.11.

  6. Bauder RA, Khoshgoftaar TM, Richter A, Herland M. Predicting medical provider specialties to detect anomalous insurance claims. In: 2016 IEEE 28th international conference on tools with artificial intelligence (ICTAI); 2016. p. 784–790. https://doi.org/10.1109/ICTAI.2016.0123.

  7. Bojanowski P, Grave E, Joulin A, Mikolov T. Enriching word vectors with subword information. Trans Assoc Comput Linguist. 2017;5:135–46. https://doi.org/10.1162/tacl_a_00051.

    Article  Google Scholar 

  8. Branting L.K, Reeder F, Gold J, Champney T. Graph analytics for healthcare fraud risk estimation. In: 2016 IEEE/ACM international conference on advances in social networks analysis and mining (ASONAM); 2016. p. 845–851. https://doi.org/10.1109/ASONAM.2016.7752336.

  9. Centers For Medicare & Medicaid Services: Hcpcs general information. https://www.cms.gov/Medicare/Coding/MedHCPCSGenInfo/index.html 2018, Accessed 15 Feb 2020.

  10. Centers for Medicare & Medicaid Services: medicare enrollment dashboard. https://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/Dashboard/Medicare-Enrollment/Enrollment%20Dashboard.html 2019, Accessed 15 Feb 2020.

  11. Centers For Medicare & Medicaid Services: medicare provider utilization and payment data. https://www.cms.gov/research-statistics-data-and-systems/statistics-trends-and-reports/medicare-provider-charge-data 2019, Accessed 15 Feb 2020.

  12. Centers For Medicare & Medicaid Services: medicare provider utilization and payment data: physician and other supplier. https://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/Medicare-Provider-Charge-Data/Physician-and-Other-Supplier 2020, Accessed 15 Feb 2020.

  13. Centers For Medicare & Medicaid Services: medicare provider utilization and payment data: part d prescriber. https://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/Medicare-Provider-Charge-Data/Part-D-Prescriber 2020, Accessed 15 Feb 2020.

  14. Centers For Medicare & Medicaid Services: trustees report & trust funds. https://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/ReportsTrustFunds/index.html 2020, Accessed 15 Feb 2020.

  15. Chandola V, Sukumar SR, Schryver JC. Knowledge discovery from massive healthcare claims data. In: KDD; 2013.

  16. Chen L. Curse of dimensionality. Boston: Springer; 2009. p. 545–6. https://doi.org/10.1007/978-0-387-39940-9_133.

    Book  Google Scholar 

  17. Choi E, Bahadori M.T, Song L, Stewart W.F, Sun J. Gram: graph-based attention model for healthcare representation learning. In: Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’17, p. 787–795. Association for Computing Machinery, New York, NY, USA; 2017. https://doi.org/10.1145/3097983.3098126.

  18. Choi E, Bahadori T, Searles E, Coffey C, Thompson M, Bost J, Tejedor-Sojo J, Sun J. Multi-layer representation learning for medical concepts. In: 22nd ACM SIGKDD international conference; 2016. p. 1495–1504. https://doi.org/10.1145/2939672.2939823.

  19. Choi Y, Chiu CYI, Sontag DA. Learning low-dimensional representations of medical concepts. AMIA Summits Transl Sci Proc. 2016;2016:41–50.

    Google Scholar 

  20. Chollet F, et al. Keras. https://keras.io (2015), Accessed 15 Feb 2020.

  21. Cost H.C.H., (HCUP), U.P. Clinical classifications software (ccs) for icd-9-cm. www.hcup-us.ahrq.gov/toolssoftware/ccs/ccs.jsp 2017, Accessed 15 Feb 2020.

  22. Das A, Ganguly D, Garain U. Named entity recognition with word embeddings and wikipedia categories for a low-resource language. ACM Trans Asian Lowresour Lang Inf Process. 2017. https://doi.org/10.1145/3015467.

    Article  Google Scholar 

  23. De Vine L, Zuccon G, Koopman B, Sitbon L, Bruza P. Medical semantic similarity with a neural language model. In: Proceedings of the 23rd ACM international conference on conference on information and knowledge management, CIKM ’14, p. 1819–1822. Association for Computing Machinery, New York, NY, USA; 2014. https://doi.org/10.1145/2661829.2661974.

  24. Devlin J, Chang M.W, Lee K, Toutanova K. Bert: pre-training of deep bidirectional transformers for language understanding. In: NAACL-HLT; 2019.

  25. Ferdous M, Debnath J, Chakraborty N.R. Machine learning algorithms in healthcare: a literature survey. In: 2020 11th International conference on computing, communication and networking technologies (ICCCNT); 2020. p. 1–6. https://doi.org/10.1109/ICCCNT49239.2020.9225642.

  26. Fursov I, Zaytsev A, Khasyanov R, Spindler M, Burnaev E. Sequence embeddings help to identify fraudulent cases in healthcare insurance. ArXiv abs/1910.03072. 2019.

  27. Gudivada A, Tabrizi N. A literature review on machine learning based medical information retrieval systems. In: 2018 IEEE symposium series on computational intelligence (SSCI); 2018. p. 250–257. https://doi.org/10.1109/SSCI.2018.8628846.

  28. Hafiz AM, Bhat GM. A survey of deep learning techniques for medical diagnosis. In: Tuba M, Akashe S, Joshi A, editors. Information and communication technology for sustainable development. Singapore: Springer; 2020. p. 161–70.

    Chapter  Google Scholar 

  29. Hancock JT, Khoshgoftaar TM. Survey on categorical data for neural networks. J Big Data. 2020;7(1):28. https://doi.org/10.1186/s40537-020-00305-w.

    Article  Google Scholar 

  30. Herland M, Bauder RA, Khoshgoftaar TM. Medical provider specialty predictions for the detection of anomalous medicare insurance claims. In: 2017 IEEE international conference on information reuse and integration (IRI); 2017. p. 579–588. https://doi.org/10.1109/IRI.2017.29.

  31. Herland M, Khoshgoftaar TM, Bauder RA. Big data fraud detection using multiple medicare data sources. J Big Data. 2018;5(1):29. https://doi.org/10.1186/s40537-018-0138-3.

    Article  Google Scholar 

  32. Herland M, Khoshgoftaar TM, Wald R. A review of data mining using big data in health informatics. J Big Data. 2014;1(1):2. https://doi.org/10.1186/2196-1115-1-2.

    Article  Google Scholar 

  33. Huang K, Altosaar J, Ranganath R. Clinicalbert: modeling clinical notes and predicting hospital readmission. ArXiv abs/1904.05342. 2019.

  34. Jeyaraj PR, Nadar ERS. Smart-monitor: patient monitoring system for IoT-based healthcare system using deep learning. IETE J Res. 2019. https://doi.org/10.1080/03772063.2019.1649215.

    Article  Google Scholar 

  35. Jiang F, Jiang Y, Zhi H, Dong Y, Li H, Ma S, Wang Y, Dong Q, Shen H, Wang Y. Artificial intelligence in healthcare: past, present and future. Stroke Vasc Neurol. 2017;2(4):230–43. https://doi.org/10.1136/svn-2017-000101.

    Article  Google Scholar 

  36. Johnson JM, Khoshgoftaar TM. Deep learning and data sampling with imbalanced big data. In: 2019 IEEE 20th international conference on information reuse and integration for data science (IRI); 2019. p. 175–183.

  37. Johnson JM, Khoshgoftaar TM. Deep learning and thresholding with class-imbalanced big data. In: 2019 18th IEEE international conference on machine learning and applications (ICMLA); 2019. p. 755–762. https://doi.org/10.1109/ICMLA.2019.00134.

  38. Johnson JM, Khoshgoftaar TM. Medicare fraud detection using neural networks. J Big Data. 2019;6(1):63. https://doi.org/10.1186/s40537-019-0225-0.

    Article  Google Scholar 

  39. Johnson JM, Khoshgoftaar TM. The effects of data sampling with deep learning and highly imbalanced big data. Inf Syst Front. 2020;22(5):1113–31. https://doi.org/10.1007/s10796-020-10022-7.

    Article  Google Scholar 

  40. Johnson JM, Khoshgoftaar TM. Hcpcs2vec: healthcare procedure embeddings for medicare fraud prediction. In: 2020 IEEE 6th international conference on collaboration and internet computing (CIC); 2020.

  41. Johnson JM, Khoshgoftaar TM. Semantic embeddings for medical providers and fraud detection. In: 2020 IEEE 21st international conference on information reuse and integration for data science (IRI); 2020. p. 224–230. https://doi.org/10.1109/IRI49571.2020.00039.

  42. Johnson JM, Khoshgoftaar TM. Thresholding strategies for deep learning with highly imbalanced big data. Singapore: Springer; 2021. p. 199–227. https://doi.org/10.1007/978-981-15-6759-9_9.

    Book  Google Scholar 

  43. Kalyan KS, Sangeetha S. Secnlp: a survey of embeddings in clinical natural language processing. J Biomed Inform. 2020;101:103323. https://doi.org/10.1016/j.jbi.2019.103323.

    Article  Google Scholar 

  44. Khattak FK, Jeblee S, Pou-Prom C, Abdalla M, Meaney C, Rudzicz F. A survey of word embeddings for clinical text. J Biomed Inform X. 2019;4:100057. https://doi.org/10.1016/j.yjbinx.2019.100057. http://www.sciencedirect.com/science/article/pii/S2590177X19300563.

  45. Ko J, Chalfin H, Trock B, Feng Z, Humphreys E, Park SW, Carter B, Frick DK, Han M. Variability in medicare utilization and payment among urologists. Urology. 2015. https://doi.org/10.1016/j.urology.2014.11.054.

    Article  Google Scholar 

  46. Linux S. About. https://www.scientificlinux.org/about/ (2014), Accessed 15 Jan 2020.

  47. Ma F, You Q, Xiao H, Chitta R, Zhou J, Gao J. Kame: knowledge-based attention model for diagnosis prediction in healthcare. In: Proceedings of the 27th ACM international conference on information and knowledge management, CIKM ’18, p. 743–752. Association for Computing Machinery, New York, NY, USA; 2018. https://doi.org/10.1145/3269206.3271701.

  48. Maas A, Daly R.E, Pham P.T, Huang D, Ng A.Y, Potts C. Learning word vectors for sentiment analysis. In: Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies; 2011. p. 142–150.Accessed 15 Feb 2020.

  49. Maaten LVD, Hinton G. Visualizing data using t-sne. J Mach Learn Res. 2008;9:2579–605.

    MATH  Google Scholar 

  50. Mikolov T, Chen K, Corrado GS, Dean J. Efficient estimation of word representations in vector space. CoRR abs/1301.3781. 2013.

  51. Morris L. Combating fraud in health care: an essential component of any cost containment strategy. Health Aff. 2009;28:1351–6. https://doi.org/10.1377/hlthaff.28.5.1351.

    Article  Google Scholar 

  52. National Plan & Provider Enumeration System: Nppes npi registry. https://npiregistry.cms.hhs.gov/registry/ 2020, Accessed 15 Feb 2020.

  53. Office of Inspector General: Leie downloadable databases. https://oig.hhs.gov/exclusions/exclusions_list.asp (2019).

  54. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E. Scikit-learn: machine learning in Python. J Mach Learn Res. 2011;12:2825–30.

    MathSciNet  MATH  Google Scholar 

  55. Pennington J, Socher R, Manning CD. Glove: global vectors for word representation. In: Empirical methods in natural language processing (EMNLP); 2014. p. 1532–1543.

  56. Peters M, Neumann M, Iyyer M, Gardner M, Clark C, Lee K, Zettlemoyer L. Deep contextualized word representations. In: Proceedings of the 2018 conference of the North american chapter of the association for computational linguistics: human language technologies, vol. 1 (long papers), p. 2227–2237. Association for Computational Linguistics, New Orleans, Louisiana; 2018. https://doi.org/10.18653/v1/N18-1202.

  57. Pianykh OS, Guitron S, Parke D, Zhang C, Pandharipande P, Brink J, Rosenthal D. Improving healthcare operations management with machine learning. Nat Mach Intell. 2020;2(5):266–73. https://doi.org/10.1038/s42256-020-0176-3.

    Article  Google Scholar 

  58. Provost F, Fawcett T. Analysis and visualization of classifier performance: comparison under imprecise class and cost distributions. In: Proceedings of the third international conference on knowledge discovery and data mining; 1999. p. 43–48.

  59. Pyysalo S, Ginter F, Moen H, Salakoski T, Ananiadou S. Distributional semantics resources for biomedical text processing. In: Proceedings of languages in biology and medicine; 2013.

  60. Rajpurkar P, Zhang J, Lopyrev K, Liang P. Squad: 100, 000+ questions for machine comprehension of text. In: EMNLP; 2016.

  61. Raunak V, Gupta V, Metze F. Effective dimensionality reduction for word embeddings. In: Proceedings of the 4th workshop on representation learning for NLP (RepL4NLP-2019), p. 235–243. Association for Computational Linguistics, Florence, Italy; 2019. https://doi.org/10.18653/v1/W19-4328.

  62. Sahlgren M. The distributional hypothesis. Ital J Linguist. 2008;20:33–54.

    Google Scholar 

  63. Shailaja K, Seetharamulu B, Jabbar M. Machine learning in healthcare: a review. In: 2018 Second international conference on electronics, communication and aerospace technology (ICECA), IEEE; 2018. p. 910–914.

  64. Si Y, Wang J, Xu H, Roberts K. Enhancing clinical concept extraction with contextual embeddings. J Am Med Inform Assoc JAMIA. 2019. https://doi.org/10.1093/jamia/ocz096.

    Article  Google Scholar 

  65. Song L, Cheong C.W, Yin K, Cheung W.K, Fung B.C.M, Poon J. Medical concept embedding with multiple ontological representations. In: Proceedings of the twenty-eighth international joint conference on artificial intelligence, IJCAI-19, p. 4613–4619. International Joint Conferences on Artificial Intelligence Organization; 2019. https://doi.org/10.24963/ijcai.2019/641.

  66. Sun J, Chen X, Zhang Z, Lai S, Zhao B, Liu H, Wang S, Huan W, Zhao R, Ng MTA, Zheng Y. Forecasting the long-term trend of covid-19 epidemic using a dynamic model. Sci Rep. 2020;10(1):21122. https://doi.org/10.1038/s41598-020-78084-w.

    Article  Google Scholar 

  67. Tukey JW. Comparing individual means in the analysis of variance. Biometrics. 1949;5(2):99–114.

    Article  MathSciNet  Google Scholar 

  68. U.S. Government, U.S. Centers for Medicare & Medicaid Services: the official U.S. government site for medicare. https://www.medicare.gov/. Accessed 15 Feb 2020.

  69. Villarroel M, Reisner A, Clifford G, Lehman LW, Moody G, Heldt T, Kyaw T, Moody B, Mark R. Multiparameter intelligent monitoring in intensive care ii (mimic-ii): a public-access intensive care unit database. Crit Care Med. 2011;39:952–60. https://doi.org/10.1097/CCM.0b013e31820a92c6.

    Article  Google Scholar 

  70. Wang M, Zhang Q, Lam S, Cai J, Yang R. A review on application of deep learning algorithms in external beam radiotherapy automated treatment planning. Front Oncol. 2020;10:2177. https://doi.org/10.3389/fonc.2020.580919.

    Article  Google Scholar 

  71. Witten IH, Frank E, Hall MA, Pal CJ. Data mining, fourth edition: practical machine learning tools and techniques. 4th ed. San Francisco: Morgan Kaufmann Publishers Inc.; 2016.

    Google Scholar 

  72. Zou WY, Socher R, Cer D, Manning CD. Bilingual word embeddings for phrase-based machine translation. In: Proceedings of the 2013 conference on empirical methods in natural language processing; 2013. p. 1393–1398.

Download references

Acknowledgements

The authors would like to thank the reviewers in the Data Mining and Machine Learning Laboratory at Florida Atlantic University.

Funding

Not applicable.

Author information

Authors and Affiliations

Authors

Contributions

JMJ performed the literature review, executed the experiment design, and drafted the manuscript. TMK worked with JMJ to develop the article’s framework and focus. All authors have read and approved the final manuscript.

Corresponding author

Correspondence to Justin M. Johnson.

Ethics declarations

Conflict of Interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article is part of the topical collection “Artificial Intelligence for HealthCare” guest-edited by Lydia Bouzar-Benlabiod, Stuart H. Rubin and Edwige Pissaloux.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Johnson, J.M., Khoshgoftaar, T.M. Medical Provider Embeddings for Healthcare Fraud Detection. SN COMPUT. SCI. 2, 276 (2021). https://doi.org/10.1007/s42979-021-00656-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s42979-021-00656-y

Keywords

Navigation