Medical Provider Embeddings for Healthcare Fraud Detection

Johnson, Justin M.; Khoshgoftaar, Taghi M.

doi:10.1007/s42979-021-00656-y

Medical Provider Embeddings for Healthcare Fraud Detection

Original Research
Published: 15 May 2021

Volume 2, article number 276, (2021)
Cite this article

SN Computer Science Aims and scope Submit manuscript

1111 Accesses
13 Citations
Explore all metrics

Abstract

Advances in data mining and machine learning continue to transform the healthcare industry and provide value to medical professionals and patients. In this study, we address the problem of encoding medical provider types and present four techniques for learning dense, semantic embeddings that capture provider specialty similarities. The first two methods (GloVe and Med-W2V) use pre-trained word embeddings to convert provider specialty descriptions to phrase embeddings. Next, HcpsVec and RxVec embeddings are constructed from publicly available big data using specialty-procedure and specialty-drug occurrence matrices, respectively. We evaluate the learned provider type embeddings on two real-world medicare fraud classification problems using logistic regression (LR), random forest (RF), gradient boosted tree (GBT), and multilayer perceptron (MLP) learners. Through repetition, statistical analysis, and feature importance measures, we confirm that semantic embeddings for provider types significantly improve fraud classification results. Finally, t-SNE visualizations are used to show that the learned provider type embeddings capture meaningful specialty characteristics and provider type similarities. Our primary contributions are two novel methods for encoding medical specialties using procedure-level statistics and the evaluation of four encoding techniques on two large-scale healthcare fraud classification tasks. Since all data sources are publicly available, these encoding techniques can be readily adopted and applied in future machine learning applications in the healthcare industry.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Encoding High-Dimensional Procedure Codes for Healthcare Fraud Detection

Article 05 July 2022

An Embedded Representation Learning of Relational Clinical Codes

Design and development of big data-based model for detecting fraud in healthcare insurance industry

Article 08 May 2023

Data Availability

All data analysed during this study are referenced in this published article.

Code Availability

All software packages used during this study are open-source and are referenced in this published article.

References

Medicare Provider Utilization and Payment Data. Centers for Medicare & Medicaid Services. https://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/Medicare-Provider-Charge-Data/index 2020, Accessed 15 Feb 2020.
Abadi M, Agarwal A, Barham P, Brevdo E, Chen Z, Citro C, Corrado G.S, Davis A, Dean J, Devin M, Ghemawat S, Goodfellow I, Harp A, Irving G, Isard M, Jia Y, Jozefowicz R, Kaiser L, Kudlur M, Levenberg J, Mané D, Monga R, Moore S, Murray D, Olah C, Schuster M, Shlens J, Steiner B, Sutskever I, Talwar K, Tucker P, Vanhoucke V, Vasudevan V, Viégas F, Vinyals O, Warden P, Wattenberg M, Wicke M, Yu Y, Zheng X. TensorFlow: large-scale machine learning on heterogeneous systems. http://tensorflow.org/ 2015, Accessed 15 Feb 2020.
Aronson A, Lang FM. An overview of MetaMap: historical perspective and recent advances. J Am Med Inform Assoc JAMIA. 2010;17:229–36. https://doi.org/10.1136/jamia.2009.002733.
Article Google Scholar
Arora S, Liang Y, Ma T. A simple but tough-to-beat baseline for sentence embeddings. In: ICLR; 2017.
Bauder RA, Khoshgoftaar TM. A novel method for fraudulent medicare claims detection from expected payment deviations (application paper). In: 2016 IEEE 17th international conference on information reuse and integration (IRI); 2016. p. 11–19. https://doi.org/10.1109/IRI.2016.11.
Bauder RA, Khoshgoftaar TM, Richter A, Herland M. Predicting medical provider specialties to detect anomalous insurance claims. In: 2016 IEEE 28th international conference on tools with artificial intelligence (ICTAI); 2016. p. 784–790. https://doi.org/10.1109/ICTAI.2016.0123.
Bojanowski P, Grave E, Joulin A, Mikolov T. Enriching word vectors with subword information. Trans Assoc Comput Linguist. 2017;5:135–46. https://doi.org/10.1162/tacl_a_00051.
Article Google Scholar
Branting L.K, Reeder F, Gold J, Champney T. Graph analytics for healthcare fraud risk estimation. In: 2016 IEEE/ACM international conference on advances in social networks analysis and mining (ASONAM); 2016. p. 845–851. https://doi.org/10.1109/ASONAM.2016.7752336.
Centers For Medicare & Medicaid Services: Hcpcs general information. https://www.cms.gov/Medicare/Coding/MedHCPCSGenInfo/index.html 2018, Accessed 15 Feb 2020.
Centers for Medicare & Medicaid Services: medicare enrollment dashboard. https://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/Dashboard/Medicare-Enrollment/Enrollment%20Dashboard.html 2019, Accessed 15 Feb 2020.
Centers For Medicare & Medicaid Services: medicare provider utilization and payment data. https://www.cms.gov/research-statistics-data-and-systems/statistics-trends-and-reports/medicare-provider-charge-data 2019, Accessed 15 Feb 2020.
Centers For Medicare & Medicaid Services: medicare provider utilization and payment data: physician and other supplier. https://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/Medicare-Provider-Charge-Data/Physician-and-Other-Supplier 2020, Accessed 15 Feb 2020.
Centers For Medicare & Medicaid Services: medicare provider utilization and payment data: part d prescriber. https://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/Medicare-Provider-Charge-Data/Part-D-Prescriber 2020, Accessed 15 Feb 2020.
Centers For Medicare & Medicaid Services: trustees report & trust funds. https://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/ReportsTrustFunds/index.html 2020, Accessed 15 Feb 2020.
Chandola V, Sukumar SR, Schryver JC. Knowledge discovery from massive healthcare claims data. In: KDD; 2013.
Chen L. Curse of dimensionality. Boston: Springer; 2009. p. 545–6. https://doi.org/10.1007/978-0-387-39940-9_133.
Book Google Scholar
Choi E, Bahadori M.T, Song L, Stewart W.F, Sun J. Gram: graph-based attention model for healthcare representation learning. In: Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’17, p. 787–795. Association for Computing Machinery, New York, NY, USA; 2017. https://doi.org/10.1145/3097983.3098126.
Choi E, Bahadori T, Searles E, Coffey C, Thompson M, Bost J, Tejedor-Sojo J, Sun J. Multi-layer representation learning for medical concepts. In: 22nd ACM SIGKDD international conference; 2016. p. 1495–1504. https://doi.org/10.1145/2939672.2939823.
Choi Y, Chiu CYI, Sontag DA. Learning low-dimensional representations of medical concepts. AMIA Summits Transl Sci Proc. 2016;2016:41–50.
Google Scholar
Chollet F, et al. Keras. https://keras.io (2015), Accessed 15 Feb 2020.
Cost H.C.H., (HCUP), U.P. Clinical classifications software (ccs) for icd-9-cm. www.hcup-us.ahrq.gov/toolssoftware/ccs/ccs.jsp 2017, Accessed 15 Feb 2020.
Das A, Ganguly D, Garain U. Named entity recognition with word embeddings and wikipedia categories for a low-resource language. ACM Trans Asian Lowresour Lang Inf Process. 2017. https://doi.org/10.1145/3015467.
Article Google Scholar
De Vine L, Zuccon G, Koopman B, Sitbon L, Bruza P. Medical semantic similarity with a neural language model. In: Proceedings of the 23rd ACM international conference on conference on information and knowledge management, CIKM ’14, p. 1819–1822. Association for Computing Machinery, New York, NY, USA; 2014. https://doi.org/10.1145/2661829.2661974.
Devlin J, Chang M.W, Lee K, Toutanova K. Bert: pre-training of deep bidirectional transformers for language understanding. In: NAACL-HLT; 2019.
Ferdous M, Debnath J, Chakraborty N.R. Machine learning algorithms in healthcare: a literature survey. In: 2020 11th International conference on computing, communication and networking technologies (ICCCNT); 2020. p. 1–6. https://doi.org/10.1109/ICCCNT49239.2020.9225642.
Fursov I, Zaytsev A, Khasyanov R, Spindler M, Burnaev E. Sequence embeddings help to identify fraudulent cases in healthcare insurance. ArXiv abs/1910.03072. 2019.
Gudivada A, Tabrizi N. A literature review on machine learning based medical information retrieval systems. In: 2018 IEEE symposium series on computational intelligence (SSCI); 2018. p. 250–257. https://doi.org/10.1109/SSCI.2018.8628846.
Hafiz AM, Bhat GM. A survey of deep learning techniques for medical diagnosis. In: Tuba M, Akashe S, Joshi A, editors. Information and communication technology for sustainable development. Singapore: Springer; 2020. p. 161–70.
Chapter Google Scholar
Hancock JT, Khoshgoftaar TM. Survey on categorical data for neural networks. J Big Data. 2020;7(1):28. https://doi.org/10.1186/s40537-020-00305-w.
Article Google Scholar
Herland M, Bauder RA, Khoshgoftaar TM. Medical provider specialty predictions for the detection of anomalous medicare insurance claims. In: 2017 IEEE international conference on information reuse and integration (IRI); 2017. p. 579–588. https://doi.org/10.1109/IRI.2017.29.
Herland M, Khoshgoftaar TM, Bauder RA. Big data fraud detection using multiple medicare data sources. J Big Data. 2018;5(1):29. https://doi.org/10.1186/s40537-018-0138-3.
Article Google Scholar
Herland M, Khoshgoftaar TM, Wald R. A review of data mining using big data in health informatics. J Big Data. 2014;1(1):2. https://doi.org/10.1186/2196-1115-1-2.
Article Google Scholar
Huang K, Altosaar J, Ranganath R. Clinicalbert: modeling clinical notes and predicting hospital readmission. ArXiv abs/1904.05342. 2019.
Jeyaraj PR, Nadar ERS. Smart-monitor: patient monitoring system for IoT-based healthcare system using deep learning. IETE J Res. 2019. https://doi.org/10.1080/03772063.2019.1649215.
Article Google Scholar
Jiang F, Jiang Y, Zhi H, Dong Y, Li H, Ma S, Wang Y, Dong Q, Shen H, Wang Y. Artificial intelligence in healthcare: past, present and future. Stroke Vasc Neurol. 2017;2(4):230–43. https://doi.org/10.1136/svn-2017-000101.
Article Google Scholar
Johnson JM, Khoshgoftaar TM. Deep learning and data sampling with imbalanced big data. In: 2019 IEEE 20th international conference on information reuse and integration for data science (IRI); 2019. p. 175–183.
Johnson JM, Khoshgoftaar TM. Deep learning and thresholding with class-imbalanced big data. In: 2019 18th IEEE international conference on machine learning and applications (ICMLA); 2019. p. 755–762. https://doi.org/10.1109/ICMLA.2019.00134.
Johnson JM, Khoshgoftaar TM. Medicare fraud detection using neural networks. J Big Data. 2019;6(1):63. https://doi.org/10.1186/s40537-019-0225-0.
Article Google Scholar
Johnson JM, Khoshgoftaar TM. The effects of data sampling with deep learning and highly imbalanced big data. Inf Syst Front. 2020;22(5):1113–31. https://doi.org/10.1007/s10796-020-10022-7.
Article Google Scholar
Johnson JM, Khoshgoftaar TM. Hcpcs2vec: healthcare procedure embeddings for medicare fraud prediction. In: 2020 IEEE 6th international conference on collaboration and internet computing (CIC); 2020.
Johnson JM, Khoshgoftaar TM. Semantic embeddings for medical providers and fraud detection. In: 2020 IEEE 21st international conference on information reuse and integration for data science (IRI); 2020. p. 224–230. https://doi.org/10.1109/IRI49571.2020.00039.
Johnson JM, Khoshgoftaar TM. Thresholding strategies for deep learning with highly imbalanced big data. Singapore: Springer; 2021. p. 199–227. https://doi.org/10.1007/978-981-15-6759-9_9.
Book Google Scholar
Kalyan KS, Sangeetha S. Secnlp: a survey of embeddings in clinical natural language processing. J Biomed Inform. 2020;101:103323. https://doi.org/10.1016/j.jbi.2019.103323.
Article Google Scholar
Khattak FK, Jeblee S, Pou-Prom C, Abdalla M, Meaney C, Rudzicz F. A survey of word embeddings for clinical text. J Biomed Inform X. 2019;4:100057. https://doi.org/10.1016/j.yjbinx.2019.100057. http://www.sciencedirect.com/science/article/pii/S2590177X19300563.
Ko J, Chalfin H, Trock B, Feng Z, Humphreys E, Park SW, Carter B, Frick DK, Han M. Variability in medicare utilization and payment among urologists. Urology. 2015. https://doi.org/10.1016/j.urology.2014.11.054.
Article Google Scholar
Linux S. About. https://www.scientificlinux.org/about/ (2014), Accessed 15 Jan 2020.
Ma F, You Q, Xiao H, Chitta R, Zhou J, Gao J. Kame: knowledge-based attention model for diagnosis prediction in healthcare. In: Proceedings of the 27th ACM international conference on information and knowledge management, CIKM ’18, p. 743–752. Association for Computing Machinery, New York, NY, USA; 2018. https://doi.org/10.1145/3269206.3271701.
Maas A, Daly R.E, Pham P.T, Huang D, Ng A.Y, Potts C. Learning word vectors for sentiment analysis. In: Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies; 2011. p. 142–150.Accessed 15 Feb 2020.
Maaten LVD, Hinton G. Visualizing data using t-sne. J Mach Learn Res. 2008;9:2579–605.
MATH Google Scholar
Mikolov T, Chen K, Corrado GS, Dean J. Efficient estimation of word representations in vector space. CoRR abs/1301.3781. 2013.
Morris L. Combating fraud in health care: an essential component of any cost containment strategy. Health Aff. 2009;28:1351–6. https://doi.org/10.1377/hlthaff.28.5.1351.
Article Google Scholar
National Plan & Provider Enumeration System: Nppes npi registry. https://npiregistry.cms.hhs.gov/registry/ 2020, Accessed 15 Feb 2020.
Office of Inspector General: Leie downloadable databases. https://oig.hhs.gov/exclusions/exclusions_list.asp (2019).
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E. Scikit-learn: machine learning in Python. J Mach Learn Res. 2011;12:2825–30.
MathSciNet MATH Google Scholar
Pennington J, Socher R, Manning CD. Glove: global vectors for word representation. In: Empirical methods in natural language processing (EMNLP); 2014. p. 1532–1543.
Peters M, Neumann M, Iyyer M, Gardner M, Clark C, Lee K, Zettlemoyer L. Deep contextualized word representations. In: Proceedings of the 2018 conference of the North american chapter of the association for computational linguistics: human language technologies, vol. 1 (long papers), p. 2227–2237. Association for Computational Linguistics, New Orleans, Louisiana; 2018. https://doi.org/10.18653/v1/N18-1202.
Pianykh OS, Guitron S, Parke D, Zhang C, Pandharipande P, Brink J, Rosenthal D. Improving healthcare operations management with machine learning. Nat Mach Intell. 2020;2(5):266–73. https://doi.org/10.1038/s42256-020-0176-3.
Article Google Scholar
Provost F, Fawcett T. Analysis and visualization of classifier performance: comparison under imprecise class and cost distributions. In: Proceedings of the third international conference on knowledge discovery and data mining; 1999. p. 43–48.
Pyysalo S, Ginter F, Moen H, Salakoski T, Ananiadou S. Distributional semantics resources for biomedical text processing. In: Proceedings of languages in biology and medicine; 2013.
Rajpurkar P, Zhang J, Lopyrev K, Liang P. Squad: 100, 000+ questions for machine comprehension of text. In: EMNLP; 2016.
Raunak V, Gupta V, Metze F. Effective dimensionality reduction for word embeddings. In: Proceedings of the 4th workshop on representation learning for NLP (RepL4NLP-2019), p. 235–243. Association for Computational Linguistics, Florence, Italy; 2019. https://doi.org/10.18653/v1/W19-4328.
Sahlgren M. The distributional hypothesis. Ital J Linguist. 2008;20:33–54.
Google Scholar
Shailaja K, Seetharamulu B, Jabbar M. Machine learning in healthcare: a review. In: 2018 Second international conference on electronics, communication and aerospace technology (ICECA), IEEE; 2018. p. 910–914.
Si Y, Wang J, Xu H, Roberts K. Enhancing clinical concept extraction with contextual embeddings. J Am Med Inform Assoc JAMIA. 2019. https://doi.org/10.1093/jamia/ocz096.
Article Google Scholar
Song L, Cheong C.W, Yin K, Cheung W.K, Fung B.C.M, Poon J. Medical concept embedding with multiple ontological representations. In: Proceedings of the twenty-eighth international joint conference on artificial intelligence, IJCAI-19, p. 4613–4619. International Joint Conferences on Artificial Intelligence Organization; 2019. https://doi.org/10.24963/ijcai.2019/641.
Sun J, Chen X, Zhang Z, Lai S, Zhao B, Liu H, Wang S, Huan W, Zhao R, Ng MTA, Zheng Y. Forecasting the long-term trend of covid-19 epidemic using a dynamic model. Sci Rep. 2020;10(1):21122. https://doi.org/10.1038/s41598-020-78084-w.
Article Google Scholar
Tukey JW. Comparing individual means in the analysis of variance. Biometrics. 1949;5(2):99–114.
Article MathSciNet Google Scholar
U.S. Government, U.S. Centers for Medicare & Medicaid Services: the official U.S. government site for medicare. https://www.medicare.gov/. Accessed 15 Feb 2020.
Villarroel M, Reisner A, Clifford G, Lehman LW, Moody G, Heldt T, Kyaw T, Moody B, Mark R. Multiparameter intelligent monitoring in intensive care ii (mimic-ii): a public-access intensive care unit database. Crit Care Med. 2011;39:952–60. https://doi.org/10.1097/CCM.0b013e31820a92c6.
Article Google Scholar
Wang M, Zhang Q, Lam S, Cai J, Yang R. A review on application of deep learning algorithms in external beam radiotherapy automated treatment planning. Front Oncol. 2020;10:2177. https://doi.org/10.3389/fonc.2020.580919.
Article Google Scholar
Witten IH, Frank E, Hall MA, Pal CJ. Data mining, fourth edition: practical machine learning tools and techniques. 4th ed. San Francisco: Morgan Kaufmann Publishers Inc.; 2016.
Google Scholar
Zou WY, Socher R, Cer D, Manning CD. Bilingual word embeddings for phrase-based machine translation. In: Proceedings of the 2013 conference on empirical methods in natural language processing; 2013. p. 1393–1398.

Download references

Acknowledgements

The authors would like to thank the reviewers in the Data Mining and Machine Learning Laboratory at Florida Atlantic University.

Funding

Not applicable.

Author information

Authors and Affiliations

College of Engineering and Computer Science, Florida Atlantic University, Boca Raton, FL, 33431, USA
Justin M. Johnson & Taghi M. Khoshgoftaar

Authors

Justin M. Johnson
View author publications
You can also search for this author in PubMed Google Scholar
Taghi M. Khoshgoftaar
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

JMJ performed the literature review, executed the experiment design, and drafted the manuscript. TMK worked with JMJ to develop the article’s framework and focus. All authors have read and approved the final manuscript.

Corresponding author

Correspondence to Justin M. Johnson.

Ethics declarations

Conflict of Interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article is part of the topical collection “Artificial Intelligence for HealthCare” guest-edited by Lydia Bouzar-Benlabiod, Stuart H. Rubin and Edwige Pissaloux.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Johnson, J.M., Khoshgoftaar, T.M. Medical Provider Embeddings for Healthcare Fraud Detection. SN COMPUT. SCI. 2, 276 (2021). https://doi.org/10.1007/s42979-021-00656-y

Download citation

Received: 27 December 2020
Accepted: 19 April 2021
Published: 15 May 2021
DOI: https://doi.org/10.1007/s42979-021-00656-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Medical Provider Embeddings for Healthcare Fraud Detection

Abstract

Access this article

Similar content being viewed by others

Encoding High-Dimensional Procedure Codes for Healthcare Fraud Detection

An Embedded Representation Learning of Relational Clinical Codes

Design and development of big data-based model for detecting fraud in healthcare insurance industry

Data Availability

Code Availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of Interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Medical Provider Embeddings for Healthcare Fraud Detection

Abstract

Access this article

Similar content being viewed by others

Encoding High-Dimensional Procedure Codes for Healthcare Fraud Detection

An Embedded Representation Learning of Relational Clinical Codes

Design and development of big data-based model for detecting fraud in healthcare insurance industry

Data Availability

Code Availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of Interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation