Abstract
Advances in data mining and machine learning continue to transform the healthcare industry and provide value to medical professionals and patients. In this study, we address the problem of encoding medical provider types and present four techniques for learning dense, semantic embeddings that capture provider specialty similarities. The first two methods (GloVe and Med-W2V) use pre-trained word embeddings to convert provider specialty descriptions to phrase embeddings. Next, HcpsVec and RxVec embeddings are constructed from publicly available big data using specialty-procedure and specialty-drug occurrence matrices, respectively. We evaluate the learned provider type embeddings on two real-world medicare fraud classification problems using logistic regression (LR), random forest (RF), gradient boosted tree (GBT), and multilayer perceptron (MLP) learners. Through repetition, statistical analysis, and feature importance measures, we confirm that semantic embeddings for provider types significantly improve fraud classification results. Finally, t-SNE visualizations are used to show that the learned provider type embeddings capture meaningful specialty characteristics and provider type similarities. Our primary contributions are two novel methods for encoding medical specialties using procedure-level statistics and the evaluation of four encoding techniques on two large-scale healthcare fraud classification tasks. Since all data sources are publicly available, these encoding techniques can be readily adopted and applied in future machine learning applications in the healthcare industry.
Similar content being viewed by others
Data Availability
All data analysed during this study are referenced in this published article.
Code Availability
All software packages used during this study are open-source and are referenced in this published article.
References
Medicare Provider Utilization and Payment Data. Centers for Medicare & Medicaid Services. https://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/Medicare-Provider-Charge-Data/index 2020, Accessed 15 Feb 2020.
Abadi M, Agarwal A, Barham P, Brevdo E, Chen Z, Citro C, Corrado G.S, Davis A, Dean J, Devin M, Ghemawat S, Goodfellow I, Harp A, Irving G, Isard M, Jia Y, Jozefowicz R, Kaiser L, Kudlur M, Levenberg J, Mané D, Monga R, Moore S, Murray D, Olah C, Schuster M, Shlens J, Steiner B, Sutskever I, Talwar K, Tucker P, Vanhoucke V, Vasudevan V, Viégas F, Vinyals O, Warden P, Wattenberg M, Wicke M, Yu Y, Zheng X. TensorFlow: large-scale machine learning on heterogeneous systems. http://tensorflow.org/ 2015, Accessed 15 Feb 2020.
Aronson A, Lang FM. An overview of MetaMap: historical perspective and recent advances. J Am Med Inform Assoc JAMIA. 2010;17:229–36. https://doi.org/10.1136/jamia.2009.002733.
Arora S, Liang Y, Ma T. A simple but tough-to-beat baseline for sentence embeddings. In: ICLR; 2017.
Bauder RA, Khoshgoftaar TM. A novel method for fraudulent medicare claims detection from expected payment deviations (application paper). In: 2016 IEEE 17th international conference on information reuse and integration (IRI); 2016. p. 11–19. https://doi.org/10.1109/IRI.2016.11.
Bauder RA, Khoshgoftaar TM, Richter A, Herland M. Predicting medical provider specialties to detect anomalous insurance claims. In: 2016 IEEE 28th international conference on tools with artificial intelligence (ICTAI); 2016. p. 784–790. https://doi.org/10.1109/ICTAI.2016.0123.
Bojanowski P, Grave E, Joulin A, Mikolov T. Enriching word vectors with subword information. Trans Assoc Comput Linguist. 2017;5:135–46. https://doi.org/10.1162/tacl_a_00051.
Branting L.K, Reeder F, Gold J, Champney T. Graph analytics for healthcare fraud risk estimation. In: 2016 IEEE/ACM international conference on advances in social networks analysis and mining (ASONAM); 2016. p. 845–851. https://doi.org/10.1109/ASONAM.2016.7752336.
Centers For Medicare & Medicaid Services: Hcpcs general information. https://www.cms.gov/Medicare/Coding/MedHCPCSGenInfo/index.html 2018, Accessed 15 Feb 2020.
Centers for Medicare & Medicaid Services: medicare enrollment dashboard. https://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/Dashboard/Medicare-Enrollment/Enrollment%20Dashboard.html 2019, Accessed 15 Feb 2020.
Centers For Medicare & Medicaid Services: medicare provider utilization and payment data. https://www.cms.gov/research-statistics-data-and-systems/statistics-trends-and-reports/medicare-provider-charge-data 2019, Accessed 15 Feb 2020.
Centers For Medicare & Medicaid Services: medicare provider utilization and payment data: physician and other supplier. https://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/Medicare-Provider-Charge-Data/Physician-and-Other-Supplier 2020, Accessed 15 Feb 2020.
Centers For Medicare & Medicaid Services: medicare provider utilization and payment data: part d prescriber. https://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/Medicare-Provider-Charge-Data/Part-D-Prescriber 2020, Accessed 15 Feb 2020.
Centers For Medicare & Medicaid Services: trustees report & trust funds. https://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/ReportsTrustFunds/index.html 2020, Accessed 15 Feb 2020.
Chandola V, Sukumar SR, Schryver JC. Knowledge discovery from massive healthcare claims data. In: KDD; 2013.
Chen L. Curse of dimensionality. Boston: Springer; 2009. p. 545–6. https://doi.org/10.1007/978-0-387-39940-9_133.
Choi E, Bahadori M.T, Song L, Stewart W.F, Sun J. Gram: graph-based attention model for healthcare representation learning. In: Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’17, p. 787–795. Association for Computing Machinery, New York, NY, USA; 2017. https://doi.org/10.1145/3097983.3098126.
Choi E, Bahadori T, Searles E, Coffey C, Thompson M, Bost J, Tejedor-Sojo J, Sun J. Multi-layer representation learning for medical concepts. In: 22nd ACM SIGKDD international conference; 2016. p. 1495–1504. https://doi.org/10.1145/2939672.2939823.
Choi Y, Chiu CYI, Sontag DA. Learning low-dimensional representations of medical concepts. AMIA Summits Transl Sci Proc. 2016;2016:41–50.
Chollet F, et al. Keras. https://keras.io (2015), Accessed 15 Feb 2020.
Cost H.C.H., (HCUP), U.P. Clinical classifications software (ccs) for icd-9-cm. www.hcup-us.ahrq.gov/toolssoftware/ccs/ccs.jsp 2017, Accessed 15 Feb 2020.
Das A, Ganguly D, Garain U. Named entity recognition with word embeddings and wikipedia categories for a low-resource language. ACM Trans Asian Lowresour Lang Inf Process. 2017. https://doi.org/10.1145/3015467.
De Vine L, Zuccon G, Koopman B, Sitbon L, Bruza P. Medical semantic similarity with a neural language model. In: Proceedings of the 23rd ACM international conference on conference on information and knowledge management, CIKM ’14, p. 1819–1822. Association for Computing Machinery, New York, NY, USA; 2014. https://doi.org/10.1145/2661829.2661974.
Devlin J, Chang M.W, Lee K, Toutanova K. Bert: pre-training of deep bidirectional transformers for language understanding. In: NAACL-HLT; 2019.
Ferdous M, Debnath J, Chakraborty N.R. Machine learning algorithms in healthcare: a literature survey. In: 2020 11th International conference on computing, communication and networking technologies (ICCCNT); 2020. p. 1–6. https://doi.org/10.1109/ICCCNT49239.2020.9225642.
Fursov I, Zaytsev A, Khasyanov R, Spindler M, Burnaev E. Sequence embeddings help to identify fraudulent cases in healthcare insurance. ArXiv abs/1910.03072. 2019.
Gudivada A, Tabrizi N. A literature review on machine learning based medical information retrieval systems. In: 2018 IEEE symposium series on computational intelligence (SSCI); 2018. p. 250–257. https://doi.org/10.1109/SSCI.2018.8628846.
Hafiz AM, Bhat GM. A survey of deep learning techniques for medical diagnosis. In: Tuba M, Akashe S, Joshi A, editors. Information and communication technology for sustainable development. Singapore: Springer; 2020. p. 161–70.
Hancock JT, Khoshgoftaar TM. Survey on categorical data for neural networks. J Big Data. 2020;7(1):28. https://doi.org/10.1186/s40537-020-00305-w.
Herland M, Bauder RA, Khoshgoftaar TM. Medical provider specialty predictions for the detection of anomalous medicare insurance claims. In: 2017 IEEE international conference on information reuse and integration (IRI); 2017. p. 579–588. https://doi.org/10.1109/IRI.2017.29.
Herland M, Khoshgoftaar TM, Bauder RA. Big data fraud detection using multiple medicare data sources. J Big Data. 2018;5(1):29. https://doi.org/10.1186/s40537-018-0138-3.
Herland M, Khoshgoftaar TM, Wald R. A review of data mining using big data in health informatics. J Big Data. 2014;1(1):2. https://doi.org/10.1186/2196-1115-1-2.
Huang K, Altosaar J, Ranganath R. Clinicalbert: modeling clinical notes and predicting hospital readmission. ArXiv abs/1904.05342. 2019.
Jeyaraj PR, Nadar ERS. Smart-monitor: patient monitoring system for IoT-based healthcare system using deep learning. IETE J Res. 2019. https://doi.org/10.1080/03772063.2019.1649215.
Jiang F, Jiang Y, Zhi H, Dong Y, Li H, Ma S, Wang Y, Dong Q, Shen H, Wang Y. Artificial intelligence in healthcare: past, present and future. Stroke Vasc Neurol. 2017;2(4):230–43. https://doi.org/10.1136/svn-2017-000101.
Johnson JM, Khoshgoftaar TM. Deep learning and data sampling with imbalanced big data. In: 2019 IEEE 20th international conference on information reuse and integration for data science (IRI); 2019. p. 175–183.
Johnson JM, Khoshgoftaar TM. Deep learning and thresholding with class-imbalanced big data. In: 2019 18th IEEE international conference on machine learning and applications (ICMLA); 2019. p. 755–762. https://doi.org/10.1109/ICMLA.2019.00134.
Johnson JM, Khoshgoftaar TM. Medicare fraud detection using neural networks. J Big Data. 2019;6(1):63. https://doi.org/10.1186/s40537-019-0225-0.
Johnson JM, Khoshgoftaar TM. The effects of data sampling with deep learning and highly imbalanced big data. Inf Syst Front. 2020;22(5):1113–31. https://doi.org/10.1007/s10796-020-10022-7.
Johnson JM, Khoshgoftaar TM. Hcpcs2vec: healthcare procedure embeddings for medicare fraud prediction. In: 2020 IEEE 6th international conference on collaboration and internet computing (CIC); 2020.
Johnson JM, Khoshgoftaar TM. Semantic embeddings for medical providers and fraud detection. In: 2020 IEEE 21st international conference on information reuse and integration for data science (IRI); 2020. p. 224–230. https://doi.org/10.1109/IRI49571.2020.00039.
Johnson JM, Khoshgoftaar TM. Thresholding strategies for deep learning with highly imbalanced big data. Singapore: Springer; 2021. p. 199–227. https://doi.org/10.1007/978-981-15-6759-9_9.
Kalyan KS, Sangeetha S. Secnlp: a survey of embeddings in clinical natural language processing. J Biomed Inform. 2020;101:103323. https://doi.org/10.1016/j.jbi.2019.103323.
Khattak FK, Jeblee S, Pou-Prom C, Abdalla M, Meaney C, Rudzicz F. A survey of word embeddings for clinical text. J Biomed Inform X. 2019;4:100057. https://doi.org/10.1016/j.yjbinx.2019.100057. http://www.sciencedirect.com/science/article/pii/S2590177X19300563.
Ko J, Chalfin H, Trock B, Feng Z, Humphreys E, Park SW, Carter B, Frick DK, Han M. Variability in medicare utilization and payment among urologists. Urology. 2015. https://doi.org/10.1016/j.urology.2014.11.054.
Linux S. About. https://www.scientificlinux.org/about/ (2014), Accessed 15 Jan 2020.
Ma F, You Q, Xiao H, Chitta R, Zhou J, Gao J. Kame: knowledge-based attention model for diagnosis prediction in healthcare. In: Proceedings of the 27th ACM international conference on information and knowledge management, CIKM ’18, p. 743–752. Association for Computing Machinery, New York, NY, USA; 2018. https://doi.org/10.1145/3269206.3271701.
Maas A, Daly R.E, Pham P.T, Huang D, Ng A.Y, Potts C. Learning word vectors for sentiment analysis. In: Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies; 2011. p. 142–150.Accessed 15 Feb 2020.
Maaten LVD, Hinton G. Visualizing data using t-sne. J Mach Learn Res. 2008;9:2579–605.
Mikolov T, Chen K, Corrado GS, Dean J. Efficient estimation of word representations in vector space. CoRR abs/1301.3781. 2013.
Morris L. Combating fraud in health care: an essential component of any cost containment strategy. Health Aff. 2009;28:1351–6. https://doi.org/10.1377/hlthaff.28.5.1351.
National Plan & Provider Enumeration System: Nppes npi registry. https://npiregistry.cms.hhs.gov/registry/ 2020, Accessed 15 Feb 2020.
Office of Inspector General: Leie downloadable databases. https://oig.hhs.gov/exclusions/exclusions_list.asp (2019).
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E. Scikit-learn: machine learning in Python. J Mach Learn Res. 2011;12:2825–30.
Pennington J, Socher R, Manning CD. Glove: global vectors for word representation. In: Empirical methods in natural language processing (EMNLP); 2014. p. 1532–1543.
Peters M, Neumann M, Iyyer M, Gardner M, Clark C, Lee K, Zettlemoyer L. Deep contextualized word representations. In: Proceedings of the 2018 conference of the North american chapter of the association for computational linguistics: human language technologies, vol. 1 (long papers), p. 2227–2237. Association for Computational Linguistics, New Orleans, Louisiana; 2018. https://doi.org/10.18653/v1/N18-1202.
Pianykh OS, Guitron S, Parke D, Zhang C, Pandharipande P, Brink J, Rosenthal D. Improving healthcare operations management with machine learning. Nat Mach Intell. 2020;2(5):266–73. https://doi.org/10.1038/s42256-020-0176-3.
Provost F, Fawcett T. Analysis and visualization of classifier performance: comparison under imprecise class and cost distributions. In: Proceedings of the third international conference on knowledge discovery and data mining; 1999. p. 43–48.
Pyysalo S, Ginter F, Moen H, Salakoski T, Ananiadou S. Distributional semantics resources for biomedical text processing. In: Proceedings of languages in biology and medicine; 2013.
Rajpurkar P, Zhang J, Lopyrev K, Liang P. Squad: 100, 000+ questions for machine comprehension of text. In: EMNLP; 2016.
Raunak V, Gupta V, Metze F. Effective dimensionality reduction for word embeddings. In: Proceedings of the 4th workshop on representation learning for NLP (RepL4NLP-2019), p. 235–243. Association for Computational Linguistics, Florence, Italy; 2019. https://doi.org/10.18653/v1/W19-4328.
Sahlgren M. The distributional hypothesis. Ital J Linguist. 2008;20:33–54.
Shailaja K, Seetharamulu B, Jabbar M. Machine learning in healthcare: a review. In: 2018 Second international conference on electronics, communication and aerospace technology (ICECA), IEEE; 2018. p. 910–914.
Si Y, Wang J, Xu H, Roberts K. Enhancing clinical concept extraction with contextual embeddings. J Am Med Inform Assoc JAMIA. 2019. https://doi.org/10.1093/jamia/ocz096.
Song L, Cheong C.W, Yin K, Cheung W.K, Fung B.C.M, Poon J. Medical concept embedding with multiple ontological representations. In: Proceedings of the twenty-eighth international joint conference on artificial intelligence, IJCAI-19, p. 4613–4619. International Joint Conferences on Artificial Intelligence Organization; 2019. https://doi.org/10.24963/ijcai.2019/641.
Sun J, Chen X, Zhang Z, Lai S, Zhao B, Liu H, Wang S, Huan W, Zhao R, Ng MTA, Zheng Y. Forecasting the long-term trend of covid-19 epidemic using a dynamic model. Sci Rep. 2020;10(1):21122. https://doi.org/10.1038/s41598-020-78084-w.
Tukey JW. Comparing individual means in the analysis of variance. Biometrics. 1949;5(2):99–114.
U.S. Government, U.S. Centers for Medicare & Medicaid Services: the official U.S. government site for medicare. https://www.medicare.gov/. Accessed 15 Feb 2020.
Villarroel M, Reisner A, Clifford G, Lehman LW, Moody G, Heldt T, Kyaw T, Moody B, Mark R. Multiparameter intelligent monitoring in intensive care ii (mimic-ii): a public-access intensive care unit database. Crit Care Med. 2011;39:952–60. https://doi.org/10.1097/CCM.0b013e31820a92c6.
Wang M, Zhang Q, Lam S, Cai J, Yang R. A review on application of deep learning algorithms in external beam radiotherapy automated treatment planning. Front Oncol. 2020;10:2177. https://doi.org/10.3389/fonc.2020.580919.
Witten IH, Frank E, Hall MA, Pal CJ. Data mining, fourth edition: practical machine learning tools and techniques. 4th ed. San Francisco: Morgan Kaufmann Publishers Inc.; 2016.
Zou WY, Socher R, Cer D, Manning CD. Bilingual word embeddings for phrase-based machine translation. In: Proceedings of the 2013 conference on empirical methods in natural language processing; 2013. p. 1393–1398.
Acknowledgements
The authors would like to thank the reviewers in the Data Mining and Machine Learning Laboratory at Florida Atlantic University.
Funding
Not applicable.
Author information
Authors and Affiliations
Contributions
JMJ performed the literature review, executed the experiment design, and drafted the manuscript. TMK worked with JMJ to develop the article’s framework and focus. All authors have read and approved the final manuscript.
Corresponding author
Ethics declarations
Conflict of Interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This article is part of the topical collection “Artificial Intelligence for HealthCare” guest-edited by Lydia Bouzar-Benlabiod, Stuart H. Rubin and Edwige Pissaloux.
Rights and permissions
About this article
Cite this article
Johnson, J.M., Khoshgoftaar, T.M. Medical Provider Embeddings for Healthcare Fraud Detection. SN COMPUT. SCI. 2, 276 (2021). https://doi.org/10.1007/s42979-021-00656-y
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s42979-021-00656-y