Skip to main content

Analysis of Primary Care Provider Electronic Health Record Notes for Discussions of Prediabetes Using Natural Language Processing Methods


Studies examining primary care providers’ (PCP) practices in managing patients with prediabetes using survey data1 and structured data from electronic health records (EHRs)2 suggest patients with prediabetes are not receiving evidence-based care. We developed and validated a natural language processing (NLP) tool to analyze unstructured data in EHR notes to identify prediabetes discussions and described these discussions.


In phases 1 and 2, we included adults without diabetes with an in-person office visit at a primary care clinic (n = 19) at an academic medical center and at least one HbA1c 5.7–6.4% between 7/1/2016 and 12/31/2018. We based the initial keyword search strategy on the authors’ clinical experience (Table 1). In phase 1, we identified and extracted PCP encounter notes matching ≥ 1 keyword from two clinics. Through random chart review of notes for patients meeting the inclusion/exclusion criteria but not containing any keyword, we identified additional keywords. The Supplement provides additional details.

Table 1 Keywords Included in Search Strategy and Frequency of Keywords Matching to Clinical Discussion About Prediabetes

In phase 2, using data from 17 other clinics, we extracted the first PCP visit note following lab results indicating prediabetes (n = 1095 encounters) and applied the updated keyword search strategy (n = 391 encounters). Two reviewers (E.T. and J.L.S.) manually annotated the notes to determine whether they contained clinical discussions of prediabetes. We applied NLP techniques using machine learning to replicate human annotation.3 To reduce overfitting and classification bias and confirm internal validation, we applied 10-fold cross-validation to shuffle the training and test sets.4 We selected logistic regression and bi-directional recurrent neural networks based on performance. We evaluated classification results using sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV).

Two reviewers (E.T. and R.L.S.) reviewed each note from phase 2 to describe the prediabetes discussions: (1) labs ordered/reviewed (HbA1c or fasting glucose), (2) lifestyle counseling, (3) diabetes prevention program (DPP) discussion/referral, (4) nutrition discussion/referral, and (5) metformin discussion or ordering/continuation. We calculated proportions with the denominator being the number of patients with a documented discussion about prediabetes and numerator being the number of patients who had each type of discussion listed above (STATA, version 15). This study was approved by the Johns Hopkins IRB.


In phase 2, 322 of 391 encounter notes matching ≥ 1 keyword had documentation of prediabetes discussions. NLP and machine learning classification results were close to human performance. Logistic regression models revealed a sensitivity of 0.961, specificity of 0.923, PPV of 0.967, and NPV of 0.907. Convolutional neural networks revealed a sensitivity of 0.979, specificity of 0.956, PPV of 0.979, and NPV of 0.956. Table 1 describes the keywords in our sample.

PCPs commonly provided lifestyle counseling (78%), reviewed current labs (67%), and ordered follow-up labs (60%) (Table 2). PCPs discussed or referred patients to a nutritionist infrequently (3%). There were no discussions or referrals to a DPP. Metformin was discussed, ordered, or continued in < 2% of patients.

Table 2 PCP Management of Prediabetes Documented in Clinical Encounters (N= 322 Encounters)


We developed and validated the first NLP tool that identifies clinical discussions about prediabetes from unstructured EHR data. PCPs underutilize the prediabetes diagnosis code (structured data); only 13% of patients were given a diagnosis code in a large EHR study.2 Therefore, using structured EHR data is insufficient to identify visits where prediabetes is addressed. Few studies have used NLP methods to identify discussions about chronic conditions5 or lifestyle counseling.6

Consistent with prior findings, PCPs most commonly addressed prediabetes through lifestyle change counseling.1 However, PCPs infrequently referred to nutrition or DPPs. Although this institution has a DPP, it is community-based and not integrated into the clinical setting, which may explain the lack of DPP referrals. PCPs commonly reviewed and ordered follow-up labs, although prior work suggests low completion rates.2

Our results are from one academic medical center and from a small sample of a larger population of patients with prediabetes. Our descriptive outcomes are based on provider visit documentation and providers may not have documented all the details of their verbal discussions. We did not capture discussions that may have occurred through EHR patient-provider messaging or telephone documentation.

In conclusion, we developed and validated an NLP tool identifying clinical discussions about prediabetes in the EHR. As diabetes prevention grows, this novel tool may help in tracking PCP practices outside of identifiable tasks in structured data.


  1. Tseng E, Greer RC, O’Rourke P, Yeh HC, McGuire MM, Albright AL, et al. National Survey of Primary Care Physicians’ Knowledge, Practices, and Perceptions of Prediabetes. J Gen Intern Med. 2019;34(11):2475-81.

    Article  Google Scholar 

  2. Schmittdiel JA, Adams SR, Segal J, Griffin MR, Roumie CL, Ohnsorg K, et al. Novel use and utility of integrated electronic health records to assess rates of prediabetes recognition and treatment: brief report from an integrated electronic health records pilot study. Diabetes Care. 2014;37(2):565-8.

    Article  Google Scholar 

  3. Deng L, Liu Y, editors. Deep Learning in Natural Language Processing. Germany: Springer Singapore; 2018.  

  4. Martin JH, Jurafsky D. Speech and Language Processing, 2nd Edition. United Kingdom: Pearson Prentice Hall; 2008.

  5. Anzaldi LJ, Davison A, Boyd CM, Leff B, Kharrazi H. Comparing clinician descriptions of frailty and geriatric syndromes using electronic health records: a retrospective cohort study. BMC Geriatr. 2017;17(1):248.

    Article  Google Scholar 

  6. Hazlehurst BL, Lawrence JM, Donahoo WT, Sherwood NE, Kurtz SE, Xu S, et al. Automating assessment of lifestyle counseling in electronic health records. Am J Prev Med. 2014;46(5):457-64.

    Article  Google Scholar 

Download references


This work was supported by the Johns Hopkins Institute for Clinical and Translational Research Core Coins Award 2018. E.T. was supported by the National Institute of Diabetes and Digestive and Kidney Diseases [K23DK118205-01A1]. J.L.S. was supported by the National Heart, Lung, and Blood Institute [5T32HL007180, PI: Hill-Briggs].

Author information

Authors and Affiliations


Corresponding author

Correspondence to Eva Tseng MD.

Ethics declarations

Conflict of Interest

Dr. Maruthur is the co-inventor of a virtual diabetes prevention program. Under a license agreement between Johns Hopkins HealthCare Solutions and the Johns Hopkins University, Dr. Maruthur and the University are entitled to royalty distributions related to this technology. This arrangement has been reviewed and approved by the Johns Hopkins University in accordance with its conflict of interest policies. This technology is not described in this study.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information


(DOCX 13 kb)

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Tseng, E., Schwartz, J.L., Rouhizadeh, M. et al. Analysis of Primary Care Provider Electronic Health Record Notes for Discussions of Prediabetes Using Natural Language Processing Methods. J GEN INTERN MED (2021).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: