Studies examining primary care providers’ (PCP) practices in managing patients with prediabetes using survey data1 and structured data from electronic health records (EHRs)2 suggest patients with prediabetes are not receiving evidence-based care. We developed and validated a natural language processing (NLP) tool to analyze unstructured data in EHR notes to identify prediabetes discussions and described these discussions.


In phases 1 and 2, we included adults without diabetes with an in-person office visit at a primary care clinic (n = 19) at an academic medical center and at least one HbA1c 5.7–6.4% between 7/1/2016 and 12/31/2018. We based the initial keyword search strategy on the authors’ clinical experience (Table 1). In phase 1, we identified and extracted PCP encounter notes matching ≥ 1 keyword from two clinics. Through random chart review of notes for patients meeting the inclusion/exclusion criteria but not containing any keyword, we identified additional keywords. The Supplement provides additional details.

Table 1 Keywords Included in Search Strategy and Frequency of Keywords Matching to Clinical Discussion About Prediabetes

In phase 2, using data from 17 other clinics, we extracted the first PCP visit note following lab results indicating prediabetes (n = 1095 encounters) and applied the updated keyword search strategy (n = 391 encounters). Two reviewers (E.T. and J.L.S.) manually annotated the notes to determine whether they contained clinical discussions of prediabetes. We applied NLP techniques using machine learning to replicate human annotation.3 To reduce overfitting and classification bias and confirm internal validation, we applied 10-fold cross-validation to shuffle the training and test sets.4 We selected logistic regression and bi-directional recurrent neural networks based on performance. We evaluated classification results using sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV).

Two reviewers (E.T. and R.L.S.) reviewed each note from phase 2 to describe the prediabetes discussions: (1) labs ordered/reviewed (HbA1c or fasting glucose), (2) lifestyle counseling, (3) diabetes prevention program (DPP) discussion/referral, (4) nutrition discussion/referral, and (5) metformin discussion or ordering/continuation. We calculated proportions with the denominator being the number of patients with a documented discussion about prediabetes and numerator being the number of patients who had each type of discussion listed above (STATA, version 15). This study was approved by the Johns Hopkins IRB.


In phase 2, 322 of 391 encounter notes matching ≥ 1 keyword had documentation of prediabetes discussions. NLP and machine learning classification results were close to human performance. Logistic regression models revealed a sensitivity of 0.961, specificity of 0.923, PPV of 0.967, and NPV of 0.907. Convolutional neural networks revealed a sensitivity of 0.979, specificity of 0.956, PPV of 0.979, and NPV of 0.956. Table 1 describes the keywords in our sample.

PCPs commonly provided lifestyle counseling (78%), reviewed current labs (67%), and ordered follow-up labs (60%) (Table 2). PCPs discussed or referred patients to a nutritionist infrequently (3%). There were no discussions or referrals to a DPP. Metformin was discussed, ordered, or continued in < 2% of patients.

Table 2 PCP Management of Prediabetes Documented in Clinical Encounters (N= 322 Encounters)


We developed and validated the first NLP tool that identifies clinical discussions about prediabetes from unstructured EHR data. PCPs underutilize the prediabetes diagnosis code (structured data); only 13% of patients were given a diagnosis code in a large EHR study.2 Therefore, using structured EHR data is insufficient to identify visits where prediabetes is addressed. Few studies have used NLP methods to identify discussions about chronic conditions5 or lifestyle counseling.6

Consistent with prior findings, PCPs most commonly addressed prediabetes through lifestyle change counseling.1 However, PCPs infrequently referred to nutrition or DPPs. Although this institution has a DPP, it is community-based and not integrated into the clinical setting, which may explain the lack of DPP referrals. PCPs commonly reviewed and ordered follow-up labs, although prior work suggests low completion rates.2

Our results are from one academic medical center and from a small sample of a larger population of patients with prediabetes. Our descriptive outcomes are based on provider visit documentation and providers may not have documented all the details of their verbal discussions. We did not capture discussions that may have occurred through EHR patient-provider messaging or telephone documentation.

In conclusion, we developed and validated an NLP tool identifying clinical discussions about prediabetes in the EHR. As diabetes prevention grows, this novel tool may help in tracking PCP practices outside of identifiable tasks in structured data.