Skip to main content

Can Patients with Dementia Be Identified in Primary Care Electronic Medical Records Using Natural Language Processing?


Dementia and mild cognitive impairment can be underrecognized in primary care practice and research. Free-text fields in electronic medical records (EMRs) are a rich source of information which might support increased detection and enable a better understanding of populations at risk of dementia. We used natural language processing (NLP) to identify dementia-related features in EMRs and compared the performance of supervised machine learning models to classify patients with dementia. We assembled a cohort of primary care patients aged 66 + years in Ontario, Canada, from EMR notes collected until December 2016: 526 with dementia and 44,148 without dementia. We identified dementia-related features by applying published lists, clinician input, and NLP with word embeddings to free-text progress and consult notes and organized features into thematic groups. Using machine learning models, we compared the performance of features to detect dementia, overall and during time periods relative to dementia case ascertainment in health administrative databases. Over 900 dementia-related features were identified and grouped into eight themes (including symptoms, social, function, cognition). Using notes from all time periods, LASSO had the best performance (F1 score: 77.2%, sensitivity: 71.5%, specificity: 99.8%). Model performance was poor when notes written before case ascertainment were included (F1 score: 14.4%, sensitivity: 8.3%, specificity 99.9%) but improved as later notes were added. While similar models may eventually improve recognition of cognitive issues and dementia in primary care EMRs, our findings suggest that further research is needed to identify which additional EMR components might be useful to promote early detection of dementia.

This is a preview of subscription content, access via your institution.

Fig. 1

Data Availability

The dataset from this study is held securely in coded form at ICES. While legal data sharing agreements between ICES and data providers (e.g., healthcare organizations and government) prohibit ICES from making the dataset publicly available, access may be granted to those who meet pre-specified criteria for confidential access, available at (email:

Code Availability

The full dataset creation plan and underlying analytic code are available from the authors upon request, understanding that the computer programs may rely upon coding templates or macros that are unique to ICES and are therefore either inaccessible or may require modification.


  1. (2020) 2020 Alzheimer's disease facts and figures. Alzheimers Dement.

  2. Nichols E, Szoeke CEI, Vollset SE, Abbasi N, Abd-Allah F, Abdela J, . . . Murray CJL (2019) Global, regional, and national burden of Alzheimer's disease and other dementias, 1990–2016: a systematic analysis for the Global Burden of Disease Study 2016. Lancet Neurol;18(1):88–106.

  3. Prince M BR, Ferri C. World Alzheimer Report 2011: the benefits of early diagnosis and intervention. London: Alzheimer’s Disease International 2011. Accessed February 8, 2021.

  4. Black CM, Fillit H, Xie L, Hu X, Kariburyo MF, Ambegaonkar BM, . . . Khandker RK (2018) Economic burden, mortality, and institutionalization in patients newly diagnosed with Alzheimer's disease. J Alzheimers Dis;61(1):185–93.

  5. Rasmussen J, Langerman H (2019) Alzheimer’s disease - why we need early diagnosis. Degener Neurol Neuromuscul Dis 9:123–130.

    Article  Google Scholar 

  6. Holzer S, Warner JP, Iliffe S (2013) Diagnosis and management of the patient with suspected dementia in primary care. Drugs Aging 30(9):667–676.

    Article  Google Scholar 

  7. Fox C, Maidment I, Moniz-Cook E, White J, Thyrian JR, Young J, . . . Chew-Graham CA (2013) Optimising primary care for people with dementia. Ment Health Fam Med;10(3):143–51.

  8. Valcour VG, Masaki KH, Curb JD, Blanchette PL (2000) The detection of dementia in the primary care setting. Arch Intern Med 160(19):2964–2968.

    Article  Google Scholar 

  9. Mitchell AJ, Meader N, Pentzek M (2011) Clinical recognition of dementia and cognitive impairment in primary care: a meta-analysis of physician accuracy. Acta Psychiatr Scand 124(3):165–183.

    Article  Google Scholar 

  10. Boustani M, Callahan CM, Unverzagt FW, Austrom MG, Perkins AJ, Fultz BA, . . . Hendrie HC (2005) Implementing a screening and diagnosis program for dementia in primary care. J Gen Intern Med;20(7):572–7.

  11. Connolly A, Gaehl E, Martin H, Morris J, Purandare N (2011) Underdiagnosis of dementia in primary care: variations in the observed prevalence and comparisons to the expected prevalence. Aging Ment Health 15(8):978–984.

    Article  Google Scholar 

  12. Bradford A, Kunik ME, Schulz P, Williams SP, Singh H (2009) Missed and delayed diagnosis of dementia in primary care: prevalence and contributing factors. Alzheimer Dis Assoc Disord 23(4):306–314.

    Article  Google Scholar 

  13. Parmar J, Dobbs B, McKay R, Kirwan C, Cooper T, Marin A, Gupta N (2014) Diagnosis and management of dementia in primary care: exploratory study. Can Fam Physician 60(5):457–465

    Google Scholar 

  14. Goerdten J, Cukic I, Danso SO, Carriere I, Muniz-Terrera G (2019) Statistical methods for dementia risk prediction and recommendations for future work: a systematic review. Alzheimers Dement (N Y) 5:563–569.

    Article  Google Scholar 

  15. Tang EY, Harrison SL, Errington L, Gordon MF, Visser PJ, Novak G, . . . Stephan BC (2015) Current developments in dementia risk prediction modelling: an updated systematic review. PLoS One;10(9):e0136181.

  16. Pellegrini E, Ballerini L, Hernandez M, Chappell FM, Gonzalez-Castro V, Anblagan D, . . . Wardlaw JM (2018) Machine learning of neuroimaging for assisted diagnosis of cognitive impairment and dementia: a systematic review. Alzheimers Dement (Amst);10:519–35.

  17. Stephan BC, Kurth T, Matthews FE, Brayne C, Dufouil C (2010) Dementia risk prediction in the population: are screening models accurate? Nat Rev Neurol 6(6):318–326.

    Article  Google Scholar 

  18. Walters K, Hardoon S, Petersen I, Iliffe S, Omar RZ, Nazareth I, Rait G (2016) Predicting dementia risk in primary care: development and validation of the Dementia Risk Score using routinely collected data. BMC Med 14:6.

    Article  Google Scholar 

  19. Bullard J, Alm CO, Liu X, Yu Q, Proano RA. Towards early dementia detection: fusing linguistic and non-linguistic clinical data. Proceedings of the Third Workshop on Computational Linguistics and Clinical Psychology 2016.

  20. Chen T, Dredze M, Weiner JP, Hernandez L, Kimura J, Kharrazi H (2019) Extraction of geriatric syndromes from electronic health record clinical notes: assessment of statistical natural language processing methods. JMIR Med Inform 7(1):e13039.

    Article  Google Scholar 

  21. Ford E, Carroll JA, Smith HE, Scott D, Cassell JA (2016) Extracting information from the text of electronic medical records to improve case detection: a systematic review. J Am Med Inform Assoc 23(5):1007–1015.

    Article  Google Scholar 

  22. Anzaldi LJ, Davison A, Boyd CM, Leff B, Kharrazi H (2017) Comparing clinician descriptions of frailty and geriatric syndromes using electronic health records: a retrospective cohort study. BMC Geriatr 17(1):248.

    Article  Google Scholar 

  23. Aponte-Hao S, Wong ST, Thandi M, Ronksley P, McBrien K, Lee J, . . . Williamson T (2021) Machine learning for identification of frailty in Canadian primary care practices. Int J Pop D Sci;6(1).

  24. Chase HS, Mitrani LR, Lu GG, Fulgieri DJ (2017) Early recognition of multiple sclerosis using natural language processing of the electronic health record. BMC Med Inform Decis Mak 17(1):24.

    Article  Google Scholar 

  25. Jackson RG, Patel R, Jayatilleke N, Kolliakou A, Ball M, Gorrell G, . . . Stewart R (2017) Natural language processing to extract symptoms of severe mental illness from clinical text: the Clinical Record Interactive Search Comprehensive Data Extraction (CRIS-CODE) project. BMJ Open;7(1):e012012.

  26. Topaz M, Adams V, Wilson P, Woo K, Ryvicker M (2020) Free-text documentation of dementia symptoms in home healthcare: a natural language processing study. Gerontol Geriatr Med 6:2333721420959861.

    Article  Google Scholar 

  27. Hane CA, Nori VS, Crown WH, Sanghavi DM, Bleicher P (2020) Predicting onset of dementia using clinical notes and machine learning: case-control study. JMIR Med Inform 8(6):e17819.

    Article  Google Scholar 

  28. McCoy TH Jr, Han L, Pellegrini AM, Tanzi RE, Berretta S, Perlis RH (2020) Stratifying risk for dementia onset using large-scale electronic health record data: a retrospective cohort study. Alzheimers Dement 16(3):531–540.

    Article  Google Scholar 

  29. Sheikhalishahi S, Miotto R, Dudley JT, Lavelli A, Rinaldi F, Osmani V (2019) Natural language processing of clinical notes on chronic diseases: systematic review. JMIR Med Inform 7(2):e12239.

    Article  Google Scholar 

  30. Tu K, Mitiku TF, Ivers NM, Guo H, Lu H, Jaakkimainen L, . . . Tu JV (2014) Evaluation of electronic medical record administrative data linked database (EMRALD). Am J Manag Care;20(1):e15–21.

  31. Tu K, Widdifield J, Young J, Oud W, Ivers NM, Butt DA, . . . Jaakkimainen L (2015) Are family physicians comprehensively using electronic medical records such that the data can be used for secondary purposes? A Canadian perspective. BMC Med Inform Decis Mak;15:67.

  32. Tu K, Wang M, Young J, Green D, Ivers NM, Butt D, . . . Kapral MK (2013) Validity of administrative data for identifying patients who have had a stroke or transient ischemic attack using EMRALD as a reference standard. Can J Cardiol;29(11):1388–94.

  33. Tu K, Mitiku T, Lee DS, Guo H, Tu JV (2010) Validation of physician billing and hospitalization data to identify patients with ischemic heart disease using data from the Electronic Medical Record Administrative data Linked Database (EMRALD). Can J Cardiol 26(7):e225–e228.

    Article  Google Scholar 

  34. Jaakkimainen RL, Bronskill SE, Tierney MC, Herrmann N, Green D, Young J, . . . Tu K (2016) Identification of physician-diagnosed Alzheimer's disease and related dementias in population-based administrative data: a validation study using family physicians' electronic medical records. J Alzheimers Dis;54(1):337–49.

  35. Statistics Canada. Postal CodeOM Conversion File Plus (PCCF+) Version 6C, Reference Guide: Ottawa, Minister of Industry, 2016.

  36. Mondor L, Maxwell CJ, Hogan DB, Bronskill SE, Gruneir A, Lane NE, Wodchis WP (2017) Multimorbidity and healthcare utilization among home care clients with dementia in Ontario, Canada: a retrospective analysis of a population-based cohort. PLoS Med 14(3):e1002249.

    Article  Google Scholar 

  37. Mondor L, Maxwell CJ, Bronskill SE, Gruneir A, Wodchis WP (2016) The relative impact of chronic conditions and multimorbidity on health-related quality of life in Ontario long-stay home care clients. Qual Life Res 25(10):2619–2632.

    Article  Google Scholar 

  38. Halpern R, Seare J, Tong J, Hartry A, Olaoye A, Aigbogun MS (2019) Using electronic health records to estimate the prevalence of agitation in Alzheimer disease/dementia. Int J Geriatr Psychiatry 34(3):420–431.

    Article  Google Scholar 

  39. Wang L, Lakin J, Riley C, Korach Z, Frain LN, Zhou L (2018) Disease trajectories and end-of-life care for dementias: latent topic modeling and trend analysis using clinical notes. AMIA Annu Symp Proc 2018:1056–1065

    Google Scholar 

  40. Gilmore-Bykovskyi AL, Block LM, Walljasper L, Hill N, Gleason C, Shah MN (2018) Unstructured clinical documentation reflecting cognitive and behavioral dysfunction: toward an EHR-based phenotype for cognitive impairment. J Am Med Inform Assoc 25(9):1206–1212.

    Article  Google Scholar 

  41. Wang B, Wang A, Chen F, Wang Y, Kuo C-CJ (2019) Evaluating word embedding models: methods and experimental results. APSIPA Transactions on Signal and Information Processing;8(E19).

  42. Khattak FK, Jeblee S, Pou-Prom C, Abdalla M, Meaney C, Rudzicz F (2019) A survey of word embeddings for clinical text. J Biomed Inform 4:100057.

    Article  Google Scholar 

  43. Wang Y, Liu S, Afzal N, Rastegar-Mojarad M, Wang L, Shen F, . . . Liu H (2018) A comparison of word embeddings for the biomedical natural language processing. Journal of biomedical informatics;87:12–20.

  44. Austin PC, Steyerberg EW (2019) The Integrated Calibration Index (ICI) and related metrics for quantifying the calibration of logistic regression models. Stat Med 38(21):4051–4065

    Article  MathSciNet  Google Scholar 

  45. Tonelli M, Wiebe N, Fortin M, Guthrie B, Hemmelgarn BR, James MT, . . . For the Alberta Kidney Disease N (2015) Methods for identifying 30 chronic conditions: application to administrative data. BMC Medical Informatics and Decision Making;15(1):31.

  46. Shao Y, Zeng QT, Chen KK, Shutes-David A, Thielke SM, Tsuang DW (2019) Detection of probable dementia cases in undiagnosed patients using structured and unstructured electronic health records. BMC Med Inform Decis Mak 19(1):1–11

    Article  Google Scholar 

Download references


This study was supported by ICES, which is funded by an annual grant from the Ontario Ministry of Health (MOH) and the Ministry of Long-Term Care (MLTC). This document used data adapted from the Statistics Canada Postal CodeOM Conversion File, which is based on data licensed from Canada Post Corporation, and/or data adapted from the Ontario Ministy of Health Postal Code Conversion File, which contains data copied under license from ©Canada Post Corporation and Statistics Canada. Parts of this material are basesd on data and information compiled and provided by CIHI and the Ontario Ministry of Health. We thank IQVIA Solutions Canada Inc. for the use of their Drug Information File.


This study was supported by the Ontario Neurodegenerative Disease Research Initiative (ONDRI) through the Ontario Brain Institute, an independent non-profit corporation, funded partially by the Ontario government. MA is funded by the Canadian Institutes of Health Research Vanier Scholarship Program. DAH is funded by an Alzheimer Society of Canada Research Program Doctoral Award.

Author information

Authors and Affiliations


Corresponding author

Correspondence to Susan E. Bronskill.

Ethics declarations


The analyses, conclusions, opinions and statements expressed herein are solely those of the authors and do not reflect those of the funding or data sources; no endorsement is intended or should be inferred.

Conflict of Interest

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file1 (PDF 404 KB)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Maclagan, L.C., Abdalla, M., Harris, D.A. et al. Can Patients with Dementia Be Identified in Primary Care Electronic Medical Records Using Natural Language Processing?. J Healthc Inform Res 7, 42–58 (2023).

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI:


  • Electronic health records
  • Dementia
  • Primary health care
  • Artificial intelligence
  • Natural language processing