Mining clinical text for stroke prediction

  • Elham Sedghi
  • Jens H. Weber
  • Alex Thomo
  • Maximilian Bibok
  • Andrew M. W. Penn
Original Article


One of the main problems in treating stroke patients is accurate and timely triage and assessment. Not all stroke events have direct severe consequences. Full strokes are often preceded by transient ischemic attacks (TIA) or mini strokes, which exhibit signs and symptoms similar to less concerning health events, e.g., migraines. In this paper, natural language techniques are presented to process a large collection of medical narrative descriptions extracting features that can be subsequently used for automatic classification using Data Mining algorithms. We reviewed 5658 cases and analyzed the chief complaint and history of the patient illness reported at stroke rapid assessment unit (SRAU) at Victoria General Hospital (VGH). Data were collected by neurologists and stroke nurses between years 2008 and 2013. Based on a clinician-supplied list of important sign and symptom terms, we translated narrative medical text into well-codified sentences achieving an impressive agreement with a human expert. Afterwards, Data Mining algorithms were applied on codified data and obtaining not only prediction models, but also important weights for the codified terms. An extensive experimental evaluation of several classifiers is provided based on past data to predict new cases. Notably, we achieved a sensitivity of about 84 % and specificity of 64 % using support vector machines (SVM). The top terms identified by data mining algorithms were responsible for most of the prediction quality; therefore, they can be used to build a questionnaire-like, online application that can be employed as a first-line screening in triage for detecting stroke/TIA or mimic and help triage decide for the next step of treatment or discharge the patient.


Support Vector Machine Receiver Operating Characteristic Curve Natural Language Processing Negative Word Data Mining Algorithm 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.



The authors would like to acknowledge Kristine Votova, Ph.D., the project manager for the SpecTRA Research Project and the Island Health clinical research team at the Stroke Rapid Assessment Unit for their support. Funding for the natural experiment in stroke care and the large-scale personalized medicine for mass spectrometry in rapid TIA triage comes from Canadian Institute of Health Research (2009–2012) and Genome Canada/BC (2013–2017).


  1. Al-Haddad MA, Friedlin J, Kesterson J, Waters JA, Aguilar-Saavedra JR, Schmidt CM (2010) Natural language processing for the development of a clinical registry: a validation study in intraductal papillary mucinous neoplasms. HPB 12(10):688–695CrossRefGoogle Scholar
  2. Amini L, Azarpazhouh R, Farzadfar MT, Mousavi SA, Jazaieri F, Khorvash F, Norouzi R, Toghianfar N (2013) Prediction and control of stroke by data mining. Int J Prev Med 4(2):245Google Scholar
  3. Averbuch M, Karson T, Ben-Ami B, Maimon O, Rokach L (2004) Context-sensitive medical information retrieval. In: Proceedings of the 11th World Congress on Medical Informatics (MEDINFO-2004), Citeseer. 1–8Google Scholar
  4. Barrett N, Weber-Jahnke J (2011) Building a biomedical tokenizer using the token lattice design pattern and the adapted viterbi algorithm. BMC Bioinform 12(3):1CrossRefGoogle Scholar
  5. Cerrito P (2001) Application of data mining for examining polypharmacy and adverse effects in cardiology patients. Cardiovasc Toxicol 1(3):177–179CrossRefMathSciNetGoogle Scholar
  6. Elkins JS, Friedman C, Boden-Albala B, Sacco RL, Hripcsak G (2000) Coding neuroradiology reports for the northern manhattan stroke study: a comparison of natural language processing and manual review. Comput Biomed Res 33(1):1–10CrossRefGoogle Scholar
  7. Fiszman M, Chapman WW, Evans SR, Haug PJ (1999) Automatic identification of pneumonia related concepts on chest X-ray reports. In: Proceedings of the AMIA Symposium, American Medical Informatics Association. 67Google Scholar
  8. Florkowski CM (2008) Sensitivity, specificity, receiver-operating characteristic (roc) curves and likelihood ratios: communicating the performance of diagnostic tests. Clin Biochem Rev 29(1):S83Google Scholar
  9. Friedman C, Alderson PO, Austin JH, Cimino JJ, Johnson SB (1994) A general natural-language text processor for clinical radiology. J Am Med Inform Assoc 1(2):161–174CrossRefGoogle Scholar
  10. Friedman C, Shagina L, Socratous SA, Zeng X (1996) A web-based version of medlee: a medical language extraction and encoding system. In: Proceedings of the AMIA Annual Fall Symposium, American Medical Informatics Association. 938Google Scholar
  11. Glasgow JM, Kaboli PJ (2010) Detecting adverse drug events through data mining. Am J Health Syst Pharm 67(4):317–320CrossRefGoogle Scholar
  12. Goryachev S, Sordo M, Zeng QT, Ngo L (2006) Implementation and evaluation of four different methods of negation detection. DSG, BostonGoogle Scholar
  13. Heart and Stroke foundation (2015) Statistics. Accessed Jan 2015
  14. Hripcsak G, Austin JH, Alderson PO, Friedman C (2002) Use of natural language processing to translate clinical information from a database of 889,921 chest radiographic reports 1. Radiology 224(1):157–163CrossRefGoogle Scholar
  15. Kelleher JD, Mac Namee B (2008) A review of negation in clinical texts: dit technical report: Soc-aig-001-08.
  16. NIH (2015) Stroke, hope through research. Accessed Jan 2015
  17. Regnier M (2012) Focus on stroke: predicting and preventing stroke. Accessed Jan 2015
  18. University of Waikato, New Zealand (2014) Weka (machine learning). learning). Accessed Dec 2014
  19. Warrer P, Hansen EH, Juhl-Jensen L, Aagaard L (2012) Using text-mining techniques in electronic patient records to identify adrs from medicine use. Br J Clin Pharmacol 73(5):674–684CrossRefGoogle Scholar
  20. Wendy W (2001) Chapman, will bridewell, paul hanbury, gregory f. cooper, and bruce g. buchanan. 2001. a simple algorithm for identifying negated findings and diseases in discharge summaries. J Biomed Inform 34(5):301–310CrossRefGoogle Scholar
  21. Wiktionary (2013) Category:english words suffixed with -n’t. Accessed Dec 2014
  22. Wikipedia (2014) Abcd score. Accessed Dec 2014

Copyright information

© Springer-Verlag Wien 2015

Authors and Affiliations

  • Elham Sedghi
    • 1
  • Jens H. Weber
    • 1
  • Alex Thomo
    • 1
  • Maximilian Bibok
    • 2
  • Andrew M. W. Penn
    • 2
  1. 1.Department of Computer ScienceUniversity of VictoriaVictoriaCanada
  2. 2.SpecTRA Research Project, Vancouver Island Health AuthorityVictoriaCanada

Personalised recommendations