The freetext matching algorithm: a computer program to extract diagnoses and causes of death from unstructured text in electronic health records
Electronic health records are invaluable for medical research, but much information is stored as free text rather than in a coded form. For example, in the UK General Practice Research Database (GPRD), causes of death and test results are sometimes recorded only in free text. Free text can be difficult to use for research if it requires time-consuming manual review. Our aim was to develop an automated method for extracting coded information from free text in electronic patient records.
We reviewed the electronic patient records in GPRD of a random sample of 3310 patients who died in 2001, to identify the cause of death. We developed a computer program called the Freetext Matching Algorithm (FMA) to map diagnoses in text to the Read Clinical Terminology. The program uses lookup tables of synonyms and phrase patterns to identify diagnoses, dates and selected test results. We tested it on two random samples of free text from GPRD (1000 texts associated with death in 2001, and 1000 general texts from cases and controls in a coronary artery disease study), comparing the output to the U.S. National Library of Medicine’s MetaMap program and the gold standard of manual review.
Among 3310 patients registered in the GPRD who died in 2001, the cause of death was recorded in coded form in 38.1% of patients, and in the free text alone in 19.4%. On the 1000 texts associated with death, FMA coded 683 of the 735 positive diagnoses, with precision (positive predictive value) 98.4% (95% confidence interval (CI) 97.2, 99.2) and recall (sensitivity) 92.9% (95% CI 90.8, 94.7). On the general sample, FMA detected 346 of the 447 positive diagnoses, with precision 91.5% (95% CI 88.3, 94.1) and recall 77.4% (95% CI 73.2, 81.2), which was similar to MetaMap.
We have developed an algorithm to extract coded information from free text in GP records with good precision. It may facilitate research using free text in electronic patient records, particularly for extracting the cause of death.
- Wang, Z, Shah, AD, Tate, AR, Denaxas, S, Shawe-Taylor, J, Hemingway, H (2012) Extracting diagnoses and investigation results from unstructured text in electronic health records by semi-supervised machine learning. PLoS One 7: pp. e30412 CrossRef
- Tate, AR, Martin, AGR, Ali, A, Cassell, JA (2011) Using free text information to explore how and when GPs code a diagnosis of ovarian cancer: an observational study using primary care records of patients with ovarian cancer. BMJ Open 1: pp. e000025 CrossRef
- Pakhomov, S, Hemingway, H, Weston, S, Jacobsen, S, Rodeheffer, R, Roger, V (2007) Epidemiology of Angina Pectoris: Role of Natural Language Processing of the Medical Record. Am Heart J 153: pp. 666-673 CrossRef
- Pakhomov, S, Buntrock, J, Chute, CG (2005) Prospective recruitment of patients with congestive heart failure using an ad-hoc binary classifier. J Biomed Informatics 38: pp. 145-153 CrossRef
- Savova, GK, Ogren, PV, Duffy, PH, Buntrock, JD, Chute, CG (2008) Mayo clinic NLP system for patient smoking status identification. J Am Med Inform Assoc 15: pp. 25-28 CrossRef
- Uzuner, O, Goldstein, I, Luo, Y, Kohane, I (2008) Identifying Patient Smoking Status from Medical Discharge Records. J Am Med Inform Assoc 15: pp. 14-24 CrossRef
- Clark, C, Good, K, Jeziernyb, L, Macpherson, M, Wilsonb, B, Chajewska, U (2008) Identifying Smokers with a Medical Extraction System. J Am Med Inform Assoc 15: pp. 36-39 CrossRef
- Pakhomov, S, Weston, SA, Jacobsen, SJ, Chute, CG, Meverden, R, Roger, VL (2007) Electronic medical records for clinical research: application to the identification of heart failure. Am J Manag Care 13: pp. 281-288
- Melton, GB, Raman, N, Chen, ES, Sarkar, IN, Pakhomov, S, Madoff, RD (2010) Evaluation of family history information within clinical documents and adequacy of HL7 clinical statement and clinical genomics family history models for its representation: a case report. J Am Med Inform Assoc 17: pp. 337-340
- Pakhomov, S, Shah, N, Hanson, P, Balasubramaniam, S, Smith, SA (2008) Automatic quality of life prediction using electronic medical records. AMIA Annu Symp Proc 2008: pp. 545-549
- Friedman, C, Shagina, L, Lussier, Y, Hripcsak, G (2004) Automated encoding of clinical documents based on natural language processing. J Am Med Inform Assoc 11: pp. 392-402 CrossRef
- Aronson, A (2011) MetaMap. National Library of Medicine, U.S.
- Meystre, S, Savova, G, Kipper-Schuler, K, Hurdle, J (2008) Extracting Information from Textual Documents in the Electronic Health Record: a Review of Recent Research. Methods Inf Med 47: pp. 128-144
- Herrett, E, Thomas, SL, Schoonen, WM, Smeeth, L, Hall, AJ (2010) Validation and validity of diagnoses in the General Practice Research Database: a systematic review. Br J Clin Pharmacol 69: pp. 4-14 CrossRef
- Khan, N, Harrison, S, Rose, P (2010) Validity of diagnostic coding within the General Practice Research Database: a systematic review. Br J Gen Pract 60: pp. e128—36 CrossRef
- The Read Codes.
- The Good Practice Guidelines for GP electronic patient records.
- Shah, AD, Martinez, C (2006) An algorithm to derive a numerical daily dose from unstructured text dosage instructions. Pharmacoepidemiol Drug Saf 15: pp. 161-166 CrossRef
- Hall, GC (2009) Validation of death and suicide recording on the THIN UK primary care database. Pharmacoepidemiol Drug Saf 18: pp. 120-131 CrossRef
- Friedman, C, Johnson, SB, Forman, B, Starren, J (1995) Architectural requirements for a multipurpose natural language processor in the clinical environment. Proc Annu Symp Comput Appl Med Care 1995: pp. 347-351
- Unified Medical Language System.
- Clinical Practice Research Datalink [http://www.cprd.com/home/] [[http://wordlist.sourceforge.net/12dicts-readme.html]
- Beale, A (2003) The 12dicts word lists.
- World Health Organization: International statistical classification of diseases and health related problems : 10th revision, Volume 2. World Health Organization, Geneva;
- Chapman, WW, Bridewell, W, Hanbury, P, Cooper, GF, Buchanan, BG (2001) A Simple Algorithm for Identifying Negated Findings and Diseases in Discharge Summaries. J Biomed Inform 34: pp. 301-310 CrossRef
- Rogers, W, Lang, FM, Gay, C (2012) MetaMap Data File Builder. National Library of Medicine, U.S
- Lang, FM, Aronson, A (2010) Filtering the UMLS Metathesaurus for MetaMap. National Library of Medicine, U.S
- Chapman, W (2009) Negex Test Set.
- Kalra, D, Fernando, B (2011) Approaches to enhancing the validity of coded data in electronic medical records. Prim Care Respir J 20: pp. 4-5 CrossRef
- Byrne, E, Fernando, B, Kalra, D, Sheikh, A (2010) The benefits and risks of structuring and coding of patient histories in the electronic clinical record: protocol for a systematic review. Inform Prim Care 18: pp. 197-203
- Shah, AD, Wood, DM, Dargan, PI (2011) Survey of ICD-10 coding of hospital admissions in the UK due to recreational drug toxicity. QJM 104: pp. 779-784 CrossRef
- The pre-publication history for this paper can be accessed here:http://www.biomedcentral.com/1472-6947/12/88/prepub
- The freetext matching algorithm: a computer program to extract diagnoses and causes of death from unstructured text in electronic health records
- Open Access
- Available under Open Access This content is freely available online to anyone, anywhere at any time.
BMC Medical Informatics and Decision Making
- Online Date
- August 2012
- Online ISSN
- BioMed Central
- Additional Links