The freetext matching algorithm: a computer program to extract diagnoses and causes of death from unstructured text in electronic health records
Electronic health records are invaluable for medical research, but much information is stored as free text rather than in a coded form. For example, in the UK General Practice Research Database (GPRD), causes of death and test results are sometimes recorded only in free text. Free text can be difficult to use for research if it requires time-consuming manual review. Our aim was to develop an automated method for extracting coded information from free text in electronic patient records.
We reviewed the electronic patient records in GPRD of a random sample of 3310 patients who died in 2001, to identify the cause of death. We developed a computer program called the Freetext Matching Algorithm (FMA) to map diagnoses in text to the Read Clinical Terminology. The program uses lookup tables of synonyms and phrase patterns to identify diagnoses, dates and selected test results. We tested it on two random samples of free text from GPRD (1000 texts associated with death in 2001, and 1000 general texts from cases and controls in a coronary artery disease study), comparing the output to the U.S. National Library of Medicine’s MetaMap program and the gold standard of manual review.
Among 3310 patients registered in the GPRD who died in 2001, the cause of death was recorded in coded form in 38.1% of patients, and in the free text alone in 19.4%. On the 1000 texts associated with death, FMA coded 683 of the 735 positive diagnoses, with precision (positive predictive value) 98.4% (95% confidence interval (CI) 97.2, 99.2) and recall (sensitivity) 92.9% (95% CI 90.8, 94.7). On the general sample, FMA detected 346 of the 447 positive diagnoses, with precision 91.5% (95% CI 88.3, 94.1) and recall 77.4% (95% CI 73.2, 81.2), which was similar to MetaMap.
We have developed an algorithm to extract coded information from free text in GP records with good precision. It may facilitate research using free text in electronic patient records, particularly for extracting the cause of death.
- Wang Z, Shah AD, Tate AR, Denaxas S, Shawe-Taylor J, Hemingway H: Extracting diagnoses and investigation results from unstructured text in electronic health records by semi-supervised machine learning. PLoS One 2012, 7:e30412. CrossRef
- Tate AR, Martin AGR, Ali A, Cassell JA: Using free text information to explore how and when GPs code a diagnosis of ovarian cancer: an observational study using primary care records of patients with ovarian cancer. BMJ Open 2011, 1:e000025. CrossRef
- Pakhomov S, Hemingway H, Weston S, Jacobsen S, Rodeheffer R, Roger V: Epidemiology of Angina Pectoris: Role of Natural Language Processing of the Medical Record. Am Heart J 2007,153(4):666–673. CrossRef
- Pakhomov S, Buntrock J, Chute CG: Prospective recruitment of patients with congestive heart failure using an ad-hoc binary classifier. J Biomed Informatics 2005,38(2):145–153. CrossRef
- Savova GK, Ogren PV, Duffy PH, Buntrock JD, Chute CG: Mayo clinic NLP system for patient smoking status identification. J Am Med Inform Assoc 2008, 15:25–28. CrossRef
- Uzuner O, Goldstein I, Luo Y, Kohane I: Identifying Patient Smoking Status from Medical Discharge Records. J Am Med Inform Assoc 2008, 15:14–24. CrossRef
- Clark C, Good K, Jeziernyb L, Macpherson M, Wilsonb B, Chajewska U: Identifying Smokers with a Medical Extraction System. J Am Med Inform Assoc 2008, 15:36–39. CrossRef
- Pakhomov S, Weston SA, Jacobsen SJ, Chute CG, Meverden R, Roger VL: Electronic medical records for clinical research: application to the identification of heart failure. Am J Manag Care 2007,13(6 Part 1):281–288.
- Melton GB, Raman N, Chen ES, Sarkar IN, Pakhomov S, Madoff RD: Evaluation of family history information within clinical documents and adequacy of HL7 clinical statement and clinical genomics family history models for its representation: a case report. J Am Med Inform Assoc 2010,17(3):337–340.
- Pakhomov S, Shah N, Hanson P, Balasubramaniam S, Smith SA: Automatic quality of life prediction using electronic medical records. AMIA Annu Symp Proc 2008, 2008:545–549.
- Friedman C, Shagina L, Lussier Y, Hripcsak G: Automated encoding of clinical documents based on natural language processing. J Am Med Inform Assoc 2004,11(5):392–402. CrossRef
- Aronson A: MetaMap. National Library of Medicine, U.S.; 2011.
- Meystre S, Savova G, Kipper-Schuler K, Hurdle J: Extracting Information from Textual Documents in the Electronic Health Record: a Review of Recent Research. Methods Inf Med 2008,47(Suppl 1):128–144.
- Herrett E, Thomas SL, Schoonen WM, Smeeth L, Hall AJ: Validation and validity of diagnoses in the General Practice Research Database: a systematic review. Br J Clin Pharmacol 2010, 69:4–14. CrossRef
- Khan N, Harrison S, Rose P: Validity of diagnostic coding within the General Practice Research Database: a systematic review. Br J Gen Pract 2010,60(572):e128—36. CrossRef
- NHS Information Centre: The Read Codes. 2011.
- Department of Health Royal College of General Practitioners British Medical Association: The Good Practice Guidelines for GP electronic patient records. 2011.
- Shah AD, Martinez C: An algorithm to derive a numerical daily dose from unstructured text dosage instructions. Pharmacoepidemiol Drug Saf 2006,15(3):161–166. CrossRef
- Hall GC: Validation of death and suicide recording on the THIN UK primary care database. Pharmacoepidemiol Drug Saf 2009,18(2):120–131. CrossRef
- Friedman C, Johnson SB, Forman B, Starren J: Architectural requirements for a multipurpose natural language processor in the clinical environment. Proc Annu Symp Comput Appl Med Care 1995, 1995:347–351.
- US National Library of Medicine: Unified Medical Language System. National Institutes of Health; 2011, [https://uts.nlm.nih.gov//home.html]
- Clinical Practice Research Datalink [http://www.cprd.com/home/] [[http://wordlist.sourceforge.net/12dicts-readme.html]
- Beale A: The 12dicts word lists. 2003.[http://wordlist.sourceforge.net/12dicts-readme.html]
- World Health Organization: International statistical classification of diseases and health related problems : 10th revision, Volume 2. World Health Organization, Geneva;
- Chapman WW, Bridewell W, Hanbury P, Cooper GF, Buchanan BG: A Simple Algorithm for Identifying Negated Findings and Diseases in Discharge Summaries. J Biomed Inform 2001,34(5):301–310. CrossRef
- Rogers W, Lang FM, Gay C: MetaMap Data File Builder. National Library of Medicine, U.S; 2012.
- Lang FM, Aronson A: Filtering the UMLS Metathesaurus for MetaMap. National Library of Medicine, U.S; 2010.
- Chapman W: Negex Test Set. 2009. [http://wordlist.sourceforge.net/12dicts-eadme.html]
- Kalra D, Fernando B: Approaches to enhancing the validity of coded data in electronic medical records. Prim Care Respir J 2011, 20:4–5. CrossRef
- Byrne E, Fernando B, Kalra D, Sheikh A: The benefits and risks of structuring and coding of patient histories in the electronic clinical record: protocol for a systematic review. Inform Prim Care 2010,18(3):197–203.
- Shah AD, Wood DM, Dargan PI: Survey of ICD-10 coding of hospital admissions in the UK due to recreational drug toxicity. QJM 2011,104(9):779–784. CrossRef
- The pre-publication history for this paper can be accessed here:http://www.biomedcentral.com/1472-6947/12/88/prepub
- The freetext matching algorithm: a computer program to extract diagnoses and causes of death from unstructured text in electronic health records
- Open Access
- Available under Open Access This content is freely available online to anyone, anywhere at any time.
BMC Medical Informatics and Decision Making
- Online Date
- August 2012
- Online ISSN
- BioMed Central
- Additional Links