Biomedical Literature Mining pp 269-286 | Cite as
Mining the Electronic Health Record for Disease Knowledge
- 13 Citations
- 2 Mentions
- 2.9k Downloads
Abstract
The growing amount and availability of electronic health record (EHR) data present enhanced opportunities for discovering new knowledge about diseases. In the past decade, there has been an increasing number of data and text mining studies focused on the identification of disease associations (e.g., disease–disease, disease–drug, and disease–gene) in structured and unstructured EHR data. This chapter presents a knowledge discovery framework for mining the EHR for disease knowledge and describes each step for data selection, preprocessing, transformation, data mining, and interpretation/validation. Topics including natural language processing, standards, and data privacy and security are also discussed in the context of this framework.
Key words
Electronic health record Knowledge discovery in databases Data mining Text mining Natural language processing Data warehouse Data privacy and security StandardsNotes
Acknowledgment
The example clinical note in Fig. 2d was obtained with permission from MTSamples (http://www.mtsamples.com). This work was supported in part by the National Library of Medicine of the National Institutes of Health under award number R01LM011364. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.
References
- 1.Institute of Medicine (U.S.), Committee on Improving the Patient Record (eds), Dick RS, Steen EB, Detmer DE (1997) The computer-based patient record: an essential technology for health care. Revised edition. National Academy Press, Washington, DCGoogle Scholar
- 2.Stewart WF, Shah NR, Selna MJ, Paulus RA, Walker JM (2007) Bridging the inferential gap: the electronic health record and clinical evidence. Health Aff (Millwood) 26: w181–w191CrossRefGoogle Scholar
- 3.Kohane IS (2011) Using electronic health records to drive discovery in disease genomics. Nat Rev Genet 12:417–428PubMedCrossRefGoogle Scholar
- 4.Coorevits P, Sundgren M, Klein GO, Bahr A, Claerhout B, Daniel C et al (2013) Electronic health records: new opportunities for clinical research. J Intern Med 274(6):547–560PubMedCrossRefGoogle Scholar
- 5.Kukafka R, Ancker JS, Chan C, Chelico J, Khan S, Mortoti S et al (2007) Redesigning electronic health record systems to support public health. J Biomed Inform 40:398–409PubMedCrossRefGoogle Scholar
- 6.Denny JC (2012) Chapter 13: Mining electronic health records in the genomics era. PLoS Comput Biol 8:e1002823PubMedCentralPubMedCrossRefGoogle Scholar
- 7.Bath P (2004) Data mining in health and medical information. Annu Rev Inform Sci Technol 38:331–369CrossRefGoogle Scholar
- 8.van Bemmel JH, van Mulligen EM, Mons B, van Wijk M, Kors JA, van der Lei J (2006) Databases for knowledge discovery. Examples from biomedicine and health care. Int J Med Inform 75:257–267PubMedCrossRefGoogle Scholar
- 9.Iavindrasana J, Cohen G, Depeursinge A, Muller H, Meyer R, Geissbuhler A (2009) Clinical data mining: a review. Yearb Med Inform:121–133Google Scholar
- 10.Murdoch TB, Detsky AS (2013) The inevitable application of big data to health care. JAMA 309:1351–1352PubMedCrossRefGoogle Scholar
- 11.Roque FS, Jensen PB, Schmock H, Dalgaard M, Andreatta M, Hansen T et al (2011) Using electronic patient records to discover disease correlations and stratify patient cohorts. PLoS Comput Biol 7:e1002141PubMedCentralPubMedCrossRefGoogle Scholar
- 12.Holmes AB, Hawson A, Liu F, Friedman C, Khiabanian H, Rabadan R (2011) Discovering disease associations by integrating electronic clinical data and medical literature. PLoS One 6:e21132PubMedCentralPubMedCrossRefGoogle Scholar
- 13.Hanauer DA, Rhodes DR, Chinnaiyan AM (2009) Exploring clinical associations using ‘-omics’ based enrichment analyses. PLoS One 4:e5203PubMedCentralPubMedCrossRefGoogle Scholar
- 14.Wilson AM, Thabane L, Holbrook A (2004) Application of data mining techniques in pharmacovigilance. Br J Clin Pharmacol 57:127–134PubMedCentralPubMedCrossRefGoogle Scholar
- 15.Wang X, Hripcsak G, Markatou M, Friedman C (2009) Active computerized pharmacovigilance using natural language processing, statistics, and electronic health records: a feasibility study. J Am Med Inform Assoc 16:328–337PubMedCentralPubMedCrossRefGoogle Scholar
- 16.Harpaz R, Perez H, Chase HS, Rabadan R, Hripcsak G, Friedman C (2011) Biclustering of adverse drug events in the FDA’s spontaneous reporting system. Clin Pharmacol Ther 89:243–250PubMedCentralPubMedCrossRefGoogle Scholar
- 17.Wilke RA, Xu H, Denny JC, Roden DM, Krauss RM, McCarty CA et al (2011) The emerging role of electronic medical records in pharmacogenomics. Clin Pharmacol Ther 89:379–386PubMedCentralPubMedCrossRefGoogle Scholar
- 18.Fayyad U, Piatetsky-Shapiro G, Smyth P (1996) From data mining to knowledge discovery in databases. AI Mag 17:37–54Google Scholar
- 19.Fayyad U, Piatetsky-Shapiro G, Smyth P (1996) The KDD process for extracting useful knowledge from volumes of data. Commun ACM 39:27–34CrossRefGoogle Scholar
- 20.Hearst M (1999) Untangling text data mining. Proceedings of the 37th annual meeting of the Association for Computational Linguistics on computational linguistics, pp 3–10Google Scholar
- 21.Zweigenbaum P, Demner-Fushman D, Yu H, Cohen KB (2007) Frontiers of biomedical text mining: current progress. Brief Bioinform 8:358–375PubMedCentralPubMedCrossRefGoogle Scholar
- 22.Institute of Medicine (2003) Key capabilities of an electronic health record system. National Academies Press, Washington, DCGoogle Scholar
- 23.National Institutes of Health National Center for Research Resources and MITRE Corporation (2006) Electronic health records overview. http://www.himss.org/files/HIMSSorg/content/files/Code%20180%20MITRE%20Key%20Components%20of%20an%20EHR.pdf
- 24.ASTM Standard E1384 (2013) Standard guide for content and structure of the Electronic Health Record (EHR). ASTM International, West Conshohocken, PAGoogle Scholar
- 25.Carter J (2008) Electronic health records for clinicians and administrators: infrastructure and supporting technologies. In: Carter J (ed) Electronic health records, 2nd edn. American College of Physicians, Philadelphia, PAGoogle Scholar
- 26.MacKenzie SL, Wyatt MC, Schuff R, Tenenbaum JD, Anderson N (2012) Practices and perspectives on building integrated data repositories: results from a 2010 CTSA survey. J Am Med Inform Assoc 19:e119–e124PubMedCentralPubMedCrossRefGoogle Scholar
- 27.
- 28.Scott DJ, Lee J, Silva I, Park S, Moody GB, Celi LA et al (2013) Accessing the public MIMIC-II intensive care relational database for clinical research. BMC Med Inform Dec Mak 13:9CrossRefGoogle Scholar
- 29.
- 30.Uzuner O, Luo Y, Szolovits P (2007) Evaluating the state-of-the-art in automatic de-identification. J Am Med Inform Assoc 14:550–563PubMedCentralPubMedCrossRefGoogle Scholar
- 31.Uzuner O, Goldstein I, Luo Y, Kohane I (2008) Identifying patient smoking status from medical discharge records. J Am Med Inform Assoc 15:14–24PubMedCentralPubMedCrossRefGoogle Scholar
- 32.Uzuner O, Solti I, Cadag E (2010) Extracting medication information from clinical text. J Am Med Inform Assoc 17:514–518PubMedCentralPubMedCrossRefGoogle Scholar
- 33.Ohno-Machado L, Bafna V, Boxwala AA, Chapman BE, Chapman WW, Chaudhuri K et al (2012) iDASH: integrating data for analysis, anonymization, and sharing. J Am Med Inform Assoc 19:196–201PubMedCentralPubMedCrossRefGoogle Scholar
- 34.
- 35.Ackoff R (1989) From data to wisdom. J Appl Syst Anal 16:3–9Google Scholar
- 36.Cao H, Markatou M, Melton GB, Chiang MF, Hripcsak G (2005) Mining a clinical data warehouse to discover disease-finding associations using co-occurrence statistics. AMIA Annu Symp Proc:106–110Google Scholar
- 37.Cao H, Hripcsak G, Markatou M (2007) A statistical methodology for analyzing co-occurrence data from a large sample. J Biomed Inform 40:343–352PubMedCentralPubMedCrossRefGoogle Scholar
- 38.Chen ES, Hripcsak G, Xu H, Markatou M, Friedman C (2008) Automated acquisition of disease drug knowledge from biomedical and clinical documents: an initial study. J Am Med Inform Assoc 15:87–98PubMedCentralPubMedCrossRefGoogle Scholar
- 39.Chen ES, Stetson PD, Lussier YA, Markatou M, Hripcsak G, Friedman C (2007) Detection of practice pattern trends through Natural Language Processing of clinical narratives and biomedical literature. AMIA Annu Symp Proc:120–124Google Scholar
- 40.Wang X, Hripcsak G, Friedman C (2009) Characterizing environmental and phenotypic associations using information theory and electronic health records. BMC Bioinforma 10(Suppl 9):S13CrossRefGoogle Scholar
- 41.Wang X, Chase H, Markatou M, Hripcsak G, Friedman C (2010) Selecting information in electronic health records for knowledge acquisition. J Biomed Inform 43:595–601PubMedCentralPubMedCrossRefGoogle Scholar
- 42.Wright A, Chen ES, Maloney FL (2010) An automated technique for identifying associations between medications, laboratory results and problems. J Biomed Inform 43: 891–901PubMedCrossRefGoogle Scholar
- 43.Wright A, Pang J, Feblowitz JC, Maloney FL, Wilcox AR, Ramelson HZ et al (2011) A method and knowledge base for automated inference of patient problems from structured data in an electronic medical record. J Am Med Inform Assoc 18:859–867PubMedCentralPubMedCrossRefGoogle Scholar
- 44.Doddi S, Marathe A, Ravi SS, Torney DC (2001) Discovery of association rules in medical data. Med Inform Internet Med 26: 25–33PubMedCrossRefGoogle Scholar
- 45.Denny JC, Ritchie MD, Basford MA, Pulley JM, Bastarache L, Brown-Gentry K et al (2010) PheWAS: demonstrating the feasibility of a phenome-wide scan to discover gene-disease associations. Bioinformatics 26:1205–1210PubMedCentralPubMedCrossRefGoogle Scholar
- 46.Mullins IM, Siadaty MS, Lyman J, Scully K, Garrett CT, Miller WG et al (2006) Data mining and clinical data repositories: insights from a 667,000 patient data set. Comput Biol Med 36:1351–1377PubMedCrossRefGoogle Scholar
- 47.Concaro S, Sacchi L, Cerra C, Fratino P, Bellazzi R (2011) Mining health care administrative data with temporal association rules on hybrid events. Methods Inf Med 50: 166–179PubMedCrossRefGoogle Scholar
- 48.Newton KM, Peissig PL, Kho AN, Bielinski SJ, Berg RL, Choudhary V et al (2013) Validation of electronic medical record-based phenotyping algorithms: results and lessons learned from the eMERGE network. J Am Med Inform Assoc 20:e147–e154PubMedCrossRefGoogle Scholar
- 49.Richesson RL, Hammond WE, Nahm M, Wixted D, Simon GE, Robinson JG et al (2013) Electronic health records based phenotyping in next-generation clinical trials: a perspective from the NIH Health Care Systems Collaboratory. J Am Med Inform Assoc 20(e2):e226–e231PubMedCrossRefGoogle Scholar
- 50.
- 51.Gottesman O, Kuivaniemi H, Tromp G, Faucett WA, Li R, Manolio TA et al (2013) The Electronic Medical Records and Genomics (eMERGE) Network: past, present, and future. Genet Med 15(10):761–771PubMedCentralPubMedCrossRefGoogle Scholar
- 52.
- 53.Weiskopf NG, Weng C (2013) Methods and dimensions of electronic health record data quality assessment: enabling reuse for clinical research. J Am Med Inform Assoc 20: 144–151PubMedCentralPubMedCrossRefGoogle Scholar
- 54.Rahm E, Do H (2000) Data cleaning: problems and current approaches. IEEE Data Eng Bull 23:3–13Google Scholar
- 55.
- 56.Bodenreider O (2004) The Unified Medical Language System (UMLS): integrating biomedical terminology. Nucleic Acids Res 32:D267–D270PubMedCentralPubMedCrossRefGoogle Scholar
- 57.Post AR, Harrison JH Jr (2008) Temporal data mining. Clin Lab Med 28:83–100, viiPubMedCrossRefGoogle Scholar
- 58.Carter C, Hamilton H (1995) A fast, on-line generalization algorithm for knowledge discovery. Appl Math Lett 8:5–11CrossRefGoogle Scholar
- 59.
- 60.
- 61.Liu H, Motoda H (1998) Feature extraction, construction and selection: a data mining perspective. Kluwer, BostonCrossRefGoogle Scholar
- 62.Dunham MH (2003) Data mining introductory and advanced topics. Prentice Hall, Upper Saddle River, NJGoogle Scholar
- 63.Sarkar IN (2013) Methods in biomedical informatics: a pragmatic approach, 1st edn. Academic, New YorkGoogle Scholar
- 64.Zupan B, Demsar J (2008) Open-source tools for data mining. Clin Lab Med 28:37–54, viPubMedCrossRefGoogle Scholar
- 65.
- 66.Tan P-N, Steinbach M, Kumar V (2006) Introduction to data mining, 1st edn. Pearson Addison Wesley, BostonGoogle Scholar
- 67.Agrawal R, Srikant R (1994) Fast algorithms for mining association rules in large databases. Proceedings of the 20th International conference on very large data bases, pp 487–499Google Scholar
- 68.Hipp J, Guntzer U, Nakhaeizadeh G (2000) Algorithms for association rule mining—a general survey and comparison. ACM SIGKDD Explor Newslett 2:58–64CrossRefGoogle Scholar
- 69.Tan P, Kumar V, Srivastava J (2002) Selecting the right interestingness measure for association patterns. Proceedings of the 8th ACM SIGKDD International conference on knowledge discovery and data mining, pp 32–41Google Scholar
- 70.Ohsaki M, Abe H, Tsumoto S, Yokoi H, Yamaguchi T (2007) Evaluation of rule interestingness measures in medical knowledge discovery in databases. Artif Intell Med 41: 177–196PubMedCrossRefGoogle Scholar
- 71.Hidalgo CA, Blumm N, Barabasi AL, Christakis NA (2009) A dynamic network approach for the study of human phenotypes. PLoS Comput Biol 5:e1000353PubMedCentralPubMedCrossRefGoogle Scholar
- 72.Harpaz R, Chase HS, Friedman C (2010) Mining multi-item drug adverse effect associations in spontaneous reporting systems. BMC Bioinforma 11(Suppl 9):S7CrossRefGoogle Scholar
- 73.Liu B, Hsu W, Ma Y (1999) Mining association rules with multiple minimum supports. KDD ‘99 Proceedings of the 5th ACM SIGKDD International conference on knowledge discovery and data mining, pp 337–341Google Scholar
- 74.
- 75.
- 76.
- 77.
- 78.
- 79.
- 80.
- 81.
- 82.
- 83.Friedman C, Johnson S (2006) Natural language and text processing in biomedicine. In: Shortliffe E, Cimino JJ (eds) Biomedical informatics computer applications in health care and biomedicine, 3rd edn. Springer, New YorkGoogle Scholar
- 84.Meystre SM, Savova GK, Kipper-Schuler KC, Hurdle JF (2008) Extracting information from textual documents in the electronic health record: a review of recent research. Yearb Med Inform:128–144Google Scholar
- 85.Cimino JJ (1996) Review paper: coding systems in health care. Methods Inf Med 35: 273–284PubMedGoogle Scholar
- 86.Cimino JJ, Zhu X (2006) The practical impact of ontologies on biomedical informatics. Yearb Med Inform:124–135Google Scholar
- 87.Friedman C (2000) A broad-coverage natural language processing system. Proc AMIA Symp:270–274Google Scholar
- 88.Friedman C, Hripcsak G, Shagina L, Liu H (1999) Representing information in patient reports using natural language processing and the extensible markup language. J Am Med Inform Assoc 6:76–87PubMedCentralPubMedCrossRefGoogle Scholar
- 89.Friedman C, Shagina L, Lussier Y, Hripcsak G (2004) Automated encoding of clinical documents based on natural language processing. J Am Med Inform Assoc 11:392–402PubMedCentralPubMedCrossRefGoogle Scholar
- 90.Aronson AR, Lang FM (2010) An overview of MetaMap: historical perspective and recent advances. J Am Med Inform Assoc 17:229–236PubMedCentralPubMedGoogle Scholar
- 91.Savova GK, Masanz JJ, Ogren PV, Zheng J, Sohn S, Kipper-Schuler KC et al (2010) Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications. J Am Med Inform Assoc 17:507–513PubMedCentralPubMedCrossRefGoogle Scholar
- 92.Zeng QT, Goryachev S, Weiss S, Sordo M, Murphy SN, Lazarus R (2006) Extracting principal diagnosis, co-morbidity and smoking status for asthma research: evaluation of a natural language processing system. BMC Med Inform Dec Mak 6:30CrossRefGoogle Scholar
- 93.Kohane IS, Churchill SE, Murphy SN (2012) A translational engine at the national scale: informatics for integrating biology and the bedside. J Am Med Inform Assoc 19: 181–185PubMedCentralPubMedCrossRefGoogle Scholar
- 94.McMurry AJ, Murphy SN, MacFadden D, Weber G, Simons WW, Orechia J et al (2013) SHRINE: enabling nationally scalable multi-site disease studies. PLoS One 8:e55811PubMedCentralPubMedCrossRefGoogle Scholar
- 95.Roden DM, Pulley JM, Basford MA, Bernard GR, Clayton EW, Balser JR et al (2008) Development of a large-scale de-identified DNA biobank to enable personalized medicine. Clin Pharmacol Ther 84:362–369PubMedCentralPubMedCrossRefGoogle Scholar
- 96.Lowe HJ, Ferris TA, Hernandez PM, Weber SC (2009) STRIDE—an integrated standards-based translational research informatics platform. AMIA Annu Symp Proc 2009:391–395PubMedCentralPubMedGoogle Scholar
- 97.Chute CG, Beck SA, Fisk TB, Mohr DN (2010) The Enterprise Data Trust at Mayo Clinic: a semantically integrated warehouse of biomedical data. J Am Med Inform Assoc 17:131–135PubMedCentralPubMedCrossRefGoogle Scholar
- 98.Cimino JJ, Ayres EJ (2010) The clinical research data repository of the US National Institutes of Health. Stud Health Technol Inform 160:1299–1303PubMedCentralPubMedGoogle Scholar
- 99.Payne P, Ervin D, Dhaval R, Borlawsky T, Lai A (2011) TRIAD: the Translational Research Informatics and Data Management Grid. Appl Clin Inform 2:331–344PubMedCentralPubMedCrossRefGoogle Scholar
- 100.Wylie JE, Mineau GP (2003) Biomedical databases: protecting privacy and promoting research. Trends Biotechnol 21:113–116PubMedCrossRefGoogle Scholar
- 101.Malin B, Karp D, Scheuermann RH (2010) Technical and policy approaches to balancing patient privacy and data sharing in clinical and translational research. J Investig Med 58: 11–18PubMedCentralPubMedGoogle Scholar
- 102.Krishna R, Kelleher K, Stahlberg E (2007) Patient confidentiality in the research use of clinical medical databases. Am J Public Health 97:654–658PubMedCentralPubMedCrossRefGoogle Scholar
- 103.Berman JJ (2002) Confidentiality issues for medical data miners. Artif Intell Med 26: 25–36PubMedCrossRefGoogle Scholar
- 104.
- 105.Gunn PP, Fremont AM, Bottrell M, Shugarman LR, Galegher J, Bikson T (2004) The Health Insurance Portability and Accountability Act Privacy Rule: a practical guide for researchers. Med Care 42:321–327PubMedCrossRefGoogle Scholar
- 106.Nosowsky R, Giordano TJ (2006) The Health Insurance Portability and Accountability Act of 1996 (HIPAA) privacy rule: implications for clinical research. Annu Rev Med 57:575–590PubMedCrossRefGoogle Scholar
- 107.Meystre SM, Friedlin FJ, South BR, Shen S, Samore MH (2010) Automatic de-identification of textual documents in the electronic health record: a review of recent research. BMC Med Res Methodol 10:70PubMedCentralPubMedCrossRefGoogle Scholar
- 108.Kushida CA, Nichols DA, Jadrnicek R, Miller R, Walsh JK, Griffin K (2012) Strategies for de-identification and anonymization of electronic health record data for use in multicenter research studies. Med Care 50(Suppl): S82–S101PubMedCrossRefGoogle Scholar
- 109.El Emam K, Jonker E, Arbuckle L, Malin B (2011) A systematic review of re-identification attacks on health data. PLoS One 6:e28071PubMedCentralPubMedCrossRefGoogle Scholar
- 110.Murphy SN, Gainer V, Mendis M, Churchill S, Kohane I (2011) Strategies for maintaining patient privacy in i2b2. J Am Med Inform Assoc 18(Suppl 1):i103–i108PubMedCentralPubMedCrossRefGoogle Scholar
- 111.Hammond WE (2005) The making and adoption of health data standards. Health Aff (Millwood) 24:1205–1213CrossRefGoogle Scholar
- 112.Chen ES, Melton GB, Sarkar IN (2012) Translating standards into practice: experiences and lessons learned in biomedicine and health care. J Biomed Inform 45:609–612PubMedCrossRefGoogle Scholar
- 113.
- 114.
- 115.
- 116.Vreeman DJ, McDonald CJ, Huff SM (2010) LOINC(R)—a universal catalog of individual clinical observations and uniform representation of enumerated collections. Int J Funct Inform Personal Med 3:273–291PubMedCentralPubMedCrossRefGoogle Scholar
- 117.
- 118.Nelson SJ, Zeng K, Kilbourne J, Powell T, Moore R (2011) Normalized names for clinical drugs: RxNorm at 6 years. J Am Med Inform Assoc 18:441–448PubMedCentralPubMedCrossRefGoogle Scholar
- 119.
- 120.
- 121.
- 122.
- 123.Noy NF, Shah NH, Whetzel PL, Dai B, Dorf M, Griffith N et al (2009) BioPortal: ontologies and integrated data resources at the click of a mouse. Nucleic Acids Res 37:W170–W173PubMedCentralPubMedCrossRefGoogle Scholar