An Interoperable Similarity-based Cohort Identification Method Using the OMOP Common Data Model Version 5.0

  • Shreya Chakrabarti
  • Anando Sen
  • Vojtech Huser
  • Gregory W. Hruby
  • Alexander Rusanov
  • David J. Albers
  • Chunhua WengEmail author
Research Article


Cohort identification for clinical studies tends to be laborious, time-consuming, and expensive. Developing automated or semi-automated methods for cohort identification is one of the “holy grails” in the field of biomedical informatics. We propose a high-throughput similarity-based cohort identification algorithm by applying numerical abstractions on electronic health records (EHR) data. We implement this algorithm using the Observational Medical Outcomes Partnership (OMOP) Common Data Model (CDM), which enables sites using this standardized EHR data representation to avail this algorithm with minimum effort for local implementation. We validate its performance for a retrospective cohort identification task on six clinical trials conducted at the Columbia University Medical Center. Our algorithm achieves an average area under the curve (AUC) of 0.966 and an average Precision at 5 of 0.983. This interoperable method promises to achieve efficient cohort identification in EHR databases. We discuss suitable applications of our method and its limitations and propose warranted future work.


Cohort identification Electronic health records (EHR) Observational medical outcomes partnership (OMOP) Similarity-based Phenotype Case-based reasoning (CBR) 



This project was funded by The National Library of Medicine (Grant R01LM009886; PI: Weng) and The National Center for Advancing Translational Science (Grant UL1TR001873; PI: Reilly).

Authors’ contributions

Chunhua Weng designed and supervised the research and edited the manuscript substantially. Shreya Chakrabarti performed data extraction, data analysis, writing, and MATLAB software development for the algorithm. Vojtech Huser developed R software for the OMOP CDM-based data extraction. Anando Sen and David J. Albers contributed to the methods and the manuscript editing. Gregory W. Hruby and Alexander Rusanov contributed to the development of the conceptual framework of this research.

Compliance with ethical standards

Conflict of interest



  1. 1.
    Hersh WR (2007) Adding value to the electronic health record through secondary use of data for quality assurance, research, and surveillance. Am J Manag Care 13:277–278Google Scholar
  2. 2.
    Safran C, Bloomrosen M, Hammond WE, Labkoff S, Markel-Fox S, Tang PC, Detmer DE, Expert Panel W (2007) Input from the expert panel (see A.A.: Toward a national framework for the secondary use of health data: an American Medical Informatics Association White Paper. J Am Med Inform Assoc 14: 1–9. doi: 10.1197/jamia.M2273
  3. 3.
    Shivade C, Raghavan P, Fosler-Lussier E, Embi PJ, Elhadad N, Johnson SB, Lai AM (2014) A review of approaches to identifying patient phenotype cohorts using electronic health records. J Am Med Inform Assoc 21:221–230CrossRefGoogle Scholar
  4. 4.
    Conway M, Berg RL, Carrell D, Denny JC, Kho AN, Kullo IJ, Linneman JG, Pacheco JA, Peissig P, Rasmussen L, Weston N, Chute CG, Pathak J (2011) Analyzing the heterogeneity and complexity of Electronic Health Record oriented phenotyping algorithms. AMIA ... Annu. Symp. proceedings. AMIA Symp 274–83Google Scholar
  5. 5.
    Collins JF, Williford WO, Weiss DG, Bingham SF, Klett CJ (1984) Planning patient recruitment: fantasy and reality. Stat Med 3:435–443. doi: 10.1002/sim.4780030425 CrossRefGoogle Scholar
  6. 6.
    Hripcsak G, Albers D (2013) Next-generation phenotyping of electronic health records. J Am Med Inform Assoc 20:117–121CrossRefGoogle Scholar
  7. 7.
    Friedman J, Hastie T, Tibshirani R (2001) The elements of statistical learning. Springer, Berlin Springer series in statisticszbMATHGoogle Scholar
  8. 8.
    Carroll RJ, Eyler AE, Denny JC (2011) Naïve Electronic Health Record phenotype identification for Rheumatoid arthritis. AMIA ... Annu. Symp. proceedings. AMIA Symp. 2011, 189–96Google Scholar
  9. 9.
    Köpcke F, Lubgan D, Fietkau R (2013) Evaluating predictive modeling algorithms to assess patient eligibility for clinical trials from routine data. BMC Med Inform Decis Mak 13:134CrossRefGoogle Scholar
  10. 10.
    Xu L (1994) Case based reasoning. IEEE Potentials 13:10–13CrossRefGoogle Scholar
  11. 11.
    Pantazi SV, Arocha JF, Moehr JR, Moehr J, Leven F, Rothemund M, Solomonoff R et al (2004) Case-based medical informatics. BMC Med Inform Decis Mak 4:19. doi: 10.1186/1472-6947-4-19 CrossRefGoogle Scholar
  12. 12.
    Miotto R, Weng C (2015) Case-based reasoning using electronic health records efficiently identifies eligible patients for clinical trials. J Am Med Inform Assoc 22:141–150CrossRefGoogle Scholar
  13. 13.
    Marling C, Whitehouse P (2001) Case-based reasoning in the care of Alzheimer’s Disease patients. In: Case-based Reasoning Research and Development. pp. 702–715. Springer Berlin HeidelbergGoogle Scholar
  14. 14.
    Bradburn C, Zeleznikow J (1994) The application of case-based reasoning to the tasks of health care planning. Presented at theGoogle Scholar
  15. 15.
    Letham B, Rudin C, Madigan D (2013) Sequential event prediction. Mach Learn 93:357–380MathSciNetCrossRefzbMATHGoogle Scholar
  16. 16.
    Vilar S, Ryan PB, Madigan D, Stang PE, Schuemie MJ, Friedman C, Tatonetti NP, Hripcsak G (2014) Similarity-based modeling applied to signal detection in pharmacovigilance. CPT Pharmacometrics Syst Pharmacol 3:e137. doi: 10.1038/psp.2014.35 CrossRefGoogle Scholar
  17. 17.
    Longhurst CA, Harrington RA, Shah NH (2014) A “green button” for using aggregate patient data at the point of care. Health Aff (Millwood) 33(1229–35). doi: 10.1377/hlthaff.2014.0099
  18. 18.
    Huang Z, Dong W, Duan H, Li H (2014) Similarity measure between patient traces for clinical pathway analysis: problem, method, and applications. IEEE J Biomed Heal Inform 18(4–14). doi: 10.1109/JBHI.2013.2274281
  19. 19.
    Cuggia M, Besana P, Glasspool D (2011) Comparing semi-automatic systems for recruitment of patients to clinical trials. Int J Med Inform 80:371–388CrossRefGoogle Scholar
  20. 20.
    Hripcsak G, Albers D, Perotte A (2011) Exploiting time in electronic health record correlations. J Am Med Inform Assoc 18:109–115CrossRefGoogle Scholar
  21. 21.
    Rusanov A, Weiskopf N (2014) Hidden in plain sight: bias towards sick patients when sampling patients with sufficient electronic health record data for research. BMC Med Inform Decis Mak 14:1CrossRefGoogle Scholar
  22. 22.
    Pivovarov R, Elhadad N (2015) Automated methods for the summarization of electronic health records. J Am Med Inform Assoc 22:938–947CrossRefGoogle Scholar
  23. 23.
    Cohen R, Elhadad M, Elhadad N (2013) Redundancy in electronic health record corpora: analysis, impact on text mining performance and mitigation strategies. BMC Bioinformatics 14:1CrossRefGoogle Scholar
  24. 24.
    Sun J, Wang F, Hu J, Edabollahi S (2012) Supervised patient similarity measure of heterogeneous patient records. ACM SIGKDD Explor 14:16–24CrossRefGoogle Scholar
  25. 25.
    Zhang P, Wang F, Hu J, Sorrentino R (2014) Towards personalized medicine: leveraging patient similarity and drug similarity analytics. AMIA Jt Summits Transl Sci proceedings AMIA Jt Summits Transl Sci 2014:132–136Google Scholar
  26. 26.
    Overhage J, Ryan P, Reich C (2012) Validation of a common data model for active safety surveillance research. J Am Med Inform Assoc 19:54–60CrossRefGoogle Scholar
  27. 27.
    Hripcsak G, Duke J, Shah N, Reich C (2015) Observational Health Data Sciences and Informatics (OHDSI): opportunities for observational researchers. Stud Health Technol Inform 216:574Google Scholar
  28. 28.
    Observational Medical Outcomes Partnership,
  29. 29.
    Dolin R, Alschuler L, Beebe C (2001) The HL7 clinical document architecture. J Am Med Inform Assoc 8(6):552–569CrossRefGoogle Scholar
  30. 30.
    Friedman D, Cohen B, Averbach A (2000) Race/ethnicity and OMB directive 15: implications for state public health practice. Am J Public Health 90:1714CrossRefGoogle Scholar
  31. 31.
    Centers for Disease Control and Prevention,
  32. 32.
    World Health Organization (1993) ICD-10 Classification of Mental and Behavioural Disorders. Diagnostic Criteria Res. WHO, GenevaGoogle Scholar
  33. 33.
    Donnelly K (2006) SNOMED-CT: the advanced terminology and coding system for eHealth. Stud Health Technol Inform 121:279Google Scholar
  34. 34.
    McDonald C, Huff S, Suico J, Hill G (2003) LOINC, a universal standard for identifying laboratory observations: a 5-year update. Clin Chem 49:624–633CrossRefGoogle Scholar
  35. 35.
    Schadow G, McDonald C The unified code for units of measure (UCUM). Regenstrief Inst. IndianaGoogle Scholar
  36. 36.
    Cerner Multum. Lexicon,
  37. 37.
    Pahor M, Chrischilles E, Guralnik J (1994) Drug data coding and analysis in epidemiologic studies. Eur J Epidemiol 10:405–411CrossRefGoogle Scholar
  38. 38.
    Cimino J, Hripcsak G (1989) Designing an introspective, multipurpose, controlled medical vocabulary. In: Proc 13th Annu Symp Comput Appl Med Care. pp. 513–518Google Scholar
  39. 39.
    Bodenreider O (2004) The unified medical language system (UMLS): integrating biomedical terminology. Nucleic Acids Res 32:D267–D270CrossRefGoogle Scholar
  40. 40.
    Milstein B, Maguire N, Meier J (1996) Method for computing current procedural terminology codes from physician generated documentation. US Pat 5:483,443Google Scholar
  41. 41.
    Thadani S, Weng C, Bigger J (2009) Electronic screening improves efficiency in clinical trial recruitment. J Am Med Inform Assoc 16:869–873CrossRefGoogle Scholar
  42. 42.
    Albers DJ, Pivovarov R, Elhadad N, Hripcsak G (2015) Model selection for EHR Laboratory tests preserving healthcare context and underlying physiology. In: American Medical Informatics AssociationGoogle Scholar
  43. 43.
    Pollard H (1934) On the relative stability of the median and arithmetic mean, with particular reference to certain frequency distributions which can be dissected into normal. Ann Math Stat 5:227–262CrossRefzbMATHGoogle Scholar
  44. 44.
    Huber P, Ronchetti E (1975) Robustness of design. Robust Stat. Second Ed. 239–248Google Scholar
  45. 45.
    Verleysen M, François D (2005) The curse of dimensionality in data mining and time series prediction. In: computational intelligence and bioinspired systems. pp. 758–770. Springer Berlin HeidelbergGoogle Scholar
  46. 46.
    Deza M, Deza E (2009) Encyclopedia of distances. Springer, Berlin HeidelbergCrossRefzbMATHGoogle Scholar
  47. 47.
    Brown MB, Forsythe AB (1974) 372: the Anova and multiple comparisons for data with heterogeneous variances. Biometrics 30:719–724. doi: 10.2307/2529238 MathSciNetCrossRefzbMATHGoogle Scholar
  48. 48.
    Eisenberg DL, Schreiber CA, Turok DK, Teal SB, Westhoff CL, Creinin MD (2015) Three-year efficacy and safety of a new 52-mg levonorgestrel-releasing intrauterine system. Contraception 92:10–16. doi: 10.1016/j.contraception.2015.04.006 CrossRefGoogle Scholar
  49. 49.
    Kent DM, Rothwell PM, Ioannidis JP, Altman DG, Hayward RA, Black D, Feinstein A et al (2010) Assessing and reporting heterogeneity in treatment effects in clinical trials: a proposal. Trials 11:1. doi: 10.1186/1745-6215-11-85 CrossRefGoogle Scholar
  50. 50.
    Karian Z, Dudewicz E (2000) Fitting statistical distributions: the generalized lambda distribution and generalized bootstrap methodsGoogle Scholar
  51. 51.
    Sen A, Chakrabarti S, Goldstein A, Wang S, Ryan P, Weng C (2016) GIST 2.0: A Scalable Multi-trait Metric for Quantifying Population Representativeness of Individual Clinical Studies. J Biomed Inform 63:325–336. doi: 10.1016/j.jbi.2016.09.003 CrossRefGoogle Scholar
  52. 52.
    Hersh W, Weiner M, Embi P, Logan J (2013) Caveats for the use of operational electronic health record data in comparative effectiveness research. Med Care 51:S30CrossRefGoogle Scholar
  53. 53.
    Weiskopf N, Weng C (2013) Methods and dimensions of electronic health record data quality assessment: enabling reuse for clinical research. J Am Med Informatics Assoc 20:144–151CrossRefGoogle Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  1. 1.Department of Biomedical InformaticsColumbia UniversityNew YorkUSA
  2. 2.National Institute of HealthNational Library of MedicineBethesdaUSA
  3. 3.Department of AnesthesiologyColumbia UniversityNew YorkUSA

Personalised recommendations