Skip to main content

Managing Data Quality for a Drug Safety Surveillance System



The objective of this study is to present a data quality assurance program for disparate data sources loaded into a Common Data Model, highlight data quality issues identified and resolutions implemented.


The Observational Medical Outcomes Partnership is conducting methodological research to develop a system to monitor drug safety. Standard processes and tools are needed to ensure continuous data quality across a network of disparate databases, and to ensure that procedures used to extract-transform-load (ETL) processes maintain data integrity. Currently, there is no consensus or standard approach to evaluate the quality of the source data, or ETL procedures.


We propose a framework for a comprehensive process to ensure data quality throughout the steps used to process and analyze the data. The approach used to manage data anomalies includes: (1) characterization of data sources; (2) detection of data anomalies; (3) determining the cause of data anomalies; and (4) remediation.


Data anomalies included incomplete raw dataset: no race or year of birth recorded. Implausible data: year of birth exceeding current year, observation period end date precedes start date, suspicious data frequencies and proportions outside normal range. Examples of errors found in the ETL process were zip codes incorrectly loaded, drug quantities rounded, drug exposure length incorrectly calculated, and condition length incorrectly programmed.


Complete and reliable observational data are difficult to obtain, data quality assurance processes need to be continuous as data is regularly updated; consequently, processes to assess data quality should be ongoing and transparent.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2


  1. Stang PE, Ryan PB, Racoosin JA, Overhage JM, Hartzema AG, Reich C, et al. Advancing the science for active surveillance: rationale and design for the Observational Medical Outcomes Partnership. Ann Int Med. 2010;153(9):600–6.

    PubMed  Article  Google Scholar 

  2. Coloma PM, Trifirò G, Schuemie MJ, Gini R, Herings R, Hippisley-Cox J, et al. Electronic healthcare databases for active drug safety surveillance: is there enough leverage? Pharmacoepidemiol Drug Saf. 2012;21(6):611–21.

    PubMed  Article  Google Scholar 

  3. FDA. The Sentinel Initiative: A National Strategy for Monitoring Medical Product Safety. May 2008 [cited 2012 September 15].

  4. Donahue JG, Weiss ST, Goetsch MA, Livingston JM, Greineder DK, Platt R. Assessment of asthma using automated and full-text medical records. J Asthma. 1997;34(4):273–81.

    PubMed  Article  CAS  Google Scholar 

  5. Hennessy S, Leonard CE, Freeman CP, Deo R, Newcomb C, Kimmel SE, et al. Validation of diagnostic codes for outpatient-originating sudden cardiac death and ventricular arrhythmia in Medicaid and Medicare claims data. Pharmacoepidemiol Drug Saf. 2010;19(6):555–62.

    PubMed  Article  Google Scholar 

  6. Lee DS, Donovan L, Austin PC, Gong Y, Liu PP, Rouleau JL, et al. Comparison of coding of heart failure and comorbidities in administrative and clinical data for use in outcomes research. Med Care. 2005;43(2):182–8.

    PubMed  Article  Google Scholar 

  7. Miller DR, Oliveria SA, Berlowitz DR, Fincke BG, Stang P, Lillienfeld DE. Angioedema incidence in US veterans initiating angiotensin-converting enzyme inhibitors. Hypertension. 2008;51(6):1624–30.

    PubMed  Article  CAS  Google Scholar 

  8. So L, Evans D, Quan H. ICD-10 coding algorithms for defining comorbidities of acute myocardial infarction. BMC Health Serv Res. 2006;6:161.

    PubMed  Article  Google Scholar 

  9. Varas-Lorenzo C, Castellsague J, Stang MR, Tomas L, Aguado J, Perez-Gutthann S. Positive predictive value of ICD-9 codes 410 and 411 in the identification of cases of acute coronary syndromes in the Saskatchewan Hospital automated database. Pharmacoepidemiol Drug Saf. 2008;17(8):842–52.

    PubMed  Article  Google Scholar 

  10. Software Engineering—Product Quality—Part 1: Quality Model. Geneva, Switzerland: International Organization for Standardization; 2001.

  11. Kan SH. Metrics and models in software quality engineering. 2nd ed. Boston: Addison-Wesley; 2002.

    Google Scholar 

  12. Glass RL. Building quality software. Upper Saddle River: Prentice-Hall; 1992.

    Google Scholar 

  13. Kahn MG, Raebel MA, Glanz JM, Riedlinger K, Steiner JF. A pragmatic framework for single-site and multisite data quality assessment in electronic health record-based clinical research. Med Care. 2012;50(Suppl):S21–9.

    PubMed  Article  Google Scholar 

  14. Wang RY, Storey VC, Firth CP. A framework for analysis of data quality research. IEEE Trans Knowl Data Eng. 1995;7(4):623–40.

    Article  Google Scholar 

  15. Pipino LL, Lee YW, Wang RY. Data quality assessment. Commun ACM. 2002;45(4):211–8.

    Article  Google Scholar 

  16. Batini C, Cappiello C, Francalanci C, Maurino A. Methodologies for data quality assessment and improvement. ACM Comput Surv. 2009;41(3):1–52.

    Article  Google Scholar 

  17. Guidance for Industry E6 Good Clinical Practice: Consolidated Guidance. 1996 [cited Oct 5, 2010].

  18. Hennessy S. Use of health care databases in pharmacoepidemiology. Basic Clin Pharmacol Toxicol. 2006;98(3):311–3.

    PubMed  Article  CAS  Google Scholar 

  19. Kahn MG, Batson D, Schilling LM. Data model considerations for clinical effectiveness researchers. Med Care. 2012;50(Suppl):S60–7.

    PubMed  Article  Google Scholar 

  20. Overhage JM, Ryan PB, Reich CG, Hartzema AG, Stang PE. Validation of a common data model for active safety surveillance research. JAMIA. 2012;19(1):54–60.

    PubMed  Google Scholar 

  21. OMOP. Common Data Model (version 4); 2012 [cited 2012 November 12].

  22. Reisinger SJ, Ryan PB, O’Hara DJ, Powell GE, Painter JL, Pattishall EN, et al. Development and evaluation of a common data model enabling active drug safety surveillance using disparate healthcare databases. J Am Med Inform Assoc. 2010;17(6):652–62.

    PubMed  Article  Google Scholar 

  23. Li L. A conditional sequential sampling procedure for drug safety surveillance. Stat Med. 2009;28(25):3124–38.

    PubMed  Article  Google Scholar 

  24. Informatics for Integrating Biology and the Bedside (i2b2) Software. [cited November 18, 2010].

  25. Leonard CE, Haynes K, Localio AR, Hennessy S, Tjia J, Cohen A, et al. Diagnostic E-codes for commonly used, narrow therapeutic index medications poorly predict adverse drug events. J Clin Epidemiol. 2008;61(6):561–71.

    PubMed  Article  Google Scholar 

  26. Guideline on General Principles of Process Validation. 1987 [cited Cot 5, 2010].

  27. General Principles of Software Validation: Guidance for Industry and FDA Staff. 2002 [cited Oct 5, 2010].

  28. Jick SS, Kaye JA, Vasilakis-Scaramozza C, Garcia Rodriguez LA, Ruigomez A, Meier CR, et al. Validity of the general practice research database. Pharmacotherapy. 2003;23(5):686–9.

    PubMed  Article  Google Scholar 

  29. Khan NF, Harrison SE, Rose PW. Validity of diagnostic coding within the General Practice Research Database: a systematic review. Br J Gen Pract. 2010;60(572):e128–36.

    PubMed  Article  Google Scholar 

  30. Herrett E, Thomas SL, Schoonen WM, Smeeth L, Hall AJ. Validation and validity of diagnoses in the General Practice Research Database: a systematic review. Br J Clin Pharmacol. 2010;69(1):4–14.

    PubMed  Article  CAS  Google Scholar 

  31. Garcia Rodriguez LA, Perez Gutthann S. Use of the UK General Practice Research Database for pharmacoepidemiology. Br J Clin Pharmacol. 1998;45(5):419–25.

    PubMed  Article  CAS  Google Scholar 

  32. Pladevall M, Goff DC, Nichaman MZ, Chan F, Ramsey D, Ortiz C, et al. An assessment of the validity of ICD Code 410 to identify hospital admissions for myocardial infarction: the Corpus Christi Heart Project. Int J Epidemiol. 1996;25(5):948–52.

    PubMed  Article  CAS  Google Scholar 

  33. Wahl PM, Rodgers K, Schneeweiss S, Gage BF, Butler J, Wilmer C, et al. Validation of claims-based diagnostic and procedure codes for cardiovascular and gastrointestinal serious adverse events in a commercially-insured population. Pharmacoepidemiol Drug Saf. 2010;19(6):596–603.

    PubMed  Article  Google Scholar 

  34. Harrold LR, Saag KG, Yood RA, Mikuls TR, Andrade SE, Fouayzi H, et al. Validity of gout diagnoses in administrative data. Arthritis Rheum. 2007;57(1):103–8.

    PubMed  Article  Google Scholar 

  35. Lewis JD, Schinnar R, Bilker WB, Wang X, Strom BL. Validation studies of the health improvement network (THIN) database for pharmacoepidemiology research. Pharmacoepidemiol Drug Saf. 2007;16(4):393–401.

    PubMed  Article  Google Scholar 

  36. Strom BL. Data validity issues in using claims data. Pharmacoepidemiol Drug Saf. 2001;10(5):389–92.

    PubMed  Article  CAS  Google Scholar 

  37. Jinjuvadia K, Kwan W, Fontana RJ. Searching for a needle in a haystack: use of ICD-9-CM codes in drug-induced liver injury. Am J Gastroenterol. 2007;102(11):2437–43.

    PubMed  Article  Google Scholar 

  38. Birman-Deych E, Waterman AD, Yan Y, Nilasena DS, Radford MJ, Gage BF. Accuracy of ICD-9-CM codes for identifying cardiovascular and stroke risk factors. Med Care. 2005;43(5):480–5.

    PubMed  Article  Google Scholar 

  39. Lain SJ, Roberts CL, Hadfield RM, Bell JC, Morris JM. How accurate is the reporting of obstetric haemorrhage in hospital discharge data? A validation study. Aust N Z J Obstet Gynaecol. 2008;48(5):481–4.

    PubMed  Article  Google Scholar 

  40. Lopushinsky SR, Covarrubia KA, Rabeneck L, Austin PC, Urbach DR. Accuracy of administrative health data for the diagnosis of upper gastrointestinal diseases. Surg Endosc. 2007;21(10):1733–7.

    PubMed  Article  CAS  Google Scholar 

  41. Austin PC, Daly PA, Tu JV. A multicenter study of the coding accuracy of hospital discharge administrative data for patients admitted to cardiac care units in Ontario. Am Heart J. 2002;144(2):290–6.

    PubMed  Article  Google Scholar 

  42. Liangos O, Wald R, O’Bell JW, Price L, Pereira BJ, Jaber BL. Epidemiology and outcomes of acute renal failure in hospitalized patients: a national survey. Clin J Am Soc Nephrol. 2006;1(1):43–51.

    PubMed  Article  Google Scholar 

  43. Waikar SS, Wald R, Chertow GM, Curhan GC, Winkelmayer WC, Liangos O, et al. Validity of international classification of diseases, ninth revision, clinical modification codes for acute renal failure. J Am Soc Nephrol. 2006;17(6):1688–94.

    PubMed  Article  Google Scholar 

  44. Hennessy S, Leonard CE, Palumbo CM, Newcomb C, Bilker WB. Quality of Medicaid and Medicare data obtained through Centers for Medicare and Medicaid Services (CMS). Med Care. 2007;45(12):1216–20.

    PubMed  Article  Google Scholar 

  45. Butani AL, Sherwood N, Adams K, et al. The VDW vital signs file: strengths, issues and recommendations for the future. Poster presented at the 15th Annual HMO Research Network Conference, Danville; 2009.

  46. Hornbrook MC, Hitz P, Pardee R, et al. The VDW demographic and enrollment files: strengths, issues, and recommendations for the Future. Presented at the 15th annual HMO research network conference, Danville; 2009.

  47. Moore KM, Cheetham C, Dublin S, et al. VDW pharmacy file: strengths, weaknesses and recommendations. Poster presented at the 15th annual HMO research network conference, Danville; 2009.

  48. Saylor G, Ellis JL, Raebel MA, et al. Formalization of the laboratory result content area of the VDW. Poster presented at the 14th Annual HMO research network conference, Minneapolis; 2008.

  49. OMOP. Observational Source Characteristics Analysis Report (OSCAR) Design Specification and Feasibility Assessment. 2010 [cited 2012 June 18].

  50. OMOP. NATHAN—Utility of Natural History Information; 2010 [cited 2012 June 18].

  51. OMOP Implementation 2011 [cited 2012 December 12].

Download references


The Observational Medical Outcomes Partnership is funded by the Foundation for the National Institutes of Health (FNIH) through generous contributions from the following: Abbott, Amgen Inc., AstraZeneca, Bayer Healthcare Pharmaceuticals, Inc., Biogen Idec, Bristol-Myers Squibb, Eli Lilly & Company, GlaxoSmithKline, Janssen Research and Development, Lundbeck, Inc., Merck & Co., Inc., Novartis Pharmaceuticals Corporation, Pfizer Inc, Pharmaceutical Research Manufacturers of America (PhRMA), Roche, Sanofi-aventis, Schering-Plough Corporation, and Takeda. Drs. Schuemie, Stang, and Ryan are employees of Janssen Research and Development. Dr. Schuemie received a fellowship from the Office of Medical Policy, Center for Drug Evaluation and Research, Food and Drug Administration. Dr. Reich is an employee of AstraZeneca. Drs. Schuemie, Madigan and Hartzema have received funding previously from FNIH. J. Marc Overhage and Emily Welebob have no conflicts of interest to declare.

The authors thank and acknowledge the contributions of the OMOP Distributed Research Partners in phases of this research, who were supported by a grant from FNIH. Assistance with writing and manuscript preparation was provided by Ken Scholz, PhD, with financial support from FNIH.

This article was published in a supplement sponsored by the Foundation for the National Institutes of Health (FNIH). The supplement was guest edited by Stephen J.W. Evans. It was peer reviewed by Olaf H. Klungel who received a small honorarium to cover out-of-pocket expenses. S.J.W.E has received travel funding from the FNIH to travel to the OMOP symposium and received a fee from FNIH for the review of a protocol for OMOP. O.H.K has received funding for the IMI-PROTECT project from the Innovative Medicines Initiative Joint Undertaking ( under Grant Agreement no 115004, resources of which are composed of financial contribution from the European Union’s Seventh Framework Programme (FP7/2007–2013) and EFPIA companies’ in kind contribution.

Author information

Authors and Affiliations


Corresponding author

Correspondence to Abraham G. Hartzema.

Additional information

The OMOP research used data from Truven Health Analytics (formerly the Health Business of Thomson Reuters), and includes MarketScan® Research Databases, represented with MarketScan Lab Supplemental (MSLR, 1.2 m persons), MarketScan Medicare Supplemental Beneficiaries (MDCR, 4.6 m persons), MarketScan Multi-State Medicaid (MDCD, 10.8 m persons), MarketScan Commercial Claims and Encounters (CCAE, 46.5 m persons). Data also provided by Quintiles® Practice Research Database (formerly General Electric’s Electronic Health Record, 11.2 m persons) database. GE is an electronic health record database while the other four databases contain administrative claims data.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Hartzema, A.G., Reich, C.G., Ryan, P.B. et al. Managing Data Quality for a Drug Safety Surveillance System. Drug Saf 36, 49–58 (2013).

Download citation

  • Published:

  • Issue Date:

  • DOI:


  • Target Concept
  • Data Anomaly
  • Common Data Model
  • Observational Medical Outcome Partnership
  • Data Quality Assurance