Despite the success of hospital-based patient safety efforts, progress to improve the safety of primary care has lagged.14 A recent Institute of Medicine (IOM) report “Improving Diagnosis in Health Care”5 highlights the safety implications of diagnostic errors, which are one of the most common types of medical errors in primary care.613 These errors are estimated to affect about one in 20 US adults in outpatient settings annually14 and are the leading basis for ambulatory malpractice claims.7,15 Diagnostic errors have remained under-studied in patient safety research,12,16 partly because they are difficult to measure.1720 Measurement of diagnostic errors often depends heavily on detailed retrospective review of patients’ medical records. Clinicians do not always agree on the presence or absence of error, and details about the clinical situation are often absent when making judgments in hindsight.21,22 Additionally, diagnoses often require additional testing or consultations for confirmation and evolve over time.23 Not surprisingly, studies consistently demonstrate low inter-physician agreement, or accuracy, on medical record reviews for diagnostic errors.2430

National initiatives such as maintenance of certification and physician quality reporting systems have placed an increasing emphasis on ambulatory quality and safety. The IOM report on improving diagnosis5 also recommends a comprehensive and rigorous methodology to measure diagnostic errors to advance the science in this area and reduce their burden.22,3134 In our previous work, we used judgments from multiple physician-raters to determine diagnostic error in selected primary care visit-records.9,3537 We defined diagnostic errors as missed opportunities to make a correct or timely diagnosis based on the available evidence, regardless of patient harm.35 We considered diagnostic errors to have occurred when at least two independent physician reviewers confirmed their presence. While reviewers used a structured data collection instrument to help them evaluate the records, they relied on subjective assessments to make judgments. Despite extensive training and calibration efforts, the reviewers only reached fair agreement.36 To facilitate better measurement through medical record reviews, we developed a new structured instrument consisting of objective criteria to improve the accuracy of assessing diagnostic errors.


Study Design

After institutional board review approval, we gathered questions from several previously used instruments for diagnostic error measurement10,16,35 and used an operational definition of diagnostic error37 to develop an initial draft of the instrument. We iteratively refined our instrument through pilot medical record reviews and multidisciplinary input, and tested the accuracy of the final instrument by conducting reviews of a sample of patients with and without diagnostic errors.

Study Setting

The study site was a large urban VA facility with 35 full-time primary care providers (PCPs), including physicians, physician assistants, and nurse practitioners, providing comprehensive care to approximately 50,000 patients. It had an integrated and well-established electronic health record (EHR), and large clinic networks through which it provided longitudinal care to ethnically and socioeconomically diverse patients from rural and urban areas. Most PCPs were physicians, some of whom supervised residents, and visits included scheduled follow-up visits and “drop-in” unscheduled visits.

Instrument Development

We developed a 12-item rating instrument (the Safer Dx Instrument) for the purpose of determining the presence or absence of diagnostic error for a specific episode of care. Our team consisted of five practicing clinicians (three of who were also diagnostic error and/or quality improvement experts), a psychometrician and a cognitive psychologist. We first sought existing content from instruments previously used in research on diagnostic error measurement.10,16,35 We then adapted some items from these previous instruments and added additional items to address important aspects of the diagnostic process such as history-taking, physical examination, test ordering, and test interpretation. All of the questions were intended to identify missed opportunities in diagnosis using criteria developed through several previous studies.9,35,36 We relied heavily on three clinical criteria found to be useful in our previous work to determine the presence or absence of diagnostic errors, i.e., case analysis reveals evidence of missed opportunity to make a correct or timely diagnosis; missed opportunity was framed within the context of an “evolving” diagnostic process; and opportunity could be missed by the provider, care team, system, and/or patient (see online Supplementary Appendix for details on criteria and instrument development).37

The final version of the Safer Dx Instrument consisted of 11 questions regarding the appropriateness of the diagnostic process and one summary question regarding the overall impression of diagnostic error (Table 1). Items were scored from 1 (strongly agree an error occurred) to 6 (strongly disagree that an error occurred), with the exception of three items (items 6, 9, and 10) that were reverse scored. Items were rated on a six-point Likert scale in order to allow for “gray areas” in the determination of diagnostic error (i.e., we did not want to force someone to say “absolutely an error” vs. “absolutely not an error,” but instead select response options that were less definite). However, to directly compare the overall impression of diagnostic error in item 12 to a previous sample of patients with and without diagnostic errors, item 12 (the main outcome) was dichotomized, such that 1 to 3 represented diagnostic error and 4 to 6 represented absence of diagnostic error (alternate ways to dichotomize are included in the online Appendix Table).

Table 1. The Safer Dx Instrument: Items for Determining Presence or Absence of Diagnostic Error in a Primary Care Encounter

Two physicians on our multidisciplinary team (AA and CD) pilot tested the instrument and provided feedback, which was used in team meetings for further refinement. The instrument was further refined through an iterative process of reviews by five additional practicing physicians outside of this team to ensure content and face validity. This type of approach is consistent with standard survey item development practices.38 Details on pilot testing are provided in the online Appendix. The chart reviewer, an actively practicing board-certified primary care physician (AA) with experience in EHR and patient safety projects, was trained extensively on record reviews.


We tested the Safer Dx Instrument using a cohort of 389 patients with and without diagnostic errors (n = 129 and n = 260, respectively) from the VA site in our prior study.35 At this VA study site, 1300 records had been selected for review; 886 using a “trigger” algorithm to identify patients with possible diagnostic errors based on unexpected hospitalizations and return visits, and 414 as “trigger negative” controls. After exclusion of false positives with no or minimal information available for error assessment, 1169 records remained and were reviewed in detail by at least two independent raters to determine the presence or absence of diagnostic errors. Patients were mostly male (93.8 %); 56.8 % White and 39 % Black. The cases represented a heterogeneous group of common medical conditions seen in the primary care setting and were independent of cases used to develop and pilot-test the earlier draft of the instrument.


The physician-reviewer blinded to the diagnostic error outcome reviewed medical records from all 389 patients and completed the Safer Dx Instrument for each. Clinical details were determined through detailed reviews of the EHR about care processes at an index primary care visit and subsequent visits. The reviewer evaluated EHR data up to 1 year after the index visit to help determine the clinical context. A second reviewer (board certified in internal medicine, but otherwise with similar familiarity with EHRs) independently assessed a random sample of 30 records from the testing data set (ten with and 20 without errors).

Statistical Analysis

We calculated the Safer Dx Instrument’s overall sensitivity, specificity, positive predictive value, and negative predictive value by comparing the main, dichotomized outcome from item 12 (1–3 = error, 4–6 = no error as determined by the single physician using the instrument) to results obtained in the previous study.35 Accuracy was defined as physician agreement with presence or absence of diagnostic errors as compared to our previous study results for all 389 cases.35

Additionally, we examined whether any of the 11 diagnostic process items were related to the main outcome (i.e., the rater’s overall impression of diagnostic error) by computing both Spearman correlation coefficients (using the six-point scaled outcome) and Pearson correlations coefficients (using the dichotomized outcome). All items that were significantly correlated to the main outcome were entered into a factor analysis with varimax rotation to identify any higher-order dimensions represented by clusters of items. We kept dimensions with eigenvalues over Kaiser’s criterion of 1 and assessed the internal consistency of the resulting dimensions using Cronbach’s alpha.

Finally, we developed a score based on all of the instrument items to predict whether cases assessed via Safer Dx Instrument were determined to be errors in our previous study. We thus performed a logistic regression using summed scores from the dimensions obtained in the factor analysis above, as well as individual items not included in the dimensions, to predict whether each case was an error or not. Using the obtained regression equation, we compared scores obtained in the error cases and the non-error cases. This would allow users to create potential cut-off scores, signaling lower or higher likelihood of diagnostic error. Users would have the flexibility to personalize these cutoff scores depending on how inclusive and conservative they wanted to be.


Of 389 patient records, use of the instrument identified 117 as diagnostic errors as compared to 129 from our previous sample. The dichotomized score on Safer Dx Instrument’s main outcome of interest (presence or absence of diagnostic errors, i.e., 1–3 = error, 4–6 = no error), was associated with an overall accuracy of 84 %, sensitivity of 71 %, specificity of 90 %, negative predictive value of 86 %, and positive predictive value of 78 % for detecting diagnostic errors. Alternate splits of the six-point scale can be seen in the online Appendix Table.

Items 1–11 were all significantly correlated with item 12 (global impression of diagnostic error; see Spearman and Pearson correlation analyses, Table 2). The Kaiser-Meyer-Olkin measure verified the sampling adequacy for the factor analysis, KMO = 0.87. Three dimensions had eigenvalues over Kaiser’s criterion of 1 and in combination explained over 76 % of the variance. As such, three domains were kept. The first domain (initial diagnostic assessment) included questions 1, 2, 5–7, 9, and 10; the second domain (performance and interpretation of diagnostic tests) included questions 3 and 8; and the third domain (patient factors) included questions 4 and 11. Cronbach’s alpha coefficients associated with these groups were 0.93, 0.92, and 0.38, respectively, suggesting that the first and second domains have an excellent internal consistency and reliability, while the third domain showed poor internal consistency.

Table 2. Correlations Between the 11 Diagnostic Process Instrument Items and the Safer Dx Instrument Outcome (Diagnostic Error vs. No Error) in 389 Cases

To create an overall score for the instrument that could predict the likelihood that a reviewed case involved a diagnostic error or not, we summed scores from each item within a dimension to create factor scores. However, because of the poor internal consistency of the third domain (questions 4 and 11), we retained these two items as individual items and did not conceive them as forming a specific factor to create the scoring system. Factor scores and items 4 and 11 were then entered into a multivariate logistic regression with error versus no error as the predicted outcome (as determined from the previous study). The summed factors and two individual items significantly predicted presence of diagnostic error in the previous study: F(4 383) = 117, p < 0.001, R2 = 0.55. Using the obtained formula, where Error Score = 0.395 + (ΣFactor1Items*0.03) + (ΣFactor2Items*0.003) + (Item 4 * −0.005) + (item 11 * 0.05), we created a figure showing the frequency of different scores in error versus no error cases. As shown in Fig. 1, lower scores are more associated with errors and higher scores are less associated with errors. Cutoff scores can be created to distinguish between diagnostic error and non-error cases and can also be used to create different risk groups; such as high, moderate, and low risk of diagnostic error. These cutoff scores could be personalized depending on a user’s desire to trade-off between positive predictive and negative predictive value, as well as between sensitivity and specificity. For example, in the future, a practice or an institution might decide to use a cutoff score of ≤ 1.50 to indicate the presence of diagnostic error and a score of ≥ 1.90 to indicate its absence. The advantage of using scoring systems such as this one is that practices or institutions might be able use scores to categorize patients into high risk, moderate risk, and low risk for diagnostic errors in order to flag cases in need of further review and analysis. An ROC curve for Safer Dx Instrument’s performance characteristics is shown in Fig. 2.

Figure 1.
figure 1

Relationship between diagnostic error status and scores obtained using the safer Dx instrument scoring system.

Figure 2.
figure 2

ROC curve for safer Dx instrument’s characteristics.

The second independent review on the randomly selected 30 patients revealed the following: agreement with previous study sample =73.3 %; agreement with current sample = 83.3 % and agreement with either previous study sample or current sample =86.7 %.


Novel approaches are needed to address the challenges of measuring diagnostic error in primary care settings.17 In response to this need, we developed the Safer Dx Instrument to measure diagnostic errors and tested its accuracy to help detect their presence or absence via record reviews. Using a sample of previously confirmed cases, we found that the Safer Dx Instrument had a reasonably high accuracy and predictive value to detect presence or absence of diagnostic error. The Safer Dx Instrument is a first step in standardizing the measurement of diagnostic processes in the primary care setting through record review and could help providers and/or healthcare facilities detect potential diagnostic errors for further review using a single reviewer. The instrument’s items clustered into two important diagnostic process domains with face validity (initial diagnostic assessment and performance and interpretation of diagnostic tests). A third, potentially important domain (patient factors) was discovered but had poor internal consistency; therefore, future work should explore developing additional items to measure patient factors.

Without measuring diagnostic performance, we are largely in the dark about an important task performed by primary care physicians.39 There are no standardized tools or strategies to facilitate measurement of diagnostic performance in the complex and vulnerable primary care setting. The Safer Dx Instrument can be used to guide a comprehensive assessment of the patient’s diagnostic experience through a detailed examination of all aspects of the patient’s medical record, including patient history, physician examination, interpretation of diagnostic tests, ordering of additional testing or referrals, generating a differential diagnosis and initial medical assessment, and evaluating the initial diagnosis or related complications. Therefore, the instrument’s 11 items address a wide spectrum of diagnostic process breakdowns that have been described in primary care.10,16

The Safer Dx Instrument would likely be most effective when used in combination with trigger algorithms to select a “high-risk” cohort of medical records36 to review versus reviewing random or non-selected records. A trigger and review strategy could provide an effective screen for diagnostic errors in primary care settings, and could be followed by a secondary review of selected records by one or more physicians to confirm errors and/or to initiate further analysis. Currently, there are no such methods being used in primary care. Although this technique cannot identify all errors, it will be a useful start to enhance learning and feedback about diagnostic safety in primary care settings. Because of reduced reliance on subjectivity, this instrument could also improve agreement on diagnostic errors.

In addition to being used retrospectively to identify cases at highest need for secondary review, the instrument could be used for learning and feedback on what aspects of the diagnostic process broke down. This exercise could lead to a more intensive analysis of diagnoses at a practice level and raise awareness of diagnostic safety issues in the primary care setting. As the recent IOM report also notes,5 measurement of diagnostic errors is essential to create the necessary policy and practice initiatives to improve safety in this area.40

Our study has several limitations. We focused solely on primary care patients and relied on an integrated and comprehensive EHR review to evaluate clinical details about visits, tests, procedures, and referrals. These details might not be available in other primary care practices that are not integrated with other health care settings. However, this is likely to change over time, as several national initiatives are addressing improved integration and data exchange for primary care records. We used an existing data set and a specific trigger algorithm to identify most cases, which may have contributed to a selection bias toward patients with return-visits who might be at more risk for error. However, as there are no currently available practical methods to find diagnostic errors in primary care, any new tools first need rigorous testing. Error determination was dependent on accurate record-keeping and could be confounded by documentation related limitations and hindsight bias.30,41 Measuring an evolving diagnostic process fraught with uncertainty is challenging.23 Individual reviewers would also vary in their tolerance of ambiguity and their perspectives regarding utilization of diagnostic testing. The use of the instrument involves some amount of individual judgment, even though we tried to minimize this. However, the instrument guides a reviewer through most concepts that need to be considered while analyzing the diagnostic process for problems within a clinical encounter. Moreover, our strategy of a single clinician who can effectively screen records for a subsequent detailed review by an additional team of clinicians would likely be more feasible and acceptable to others. We also acknowledge that agreement between our two reviewers was not perfect, but believe it is a start for measuring something so important but yet quite abstract (this concept is also acknowledged in the recent IOM report). The instrument might perform differently in different populations and different disease conditions and thus, testing will be required in other settings. Additional scientific understanding in the future will likely make this instrument better.

In conclusion, we tested a new instrument and found it to have a high degree of accuracy and predictive value for measuring diagnostic errors in primary care settings. This instrument could be useful to identify high-risk cases for further study and quality improvement. With further testing in additional clinical settings, the Safer Dx Instrument could be used to enhance knowledge on improving diagnostic safety in primary care settings.