Health care delivery is rarely systematically directly observed.1, 2 Clinicians vary in effective practice based on how well they listen and ask key questions of patients; these variations are not captured using current quality measures.3, 4

In past research, we used “unannounced standardized patients” (USPs) to measure clinician performance during direct observation. USPs are actors trained to present to clinicians incognito as patients, portraying standardized scripts that facilitate controlled comparisons among practices and providers.

Using USPs, we have studied the impact of physician inattention to patient psychosocial issues relevant to care planning, termed “patient contextual factors.”5,6,7 In a study of 400 USP visits, internists who overlooked clues that patients’ clinical problems were related to contextual factors were more likely to order unnecessary tests and therapies, with a median excess cost of $231 per visit.8 USP projects have also found discrepancies between the medical record and what actually occurred during the visit. Others employing USPs have reported nearly a third of recorded physical exam maneuvers do not occur.9, 10 These lapses are not discoverable without direct observation.

We have advocated for the widespread implementation of USPs as a performance assessment and improvement strategy.1, 11, 12 USPs unmask widespread and serious deficits in care quality that are otherwise invisible and that, if remediated, would reduce overuse and misuse of medical services.13, 14 One approach to remediation (“audit and feedback”) is to give providers feedback based on USP assessments of their performance.15 The goal is to positively modify the way they care for real patients.

The purpose of this study was to use USP visits and feedback to assess and improve the quality of primary care delivered and documented, and to measure the impact on cost of care in actual patients.


Impact on Quality of Care

Setting and Cases

We partnered with Horizon BCBS of New Jersey to define target areas aligned with Horizon’s value-based program, which incentivizes practices based on care metrics related to the target areas. Four USP case scripts were designed around diabetes and idiopathic low back pain, with opportunities for providers to address medication adherence (diabetes), opioid use (low back pain), depression screening (all cases, with positive screens in two cases), smoking cessation (two cases), and reluctance to engage in recommended cancer screenings (three cases). Measures were of ordinarily unmeasurable physician behaviors, e.g., whether they followed CDC guidelines for promoting smoking cessation or simply told the patient they should “quit.” Appendix 1 shows the conditions and expected behaviors associated with each case script.

USPs were recruited from the standardized patient and professional theater communities in the local area. They received eight hours of in-person training prior to their first visit. Script development and training methods were similar to those used in past research.7

We recruited New Jersey primary care providers (physicians and nurse practitioners) in practices enrolled in Horizon’s value-based program. Providers were told that they would receive visits from four USPs during the 18 months of the project (without remuneration), and provided informed consent. Physician participants could receive MOC part IV credit from the American Board of Internal Medicine or American Board of Family Medicine. Practices identified a staff confederate to assist in scheduling visits and ensuring that USPs would be seen (despite lack of insurance), to transmit copies of provider notes to the researchers following USP visits, and to “white out” electronic medical records of USP visits to prevent their inclusion in billing or practice quality reporting metrics. Horizon staff did not have access to identified provider- or practice-level data.


Providers were visited by four USPs, each playing one of the case scripts, in a counterbalanced order, and divided into two phases of two visits each; providers were randomized to case orders. For each visit, the provider was emailed after obtaining their note to inform them they had seen a USP and to ask whether they had believed they were seeing a real patient or a USP. Following the first two visits to all providers within a practice, each provider was given reports of their performance and their practice’s aggregate performance, and participated in a coaching phone call with a physician investigator (SW) with input from a quality improvement specialist from the American College of Physicians. This process was repeated after the two following visits. Reports included, for each case, visualizations of the proportion of times the practice or individual provider performed each expected care behavior (as compared with all practices and providers in the study), visualizations of document fidelity (how often visit tasks were performed or not vs. documented or not), and, in the practice reports, CAHPS clinician and group survey measures and narrative comments from the USPs, with the provider not identified.


USPs completed a checklist of guideline-based diagnostic and treatment provider behaviors. USPs also covertly audio recorded the visits. The practice confederate provided the provider’s encounter note and any intake paperwork used for the visit. We coded visit recordings and provider notes for case-specific performance indicators (e.g., was depression screening correctly performed (audio) and did physician document the task (note)?) We considered information on the intake paperwork as if it had been provided by the patient in the encounter, and we considered a depression screen to have been performed if the screen appeared in the paperwork. Our fidelity measure included four categories for each indicator: heard on audio and documented in note (correct), heard on audio but not documented (undocumented), not heard on audio, not documented in note (unperformed), and not heard on audio but documented (fictitious).

Data Analyses

To examine the association between visit time (pre- vs. post-feedback) and quality-of-care behaviors, we fitted a mixed-effects logistic regression model to the USP checklist items, with fixed effects of visit time and SP case, and random effects of practice, provider, and checklist item to control for clustering of items in cases in providers in practices. We also examined whether physician suspicion they had seen a USP was associated with performance. To examine the association between visit time and documentation fidelity, we fitted a similar mixed-effects multinominal logistic regression model. We sought a sample size of 60 providers overall based on a priori 80% power to detect an improvement from a baseline rate of 40% to a post-intervention rate of 65% in a given performance indicator, based on previous studies.

Role of the Funding Source

Support for this project was provided by a grant from the Robert Wood Johnson Foundation to the American College of Physicians and the Institute for Practice and Provider Performance Improvement, Inc. The funding agreement ensured the independence of the investigators in the design, conduct, and analysis of the study.

Impact on Cost Patterns

Comparison Practices

For billing purposes, practices are associated with taxpayer identification numbers (TINs); several co-owned practices could share one TIN. Prior to claims data collection, we matched each study practice TIN to a comparison practice TIN in the same county and also enrolled in the Horizon value-based program, using propensity score matching16 based on practice size (members enrolled) and activity (two different 5-level ordinal measures of aggregate claims billed to Horizon) in 2016. A total of 117 TINs were available for matching; the matched group was well balanced on size (SMD = − 0.02, p = 0.96) and better balanced on activity measures than the population (SMD = 0.17 and − 0.20).

Horizon provided costs of actual claims paid for inpatient, outpatient, and professional services (primary or specialty care) in the three months prior to the project (incurred October–December 2017 and paid October 2017–March 2018), the three months 9 months later (after feedback, incurred July–September 2018 and paid July–December 2018), and the three months 12 months later (incurred October–December 2018 and paid October 2018–March 2019), in all visits by (actual) patients attributed to study and comparison practices who had received at least one service associated with a focal condition (diabetes, opioid use, depression, back pain, smoking, or cancer screening) during at least one visit. We considered a visit in the per-visit claims data to be associated with a condition if it included any claim with any ICD-10 code associated with the condition (see Appendix 2 for ICD-10 codes by condition), but included claims for all care that occurred in the same visit. Claims data did not include prescription claims, capitation claims, or patient-paid portions of charges (i.e., co-pays and deductibles) and were identified at the TIN level. For diabetes and low back pain, Horizon also provided patient-level costs in each period using Horizon’s internal attribution algorithm, which aggregates only those claims related to the condition, and computes total disease-related cost of the claims for members receiving care for the disease.

Claim “costs” represent costs to Horizon, not societal costs, but provide a numeraire that does not differ between intervention and control practice groups. Not all claims had an associated cost to Horizon; we refer to those that do as “positive claims.”

Data Analyses

We fitted two-part mixed-model regressions for each focal condition to per-visit claims costs: we modeled the probability of having a positive claim with a logistic model and the cost of positive claims with a gamma model. In each model, we include a fixed effect of practice type (study vs. comparison), a fixed effect of time period (pre-feedback, 9 months later, and 12 months later), the interaction between practice type and time period, and a random effect of practice TIN to account for clustering of claims in TINs. We examined the difference-in-difference of predicted per-visit costs between comparison and study practices at the 9-month and 12-month periods (each compared with the pre-feedback cost as a baseline). We fitted similar two-part mixed regressions to patient-attributed costs for diabetes and for low back pain. We hypothesized significant practice type × time interactions, with the direction depending on the nature of the focal condition.

Table 1 Raw Proportions and Adjusted Odds Ratios for Quality of Care Pre- and Post-feedback

For privacy reasons, Horizon’s per-visit claims could not identify whether different visits were by the same patients, so we treated each visit as an independent patient. The patient-level cost data does not require this assumption. We analyzed claims in raw dollars in the period they were incurred (thus, main effects of time include any changes in reimbursement rates over the study duration). One author (AS) conducted analyses using R 3.617 with the lme418 and mgcv19 packages and Julia 1.1.120 with the MixedModels21 package. The project was approved by Advarra IRB.


Quality of Care Measured by USPs

Twenty-two study practices (representing 14 TINs) agreed to participate; eight were solo practices, 10 were group practices with 2–3 providers, and 4 were group practices with more providers (range 4–14). Of the 64 total providers at these practices, 59 agreed to participate.

USPs made 217 visits in total between May 2017 and March 2019. Four practices had only pre-feedback visits (because the practice closed, participating providers died or moved, or actors were unable to schedule post-feedback visits in a timely fashion); the remaining 18 had pre-feedback and post-feedback visits. Of the providers, 48 received all four visits, 4 received 3 visits, 6 received 2 pre-feedback visits only, and one received a single pre-feedback visit. Providers suspected they had seen a USP in 19 visits, believed they had seen a real patient in 73 visits, and did not reply to the email in the remaining visits.

Care Provided

Providers performed expected care based on USP checklists in 46% of items in pre-feedback visits and 56% in post-feedback visits (OR = 1.53, 95% CI 1.29–1.83, p < 0.001). Significant improvement areas included smoking cessation, managing chronic low back pain without opioids, and depression screening (Table 1). Belief that they had seen a USP vs. a real patient vs. not responding was not associated with provider performance (χ2(2) = 1.09, p = 0.58).

Documentation Fidelity

Post-feedback visits had lower rates of unperformed behaviors (OR = 0.67, 95% CI 0.57 to 0.78, p < 0.001) and of fictitious documentation (OR = 0.74, 95% CI 0.62 to 0.90, p = 0.002) than pre-feedback visits. Unperformed behaviors increased for cancer screening and decreased for depression and smoking cessation. Rates of undocumented behaviors did not differ overall by post- vs. pre-feedback timing (OR = 0.89, 95% CI 0.74 to 1.06, p = 0.19), but were higher for medical adherence in diabetes and lower for review of systems. Figure 1 presents overall unadjusted rates and Table 2 adjusted odds ratios for groups of items.

Figure 1
figure 1

Observed frequency of targeted behaviors recorded by USPs (“Heard”) vs. entered in the medical record (“Documented”) in pre- and post-feedback visits. Error bars indicate the upper limits of the 95% confidence intervals around the proportions.

Table 2 Changes in Documentation Fidelity from Pre- to Post-feedback overall and for groups of items

Cost for Care in Actual Patients

Overall, per-visit claims were more frequent and considerably more expensive in post-feedback than pre-feedback periods, regardless of study group. Per-visit claims related to study focal conditions totaled $17,104,906 (235,644 claims) pre-feedback (Oct–Dec 2017), $48,787,157 (316,116 claims) nine months after feedback (Jul–Sep 2018), and $48,112,140 (307,277 claims) twelve months after feedback (Oct–Dec 2018).

Overall Patterns in Claims

The rate of positive claims for focal conditions was 86.8% overall and did not differ significantly between study and comparison practice groups at any time period for any focal condition except cancer screening. The number of patients with $0 per-patient-attributable costs similarly did not differ between groups at any time period.

Table 3 presents gamma regression coefficients and standard errors for per-visit office-based claims for each of the focal conditions, as well as per-visit difference-in-difference costs (pre-post change in the study TINs vs. the comparison TINs); Table 4 presents the same for per-patient-attributable costs for diabetes and low back pain. Figure 2 shows the predicted average claim per visit by groups and time periods for each condition, and for all other claims (“non-focal”).

Table 3 Gamma Regression Coefficients (Per-Visit Data for Visits with Positive Claims) and Difference-in-Differences of Claims Costs
Figure 2
figure 2

Predicted average per-visit claims by group, time period, and focal condition.

In all conditions, there was no significant difference in claim dollars at baseline between study and comparison group practices, and average claim dollars significantly increased over time periods in both groups. Among non-focal conditions, claims also did not differ significantly by group.

Table 4 Gamma Regression Coefficients (Patient-Level Data) and Difference-in-Differences in Per-Patient Costs


Study practices showed significantly lower rates of increase in office-based diabetes claims than the comparison practices; this was reflected in both per-visit and per-patient costs.

Low Back Pain

Study practices showed significantly higher rates of increase in office-based low back pain claims than the comparison practices at 9 months (per-visit and per-patient) and 12 months (per-patient only).

Cancer Screening

In cancer screening, study and comparison groups were equally likely to have positive claims at the pre-feedback time period (adjusted OR = 0.85, 95% CI 0.72–1.02), but the study group was significantly more likely to have positive claims than the comparison group at both the 9-month and 12-month periods (adjusted OR = 1.36, 95% CI 1.18–1.56 in each period compared with baseline for study vs. comparison group). Study practices also showed significantly higher rates of increase in office-based screening claims than the comparison practices at 9 (but not 12) months. Combining the greater frequency and amount of positive claims over time in the study vs. comparison group, study practices showed significantly higher rates of increase in office-based cancer screening claims than comparison practices.


Study practices showed significantly lower rates of increase in office-based depression claims than the comparison practices.

Smoking Cessation

Study practices showed significantly higher rates of increase in office-based smoking cessation claims than the comparison practices.


In this study, we employed unannounced standardized patients (USPs) to directly observe physician behaviors at managing common ambulatory conditions and compared findings to what they documented in the medical record. We discovered previously undetectable deficits in clinician performance and documentation, and after providing feedback and working with practices to develop quality improvement strategies, quality of care and documentation improved.

We also observed significant and expected changes to claims costs in the care of real patients among practices who participated in the intervention relative to the comparison group. Specifically, actual claims costs increased less among real patients with diabetes and depression, where physicians adopted recommended behaviors that reduced costs when caring for USPs, and claims costs increased more among real patients receiving care for smoking addiction, low back pain, and preventive cancer care, where physicians adopted recommended behaviors that increased costs.

We did not have access to prescription claims. Although the observed changes in costs in the care of real patients were consistent with what we observed in the care of the USPs, we cannot confirm that they were due to the same desired increases or decreases in prescribing. For instance, although physical therapy claims went up as expected, we cannot confirm that opioid prescribing went down. In one area, cancer prevention, however, the improvements we saw in the care of USPs matched those exhibited in the care of real patients.

Two observations from the claims analysis were unexpected. Claims increased substantially overall between the pre- and post-feedback time periods. The increase may reflect changes in reimbursement policy or shifts in membership over the insurance year; our comparison group helps us control for this. For low back pain, the per-visit claims costs increased more for the study than comparison groups at 9 months, but not 12 months, while the per-patient costs increased more for the study than comparison groups at both periods. This may reflect a decreasing impact of our intervention over time.

Because practices elected to participate in the project, differences seen in the study practices may not represent the effect that would be observed in all practices. The matched (a priori) comparison group and use of common time periods mitigates this limitation, but if the study practices differ substantially from the typical practice, our matching could have led to similarly atypical comparison practices.

USPs provide a penetrating assessment of clinician performance because they assess actual clinician behaviors rather than what is recorded in the medical record. They enable highly personalized feedback that providers can utilize to improve their care and potentially enhance value-based care. In our project, the marginal cost of an additional two USP visits to a provider (not including overhead) with feedback and coaching was approximately $700, a fraction of the potential cost savings associated with improved patient care for conditions where higher quality reduces office-based care (e.g., diabetes medication adherence). For conditions where improving quality increases short-term costs (e.g., preventive care), payers with sufficiently long-term horizons may realize cost savings from future improved health. In short, direct observation of care identifies deficits in practice and documentation, and can improve both, sometimes with predictable concomitant cost savings.