Impact on Quality of Care
Setting and Cases
We partnered with Horizon BCBS of New Jersey to define target areas aligned with Horizon’s value-based program, which incentivizes practices based on care metrics related to the target areas. Four USP case scripts were designed around diabetes and idiopathic low back pain, with opportunities for providers to address medication adherence (diabetes), opioid use (low back pain), depression screening (all cases, with positive screens in two cases), smoking cessation (two cases), and reluctance to engage in recommended cancer screenings (three cases). Measures were of ordinarily unmeasurable physician behaviors, e.g., whether they followed CDC guidelines for promoting smoking cessation or simply told the patient they should “quit.” Appendix 1 shows the conditions and expected behaviors associated with each case script.
USPs were recruited from the standardized patient and professional theater communities in the local area. They received eight hours of in-person training prior to their first visit. Script development and training methods were similar to those used in past research.7
We recruited New Jersey primary care providers (physicians and nurse practitioners) in practices enrolled in Horizon’s value-based program. Providers were told that they would receive visits from four USPs during the 18 months of the project (without remuneration), and provided informed consent. Physician participants could receive MOC part IV credit from the American Board of Internal Medicine or American Board of Family Medicine. Practices identified a staff confederate to assist in scheduling visits and ensuring that USPs would be seen (despite lack of insurance), to transmit copies of provider notes to the researchers following USP visits, and to “white out” electronic medical records of USP visits to prevent their inclusion in billing or practice quality reporting metrics. Horizon staff did not have access to identified provider- or practice-level data.
Design
Providers were visited by four USPs, each playing one of the case scripts, in a counterbalanced order, and divided into two phases of two visits each; providers were randomized to case orders. For each visit, the provider was emailed after obtaining their note to inform them they had seen a USP and to ask whether they had believed they were seeing a real patient or a USP. Following the first two visits to all providers within a practice, each provider was given reports of their performance and their practice’s aggregate performance, and participated in a coaching phone call with a physician investigator (SW) with input from a quality improvement specialist from the American College of Physicians. This process was repeated after the two following visits. Reports included, for each case, visualizations of the proportion of times the practice or individual provider performed each expected care behavior (as compared with all practices and providers in the study), visualizations of document fidelity (how often visit tasks were performed or not vs. documented or not), and, in the practice reports, CAHPS clinician and group survey measures and narrative comments from the USPs, with the provider not identified.
Measures
USPs completed a checklist of guideline-based diagnostic and treatment provider behaviors. USPs also covertly audio recorded the visits. The practice confederate provided the provider’s encounter note and any intake paperwork used for the visit. We coded visit recordings and provider notes for case-specific performance indicators (e.g., was depression screening correctly performed (audio) and did physician document the task (note)?) We considered information on the intake paperwork as if it had been provided by the patient in the encounter, and we considered a depression screen to have been performed if the screen appeared in the paperwork. Our fidelity measure included four categories for each indicator: heard on audio and documented in note (correct), heard on audio but not documented (undocumented), not heard on audio, not documented in note (unperformed), and not heard on audio but documented (fictitious).
Data Analyses
To examine the association between visit time (pre- vs. post-feedback) and quality-of-care behaviors, we fitted a mixed-effects logistic regression model to the USP checklist items, with fixed effects of visit time and SP case, and random effects of practice, provider, and checklist item to control for clustering of items in cases in providers in practices. We also examined whether physician suspicion they had seen a USP was associated with performance. To examine the association between visit time and documentation fidelity, we fitted a similar mixed-effects multinominal logistic regression model. We sought a sample size of 60 providers overall based on a priori 80% power to detect an improvement from a baseline rate of 40% to a post-intervention rate of 65% in a given performance indicator, based on previous studies.
Role of the Funding Source
Support for this project was provided by a grant from the Robert Wood Johnson Foundation to the American College of Physicians and the Institute for Practice and Provider Performance Improvement, Inc. The funding agreement ensured the independence of the investigators in the design, conduct, and analysis of the study.
Impact on Cost Patterns
Comparison Practices
For billing purposes, practices are associated with taxpayer identification numbers (TINs); several co-owned practices could share one TIN. Prior to claims data collection, we matched each study practice TIN to a comparison practice TIN in the same county and also enrolled in the Horizon value-based program, using propensity score matching16 based on practice size (members enrolled) and activity (two different 5-level ordinal measures of aggregate claims billed to Horizon) in 2016. A total of 117 TINs were available for matching; the matched group was well balanced on size (SMD = − 0.02, p = 0.96) and better balanced on activity measures than the population (SMD = 0.17 and − 0.20).
Horizon provided costs of actual claims paid for inpatient, outpatient, and professional services (primary or specialty care) in the three months prior to the project (incurred October–December 2017 and paid October 2017–March 2018), the three months 9 months later (after feedback, incurred July–September 2018 and paid July–December 2018), and the three months 12 months later (incurred October–December 2018 and paid October 2018–March 2019), in all visits by (actual) patients attributed to study and comparison practices who had received at least one service associated with a focal condition (diabetes, opioid use, depression, back pain, smoking, or cancer screening) during at least one visit. We considered a visit in the per-visit claims data to be associated with a condition if it included any claim with any ICD-10 code associated with the condition (see Appendix 2 for ICD-10 codes by condition), but included claims for all care that occurred in the same visit. Claims data did not include prescription claims, capitation claims, or patient-paid portions of charges (i.e., co-pays and deductibles) and were identified at the TIN level. For diabetes and low back pain, Horizon also provided patient-level costs in each period using Horizon’s internal attribution algorithm, which aggregates only those claims related to the condition, and computes total disease-related cost of the claims for members receiving care for the disease.
Claim “costs” represent costs to Horizon, not societal costs, but provide a numeraire that does not differ between intervention and control practice groups. Not all claims had an associated cost to Horizon; we refer to those that do as “positive claims.”
Data Analyses
We fitted two-part mixed-model regressions for each focal condition to per-visit claims costs: we modeled the probability of having a positive claim with a logistic model and the cost of positive claims with a gamma model. In each model, we include a fixed effect of practice type (study vs. comparison), a fixed effect of time period (pre-feedback, 9 months later, and 12 months later), the interaction between practice type and time period, and a random effect of practice TIN to account for clustering of claims in TINs. We examined the difference-in-difference of predicted per-visit costs between comparison and study practices at the 9-month and 12-month periods (each compared with the pre-feedback cost as a baseline). We fitted similar two-part mixed regressions to patient-attributed costs for diabetes and for low back pain. We hypothesized significant practice type × time interactions, with the direction depending on the nature of the focal condition.
Table 1 Raw Proportions and Adjusted Odds Ratios for Quality of Care Pre- and Post-feedback
For privacy reasons, Horizon’s per-visit claims could not identify whether different visits were by the same patients, so we treated each visit as an independent patient. The patient-level cost data does not require this assumption. We analyzed claims in raw dollars in the period they were incurred (thus, main effects of time include any changes in reimbursement rates over the study duration). One author (AS) conducted analyses using R 3.617 with the lme418 and mgcv19 packages and Julia 1.1.120 with the MixedModels21 package. The project was approved by Advarra IRB.