Teaching Commentary on “A Primary Care-Based Multidisciplinary Readmission Prevention Program”: Essential Aspects of Comparability and Context in Practice-Based Program Evaluation

In their paper evaluating a primary care-based multidisciplinary readmission prevention program, Cavanaugh and colleagues describe the quality improvement processes they underwent to develop, iteratively refine, implement and then evaluate a program designed to improve post-discharge planning and care transitions, thereby reducing readmissions.1 Using a retrospective cohort study design, they then describe the results of their program evaluation comparing patients who have been discharged from their hospital and exposed to a local innovation—a hospitalization follow-up clinic—to a local comparison group matched on readmission risk and admission timing. In unadjusted analyses, program participants had significantly fewer readmissions at 30 and 90 days, with non-significant trends for lower emergency department visits, shorter median time to first post-discharge follow-up visit, and higher achievement of hospital follow-up within 30 days.1 When they adjusted for baseline differences between their intervention and comparison groups, only the hazard of a 90-day readmission was significantly lower in the intervention group. 
 
This paper provides an excellent example of the potential for and challenges of identifying an appropriate and comparable comparison group in non-experimental study designs, as well as the value of what Jubelt and colleagues described as “thick descriptions” of program content and processes of implementation, in context such that actions and behaviors become clear to outsiders.2 In this commentary, I lay out the implications of both comparability and context for evaluating quality improvement (QI) programs, with an eye to their contribution to implementation and spread of local innovations. 
 
Comparability of Comparisons: Enhancing Internal Validity 
While even randomization does not always confer comparability, the utility and internal validity of non-experimental evaluation designs often rise or fall on the basis of the comparability of the comparison group(s) used.3 The purpose of a comparison group is to offer evidence of what would have happened in the absence of the intervention to the extent possible, enabling estimation of an incremental advantage attributable to program implementation.4 Therefore, the type and manner in which selection criteria are applied, the availability of data and measures, and the covariate adjustments that evaluators deploy to increase the comparability of intervention vs. comparison groups become essential design elements.5 Otherwise, potential threats to internal validity begin to mount, undermining confidence in the study’s findings.6 
 
Cavanaugh and colleagues1 compared two groups of patients that had been discharged from their hospital: (1) the intervention group, members of whom had been exposed to a hospitalization follow-up clinic; and (2) a comparison group, derived using the same inclusion/exclusion criteria, matched on a general readmission risk stratification score, and whose index discharge was within a month of the intervention patient to whom they were matched. To their credit, Cavanaugh et al.'s approach reduced many common validity threats by applying comparable criteria and similar time trajectories (e.g., within a month of index discharges) for both groups over the same 5-month time period and in the same facility. 
 
At the same time, each of these strengths may be moderated by one or more challenges (Table 1). For example, use of the same facility means that patients in both intervention and comparison groups shared the same area and organizational environments, which reduces the need to adjust for contextual factors. However, implementing a program in the same site simultaneously increases the potential for contamination, as patients, providers and staff with experience in the intervention may co-mingle or otherwise influence referrals, participation or behavior. Single-site studies also limit external validity, but all studies have to start somewhere, and a rush to multi-site research without solid pilot evidence would be unwarranted, not to mention prohibitively expensive. The authors also applied the same inclusion/exclusion criteria, increasing the comparability of intervention and comparison patients. Yet only patients with established primary care providers were eligible, among other exclusions, limiting knowledge about how the program might be deployed or tailored for other types of patients. And while comparison group patients were randomly sampled from the pool of eligible patients, the factors driving program referral (let alone attendance/participation) produce inherent selection effects.7 Matching comparison group patients to individual intervention patients based on the risk of readmission using a locally standardized risk classification scheme also represented a methodological strength. The evaluators’ approach to risk stratification and matching yielded intervention and comparison group members of low, moderate and high risk in equal proportions, seeking comparability in a fundamental and inherent driver of readmissions. At the same time, despite best efforts, the intervention group ultimately included significantly more patients with chronic obstructive pulmonary disease (COPD) or asthma (46 % vs. 24 %) and fewer with cirrhosis (4 % vs. 22 %) or depression (22 % vs. 44 %).1 As is commonly done, the authors adjusted for these and other baseline differences in their final analyses, as they focused on better understanding the program’s independent effects.7 The extent to which the baseline differences continued to be associated with program participation after statistical adjustment is not reported; understanding whether these differences had meaningful implications for their findings may have been a useful adjunct, especially for patients with mental health conditions. 
 
 
 
Table 1 
 
Strengths and Challenges Associated with Comparison Group Methods in Cavanaugh et al.1 
 
 
 
That said, the reality is that for every strength, any number of challenges is always present. Criteria that seek to homogenize patients into comparable, highly-controlled groups to improve internal validity often limit our ability to translate research and QI evidence into routine clinical practice.8,9 Therefore, achieving comparability of intervention and comparison groups in program evaluation requires constant balancing of the demands of both internal and external validity.


INTRODUCTION
In their paper evaluating a primary care-based multidisciplinary readmission prevention program, Cavanaugh and colleagues describe the quality improvement processes they underwent to develop, iteratively refine, implement and then evaluate a program designed to improve post-discharge planning and care transitions, thereby reducing readmissions. 1 Using a retrospective cohort study design, they then describe the results of their program evaluation comparing patients who have been discharged from their hospital and exposed to a local innovation-a hospitalization follow-up clinic-to a local comparison group matched on readmission risk and admission timing. In unadjusted analyses, program participants had significantly fewer readmissions at 30 and 90 days, with non-significant trends for lower emergency department visits, shorter median time to first post-discharge follow-up visit, and higher achievement of hospital follow-up within 30 days. 1 When they adjusted for baseline differences between their intervention and comparison groups, only the hazard of a 90-day readmission was significantly lower in the intervention group.
This paper provides an excellent example of the potential for and challenges of identifying an appropriate and comparable comparison group in non-experimental study designs, as well as the value of what Jubelt and colleagues described as "thick descriptions" of program content and processes of implementation, in context such that actions and behaviors become clear to outsiders. 2 In this commentary, I lay out the implications of both comparability and context for evaluating quality improvement (QI) programs, with an eye to their contribution to implementation and spread of local innovations.

Comparability of Comparisons: Enhancing Internal Validity
While even randomization does not always confer comparability, the utility and internal validity of non-experimental evaluation designs often rise or fall on the basis of the comparability of the comparison group(s) used. 3 The purpose of a comparison group is to offer evidence of what would have happened in the absence of the intervention to the extent possible, enabling estimation of an incremental advantage attributable to program implementation. 4 Therefore, the type and manner in which selection criteria are applied, the availability of data and measures, and the covariate adjustments that evaluators deploy to increase the comparability of intervention vs. comparison groups become essential design elements. 5 Otherwise, potential threats to internal validity begin to mount, undermining confidence in the study's findings. 6 Cavanaugh and colleagues 1 compared two groups of patients that had been discharged from their hospital: (1) the intervention group, members of whom had been exposed to a hospitalization follow-up clinic; and (2) a comparison group, derived using the same inclusion/exclusion criteria, matched on a general readmission risk stratification score, and whose index discharge was within a month of the intervention patient to whom they were matched. To their credit, Cavanaugh et al.'s approach reduced many common validity threats by applying comparable criteria and similar time trajectories (e.g., within a month of index discharges) for both groups over the same 5-month time period and in the same facility.
At the same time, each of these strengths may be moderated by one or more challenges (Table 1). For example, use of the same facility means that patients in both intervention and comparison groups shared the same area and organizational environments, which reduces the need to adjust for contextual factors. However, implementing a program in the same site simultaneously increases the potential for contamination, as patients, providers and staff with experience in the intervention Published online March 14, 2014 may co-mingle or otherwise influence referrals, participation or behavior. Single-site studies also limit external validity, but all studies have to start somewhere, and a rush to multi-site research without solid pilot evidence would be unwarranted, not to mention prohibitively expensive. The authors also applied the same inclusion/ exclusion criteria, increasing the comparability of intervention and comparison patients. Yet only patients with established primary care providers were eligible, among other exclusions, limiting knowledge about how the program might be deployed or tailored for other types of patients. And while comparison group patients were randomly sampled from the pool of eligible patients, the factors driving program referral (let alone attendance/ participation) produce inherent selection effects. 7 Matching comparison group patients to individual intervention patients based on the risk of readmission using a locally standardized risk classification scheme also represented a methodological strength. The evaluators' approach to risk stratification and matching yielded intervention and comparison group members of low, moderate and high risk in equal proportions, seeking comparability in a fundamental and inherent driver of readmissions. At the same time, despite best efforts, the intervention group ultimately included significantly more patients with chronic obstructive pulmonary disease (COPD) or asthma (46 % vs. 24 %) and fewer with cirrhosis (4 % vs. 22 %) or depression (22 % vs. 44 %). 1 As is commonly done, the authors adjusted for these and other baseline differences in their final analyses, as they focused on better understanding the program's independent effects. 7 The extent to which the baseline differences continued to be associated with program participation after statistical adjustment is not reported; understanding whether these differences had meaningful implications for their findings may have been a useful adjunct, especially for patients with mental health conditions.
That said, the reality is that for every strength, any number of challenges is always present. Criteria that seek to homogenize patients into comparable, highly-controlled groups to improve internal validity often limit our ability to translate research and QI evidence into routine clinical practice. 8,9 Therefore, achieving comparability of intervention and comparison groups in program evaluation requires constant balancing of the demands of both internal and external validity.

Implications of Context for Considering External Validity, Implementation and Spread
A central tenet of external validity is the ability to generalize beyond the circumstances of a particular study to other patients, providers and practices. 10 However, local innovations, like the case presented here, are just that-local-and reflect the environment within which providers practice and patients experience care, and within which quality improvement and evidence-based programs may or may not take hold and work. 11 As a result, area and organizational contexts have substantial influences on quality of care, the design and deployment of QI interventions, and, ultimately, the implementation of evidence-based practice and policy. 12 Cavanaugh and colleagues 1 again provide a strong example for describing the local context within which they developed, tested and evaluated their follow-up program for individuals at risk for hospital readmission. They describe their organizational context as a large, academic internal medicine practice based at the University of North Carolina (UNC) at Chapel Hill. The practice is further distinguished by a 15-year history of QI activities, reflected by apparent in-house QI capabilities (e.g., use of root cause analysis, process mapping, small tests of change, run charts) and their referenced peer-reviewed publications, as well as NCQA recognition as a Level 3 PCMH. This latter recognition reflects the highest level a practice can achieve on standards of access/continuity, population management, planning and care management, self-management support/ community resources, tracking/coordinating care, and performance measurement/improvement. 13 Not surprising for a This description provides an extremely useful backdrop for understanding context and its implications for external validity, which in turn informs how readily other practices might be able to adopt (or need to adapt) the hospitalization follow-up clinic as a model. First, to what other types of practices might the evaluation results generalize? A major advantage is its reliance on practice-based research. As Lawrence Green notes, "If we want more evidence-based practice, we need more practice-based evidence." 14 While this study was not part of a practice-based research network (PBRN), the practice under study appears to have operated under similar precepts: use of the practice as a "laboratory" for testing healthcare innovations with engagement of frontline providers and staff in the framing and design of potential solutions to problems with which they have experience, benefiting from elements of QI and participatory action research. 15,16 As an academic practice, however, it may be difficult to generalize to non-academic settings, where the majority of Americans obtain their health care. The multi-level partnerships between the UNC practice and UNC Hospitals may also not be easily reproducible elsewhere, yet are essential to implementation and spread of evidence-based practice. 17,18 EHR implementation is also very uneven in the U.S., with significant organizational determinants of variation in use and quality of decision support and data mining.
Adoption decisions by other organizations may be predicated on the extent to which being an enriched, academic practice actually served as a core attribute of the intervention and its implementation. While not defined as such in the paper, in "realistic evaluations," interventions are conceptualized and evaluated as being inextricably linked to their contexts, such that "mechanisms [of change] may be more or less effective in producing their intended outcomes, depending on their interaction with various contextual factors." 19,20 For this innovation to have impacts beyond its relatively narrow context, readers may therefore need to consider how they might have to adapt their practice and/or the innovation to fit their own organizational milieu. 21,22