INTRODUCTION

Background

A primary objective of public reports on healthcare quality is to provide comparative information that patients can use to make well-informed choices among providers and health plans.1 One critical challenge facing report designers is that this information typically consists of a large array of quality measures unfamiliar to most patients, as well as data that can be statistically complex and hard to decipher. To ease the cognitive task of interpreting and integrating many different pieces of information, report designers often employ multiple strategies, including the use of symbols to convey relative performance, the use of plain language to make measures understandable and salient, and the use of summary, or roll-up, measures that combine multiple, often disparate indicators of healthcare quality into a single score.1,2,3,4,5,6,7,8,9

At the highest level of aggregation, a roll-up quality measure can combine indicators of patient experience, clinical processes, and patient outcomes. One prominent example of this kind of measure is the Centers for Medicare & Medicaid Services (CMS)’ Five-Star Quality Rating System, which assigns overall ratings of one to five stars to different providers and health plans based on their performance across various domains of quality.10 A roll-up measure could also represent an organization’s performance across multiple indicators of a single dimension of quality.11 For example, for medical groups, the California Office of the Patient Advocate reports a score for “Patients Rate Overall Experience” that combines five patient experience measures.

The use of roll-up measures to reduce the cognitive burden imposed by quality reports is predicated on three assumptions: (1) well-constructed roll-up measures can accurately summarize performance across different dimensions of care,12 (2) roll-up measures can capture enough performance variation for patients to choose wisely among alternatives, and (3) weighing fewer attributes (e.g., quality measures) results in better decisions.6, 13

Not all stakeholder groups (including patients, clinicians, hospitals, health plans, and developers of healthcare quality reports) view the provision of roll-up scores in healthcare quality reports for patients as a positive strategy. Critics commonly express concerns about the potential of roll-ups to negatively affect decision making, obscuring important nuances of quality, the sometimes-hidden assumptions about the relative importance of component measures, and the lack of evidence-based methods for calculating scores.11, 14,15,16 Additionally, some report developers raise concerns about whether display strategies align with patients’ preferences.11, 14,15,16,17,18,19,20

Objective

The public reporting community is divided on the use of roll-ups, but little is known about patients’ understanding and use of such scores when choosing clinicians.11 To address that gap, we conducted an experiment to determine whether presenting quality scores at different levels of aggregation affects patients’ clinician choices. A realistic, interactive website presented participants with information about the quality of a set of fictional clinicians. This information included scores for an array of quality measures as well as patients’ comments about clinicians. The site allowed participants to explore the performance of different clinicians and make a hypothetical selection. We used this experimental platform to explore how the choice of clinician varies based on exposure to roll-up scores, disaggregated drill-down scores (i.e., the component measures used to create a roll-up measure), or both. The third arm was included because several report developers participating in a previous study indicated that they would most likely present both roll-up and drill-down scores rather than just one.11

Because of the complexity and volume of healthcare quality data, we anticipated that organizing and presenting data in ways that are easier for participants to understand (i.e., roll-up scores) would result in a lower cognitive load, leading to an easier decision process, and better choices of clinicians. Specifically, we hypothesized that relative to those presented with drill-down scores only, participants presented with roll-up scores would exert less effort in the choice process, be more likely to choose the best-performing clinician (according to quantified performance metrics), and make fewer preference-incongruent choices. We entertained two competing hypotheses for participants seeing both roll-up and drill-down scores. On the one hand, because those who saw both roll-up and drill-down scores could view quality data at whatever level of aggregation aligns with personal information needs and preferences, they might behave similarly to those in the roll-up arm. On the other hand, participants receiving both roll-up and drill-down scores might have too much information to process, putting them at risk of confusion and poorer choices.

Finally, we explored whether the effects of presenting roll-up scores differed across sicker and healthier participants. Because sicker participants have more experience interacting with clinicians, they might value and be able to use detailed information about clinician performance more than healthier participants do. Alternatively, sicker participants who are taxed by their poor health may find detailed information to be especially overwhelming. If so, the availability of drill-down data might be detrimental to the choices of sicker versus healthier participants.

METHODS

Data were collected from May to July 2015 as part of a larger study of clinician choice.21 We focus here on three experimental arms that varied the types of scores provided to participants. This study was approved by the first author’s institutional review board. All participants underwent an informed consent procedure.

Participants

Participants were drawn from the GfK probability-based Internet Knowledge Panel, which is designed to be representative of the US population.22 Fifty-two percent of panelists were invited to participate in the study accepted, resulting in 550 participants across the three arms.

Design and Procedure

Participants completed an initial survey that included questions about their real-life healthcare experiences and how they choose a clinician. After a week, 85% of the respondents who completed the initial survey returned to the study and were directed to an experimental website called SelectMD that had information about a fictitious set of 12 clinicians (see Fig. 1). After being randomly assigned to one of three experimental arms, participants were asked to use the information on the website to select the clinician they thought would be best for their healthcare needs, treating the choice as carefully as if it were a real one (see a related methodological report,21 or the online appendix for more on the site). After selecting a clinician, participants completed a second survey about the choice process. The SelectMD website collected data on participants’ website use while they were engaged in the choice task. The analyses presented in this paper combine objective measures of website usage and participants’ survey responses. Most data were based on self-reports or behavior within the study. Some socio-demographic variables (gender, age, education, and race) were maintained by GfK as part of panel administration and provided to us.

Figure 1
figure 1

Sample screenshot from the SelectMD experimental website.

The SelectMD website allowed participants to see three categories of quantitative performance metrics: patient experience survey scores, clinical quality scores (i.e., indicators of the extent to which the clinician-delivered care aligned with best practice), and patient safety scores (i.e., indicators of the clinician’s adoption of protocols that enhance patient safety). Scores were presented as one-to-five star ratings. In addition, all participants could view comments about clinicians and their staff; comments did not differ across the experimental arms. These comments mimicked actual patient comments about clinicians; the content was drawn from real comments found on clinician quality websites and comments elicited in a companion study.23

The three experimental arms varied the presentation of performance scores. In the drill-down arm, participants saw scores for four measures in each of the three categories, for a total of 12 quality scores (e.g., for patient experience, scores indicated how well clinicians communicate with their patients, patients’ ability to get timely appointments and information, the courtesy and helpfulness of office staff, and how well clinicians attend to their patients’ mental or emotional health). In the roll-up arm, participants saw only rolled-up scores for each of the three categories of metrics (patient experience, clinical quality, and patient safety); they did not see the individual measures combined to form each roll-up score. In the drill-down plus roll-up arm, participants could see roll-up scores for each category as well as the individual measures used to form the roll-up scores.

Main Measures

Decision Quality

We measured decision quality in two ways. First, some presented clinicians were objectively worse than other presented clinicians, in the sense that they were at least as bad on all dimensions (i.e., the three categories of metrics) and worse on at least one dimension (this is also known as a “dominated” option24). We assessed whether a participant chose a suboptimal clinician.

Second, we constructed a measure, labeled preference incongruence, that captured the level of disagreement between how important a participant said each of the dimensions was in a preferred clinician and the characteristics of the selected clinician. For each of the three dimensions reported on the website, participants rated whether each dimension “matters a lot,” “matters some,” or “does not matter much.” Participants were labeled as having made a preference-incongruent choice if they identified one of the performance dimensions as being among their top characteristics of a preferred clinician (i.e., they reported that the dimension “matters a lot”) but did not actually select a clinician on the website who was highly rated on that dimension (i.e., had a five-star rating). For example, a participant who indicated that safety matters a lot and chose a clinician with three stars on safety would have made a preference-incongruent choice on that dimension. This preference incongruence measure was constructed using the preferences expressed on the post-choice survey to allow for learning from the website.25

Level of Effort

We used two objective measures derived from tracking data collected while participants used the SelectMD website to determine level of effort: the length of time spent on the website and the number of concrete actions taken (e.g., clicking on items, hovering over a pop-up element).

Assessing Health Status

As part of the initial survey, participants were asked if they had ever been treated for a serious or life-threatening condition. They were also asked if they had a long-term medical condition that required medical monitoring or treatment. Participants who responded “yes” to either question were classified as “sicker;” all others were coded as “healthier.”

Statistical Analysis

We examined sample characteristics and tested the success of random assignment across the three arms. Then, we tested for pairwise differences among the three arms on each of the dependent measures, using chi-square tests of independence and independent-sample t tests. Finally, we broke the sample into healthier and sicker subsamples, again testing for differences across arms on each of the dependent measures.

RESULTS

As shown in Table 1, participants did not differ across experimental arms in terms of age, race/ethnicity, education, chronic disease, self-reported health status, and self-reports of having visited a doctor quality website, indicating that randomization to experimental arm was effective. Because there were no significant differences in patient characteristics across arms, these characteristics were not included as covariates in the models presented here.Footnote 1 Compared to the general US population, our sample skewed older, whiter, and more educated. Our sample was comparable to the US population in terms of the proportion experiencing chronic health conditions but reported slightly poorer health.

Table 1 Sample Characteristics by Experimental Arm

Effect of Provision of Roll-up Scores on Level of Effort

Participants who saw drill-down scores (mean = 14.9, SD = 19.0; Table 2) or drill-down scores and roll-up scores together (mean = 19.2, SD = 28.5) took significantly more actions while navigating the site relative to those who saw roll-up scores (mean = 10.5, SD = 9.5).

Table 2 Effects of Quality Data Presentation on Decision Quality

Effect of Provision of Roll-up Scores on Whether the Best Clinician Is Chosen

A significantly greater proportion of participants in the drill-down arm chose a suboptimal clinician (36.3%, Table 2) relative to participants in the roll-up (23.4%) and drill-down plus roll-up (25.6%) arms.

Participants in the drill-down arm made more preference-incongruent choices (51.2%, Table 2), in contrast to participants in the roll-up arm (45.6%) and participants in the drill-down plus roll-up arm (47.5%).

Thus, on both measures of decision quality, more participants in the drill-down arm made poor quality choices than did participants in the other arms.

Differences Based on Health Status

We examined the level of effort and decision quality among healthier and sicker participants (Table 3). For neither group did time spent deliberating vary across the arms, and both groups took more actions in the drill-down plus roll-up arm than in the roll-up arm (healthier participants also took more actions in the drill-down plus roll-up arm than in the drill-down arm).Footnote 2

Table 3 Effects of Quality Data Presentation on Decision Quality for Healthier and Sicker Participants

The distinction between roll-up and drill-down on decision quality was somewhat stronger for sicker participants than healthier ones. Specifically, among sicker participants, a greater proportion chose a suboptimal clinician or made preference-incongruent choices when viewing drill-down scores relative to those who saw roll-up scores (or drill-down plus roll-up scores for preference incongruence). Among healthier participants, more participants in the drill-down arm selected a suboptimal clinician than did those in the drill-down plus roll-up arm, and those who viewed only drill-down scores made more preference-incongruent choices relative to those who saw roll-up scores.

DISCUSSION

In summary, participants in this study who saw quality information only in the form of drill-down scores tended to engage in more effortful consideration of the data they were provided (as indicated by taking more actions on the website) but made worse clinician choices than did those who had access to roll-up scores. This was true whether decision quality was measured by preference incongruence or the selection of dominated alternatives. Adding drill-down scores to roll-up scores increased effort but did not appear to harm decision quality. These findings suggest that it is advantageous to include roll-up scores in reports on healthcare quality with or without accompanying drill-down information.

The differences between healthier and sicker patients in their use of quality metrics may be useful for report developers. Because sicker participants likely have more frequent encounters with clinicians than do healthy participants, the process of choosing a clinician is especially salient for them. Information presentation appeared to affect the amount of effort healthier participants exerted, yet had relatively little effect on decision quality. Conversely, while sicker participants also differed in effort across experimental arm (although less so than healthier participants), the differences in decision quality were more profound, with those seeing only drill-down information performing the worst. This finding suggests that providing roll-up information (with or without accompanying drill-down information) may be particularly important for those most involved with the healthcare system. As posited earlier, sicker patients may be taxed by their poor health and benefit from the reduced cognitive load provided by roll-up scores. However, this explanation cannot be tested here, and further studies should determine the robustness of this difference.

The study is subject to several limitations. First, while the website was designed to be as realistic as possible, participants engaged in a hypothetical clinician choice, not a real one. Despite the hypothetical nature of the task, participants spent about 6 min to 7.5 min choosing a clinician, suggesting they engaged in the task. Though the study was not designed to estimate rates of poor decision making in the general population, the rates of suboptimal choices in this study (23% to 36%, depending on arm) are lower than those found in other studies of real-world health decision making.30, 31 For example, from 2007 to 2010, traditional Medicare was a dominated option relative to Medicare Advantage, but less than a quarter of beneficiary-selected Medicare Advantage.30 Second, though we attempted to capture the most relevant constructs related to physician choice, unmeasured factors could have affected responses. Randomization to experimental arms should have minimized the chance that variation in unmeasured constructs would vary systematically across arms. Third, the study focused on the choice of clinician only. We did not test other healthcare choice scenarios (e.g., selecting hospitals, health plans), but we would not predict that the effect of providing roll-up scores would differ substantially.

This study increases knowledge of how patients’ clinician choices are influenced by the level of aggregation and presentation of healthcare quality data. Some questions about roll-up scores remain unaddressed, including how clinicians or health plans respond to roll-up measures or use them to guide quality improvement initiatives32, 33 and determining best methods of roll-up score calculation.

CONCLUSIONS

The results of this study suggest the value of presenting roll-up scores in healthcare quality reports for patients. Developers of such reports may want to consider summarizing performance, where possible, both to reduce cognitive load and to improve decision quality. Because participants in the roll-up only and drill-down plus roll-up arms did not significantly differ in their likelihood of making errors, it appears that providing drill-down scores does not necessarily hurt decision making. Instead, it seems that roll-up scores (whether presented alone or as a complement to drill-down scores) can potentially improve decision making. Previous research suggests that report sponsors and national organizations involved in public reporting may be more comfortable with offering both roll-up and drill-down measures to accommodate individuals with different information needs.11 Though the reporting of roll-up scores continues to pose practical issues (e.g., the development of scientific methods for calculating fair and reliable scores11), this study is the first to empirically demonstrate that the provision of roll-up scores can increase the proportion of patients choosing better performing clinicians.