FormalPara Key Points for Decision-Makers

As the use of patient-reported outcomes measures (PROMs) becomes more widespread in healthcare, including their integration into clinical workflows, it is essential that all patients, including those with low health literacy, can understand and complete PROMs.

This study shows that comprehensibility of PROMs can be improved. Especially by including clear instruction for patients and paying attention to the number of questions and answers.

More attention is needed for comprehensibility when developing or selecting PROMs to be implemented in initiatives supporting the decision-making process or healthcare evaluation. The Pharos Checklist for Questionnaires in Healthcare (PCQH) is a valuable tool but needs further development and validation for these purposes.

1 Introduction

Patient-reported outcomes (PROs) are becoming increasingly important in clinical trials as well as in daily routine healthcare and are considered essential in value-based healthcare [1,2,3]. Various national health authorities now recommend routine collection of PROs from patients receiving medical specialist care [2, 4,5,6]. Studies show that patient-level PRO data can enhance patient engagement, shared decision-making and personalized treatment [7,8,9], while aggregated data across care providers can be used in quality improvement through benchmarking and shared learning [10,11,12]. Additionally, aggregated PRO data serves to provide real-world evidence on treatment effectiveness and safety [13, 14] and ensures accountability to payers and the public [15]. Therefore, patient-reported outcome measures (PROMs) are increasingly included in core outcome data collection initiatives [16,17,18]. To achieve the potential benefits of using PROMs in all patients and to obtain high-quality aggregated data, it is important that every patient can participate, particularly when PROMs are embedded in the clinical workflow. However, PROM implementation studies show challenges regarding the comprehensibility of PROMs (i.e., the degree to which the PROM is correctly understood by patients) [19,20,21,22].

Problems with comprehensibility are of particular concern in people with low (health) literacy, which is a worldwide problem [23,24,25]. For instance, in the Netherlands, one in four people have insufficient or limited health literacy skills [26, 27]. They face difficulties in obtaining, understanding, appraising, and using health information when making health-related decisions [28], for example, when filling in PROMs. A part of this population also has low basic literacy skills [29]. Health literacy is recognized as a major determinant of health and socioeconomic health disparities by the World Health Organization [30] and lower health literacy is associated with poorer health outcomes and increased mortality [25, 31].

To ensure that every patient, regardless of their health literacy skills, can participate and benefit from discussing their own PRO data during consultations and to obtain PRO data that represents the entire population for quality assessment purposes, it is crucial that everyone can understand PROMsFootnote 1 [32, 33]. Consequently, various well-defined frameworks to support PROM development [e.g., International Society for Pharmacoeconomics and Outcomes Research (ISPOR) [34] and selection [e.g. International Consortium for Health Outcomes (ICHOM) [35], COnsensus-based Standards for the selection of health Measurements Instruments (COSMIN) [36], PROM-cycle [37], and International Society for Quality of Life Research (ISOQOL) [38] emphasize the importance of comprehensibility [16, 17, 22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40]. For instance, the ISPOR PRO Good Research Practices Task Force advises to use cognitive interviews to ensure that respondents understand how to complete the PROM, the meaning of the questions, and how to use the response scales [41, 42]. In addition, according to the COSMIN guideline and the PROM-cycle, the comprehensibility of PROMs impacts content validity, which is often considered to be the most important psychometric property of a PROM [43]. Thus, it is recommended to use qualitative methods that involve patients, such as cognitive interviews, to assess the comprehensibility of PROMs [43]. However, a scoping review by Wiering et al. shows that only in 51% of developed PROMs, patients were involved in testing for comprehensibility [44]. Other studies show that, for example, the readability level of PROMs is higher than recommended [45,46,47]. This may result in lower completion rates or hinder accurate completion, compromising PROM validity and excluding people with low (health) literacy [32, 33, 45].

Little is known, however, about how the degree to which PROMs are easy to use, varies between PROMs as qualitative methods assessing the comprehensibility of PROMs make it difficult to compare the comprehensibility of PROMs between populations or PROMs. Moreover, in situations where a large number of PROMs need to be evaluated for comprehensibility, for example, during PROM selection, it is often impractical to conduct qualitative studies on multiple PROMs as this is time-consuming and requires specific expertise.

The aim of this study was to evaluate the comprehensibility of the PROMs included in the 35 core outcome sets that were developed as part of the Dutch Outcome-Based Healthcare Program (2018–2023), which was a national initiative to stimulate the collection of routine patient-reported and clinical outcomes in daily medical specialist care. A core outcome set, further referred to as “outcome set,” is an agreed standardized collection of important outcomes and how they should be measured in a specific area of health or health care [48]. The program was initiated as part of the national policy agenda for medical specialist care agreed upon by all relevant umbrella organizations and conducted under the auspices of the Dutch Ministry of Health, Welfare and Sport [2, 49].

2 Methods

2.1 Dutch Outcome-Based Healthcare Program

In the Outcome-Based Healthcare Program, disease/condition-specific working groups consisting of mandated representatives of the umbrella organizations representing all stakeholders involved in Dutch medical specialist care (including patient organizations), have developed 33 outcome sets to support shared decision-making and healthcare quality evaluation [50]. The process was facilitated by the Dutch National Health Care Institute by providing methodological and organizational support. The program was assigned with the task of developing 36 outcome sets for diseases and conditions representing a considerable part of the Dutch national disease burden [50, 51]. Due to practical considerations, 33 sets were eventually developed (Online Resource 1). In addition, two sets of Generic PROMs (for adults and children) were developed, consisting of PROs that reflect common areas of disease impact for all patients in medical specialist care [49, 52, 53]. In this study on the comprehensibility of PROMs, outcome set exclusion criteria were: (1) sets without PROMs, (2) sets developed for children, and (3) sets that were not approved (yet) by all participating umbrella organizations.

2.2 Pharos Checklist for Questionnaires in Healthcare

Pharos, the Dutch Centre of Expertise on Health Disparities, developed a checklist to evaluate and improve the comprehensibility and accessibility of questionnaires for adults with low health literacy, the so-called “Pharos Checklist for Questionnaires in Healthcare” (PCQH) [54]. Types of healthcare questionnaires that can be assessed with the PCQH include, for example, PROMs, patient reported experience measures (PREMs), and questionnaires for scientific studies or nationwide surveys. Pharos is committed to reduce avoidable health disparities due to socio-economic conditions [55] and is frequently involved in the development and adaptation of health questionnaires. To do this, Pharos employs a standardized qualitative methodology [56] based on think aloud principles and the use of probing questions for rewording and testing with persons with low literacy levels [57, 58]. Based on the experiences gained through this work and substantiated by scientific publications [32, 59,60,61,62,63,64], the PCQH was developed.

The PCQH consists of four components: (1) comprehensibility, (2) accessibility, (3) layout, and (4) validation. In our study we focus on comprehensibility, which we define as: the ease with which the intended patient population of the PROMs can understand, interpret and accurately respond to the questions or items presented. In the PCQH, comprehensibility comprises the following eight domains: (1) language level, (2) presence of a brief and clear instruction, (3) number of questions, (4) number of answer options, and the use of: (5) questions in active voice, (6) medical terms/abbreviations, (7) concrete questions, and (8) statements. Each domain is rated using a color-coded ordinal or nominal scale where “green” denotes “optimal” adherence to the domain criteria, “orange” denotes “acceptable” adherence and “red” denotes “significant inadequacies” that require substantial revisions for enhanced clarity and accessibility.

Since the PCQH was originally intended to support the development of new or adapting existing complete questionnaires in healthcare (i.e., evaluation of complete PROMs instead of rating multiple PROM scales), several changes were made to the criteria used for rating some of the eight domains to better align with the goals of this study. A detailed overview of the domain-specific definitions and ratings, as well as the changes that were made to the PCQH, are presented in Table 1. We evaluated comprehensibility at the scale level of PROMs, meaning that for multidimensional PROMs the PCQH was applied to each individual scale. For example, the Michigan Hand Outcomes Questionnaire consists of six scales that measure multiple PROs [65]; only the scale activities of daily living was included in the outcome set for “hand and thumb base osteoarthritis” and assessed for comprehensibility in this study. Therefore, the number of questions allowed for a green or orange rating was reduced relative to the original version of the PCQH, which is intended to be applied to a complete questionnaire. Threshold values for acceptable length of individual scales were derived from the balance between minimizing patient burden and maintaining scale reliability. In this regard, we established that a maximum of six questions would receive a green rating, as this represents the minimum number of questions that still achieves acceptable internal consistency reliability (Cronbach’s alpha = 0.72) at the group level assuming relatively low inter-item correlations of 0.3 [66]. Furthermore, for a green rating in the domain “questions in active voice,” an absolute number of questions with the unfavorable property is tolerated in the original version of the PCQH. Applying this threshold disproportionately penalizes longer questionnaires. To ensure a fair and balanced evaluation, we adjusted the criteria so that ratings in this domain consider the proportion of questions in active voice rather than their absolute number. To be able to rate more objectively, the criteria used to rate the domains of “concrete questions” and “statements” have been adapted from nominal in the original version of the PCQH to rational in this study. Again, to not penalize longer PROM scales, percentages are used instead of absolute numbers.

Table 1 Eight comprehensibility domains and their ratings

2.3 Analyses

All eight raters were methodologists with expertise in PROMs, employed by the Outcome-Based Healthcare Program and were trained by Pharos (H.v.B. and G.B.) to apply the PCQH. For calibration, the comprehensibility of the eight scales of the SF-36 [67] was assessed by the methodologists. Differences and similarities were discussed to reach a consistent interpretation of each domain.

The methodologists were divided into fixed pairs and PROM scales were assigned to these pairs. When a PROM consisted of multiple scales, all included scales of the PROM were assigned to the same pair. Subsequently, each PROM scale was independently assessed by both members of a pair. An online tool was used to assess the (Dutch) language level of a text (Klinkende Taal) based on a computer algorithm consisting of a preexisting list of difficult words and a linguistic and syntactic analysis [68]. Differences in assessment were discussed in a consensus meeting within each pair and if no conclusion was reached a third assessor was consulted. As a result, each pair provided one final assessment per PROM scale on the eight comprehensibility domains of the PCQH. All descriptive results regarding the comprehensibility of the PROM scales in this study are based on these final ratings.

The interrater agreement between the assessment of comprehensibility by the two methodologists, prior to the consensus meeting, was evaluated for each domain by examining the absolute percentage of agreement and its corresponding 95% confidence interval. For the four domains measured on a ratio scale (number of questions, active voice in questions, concrete questions, statements), the intraclass correlation coefficient (ICC) was determined. This was based on a two-way analysis of variance (ANOVA) model for single measurements and absolute agreement (ICC type 2,1 according to Shrout & Fleiss) [69]. For the interpretation of the ICC, it is assumed that an ICC > 0.90 indicates good agreement. For the four domains measured at an ordinal level (all other domains), the kappa statistic was calculated. The interpretation of the kappa values follows the classification of Landis and Koch, where a kappa value of 0.81–1 is considered “almost perfect agreement,” and a value of 0.61–0.80 is considered “substantial agreement” [70]. Lower values indicate insufficient agreement between assessments.

3 Results

3.1 Outcome Sets and PROMs Selected for Analyses

A total of 6 of the 35 outcome sets (33 disease/condition-specific sets and 2 Generic PROM sets) developed in the Outcome-Based Healthcare Program were excluded from the study. The cataract set and macular degeneration set did not contain PROMs, and the pancreatic cancer set and renal cell carcinoma set were not approved (yet) by the boards of all participating umbrella organizations. In addition, the asthma in children set and the generic PROM set for children were excluded as the focus of this study is on PROMs for adults. The remaining 29 outcome sets included a total of 157 PROM scales. All 157 PROM scales were independently assessed as described in the “Methods” section. A third assessor was consulted for six PROM scales. Online Resource 2 contains the comprehensibility profiles of each of the 157 PROM scales.

3.2 Agreement Between Assessors

Agreement between assessors was predominantly high, with ICC/kappa in most domains compatible with good agreement according to the specified cutoff values (Table 2). Absolute agreement was lower for most ratio variables (active voice in questions, concrete questions, statements). This can be explained by a wider range of possible values compared with ordinal variables, making the chance of exact agreement between assessors smaller. The negative kappa value for the domain medical terms/abbreviations indicates lower agreement than could be expected by chance, despite high absolute agreement. Inspection of the data revealed that this was caused by uneven distribution of assessments across categories, with nearly all assessments falling into the green category. The domain of instructions is the only domain with a low degree of agreement that cannot be explained by reasons other than assessment discrepancies. These discrepancies were due to varying interpretations of which section of a PROM should be identified as the instruction.

Table 2 Agreement between assessors

3.3 Comprehensibility Results per PCQH Domain

Next, we address the results per domain based on the final ratings among the pairs of assessors (Fig. 1).

Fig.1
figure 1

Results per domain

Almost all PROM scales (91%) were judged to be written on A1, A2, or B1 language level according to the Common European Framework of Reference for Languages (CEFR), which constitutes optimal adherence with the criteria for language level (Fig. 1) [71]. No PROM scale was written on language level C1 (red). We observed that in general the language level of the PROMs' instruction was rated more difficult (68% of the instruction's language level ratings were green) than that of the questions (91% of the questions' language level ratings were green) (data not shown in Fig. 1).

Only 48% of PROM scales included a general instruction (that was directed at the patient) and contained the subject/purpose of the PROM scale and a fill-in instruction. For a slightly smaller proportion, the rating was red (39%). These include PROM scales that lacked any form of instruction (15% of all PROMs).

With respect to the domain “number of questions,” we found that PROM scales with one question were most common, followed by PROM scales with ten questions (Fig. 2). More than half (54%, green rating) of the PROM scales contained six or fewer questions (Figs. 1 and 2). Only 13% of PROM scales contained 11 or more questions (red rating), with a maximum of 66 questions. Approximately half of all PROM scales (54%) contained a maximum of four answer options and/or used a numeric rating scale (NRS) ranging from 0 or 1 to 10. For almost one in five PROM scales (18%), the number of answer options was more than five or the PROM scale contained an open question.

Fig.2
figure 2

Domain “number of questions” in PROM scales

Questions in a PROM scale were either all formulated actively (87%) or predominantly formulated passively (8%). This pattern was also seen in the concrete questions domain (72% exclusively concrete questions versus 11% predominantly non-concrete questions).

Ab hh***breviations were not used at all in the assessed PROM scales and the use of medical terms was almost absent (98%). In four PROM scales, medical terms were used with (orange rating) or without (red rating) explaining them in lay terms.

Most PROM questions consisted of only interrogative sentences (green, 72%). Although the red category theoretically also includes PROMs that consist of a combination of interrogative sentences and statements (≤ 25% interrogative sentences), all PROM scales that are rated red on this domain (25%) consisted of statements only.

Table 3 shows the PROM scales that had a green rating on all eight domains of the comprehensibility component of the PCQH.

Table 3 Most comprehensible PROM scales

4 Discussion

The current study provides an extensive evaluation of the comprehensibility of 157 widely used and previously validated PROM scales included in 28 disease/condition-specific outcome sets and one generic PROM set that were developed as part of the National Outcome-Based Healthcare Program in the Netherlands.

A total of 18 out of 157 PROM scales (11%) had a green rating on all eight domains of the comprehensibility component of the PCQH and should be easy to use for all patients. Most PROM scales are on an appropriate language level and largely free from medical terms and passively formulated questions. However, approximately half of all PROMs lacked clear instructions. In fact, in 15% of all cases instructions were completely absent. Moreover, we found that individual PROM scales regularly comprised more than ten questions and five response options per question. These factors reduce comprehensibility and may result in obtaining biased outcomes and preclude the participation of patients with lower (health) literacy skills in PROM initiatives.

A strength of this study is that a large number of PROM scales was assessed, which covers a wide range of diseases and conditions across Dutch medical specialist care. This suggests that the evaluated PROM scales constitute a representative sample of those that are now in use. Furthermore, the analysis that is presented results from final ratings by trained PhD level raters with expertise in outcome assessment and PROMs. The interrater reliability of the individual ratings was high for most domains. An exception is the instruction domain. Further examination of this finding showed that this was explained by differences in the interpretation of what part of a PROM to mark as an instruction. For example, some assessors only included instructions at the beginning of a PROM, while other assessors also included instructions that are provided with individual questions. During the consensus meetings, this was aligned and therefore resulted in one way of interpretation (i.e. instructions at the beginning of a PROM) for the final rating of this domain.

A potential limitation of this study is that the use of a computer algorithm embedded in the “Klinkende Taal” online tool for assessing language level [68] resulted in relatively little differentiation between PROMs. This tool evaluates the (Dutch) language level of a text based on a list of difficult words and a linguistic and syntactic analysis [68] and was used to assess the language level of the PROM scales in this study in a standardized way. To obtain an impression of the agreement with an expert judgment of the language level assessments, five carefully selected PROM scales were assessed both with “Klinkende Taal” and by an expert from Pharos (G.B.). In this small sample, the assessed language levels showed little difference between the expert from Pharos and “Klinkende Taal;” one PROM scale was judged to have a more difficult language level by the expert from Pharos, one PROM scale was judged to have a less difficult language level, and three PROM scales were judged equal. Another limitation is that the PCQH did not take into account whether or not qualitative methods were applied when developing the PROM scale. A final limitation of the study is that the level of complexity of questions also influences the comprehensibility and this is not captured by the domains of the PCQH. However, previous studies show that PROMs with complex questions may result in higher average time to complete. For instance, Van der Willik et al. (2019) show that the average time to complete the 66-item Dialysis Symptom Index (DSI) was 5.4 min [standard deviation (SD) 1.6] versus 7.5 min (SD 1.8) to complete the more complex 21 symptom-items from the Palliative Care Outcome Scale—Renal Version (IPOS–Renal) [72].

The results of our study suggest that more attention is needed for the comprehensibility in the development and appraisal of PROMs. For example, qualitative methods and pilot tests are currently the recommended approach to assess comprehensibility of PROMs [43]. However, many of the more recent and well-known PROM scales evaluated in the present study, such as those derived from the PROMIS item banks or the SF-36 version 2, were developed through extensive qualitative and quantitative approaches to guarantee comprehensibility [73, 74]. Although improvements in preliminary versions of the respective PROMs through the application of these methods are well documented, our results show that this has not resulted in measures that meet all eight comprehensibility domains that were examined in our study. This finding suggests that use of the PCQH in PROM development might complement established methods for appraising comprehensibility and ultimately lead to the development of new PROMs that are easier to use for the intended patient populations. In addition, the fact that in our study domain scores are reliable across raters suggests that the PCQH could also be used to compare comprehensibility between different PROMs, which is difficult to do with qualitative approaches. Comparing comprehensibility of PROMs may be useful in systematic reviews or to inform PROM selection in addition to appraisal of psychometric properties. Online Resource 2 contains comprehensibility profiles per PROM scale and can be used as a practical tool during a PROM selection process to gain insight into and compare the comprehensibility of PROMs.

In this study, the PCQH was used for the first time as a tool for large scale evaluation of PROM comprehensibility, as it was originally developed to assess and subsequently improve comprehensibility of individual healthcare questionnaires. Although we are able to present some conclusions on the comprehensibility of the assessed PROM scales, more research is needed to evaluate the comprehensiveness and validity of the PCQH. This research should focus on the domains and the cutoff values for the domain ratings which might be considered arbitrary or subjective without further evidence to support them. However, for some domain ratings, such evidence does exist. For example, literature has shown that patients have trouble distinguishing between more than five Likert/rating scale response options and prefer four to five. This aligns with the PCQH’s classification which rates four response options as optimal, five as acceptable, and more than five as problematic. [73, 75]. Furthermore, there is experimental evidence that that adding clear instructions substantially reduced the missing item rate and increased the overall response rate [78]. Finally, the number of questions is frequently used as a proxy measure of patient burden in validation studies [76, 77].

When using the PCQH to support PROM selection, it might be beneficial to have an overall evaluation measure of comprehensibility. For example, a graded scoring system in which numerical values (e.g. 0, 1, 2) are first assigned to the domains, which are then summed to provide an overall comprehensibility rating of a PROM scale. Future research on this should also focus on whether or not all domains need to be equally represented in such a measure.

Finally, the PCQH consists of three additional components not covered in this study: accessibility, layout and validation [54]. To improve the suitability of PROMs for all patients, we recommend consideration of these components as well. In addition, given the target group of the questionnaire, it might be needed to offer PROMs in multiple languages.

In conclusion, we have provided an extensive evaluation of 157 well-known and widely used PROM scales in eight relevant domains of comprehensibility. Our results provide actionable insights to improve the comprehensibility of PROMs, such as including clear instructions for patients and paying attention to the number of questions and answers. The use of PROMs is increasingly ubiquitous, including integration of PROMs in clinical workflows. For such applications it is essential that all patients, including those with low health literacy, can understand and complete their PROMs, ultimately facilitating the delivery of person-centered and more effective healthcare. Therefore, more attention is needed for comprehensibility when developing or selecting PROMs to be implemented in such initiatives.