Background

The European Organization for Research and Treatment of Cancer Core Quality of Life Questionnaire (EORTC QLQ-C30) is a 30-item questionnaire and 28 out of 30 items are scored on a 4-point Likert response scale: 1 = not at all, 2 = a little, 3 = quite a bit, and 4 = very much [1]. The German equivalents have been translated as 1 = überhaupt nicht, 2 = wenig, 3 = mäßig, and 4 = sehr [2]. Ideally, multi-item Likert scales should be interval scaled, which assumes equidistance between response options.

Research suggests that the German wording of the EORTC QLQ-C30 response scale, particularly the term mäßig for response category 3 (which in English is supposed to stand for quite a bit), may not be optimal [3, 4]. Based on these findings, we conducted three studies involving students, cancer patients, and adult control subjects (total number of participants N = 334) to investigate the intensity rating of the critical term mäßig relative to intensity ratings of other terms that seemed to be more appropriate for response category 3, such as einigermaßen, überwiegend or ziemlich. The task of the research participants was to rate each term on a 0–100 linear intensity scale (with the anchors 0 = überhaupt nicht [not at all] and 100 = sehr [very much]). The currently used term mäßig yielded an average intensity rating of 42 and thus, was rated substantially lower than the ideal value of 67 (difference − 25). In contrast, ziemlich turned out to be the best choice for response category 3, with mean intensity rating of 71, and it was among the top three terms for response option “3” in each study (see Additional file 1: Appendix Table S1).

Research undertaken by Schwarz and Strack [5, 6] showed that response scales influence respondents’ answers to questions. For example, respondents consistently reported higher frequencies for certain response options on scales with high rather than low frequency response alternatives [5]. Following this logic, we assumed that changes in the current German response format of the EORTC QLQ-C30 items will lead to changes in reported symptom and functioning scores. If mäßig is semantically very close to wenig (in English a little), it does not constitute a reasonable response alternative for patients with a moderate/considerable health problem. They might then tend to skip mäßig and turn to the next higher response alternative sehr (in English very much). Thus, we hypothesized that the current German response scale of the EORTC QLQ-C30 (mäßig version) leads to higher symptom scores and lower functioning scores than an optimized version (ziemlich version) with a category-label 3 that is equidistant between response categories 2 and 4. This effect should be particularly pronounced in patients with considerable health problems. The present pair of studies were designed to test this hypothesis.

Methods

Study 1

Study design and sample size rationale

This study involved patients with different types of cancer. It was a randomized cross-over-design study allowing for within-subject comparisons of the current and updated questionnaire versions. Patients were randomized either to a paper-based or a tablet-based version of the EORTC QLQ-C30 (see Additional file 1: Appendix Figure S1). A commonly accepted rule of thumb recommends a ratio of 10–15 respondents per item [7]. Given that the EORTC QLQ-C30 compromises 30 items, a sample size of 300–450 respondents is adequate. Data were collected between April 2016 and September 2018 at 7 study sites in Germany, Austria, and Switzerland.

Ethical considerations

The study was approved by the Ethical Committee of the University of Regensburg (reference number 14-101-0209) and by local ethical committees of the other study sites. The study was registered on the German Registry for Clinical Studies (reference number DRKS00012759), which is part of the WHO Trial Registration Data Set.

Inclusion and exclusion criteria

Inclusion criteria were histologically confirmed diagnosis of cancer, mentally and physically fit to complete a questionnaire, able to understand German, 18 years of age or above (no upper age limit), and informed consent. Patients who were mentally and physically unfit to complete a questionnaire or denied informed consent were excluded.

Procedure

Patients were approached by a researcher and subsequently informed about the study. After providing written informed consent, patients were randomly assigned to the paper-based or computer-based assessment. The paper version involved the standard two-page EORTC QLQ-C30 questionnaire, in which the response options are numbered from 1 to 4 for each item of the questionnaire, with the appropriate labels appearing at the top of each section. In the electronic version [8], each item is presented separately on screen together with the response options. Regardless of paper version or electronic version, patients were randomly assigned to fill in the questionnaire using conventional German response options (i.e., überhaupt nicht, wenig, mäßig, sehr) of the EORTC QLQ-C30 version 3.0 or using the optimized version in which mäßig was replaced by ziemlich. Patients filled in the questionnaire again at a later point in time, whereby the alternate response option version was presented, and continued with either paper-based or computer-based assessment depending on the assigned study arm. Additionally, patients rated on two anchor variables whether their health/QoL improved, worsened, or remained unchanged between both assessments to ensure that differences between EORTC QLQ-C30 versions within a patient is attributed to questionnaire versions and not real changes in health/QoL.

Study 2

Study design and sample size rationale

The data were collected in 13 European countries, the USA and Canada in the context of an international project to generate European general population norm data for the EORTC QLQ-C30 questionnaire [9, 10]. Sample size per country was based on the following rationale: stratification by sex and age groups (18–39, 40–49, 50–59, 60–69, 70 + years), with a target sample size of each sex x age x country subgroup of n = 100, leading to an anticipated sample size of n = 1000/country. This sampling design was considered sufficient to investigate differential item functioning (DIF) using logistic regression analysis which was at the core of the original study [10]. Data collection was performed by GfK SE (www.gfk.com), a panel research company specialized in representative multinational online surveys. Panel members register voluntarily and generally participate when contacted, resulting in response rates between 75 and 90% [9]. Data were collected in March/April 2017. German respondents were randomly assigned either to the conventional EORTC QLQ-C30 questionnaire version 3.0 (response option 3 = mäßig, n = 1006) or the optimized version (response option 3 = ziemlich, n = 1027).

Ethical considerations

The multinational survey conformed to the common ethical standards by obtaining informed consent from all participants before collecting data completely anonymously. Any identification of the respondents through the authors is impossible. The study thus complies with the EU General Data Protection Regulations as well as with the professional standards of the European Pharmaceutical Market Research Association (EphMRA), which GfK SE is a member of.

Inclusion and exclusion criteria for the present analyses

Respondents were eligible if they provided informed consent. Since these were all registered panel members, all persons contacted were able to read and understand a sufficient level of German and they also had access to a computer, as data collection was done electronically. For the present analyses only respondents from Germany were used.

Procedure

Subjects were contacted by the survey company GfK SE. Samples were stratified with an equal number of men and women, and 5 pre-defined age categories, i.e., 18–39 years, 40–49, 50–59, 60–69, and 70 years and above, resulting in n = 200 per age/sex stratum. As part of the online panel, respondents were asked to complete the 30 items of the EORTC QLQ-C30 [10]. Comparable to study 1, each item was presented separately on screen.

Statistical analyses

EORTC QLQ-C30 scales were computed according to the EORTC Scoring Manual [11]. In a first step, all scales were linearly transformed (0–100), so that for the five functioning scales, higher scores represent higher functioning and for the nine symptom scales, higher scores represent higher symptom burden. In a second step, a summary score was calculated, consisting of 13 out of the 15 scales, excluding financial difficulties and global health status/quality-of-life. For this summary score, the symptom scales were reversed, so that 0 represents lowest and 100 highest QoL [12].

We employed the following strategy in using and interpreting scale results: we first had a look at the statistically significant difference (p value < 0.05) in the summary score. If a significant difference was obtained, we inspected significant differences with regard to the 14 single symptom or functioning scales. This strategy was chosen in order to address multiplicity issues. To determine clinically meaningful differences we used the conservative 5 point criterion (small difference) [13].

The core analyses related to differences between the conventional EORTC QLQ-C30 version (mäßig) and the optimized EORTC QLQ-C30 version (ziemlich) and included univariable analyses of the unadjusted means (t tests) as well as multivariable analyses. More specifically, two separate analyses were conducted on the cancer patient sample: between-subject and within-subject comparisons. For between-subject comparisons, responses to both questionnaire versions of the first assessment were compared using analyses of covariance (ANCOVAs) adjusted for sex, age, mode of administration (MOA, paper vs. electronic), and health burden. Health burden was defined by the EORTC QLQ-C30 scale global quality-of-life: < 50 (worse QoL) vs. ≥ 50 (better QoL) [14, 15].

For within-subject comparisons, mixed linear models were used: subject as random factor, questionnaire version as repeated factor and the following set of fixed factors: questionnaire version, MOA, order of questionnaire versions, sex, age, and health burden. The mixed linear models included only patients who reported no changes in QoL and health between both assessments on the two anchor questions.

In the German population sample, differences between the two EORTC QLQ-C30 versions were assessed using ANCOVAs adjusted for sex, age, and health burden.

Parametric methods were used for all analyses due to their robustness to violations of non-normality, which is occasionally the case with QoL data [16].

Furthermore, according to classical test theory, basic psychometric performance (internal consistency [17] as well as convergent and discriminant validity [18, 19]) of both EORTC QLQ-C30 versions were explored (see Additional file 1: Appendix Basic psychometric properties and Table S2).

Statistical analyses were carried out using SPSS 25. Statistical tests were two-sided and were done at the 0.05 significance level. Descriptive statistics included the following: frequencies (n), percentages (%), means (m), standard deviations (sd), 95% confidence intervals (CI), medians (med), interquartile ranges (IQR).

Results

Study 1

In total, 467 patients were recruited. Seventeen patients were excluded from analyses due to the following reasons: physically or mentally unfit (n = 10), declined participation during first assessment (n = 5), and study data were overwritten due to technical issues (n = 2). Thus, data of 450 patients (median age = 63 years, 46% females) were available (Table 1). A second assessment could be obtained in 404/450 patients (90%), which is a high completion rate for second assessment [20]. The median gap between the two assessment points was 4 days (IQR = 2/7) (Additional file 1: Appendix Figure S1). Accidently, four patients responded twice to the same questionnaire version and had to be excluded for test–retest analyses.

Table 1 Study 1: patient characteristics

In the first step, we analyzed differences in EORTC QLQ-C30 scores between patients who received either the mäßig or ziemlich version at the first assessment. As shown in Table 2, the unadjusted analysis showed no significant differences in the summary score between the two questionnaire versions (mean = 70.1, sd = 19.9 vs. m = 73.0, sd = 18.6; p = 0.116). Multivariable analyses adjusted for age, sex, MOA, and health burden, showed a mean difference of − 4.5 (95% CI − 7.8 to − 1.3) in the summary score (p = 0.006), such that the mäßig-version yielded lower scores (poorer QoL) than the ziemlich-version (Table 3). Mean differences for all 14 scale scores were in the expected direction (i.e., higher symptoms and lower functioning in the mäßig- than in the ziemlich-version), with four showing a statistically significant difference (p values < 0.05) (Table 3), and all were > 5 score points.

Table 2 Comparisons between QLQ-C30 versions—univariable analyses (unadjusted)
Table 3 Comparisons between QLQ-C30 version—multivariable analyses (adjusted)

When taking a closer look at patients with considerable health burden (global QoL < 50 points, n = 144), the differences between the mäßig and ziemlich versions became particularly pronounced, i.e., the mean difference in the summary score was − 6.8 (95% CI − 12.2 to − 1.4, p = 0.013), whereas it was only − 2.3 (95% CI − 5.9 to 1.4, p = 0.226) in patients with lower/no health burden (global QoL ≥ 50 score points, n = 306; Table 3). In addition, four of the 14 single scales of patients with higher health burden yielded statistically significant differences. The four single scales as well as the total score were > 5 score points.

The next step were within-group comparisons in patients who did not indicate a change in their health and QoL between assessments (n = 229). Univariable analyses showed a lower summary score in the mäßig (m = 75.1, sd = 18.3) than ziemlich version (m = 77.4, sd = 16.8; p < 0.001, Table 2). Furthermore, we observed corresponding statistically significant mean differences in four of the 14 single scales (p values < 0.05); however, none was > 5 points.

In multivariable analyses (Table 3), we again found a larger difference in the summary score between both versions in the group of patients with considerable health burden (− 4.8, 95% CI − 6.9 to − 2.8, p < 0.001, global QoL < 50, n = 57) compared to patients with lower/no health burden (− 1.4, 95% CI − 2.6 to − 0.2, p = 0.022, global QoL ≥ 50, n = 172). Furthermore, 7 out of 14 scale differences in the higher health burden group were statistically significant and all differences exceeded the 5 score point criterion.

In addition to the comparison of the two EORTC QLQ-C30 versions, the study design further allows for the comparison between paper-based and computer-based assessment of the questionnaire. Subgroup analyses revealed that differences between the both versions were more pronounced in the computer-based version than in the paper-based version (Table 3). However, the 5 score point criterion was only exceeded in the between-group comparison within the computer-based assessment.

Study 2

German respondents were randomly assigned either to the conventional EORTC QLQ-C30 questionnaire version 3.0 (response option 3 = mäßig, n = 1006) or the optimized version (response option 3 = ziemlich, n = 1027).

Participants in study 2 comprised of a representative sample of the German general population surveyed in the context of a large-scale international online norm data survey [9]. As shown in Table 4, the median age was 54 years, 50% were female and most participants (58%) reported at least one disease.

Table 4 Study 2: sample characteristics

As shown in Table 2, the unadjusted analysis showed a significantly higher summary score for the optimized EORTC QLQ-C30 version compared with the conventional EORTC QLQ-C30 version (m = 83.6, sd = 15.9 vs. m = 82.0, sd = 17.7; p = 0.038). Multivariable analyses adjusted for age, sex, and health burden yielded even stronger effects: the mean difference of the summary score was − 3.1 (95% CI − 4.6 to − 1.5; p < 0.001, Table 3), and 9 out of 14 single scales showed statistically significant differences, i.e., p values < 0.05. None of the observed differences reached 5 points or more (Table 3).

When taking a closer look at respondents with considerable health burden (n = 370, global QoL < 50) versus those with lower/no health burden (n = 1663, global QoL ≥ 50), the difference in the summary score between both versions was more pronounced in the high burden group (− 4.5, 95% CI − 7.3 to − 0.17, p = 0.002) than in the low burden group (− 1.6, 95% CI − 2.9 to − 0.3, p = 0.016, Table 3). In the higher health burden group, 8 out of 14 differences in single scales were statistically significant, and 7 of these differences exceeded the 5 point criterion.

Significant and minimally important differences between conventional and optimized EORTC QLQ-C30 versions are summarized up in Additional file 1: Appendix Table S3.

Choice of response options in the mäßig and in the ziemlich questionnaire versions (Studies 1 and 2)

We collapsed the total number of responses for response options 1, 2, 3 and 4 across the 27 items that made up the summary score and compared their distributions between the questionnaire versions in studies 1 and 2 (Fig. 1 and Table 5).

Fig. 1
figure 1

Frequencies of chosen response option—German population and cancer patients’ first assessment. The EORTC QLQ-C30 questionnaire was presented in two versions. The conventional questionnaire used mäßig and the optimized version used ziemlich as response option 3 (quite a bit) of the 4-point Likert scale. Responses to each response option (1–4) are presented for the total sample and are further separated for (1) subjects with QoL < 50 and QoL ≥ 50 as well as for (2) questionnaire version with response option mäßig and questionnaire version with response option ziemlich. German population: A total of N = 54,891 responses were given from N = 2033 respondents to items 1–27 (no missing responses). Cancer patients: At first assessment, a total of N = 12 089 responses were given from N = 450 patients to items 1–27 (missing responses n = 61 [0.5%])

Table 5 Study 1: changes in frequencies of chosen response option—cancer patients

Looking at Study 2 and analyzing responses (N = 54,891) of all respondents (N = 2033) (Fig. 1), it appeared that frequencies in response option 1 (überhaupt nicht [not at all]) were practically identical in the mäßig and ziemlich versions (61.9% and 62.3%, respectively). However, the introduction of the term ziemlich modified the meaning of the entire scale and consequently the choice of the remaining response options 2, 3, and 4. Firstly, as expected, the response option 4 (very much) was used more frequently in the mäßig version than in the ziemlich version 5.1% versus 3.2%. Secondly, the difference between the percentage of respondents choosing options 2 and 3 was 12.9% in the mäßig version, and 15.9% in the ziemlich version.

These two effects were particularly pronounced in respondents with a poor general health status (global QoL < 50). While 19% percent of these respondents chose the highest response option 4 (very much) in the mäßig condition, only 11.3% chose this response option in the ziemlich condition. Furthermore, in the ziemlich version, response options 2 and 3 were more distinct (6.0% difference) than in the mäßig condition, showing a 2.4% difference.

Comparable effects were obtained in the first assessment sample of study 1 (Fig. 1).

Further analyses included cancer patients who answered both versions consecutively and reported no health changes between the two assessments (n = 229, Table 5). While there was a high overlap of 83.6% in choosing response option 1 (not at all) across the two versions, overlap for the other 3 response options was considerably lower, i.e., 60.1%, 45.2%, 44.6%, respectively.

That is, 39.3% of respondents who chose response option 4 (very much) in the mäßig-version switched to option 3 (= quite a bit) in the ziemlich-version (Table 5). This effect was particularly pronounced in patients with good health (QoL ≥ 50) who switched in 43.5% of the cases, whereas this percentage was only 37.0% in patients with higher health burden (QoL < 50) (data not shown).

Discussion

Based on the observation that response options are not equidistant in the German version of the EORTC QLQ-C30, the main aim of this research was to test the hypothesis that the current German response option 3 is suboptimal and may bias results towards the worse end of the scale, i.e., worse/lower functioning and higher symptoms.

As hypothesized, the main finding of the present studies is that the optimized EORTC QLQ-C30 version yielded slightly lower symptom and higher functioning scores. The magnitude of mean differences in adjusted multivariable analyses was 4.5 (cancer patients, between-group comparison, n = 450), 3.1 (cancer patients, within-group comparison, n = 229), and 3.1 (German reference sample, n = 2033). This effect became particularly pronounced when we had a closer look at respondents with a high health burden: 6.8, 4.8, and 4.5 mean difference in score points, respectively. These values are at the lower end of Osoba’s widely cited 5–10 point difference criterion for minimal important clinical changes on the EORTC QoL scales [13]. These effects were not only obtained for the summary scale, but also for numerous of the single scales of the EORTC QLQ-C30. The scale that showed the highest proportion of significant differences was physical functioning, followed by appetite loss, role functioning, emotional functioning, and fatigue.

This effect can be interpreted through a psychological theory which posits that scaling labels are of informational value for respondents, guiding them to understand the question and to elicit the most “appropriate” answer in a given context [5, 6]. In the mäßig-version, mäßig (response option 3) is semantically very close to response option 2 (wenig = a little), but considerably far apart from response option 4 (sehr = very much). Therefore, respondents may have problems to differentiate between wenig and mäßig and have an inclination to choose sehr (very much), particularly when they suffer from an impaired health status. Introducing ziemlich changed the entire response environment, as it lies more equally balanced between response options 2 (a little) and 4 (very much). Thus, the response options have a clearer meaning, now rendering ziemlich (quite a bit) a worthwhile option in the case of health problems and making sehr (very much) less attractive.

This interpretation is in line with the pooled frequencies of each of the four response options across 27 questionnaire items. We saw that the differences in frequencies between mäßig and wenig are less pronounced than between wenig and ziemlich. Furthermore, for respondents with high health burden, sehr (very much) was regularly an appropriate response option in the mäßig-version, and much more so than in the ziemlich-version where ziemlich was still considered an adequate reflection of their perceived health status.

Furthermore, we investigated the possibility of potential differences between the paper-based and the computer-based assessment. In the computer-based assessment each item is presented individually at the screen together with the response labels, whereas in the paper version the response labels are shown only at the very beginning of the questionnaire. There is reason to believe that these differences in the presentation format may amplify the wording effect, and this effect becomes more pronounced in the computer-based assessment. We found some indication for this sort of amplification, but it was not as strong and as consistent as one might expect.

Adopting a broader perspective outside the peculiarities of response labels in specific language versions (in this case German), the implications of this project are twofold.

Firstly, this project is a good example of how quality assurance can be done in the field of patient-reported outcomes instruments. To date, only few examples have been published in this area. Quality assurance projects have focused on paper-based versus electronic assessment (particularly migration of the former to the latter) [21], translation and linguistic validation [2], or compliance with regulators’ (FDA, EMA) perspectives on outcome assessment [22, 23]. We are not aware of a study like this that systematically called into question existing response options and made a head-to-head comparison between two questionnaire versions.

Secondly, this project is also a timely reminder that psychological processes play a crucial role in QoL assessment. QoL research is preoccupied with psychometrics, statistical models, and technical details, at the expense of analyzing the dynamics underlying the interplay between the responder and the questionnaire. In order to understand and interpret answers to questionnaires correctly, a thorough analysis of the cognitive and emotional underpinnings is essential. Ultimately, questionnaires are communication tools that are of value only if the questionnaire developer, the sender (i.e., the patient) and the receiver (i.e., the researcher or clinician) of the information are on the same page.

Limitations of the study may relate to the use of the EORTC QLQ-C30 summary score. An argument can be reasonably made, that this summary score is composed of many and diverse QoL aspects rendering it difficult to interprete and thus, meaningless for use in clinical studies. In fact, many clinical studies are often based on well-defined hypotheses and therefore focus on specific QoL scales or side effects. A strength of the summary score comes into play, when a hypothesis with regard to a specific scale is not at the core: it avoids problems connected with exploratory multiple statistical testing of numerous QoL scales (“p-hacking”). This property motivated the creation of the summary score in the first place and this was also a reason why we made use of it. We expected to see differences between the two questionnaire versions without being able to specify beforehand which of the available 14 single scales would show the hypothesized effects. Therefore, our analysis strategy was to have a look at the summary scale first, and only in case of a significant effect, the single scales were explored further. It should be noted that the EORTC Quality of Life Group is in the process of exploring the potential of the summary score and is about to prepare a guideline on its use.

A further limitation of the present analyses lies in their exclusive use of methods of classical test theory. We acknowledge the conceptual and statistical superiority of item response theory (IRT), which is used by the EORTC Quality of Life Group particularly in the construction of item sets for computer adaptive testing [24]. To obtain reliable results, IRT analyses require larger sample sizes than were available here. Additional studies focusing on the measurement properties of the updated questionnaire including a wider range of methodological approaches are desirable.

Conclusion

Our starting point was that the German translation of the quite a bit response category was not located at the right place according to the assumption of equidistance. This pair of studies tested a revised response option, confirming that the revised version solves the problem, and should therefore be used in the future.