Introduction

Colorectal cancer (CRC) is among the most prevalent cancers worldwide [1]. CRC and its treatment can have a large impact on health-related quality of life (HRQOL) [2]. It is important to assess HRQOL in clinical trials to investigate the impact of a treatment on HRQOL, and in clinical practice to detect and monitor symptoms and offer optimal care [3,4,5]. A frequently used patient-reported outcome measure (PROM) to evaluate HRQOL in cancer patients is the 30-item European Organization for Research and Treatment of Cancer (EORTC) Quality of Life Core Questionnaire (EORTC QLQ-C30) [6, 7] and its tumor-specific questionnaire modules [8]. In 1999, the 38-item module for CRC patients was developed (EORTC QLQ-CR38) [9], and in 2007, the module was revised and shortened to 29 items (EORTC QLQ-CR29) [10]. The initial validation study of the QLQ-CR29 in an international sample of CRC patients [11] showed that it had good internal consistency (α > 0.70) in all but one subscale, was acceptably reliable (intraclass correlation coefficient (ICC) > 0.68 for subscales and > 0.55 for single items), was able to discriminate between known groups (patients with and without stoma, Karnofsky performance score < 80 and > 80, and with curative and palliative treatment), and had good divergent validity, i.e., low correlation with QLQ-C30 items [11].

Two systematic reviews on the measurement properties of the QLQ-CR29 were published in 2015 and 2016 [12, 13]. Wong et al. performed a systematic review on various disease-specific and generic HRQOL PROMs, and included two studies on the QLQ-CR29. They recommended the QLQ-CR38 to assess HRQOL in CRC patients, because it had the most positive ratings on the measurement properties according to their quality assessment criteria [12]. Ganesh et al. performed a systematic review of three CRC-specific PROMs (Functional Assessment of Cancer Therapy-Colorectal (FACT-C), QLQ-CR38, and QLQ-CR29), and included three studies on the QLQ-CR29. They concluded that the choice for one of these three instruments depends on the context and the research aim [13].

Since these reviews, several new validation studies of the QLQ-CR29 have been published. Therefore, the aim of the present study was to perform a systematic review of the measurement properties of the QLQ-CR29 as investigated in validation studies up to 2019, according to the COnsensus-based Standards for the selection of health Measurement INstruments (COSMIN) criteria [14, 15], and to investigate whether the initial positive results regarding the measurement properties of the QLQ-CR29 are confirmed.

Materials and methods

EORTC QLQ-CR29

The EORTC QLQ-CR29 is a tumor-specific HRQOL questionnaire module for CRC patients, which is designed to complement the EORTC QLQ-C30 questionnaire [6, 10]. The QLQ-CR29 has five functional and 18 symptom scales. It contains four subscales (urinary frequency (UF), blood and mucus in stool (BMS), stool frequency (SF), and body image (BI)) and 19 single items (urinary incontinence, dysuria, abdominal pain, buttock pain, bloating, dry mouth, hair loss, taste, anxiety, weight, flatulence, fecal incontinence, sore skin, embarrassment, stoma care problems, sexual interest (men), impotence, sexual interest (women), and dyspareunia) [11]. Patients are asked to indicate their symptoms during the past week(s). Scores can be linearly transformed to provide a score from 0 to 100. Higher scores represent better functioning on the functional scales and a higher level of symptoms on the symptom scales [10, 11].

Literature search

The literature search was part of a larger systematic review (Prospero ID 42017057237) [16], investigating the validity of 39 PROMs measuring HRQOL of cancer patients included in an eHealth self-management application “Oncokompas” [17,18,19,20,21]. We performed a systematic search of Embase, Medline, PsycINFO, and Web of Science, to identify studies investigating the measurement properties of these 39 PROMs, including the QLQ-CR29. The search terms were the PROM’s name, combined with search terms for cancer, and a precise filter for measurement properties [22]. The full search terms can be found in Appendix. The literature search was performed in June 2016 and updated in January 2019 to search for recent studies on the QLQ-CR29 specifically. References of included studies have been checked for additional articles manually.

Inclusion and exclusion criteria

Inclusion criteria were as follows: reporting original data from cancer patients on at least one measurement property of the QLQ-CR29, as defined in the COSMIN taxonomy [15, 23, 24]—structural validity (the degree to which the scores of a PROM are an adequate reflection of the dimensionality of the construct to be measured), internal consistency (the degree of interrelatedness among items), reliability (the extent to which scores of patients who have not changed are the same for repeated measures on different occasions), measurement error (the error of a patient’s score that is not attributed to true changes in the construct to be measured), construct validity (hypothesis testing; including known-group comparison, convergent and divergent validity [the degree to which the scores of a PROM are consistent with hypotheses with regard to differences between relevant groups, and relationships to scores of other instruments, respectively]), cross-cultural validity (the degree to which the performance of the items on a translated or culturally adapted PROM are an adequate reflection of the performance of the items of the original version of the PROM), and responsiveness (the ability of a PROM to detect change over time in the construct to be measured) [24]. Exclusion criteria were as follows: no availability of full-text manuscripts, conference proceedings, and non-English publications. Titles and abstracts, and eligible full texts were screened by two of the four raters independently (KN, FJ, AH, NH). Disagreements were discussed until consensus was reached.

Data extraction

For each reported measurement property defined by the COSMIN taxonomy [25], data were extracted by two of the four extractors independently (KN, FJ, AH, NH). This included type of measurement property, its outcome, and information on methodology. Disagreements were discussed until consensus was reached.

Data synthesis

For the data synthesis, we followed the three steps of the COSMIN guideline for systematic review of PROMs [26]. In step 1, we rated methodological quality of the studies per reported measurement property as either “excellent,” “good,” “fair,” or “poor.” A total score was obtained by taking the lowest rating on any of the methodological aspects, according to the original COSMIN checklist [14]. In step 2, we rated the results per measurement property, by applying the COSMIN criteria for good measurement properties [26]. Results of the individual studies were rated as “sufficient,” “insufficient,” or “indeterminate” per measurement property, according to predefined criteria. Ratings from the individual studies were then qualitatively summarized into an overall rating per measurement property. Inconsistencies between studies were explored. If any explanation was found (e.g., poor methodological quality), this was taken into account in the overall rating, if no explanation was found, the overall rating was summarized as “inconsistent.” In step 3, we graded the quality of the evidence of the measurement properties, following a modified GRADE approach [26]. The overall quality of the evidence was rated as “high,” “moderate,” “low,” or “very low,” taking into account risk of bias, inconsistency of study results, imprecision, and indirectness. When a measurement property was rated as “indeterminate” in step 2, the quality of evidence could not be graded, as there was no evidence to grade. The evaluation of measurement properties was performed by two raters independently (AH, NH). Disagreements were discussed until consensus was reached.

Results

In the initial search, 980 nonduplicate abstracts were identified for all 39 PROMs, of which 31 were relevant for the QLQ-CR29. The search update resulted in 27 extra nonduplicate abstracts regarding the QLQ-CR29. In total, 55 abstracts were screened, of which 30 were excluded. Thirteen studies were excluded during full-text screening. One study was excluded during data extraction [27], because data were not presented for the QLQ-CR29 separately. The flow diagram of the literature search and selection process is shown in Supplementary Fig. 1. Study characteristics of the 11 included studies [11, 28,29,30,31,32,33,34,35,36,37] are shown in Table 1. These studies reported on structural validity (9 studies), internal consistency (10 studies), reliability (6 studies), construct validity (know-group comparison [10 studies], convergent [8 studies], and divergent [2 studies] validity), and responsiveness (2 studies), but did not report on measurement error, or cross-cultural validity. However, measurement error could be calculated for four studies.

Table 1 Study characteristics of the included studies

Structural validity

Nine studies investigated structural validity (Table 2). Methodological quality of eight studies was rated as “poor” [11, 28, 30,31,32, 34,35,36], because they performed multitrait item scaling (MIS) instead of exploratory/confirmatory factor analysis (EFA/CFA). One study was rated as “fair” [37], because it performed a principal component analysis (PCA). The studies reporting MIS were consistent in their findings, showing no inconsistent items regarding convergent and discriminant validity within the subscale. However, since MIS is an indirect test of structural validity, no conclusions can be drawn on the basis of these studies. In the study that used PCA, three of the four original subscales (UF, BMS, and BI) were replicated. The two-item original SF subscale was merged with four single items about bowel or stoma problems into a new subscale “defecation/stoma problems (DSP)” [37]. Because the account of variability and the ratio of the explained variance by the factors was not reported, the PCA cannot be interpreted properly, and therefore structural validity was rated as “indeterminate,” and there is no evidence for or against unidimensionality of the subscales.

Table 2 Structural validity of the EORTC QLQ-CR29

Internal consistency

Ten studies investigated internal consistency (Supplementary Table 1). Methodological quality of nine studies was rated as “poor” [11, 28, 30,31,32,33,34,35,36], because evidence for unidimensionality of the subscales was not provided, and therefore, the value of Cronbach’s α could not be interpreted properly [38]. One study was rated as “fair” [37], because it did not report on how missing items were handled. This study reported good internal consistency for the BI subscale (α = 0.80), and the new subscale established (see “Structural validity”); DSP (α = 0.84). The subscale UF had adequate internal consistency (α = 0.71), and for the original subscale SF two values were presented; for patients with (α = 0.72) and without stoma (α = 0.68). The subscale BMS had low internal consistency (α = 0.56) [37]. The studies of poor quality showed mostly adequate Cronbach’s α values, except for the BMS subscale. Based on these findings, the evidence on internal consistency was rated “sufficient,” because > 75% of the values were good for the original subscales, assuming these subscales are unidimensional, which is not proven with PCA (see “Structural validity”). Quality of evidence was graded as “low,” because only one study of fair quality was found.

Reliability

Six studies investigated test–retest reliability (Table 3). Methodological quality of two studies was rated as “poor” [36, 37], because of the small sample size. Four studies were rated as “fair” [11, 30, 32, 35], because it was not reported how missing items were handled and/or had a moderate sample size. Two of these studies [11, 32] provided an overall ICC value for all subscales/items with exceptions (e.g., “ICC for all subscales was > 0.66, except for BI”), and thereby provided too little information to interpret the ICC on the subscale/item level. Low correlations (< 0.70) were reported in two remaining studies for the UF subscale, and in one of the two studies for multiple single items [30, 35]. Based on these findings, evidence on reliability was rated as “insufficient,” because of multiple unacceptable ICC values across studies. Quality of evidence was graded as “moderate,” because only studies of fair and poor quality were found.

Table 3 Test–retest reliability (correlation coefficients) of the of QLQ-CR29

Measurement error

None of the studies reported on measurement error. However, standard error of measurement (SEM) and smallest detectable change (SDC) could be calculated for four studies reporting on test–retest reliability [30, 35,36,37], using the ICCs and standard deviations of the subscales and single items (Supplementary Table 2). Methodological quality of two studies was rated as “fair” [30, 35], because of the moderate sample size. Two studies were rated as “poor” [36, 37], because of the small sample size. SDC scores ranged between 9.41 and 54.21, representing 9–54% of the scale of the QLQ-CR29 (0–100). However, because the minimal important change (MIC) was not reported, measurement error could not be interpreted. Based on these findings, the evidence on measurement error was rated as “indeterminate.”

Construct validity (hypothesis testing)

Known-group comparison

Ten studies performed a known-group comparison; a comparison of subgroups based on sociodemographic and/or clinical variables where differences in QLQ-CR29 scores should be expected (Table 4). Methodological quality of eight studies was rated as “poor” [11, 30,31,32,33,34,35, 37], because they did not formulate a priori hypotheses about expected differences between groups. Two studies were rated as “fair” [28, 36], because it was not described how missing items were handled. The studies of fair quality showed multiple differences in subscales and items between known-groups, but confirmed less than 75% of their hypotheses, leading to an “insufficient” rating. The studies of poor quality found multiple differences between groups (e.g., difference in taste for stoma vs. no stoma group), but careful interpretation is warranted, because no hypotheses were formulated, therefore leading to an “indeterminate” rating.

Table 4 Known group comparison of the QLQ-CR29

Convergent validity

Eight studies investigated convergent validity (Table 5). Methodological quality of seven studies were rated as “poor” [28, 30,31,32,33, 35, 37], because a priori hypotheses about expected correlations were not reported, and/or information on the measurement properties of the comparator instrument was not provided. One study was rated as “good” [29]. In this study, the comparator instrument was the low anterior resection syndrome (LARS) score (measuring bowel dysfunction after sphincter-preserving surgery among rectal cancer patients [29, 39]). All five of the a priori formulated hypotheses were confirmed, leading to a “sufficient” rating. In the studies of poor quality, the comparator instrument was the QLQ-C30. The QLQ-C30 is the core questionnaire of the EORTC QLQ questionnaires [6]. Most studies showed that functional subscales of the QLQ-CR29 were positively correlated with functional scales of the QLQ-C30, and negatively correlated with symptom scales of the QLQ-C30, and that most QLQ-CR29 symptom scales were positively correlated with symptom scales of the QLQ-C30, and negatively correlated with functional scales of the QLQ-C30. As there were no a priori hypotheses reported in most of these studies, results are difficult to interpret. While some scales make theoretical sense to be correlated (e.g., functional scales: QLQ-C30 emotional functioning and QLQ-CR29 anxiety), many scales are likely unrelated (e.g., symptom scales: QLQ-C30 insomnia and QLQ-CR29 hair loss). Due to the diversity of subscale constructs, the results were rated as “indeterminate” in these studies.

Table 5 Convergent validity of the QLQ-CR29

Divergent validity

Two studies investigated divergent validity. Methodological quality was rated as “poor” [11, 28], because a priori hypotheses about expected correlations were not reported, and information on the measurement properties of the comparator instrument was not provided. In both studies, the comparator instrument was the QLQ-C30 [6]. One study reported correlations between the two instruments of < 0.02 for most subscales [28], and the other reported correlations of < 0.40 in all subscales [11]. As was the case for convergent validity, due to the diversity of subscale constructs it is difficult to determine which subscales should be unrelated and which should be related. As such, we rated these results as “indeterminate.”

Based on these findings, construct validity (hypothesis testing) was rated overall as “inconsistent.” Most studies did not report a priori hypotheses, and therefore could not be interpreted. Three remaining studies provided an “insufficient” rating for known-group comparison [28, 36], and a “sufficient” rating for convergent validity [29], leading to the overall “inconsistent” rating. Quality of evidence was graded as “moderate,” because mostly studies of fair and poor quality were found.

Responsiveness

Two studies investigated responsiveness. Methodological quality of these studies was rated as “poor” [11, 33], because a priori hypotheses about changes in scores were not reported. Sensitivity to measure change in HRQOL was tested in patients before and within 2 years after stoma closure, and in patients receiving palliative chemotherapy and 3 months later [11], and before and after neoadjuvant or palliative chemotherapy [33]. A statistically significant reduction was found of the symptoms scores on weight [11], and BMS, SF, urinary frequency, urinary incontinence, dysuria, buttock pain, bloating, and taste [33] after chemotherapy. Other scores were unchanged, as was the case after stoma closure. Based on these findings and the fact that no correlations with changes in instruments measuring related constructs were reported, the evidence on responsiveness was rated “indeterminate.”

Summarized ratings of the results and the overall quality of evidence of all measurement properties are shown in Table 6.

Table 6 Summary of results and quality of the evidence of the measurement properties of the QLQ-C29

Discussion

The QLQ-CR29 is a well-known and commonly used PROM, which was published in 2007, following revision of the QLQ-CR38. Both instruments cover a wide range of symptoms among CRC patients. This review shows that current evidence on the measurement properties of the QLQ-CR29 is limited. For each of the 11 studies included in the review, methodological quality per measurement property was rated most often as “fair” or “poor.” Evidence of internal consistency was rated as “sufficient,” reliability as “insufficient,” construct validity (hypotheses testing) as “inconsistent,” and structural validity, measurement error, and responsiveness as “indeterminate.”

Most studies performed indirect measurements of structural validity. With PCA, one of the original subscales could not be confirmed but was changed into a new subscale [37]. We recommend future studies to perform CFA, to confirm either the original or newly found factor structure. Subsequently, internal consistency should be assessed on those subscales that are confirmed to be unidimensional.

Reliability appears to be a concern for the QLQ-CR29. Further investigation is necessary, using ICC to control for systematic error variance. These data can also be used to assess measurement error, by calculating SDC. The SDC should be compared with the MIC, in order to determine whether the smallest change in scores that can be detected is smaller than the change that is minimally important for patients, and is not due, with 95% certainty, to measurement error.

Criterion validity cannot be assessed for the QLQ-CR29, since there is no “gold standard” for measuring HRQOL. Therefore, it is important to assess construct validity by formulating hypotheses, a priori, for (1) known-group differences and (2) assessing convergent and divergent validity with other PROMs, including direction and magnitude. Hypotheses that can be confirmed contribute to construct validity. The aim of the studies included in this review was primarily to determine whether there was overlap between the QLQ-C30 and QLQ-CR29, and not to specifically test for convergent/divergent validity. Therefore, construct validity of the QLQ-CR29 needs to be investigated further with a priori formulated hypotheses. The same applies to responsiveness, which needs to be investigated in groups that are known to change, with a priori formulated hypotheses.

While none of the studies reported on tests of cross-cultural validity (i.e., measurement invariance), the original validation study was performed in an international sample [11]. Additional, formal tests of measurement invariance would be useful.

The strength of the current review is that we closely followed the COSMIN guidelines during all steps of this review. A limitation is that we used a precise instead of a sensitive search filter for measurement properties in the literature search, which has a lower sensitivity (93 vs. 97%) [22]. Therefore, we cannot rule out that some additional validation studies of the QLQ-CR29 might have been missed.

This review indicates that additional, better quality research is needed on the measurement properties of the QLQ-CR29. Future validation studies should focus on assessing structural validity and subsequently internal consistency on subscales that are unidimensional, reliability and thereby measurement error, construct validity (hypothesis testing), and responsiveness with a priori hypotheses, and cross-cultural validity. It is thereby recommended to use the COSMIN methodology.