Plain English summary

Bladder Cancer is one of the ten most common cancers worldwide. Though, little attention is being paid to the impact of the disease and its treatment on the quality of life of patients. Quality of life can be measured using questionnaires. Multiple organisations developed one or more questionnaires investigating common symptoms and problems of patients presenting with or being treated for bladder cancer. A limitation of most bladder cancer specific questionnaires, in particularly the EORTC-QLQ-BLM30, is the uncertainty about its performance: does it measures what it intends to measure? Can we identify patients with many problems and symptoms? The aim of this study was to answer these questions among others. This study shows that the original grouping of questions of the EORTC-QLQ-BLM30 was inadequate. However, the questionnaire performed well after regrouping of the questions. The results of this study will facilitate urologists and scientists with the interpretation of their patients’ questionnaire data, although some caution is to be remained as only the Dutch version of this questionnaire was examined.

Introduction

Bladder cancer (BC) is one of the ten most common types of cancer worldwide [1]. About a quarter of the patients presents with Muscle Invasive Bladder Cancer (MIBC) and of three quarter of patients diagnosed with Non-Muscle Invasive Bladder Cancer (NMIBC) 10–15% will subsequently progress to MIBC.

In the past decades the main focus of BC research has been on optimizing oncological outcomes, with relatively little attention being paid to the impact of the disease and its treatment on the functional health, symptom burden and health-related quality of life (HRQoL) of patients [2]. Several reports have indicated that the HRQoL outcomes of patients with BC appear to be worse than those of patients with other cancer types [3]. HRQoL outcomes are typically assessed with patient-reported outcome measures (PROMS). There are currently a number of PROMs designed to assess the HRQoL of patients with BC, including: the Bladder Cancer Index (BCI) [4]; the Functional Assessment of Cancer Therapy questionnaire for bladder cancer patients in general (FACT-Bl) and for those who undergo a cystectomy (FACT-VCI) [5, 6]; and the European Organisation for Research and Treatment of Cancer (EORTC) questionnaires for NMIBC (EORTC-QLQ-NMIBC24) [7, 8] and MIBC (EORTC-QLQ-BLM30) [9]. A major limitation of many of these PROMs is the lack of validation studies demonstrating that these PROMs can accurately measure what they intend to measure [10]. This is especially true for the EORTC-QLQ-BLM30. To date, only one study has investigated the internal consistency of four of seven scales of the QLQ-BLM30 [11] and all other psychometric properties, with the exception of known-group validity [10], have not yet been assessed.

The aim of the current study was to investigate the structural validity, reliability (i.e., internal consistency and test-retest reliability), construct validity (i.e., divergent validity and known group validity), and responsiveness of the Dutch-language version of the EORTC-QLQ-BLM30 in patients with MIBC.

Methods

Study design

The study included patients diagnosed with non-metastatic MIBC (≥cT2,cN0–2,cM0) between November 1st 2017 and November 1st 2019, who participated in the HRQoL component of the BlaZIB study (Blaaskankerzorg In Beeld, EN: Insight into bladder cancer care). BlaZIB is a Dutch population-based prospective cohort study, embedded in the Netherlands Cancer Registry (NCR), evaluating the quality of bladder cancer care in the Netherlands. BlaZIB collects comprehensive clinical data and HRQoL data of patients. More information about the BlaZIB study can be found elsewhere [12]. The Committee on Research involving Human Subjects (CMO) of Arnhem-Nijmegen deemed the BlaZIB study exempt from ethical review under the Medical Research Involving Human Subjects Act (WMO). The BlaZIB study was approved by the privacy review board of the NCR. Informed consent, either written or digital, was obtained from all patients participating in the HRQoL component of the BlaZIB study.

Questionnaires

All patients diagnosed in a hospital participating in the HRQoL component of the BlaZIB study received an invitation to complete the baseline questionnaire shortly, i.e. about 6 weeks, after histological confirmation of the bladder tumour (T6wk). Patients who completed in the baseline questionnaire and were still alive at follow-up received a follow-up questionnaire at 6 months (T6mo), 12 months (T12mo) and 24 months (T24mo) after diagnosis. The questionnaires were provided digitally and in paper-and-pencil format and included questions on demographics, work, lifestyle and HRQoL. HRQoL was assessed with the Dutch version of the EQ-5D-5L, the EORTC-QLQ-C30, the EORTC-QLQ-NMIBC24 and the EORTC-QLQ-BLM30. The Bladder Cancer Index (BCI) was included as an additional non-obligatory questionnaire.

Questionnaire scoring

The EORTC-QLQ-C30 consists of 30 items assessing global health status, five functional health domains (physical, role, emotional, cognitive and social functioning) and nine symptoms (fatigue, nausea and vomiting, pain, dyspnoea, insomnia, appetite loss, constipation, diarrhoea and financial difficulties) [13]. The EORTC-QLQ-BLM30 consists of 30 items and originally hypothesized scale to form seven scales (urinary symptom (US), urostomy problem, single catheter use problem (CAT), future worries (FW), abdominal bloating and flatulence (BAF), body image (BI) and sexual functioning) [14]. All items, except for global health status (seven-point scale), are scored on a four-point Likert scale ranging from 1 (not at all) to 4 (very much). Because patients who completed the online questionnaire were required to answer all questions, the response category ‘not applicable / not willing to share’ was added to the items of the sexual functioning scale (items 53 to 60) in the online and paper-and-pencil questionnaire. This extra response category was handled as missing in the calculation of the scores. In accordance with the EORTC guidelines, all responses were linearly transformed to a 0 to 100 scale. and missing data were imputed by averaging the scores of the scale if more than 50% of the items of the scale were completed [15].

Additional measures

In order to assess the test-retest reliability of the EORTC-QLQ-BLM30, 81 patients diagnosed with MIBC were asked to complete the EORTC-QLQ-BLM30 2 weeks after the T12mo assessment (T12mo + 2wk; response rate: 84.4%). For practical reasons, this latter questionnaire was only administered in a paper-and-pencil format, but included the same instructions given at T12mo. The T12mo + 2wk questionnaire contained four questions to assess whether patients had less, equal or more complaints in general and on three subscales (urinary, bowel and sexual) compared to the previous questionnaire (T12mo). Only those patients who indicated that they were stable over time on the relevant subscales (i.e. equal complaints) were included in the test-retest analysis.

To assess the divergent validity, the content of the EORTC-QLQ-BLM30 was compared with that of the core questionnaire, the EORTC-QLQ-C30.

Statistical analyses

Structural validity

Confirmatory factor analysis (CFA) was performed to evaluate the hypothesized scale structure of the QLQ-BLM30. Because the US and urostomy problem scales are mutually exclusive, the CFA was run twice, i.e., without US and with urostomy problem and vice versa. Maximum Likelihood (ML) was used as estimator in the CFA and missing items were imputed using Full Info Max Likelihood (fiml). Model-data-fit of the CFA was assessed with model chi-square, the Comparative Fit Index (CFI), Root Mean Square Error of Approximation (RMSEA) and Standardized Root Mean Square Residual (SRMR). Model chi-square > 0.05, CFI ≥0.95, RMSEA < 0.05 and SRMR< 0.05 indicate a good fit, and CFI > 0.90, 0.05 < RMSEA < 0.08 and 0.05 < SRMR < 0.08 indicate an acceptable fit [16].

Multitrait scaling analysis was performed for each assessment point to evaluate the unidimensionality of the scales (i.e., an assumption in classical test theory) and to examine whether the individual items could be grouped in the hypothesized scales. A correlation of ≥0.40 between an item and its own scale was regarded as adequate statistical evidence for convergent validity. Statistical evidence of discriminant validity was defined as a correlation of < 0.40 between an item and other scales in the questionnaire [17]. Items that had poor convergent and/or discriminant validity were discussed and reassigned to another or new scale if necessary. Further psychometric analyses were performed after finalizing the scale structure.

Floor and ceiling effects were examined for each scale. A scale was considered to have a floor or ceiling effect if more than 15% of the patients achieved the lowest or highest possible score, respectively.

Measurement error and reliability

Cronbach’s coefficient α was calculated for each scale to assess internal consistency. A Cronbach’s α of 0.70 or higher was considered acceptable for group comparisons. Test-retest reliability was assessed based on the questionnaires administered at T12mo and T12mo + 2wk using the intraclass correlation coefficient (ICC) for absolute agreement (two-way mixed model, single measure) [18]. An ICC value of 0.70 or higher was considered acceptable.

Hypothesis testing for construct validity

Divergent validity of the QLQ-BLM30 was assessed by calculating the Spearman correlation coefficients between the scales of the QLQ-C30 and QLQ-BLM30. It was hypothesized that symptoms scales of the QLQ-BLM30 would have low to moderately negative correlations with the functioning scales of the QLQ-C30. A strong correlation was expected between scales that were conceptually related, i.e. constipation and diarrhoea (QLQ-C30) vs abdominal bloating and flatulence (QLQ-BLM30).

Known group validity was assessed by comparing patients with stage II (T2,N0,M0) and stage III (T3-4a,N0,M0 or T1-4A, N1–2, M0) (UICC TNM 2018 [19]) and physical function (PF) < 90 and ≥ 90. It was expected that the HRQoL of patients with stage II and III disease would be comparable as these patients are treated similarly [20]. We hypothesized that patients with high PF (≥90) would report better functioning and less symptoms on all scales than patients with low PF (< 90). Effect sizes (ESs) were calculated using Cohen’s d statistic (mean difference divided by pooled standard deviation). These provide a distribution-based estimate of the magnitude of mean differences/changes, where an ES of 0.2 is considered small, 0.5 moderate, and 0.8 large [21].

Responsiveness

Responsiveness to change was assessed in patients who underwent a treatment with curative intent (i.e., radical cystectomy (RC), (chemo) radiotherapy ((C)RT) [20]) after completion of the baseline questionnaire, showed no disease recurrence or progression and completed the EORTC-QLQ-BLM30 questionnaire at all time points. It was hypothesized that patients would report increased urinary, bowel and sexual problems after removal of the bladder compared to baseline [22, 23]. Additionally, we hypothesized that patients who were treated with (C) RT would report better sexual function and body image than patients treated with RC [11].

The CFA was conducted with the software package R using the “lavaan” package [24]. ICCs were calculated in STATA version 16.0 (StataCorp LLC, College Station, Texas, USA). All other statistical analyses were performed using SAS version 9.4 (SAS Institute, Cary, North Carolina, USA).

Results

Patient characteristics, completion rates and missing data

Of the 1542 patients invited to participate in the HRQoL measures, 650 patients (42.2%) completed the baseline questionnaire (T6wk). Respondents were more often male, had a slightly better comorbidity profile, had a higher SES, had a more favourable stage distribution and were more likely to undergo a RC (see Additional file 2 for a comparison of the patient and tumour characteristics of the respondents and non-respondents). The follow-up questionnaires had higher completion rates; 396 (62.7% of the invited patients) at T6mo, 357 (70.3%) at T12mo and 277 (76.5%) at T24mo (Fig. 1). The majority of the patients was male (77.7%), living with a partner (76.3%), former smokers (64.8%), and diagnosed with stage II BC (66.2%) (Table 1).

Fig. 1
figure 1

Flowchart. HRQoL = Quality of Life; NMIBC = Non-Muscle Invasive Bladder Cancer; MIBC = Muscle Invasive Bladder Cancer; wk. = week; mo = month. aPercentage of patients that completed the questionnaire after being invited to fill in the questionnaire

Table 1 Patient and tumour characteristics (n and %)

The percentage of missing responses, including not applicable, on the items single catheter use (item 44) and female sexual function (item 60) were high (> 85%) (see Additional file 2). The percentage of missing responses was low (< 3%) for the items 45 to 52 (future worries, bloating and flatulence and body image scales) and varied per assessment point for the items belonging to the original urinary symptom and urostomy problems scale (items 30 to 43), as these scales are mutually exclusive. The percentage of missing responses for the items belonging to the original sexual function scale (items 53 to 60) was < 48%, except for female sexual function, if limited to the patients reporting at least some sexual activity (item 48).

Structural validity

Items 44 and 60 were excluded from the CFA because of the high number of missing responses (> 85%). The hypothesized scale structure of the QLQ-BLM30 did not fit the data well, with: a CFI of 0.80–0.86, RMSEA of 0.07–0.08 and SRMR of 0.11–0.13 (see Additional file 2). Multitrait scaling analysis showed that the sexual functioning and urostomy problems scales were particularly problematic (see Additional file 2). For this reason, we decided to revise the sexual functioning scale in the same way as was done for the EORTC-QLQ-NMIBC24 [7] (see Table 3), even though items 57 and 58 showed a moderate correlation (0.53–0.59) and the model fit remained largely the same (±0.002 change) after combining these items into one scale (Additional file 2).

Based on the data and the item content, the urostomy problems scale was reduced to a three-item scale (items 38, 39 and 43). The remaining items in the originally hypothesized scale about urostomy irritation (item 40), urostomy embarrassment (item 41) and urostomy support (item 42) were kept as single items. Although the bloating and flatulence scale showed low convergent validity (< 0.40) at all time points (see Table 2), it was considered to be an unidimensional scale in patients who underwent a RC (con: 0.43; dis: − 0.07 to 0.31). This led to the decision to keep the bloating and flatulence scale intact. The revised scale structure exhibited adequate to good fit at all time points (see Additional file 2).

Table 2 Item correlations within and between scales of the EORTC QLQ-BLM30 at baseline (T6wk), 6 months, 12 months and 24 months

The hypothesized and revised scale structures of the EORTC-QLQ-BLM30 are shown in Table 3.

Table 3 The originally hypothesized and revised scale structure of the EORTC QLQ-BLM30

Measurement error and reliability

The internal consistency of the revised scales at all time points was good (α > 0.70), with the exception of urostomy problems (0.55–0.65) and bloating and flatulence (0.48–0.54; Table 2). Test-retest reliability was acceptable for three scales (ICC > 0.70); nearly acceptable for two scales (ICC 0.68–0.69) and fair to moderate for the urostomy problems scale (0.61) and bloating and flatulence (0.47; Table 4).

Table 4 Interclass correlation coefficient of the QLQ-BLM30 subscales for 81 MIBC patients participating in the test-retest analysis of BlaZIB

Hypothesis testing for construct validity

The correlations between the scales of the core questionnaire (QLQ-C30) and QLQ-BLM30 questionnaire were low (< 0.40; Table 5), with the exception of the correlation between emotional function (QLQ-C30) and future worries (QLQ-BLM30). This indicates that the module’s content does not, for the most part, overlap with the content of the core questionnaire.

Table 5 Correlations between scales of the QLQ-C30 and QLQ-BLM30 at baseline

The scores of patients with stage II and stage III bladder cancer were, as expected, quite similar (ES < 0.30). Patients with a score of 0.90 or higher on the physical function scale of the QLQ-C30 had higher scores on the functional scales and lower scores on the symptom scales of the QLQ-C30 and QLQ-BLM30 compared to patients with physical function scores < 0.90 (Table 6).

Table 6 Comparison of mean scores for the QLQ-C30 and QLQ-BLM30 between patients with stage II and stage III bladder cancer and with high and low physical function at baseline (T6wk)

Responsiveness

Future worries decreased after baseline in patients who underwent treatment with curative intent (EF = 0.67 to 1.39; Table 7). Body image (EF= –0.77 to –0.62) and male sexual problems (EF= –0.78 to –0.67) deteriorated in patients who underwent a RC, RC, while body image (EF = 0.23 to 0.33) and urinary function (EF = 0.16 to 0.59) improved in patients undergoing (C)RT.

Table 7 Responsiveness to change over time in patients who underwent a potential curative therapy, completed the EORTC-QLQ-BLM30 at all time points and had no disease recurrence or progression

Discussion

The aim of this study was to examine the structural validity, reliability, construct validity and responsiveness of the Dutch version of the EORTC-QLQ-BLM30. The original hypothesized scale structure of the QLQ-BLM30 could not be substantiated using data of 650 Dutch patients with MIBC, and therefore, the scale structure was revised into seven scales (urinary symptoms, urostomy problems, future worries, abdominal bloating and flatulence, body image, sexual functioning and male sexual problems) and eight single items. The revised scale structure, in general, exhibited good reliability, construct validity and responsiveness. Only reliability (i.e. internal consistency and test-retest reliability) of the new urostomy problems scale and abdominal bloating and flatulence scale was below the acceptable cut-off point.

Based on the data and the items’ content, we revised the six-item urostomy problems scale into a three-item scale and three single items. The new urostomy problem scale performs better (i.e., has higher internal consistency and better CFA results) than the originally hypothesized scale, but still performs below the acceptable cut-off point for both internal consistency and test-retest reliability. We would note that the original hypothesized scale showed better internal consistency (α = 0.71) in the study of Mak et al. [11]. This indicates that more research will be needed to examine the coherence of the items 38 to 43, as the possibility exists that the originally hypothesized urostomy problem scale may be sufficient in other populations than the current study population.

The two items of the abdominal bloating and flatulence scale (i.e., items 48 and 49) were only moderately correlated and thus appeared to measure one unidimensional construct in the general MIBC population. Furthermore, the test-retest analysis indicated that the scores on this scale varied more than would be desirable in generally stable patients, and no large differences were observed for this scale in patients who experienced changes in health over time. Similar results have been observed for this scale in patients diagnosed with NMIBC [7, 8]. However, the correlation between the items of the bloating and flatulence scale was higher for patients treated with RC compared to the total MIBC population. It may be that the bloating and flatulence scale is not relevant for the entire MIBC population, but only for certain subgroups such as patients treated with RC or (C)RT. More research is needed to explore this scale further in other populations and patient subgroups.

The response rates for the items addressing sexual function, especially female sexual function (item 60), were generally lower than for the items of other scales if corrected for the missing responses related to having and not having had a urostomy (i.e., urinary symptom and urostomy problem scale). This finding is in line with other studies [7, 22, 25], but nevertheless resulted in the exclusion of female sexual function and the item single catheter use from the CFA. The new grouping of the items addressing sexual function is the same as proposed and confirmed for the QLQ-NMIBC24 [7, 8, 25], and therefore, we expect that this new grouping of the sexual items will be sustained in future studies investigating the measurement properties of the QLQ-BLM30.

Although this study evaluated the measurement properties of the QLQ-BLM30 in a large population-based group of patients with MIBC, it has some limitations. The primary limitation is its setting; this study was performed in a single study and country. Further research will be needed to examine the measurement properties of the QLQ-BLM30 in other countries and settings. Furthermore, the completion rate of the baseline questionnaire (T6wk) was rather low, which may affect the generalizability of the scores to the entire Dutch population with MIBC. We do, however, believe that this would have a negligible effect on the observed measurement properties of the QLQ-BLM30 as it is unlikely that potential selection bias significantly affect the measurement properties of the questionnaire. Finally, we would note that we relied on classical psychometrics to evaluate the properties of the QLQ-BLM30. Clinimetric evaluation of the questionnaire might result in a somewhat different scale structure than reported here, because that approach does not require unidimensionality and homogeneity of components [26]. Clinimetric evaluation may additionally provide insight into the clinical utility and sensitivity of the questionnaire.

Conclusion

The originally hypothesized scale structure of the EORTC-QLQ-BLM30 did not fit the data well and needed revision. The proposed revised scale structure of the QLQ-BLM30, in general, exhibits good structural validity, reliability (i.e., internal consistency and test-retest reliability), construct validity (i.e., divergent validity and known group validity) and responsiveness in Dutch patients with MIBC. The urostomy problem and bloating and flatulence scale properties remain suboptimal and further studies are needed to confirm our findings.