Equivalence testing of a newly developed interviewer-led telephone script for the EORTC QLQ-C30

Purpose The European Organisation for Research and Treatment of Cancer (EORTC) Quality of Life-Core Questionnaire (QLQ-C30) is a widely used generic self-report measure of health-related quality of life (HRQOL) for cancer patients. However, no validated voice script for interviewer-led telephone administration was previously available. The aim of this study was to develop a voice script for interviewer administration via telephone. Methods Following guidelines from the International Society for Pharmacoeconomics and Outcomes Research (ISPOR) PRO Mixed Modes Good Research Practices Task Force, a randomised cross-over equivalence study, including cognitive debriefing, was conducted to assess equivalence between paper and telephone administration modes. Assuming an expected intraclass correlation coefficient (ICC) of 0.70 and a minimally acceptable level of 0.50, a sample size of 63 was required. Results Cognitive interviews with five cancer patients found the voice script to be clear and understandable. Due to a protocol deviation in the first wave of testing, only 26 patients were available for analyses. A second wave of recruitment was conducted, adding 37 patients (n = 63; mean age 55.48; 65.1% female). Total ICCs for mode comparison ranged from 0.72 (nausea and vomiting, 95% CI 0.48–0.86) to 0.90 (global health status/QoL, 95% CI 0.80–0.95; pain, 95% CI 0.79–0.95; constipation, 95% CI 0.80–0.95). For paper first administration, all ICCs were above 0.70, except nausea and vomiting (ICC 0.55; 95% CI 0.24–0.76) and financial difficulties (ICC 0.60; 95% CI 0.31–0.79). For phone first administration, all ICCs were above 0.70. Conclusions The equivalence testing results support the voice script’s validity for administration of the QLQ-C30 via telephone. Supplementary Information The online version contains supplementary material available at 10.1007/s11136-021-02955-6.


Purpose
The European Organisation for Research and Treatment of Cancer (EORTC) Quality of Life-Core Questionnaire (QLQ-C30) [1] is currently one of the most widely used selfreport measures of health-related quality of life (HRQOL) for cancer patients [2]. Patient reports using the EORTC QLQ-C30 are carried out using paper-and-pencil administration or through electronic methods. However, to increase the accessibility of the questionnaire across different research settings and populations (e.g. in patients that could be at risk of exclusion because of illiteracy), and to minimise the need for otherwise unnecessary clinic trips, the EORTC Quality of Life Group (QLG) set out to develop and validate a voice script for phone administration of the questionnaire. The voice script would incorporate all relevant elements of the QLQ-C30, along with additional instructions and text, where necessary, to facilitate its interviewer-led administration.
The EORTC QLQ-C30 was first released in 1987 by Aaronson et al. [3], and underwent further revisions leading to the development of its third version [1], which is still in use today. Comprised of 30 different items (questions), it is made up of eight multi-item functional (physical, role, emotional, cognitive, and social) and symptom (fatigue, pain, and nausea) scales, one global health status and quality of life (QOL) scale, and six single items (dyspnoea, insomnia, appetite loss, constipation, diarrhoea, and financial difficulties). Covering the majority of the core symptoms recommended for patient-reported outcome (PRO) measurement in cancer clinical trials [4], it is available for use in over 110 language versions, having undergone extensive testing to demonstrate its psychometric [5] and cultural [6] validity.
The majority of QLQ-C30 items (n = 28) are measured by a four-option Likert response scale that ranges from 1, indicating "not at all", to 4, indicating "very much", capturing the presence and/or severity of a symptom or issue and its impact on QOL. The final two items which make up the global health status and QOL scale are rated on a scale from 1 to 7, with 1 indicating "very poor" and 7 "excellent". The time scale for all items is "during the past week" with the exception of the first 5 items (the physical functioning scale), for which no specific timeframe is used, given the intent to capture a more global impact on physical functioning, not limited to a 1-week recall period. All single items and multiitem scales in the questionnaire are scored and transformed onto a 0-100 scale, with higher scores for the functional and global health status/QOL scales indicating higher levels of functioning and QOL, and higher scores for symptom scales and single items indicating a higher degree of symptomatology and problems.
In addition to its frequent use in cancer research and clinical trials [2,7], the QLQ-C30 is being increasingly used for monitoring purposes in clinical practice [8]. By providing a direct means of measuring core symptoms and issues from the patient's perspective, the QLQ-C30 provides clinically meaningful information, distinct from that offered by clinical markers and clinicians' ratings [2,7,9]. In 2018, the EORTC QLG published guidelines to help facilitate the use and migration of EORTC questionnaires into electronic PRO (ePRO) formats (e.g. computer, tablet) [10]. A computerised adaptive testing (CAT) version of the QLQ-C30, the EORTC CAT Core [11], is also available, and consists of dynamic item banks which correspond with the QLQ-C30's 14 functional and symptom domains.
The purpose of this study was to pilot test the provisional QLQ-C30 phone script through cognitive debriefing interviews to ensure its acceptability and relevance, amending it if needed, and to subsequently validate the QLQ-C30 phone-administered version by carrying out equivalence testing between the paper and phone administration modes in a population of patients actively undergoing cancer treatment. An intraclass correlation coefficient (ICC) of ≥ 0.70, the recommended threshold to demonstrate equivalence between various modes of administration, was employed in this study for the purpose of equivalence testing [12]. Previous research supports the use of ICC ≥ 0.70, as demonstrated in studies by Lundy et al. [13,14], in which an interactive voice response (IVR) version of the QLQ-C30 was developed. Similarly, in an equivalence study aimed at comparing tablet computer, IVR, and paper-based administration of the PRO-CTCAE [15], the degree of mode equivalence was assessed using ICC ≥ 0.70.
Although previous work conducted by Lundy et al. demonstrated the equivalence of an IVR version of the QLQ-C30 to its paper administration [13,14], this is the first project aimed at validating a voice script for phone administration of the QLQ-C30 by an interviewer. A considerable body of research comparing paper to screen-based (e.g. tablet, computer) administration of PROs has demonstrated high levels of reliability between both modes [16][17][18] but less work has compared paper administration to auditory modes (e.g. IVR, phone interview). Still, the existing research suggests that equivalence can be established between paper and phone PRO administration [13,15].

Methods
Patient recruitment and data collection, management, and analysis were subcontracted to contract research organisation (CRO) Mapi/ICON plc who provided a final report to the EORTC detailing the methodology and findings. Study approval was obtained in the United Kingdom (UK) by the Quorum Review independent review board.

Sample
Recruitment was carried out through a UK-based recruitment agency (Global Perspectives Limited) and patients were eligible to participate if they were 18 years or older, currently receiving cancer treatment as confirmed by a clinician, able to read and understand English, voluntarily agreed to participate in the study, and provided written informed consent.

Pilot-testing
Five patients were interviewed to test the acceptability, understanding, and relevance of the instructions for the QLQ-C30 voice script. Given that the QLQ-C30 has already been extensively tested and validated, and pilot-testing of the script had the aim of ensuring that the inclusion of additional text (e.g. instructions) was understandable, 5 patients was determined to be an acceptable number.

Equivalence testing
In addition to the previously described eligibility criteria, patients in the equivalency testing were required to have no changes in treatment planned between the paper and phone version completion. To support equivalence between paper-and-pen and phone administration modes using an ICC ≥ 0.70 and a minimally acceptable level of 0.50, a sample size of 63 patients was required [12]. Two waves of recruitment were conducted. In the first wave, 50 patients were recruited, the appropriate number for an equivalence threshold of ICC ≥ 0.90. Since protocol deviations were observed in which only 26 patients completed the paper and phone versions of the QLQ-C30 within the pre-specified 2-day timeframe, a second wave of recruitment was therefore conducted to address these limitations. The deviation relative to the 2-day time frame was due to issues with mailing of the questionnaires, and missing time stamps. The initial decision to use an equivalence threshold of ICC ≥ 0.90 garnered significant discussion among the research team members and CRO. Following the protocol deviation, it was deemed an appropriate moment to adapt the criteria to the widely used ICC ≥ 0.70. As such, thirty-seven additional patients were recruited based on the same eligibility criteria, bringing the total sample size to 63.

Pilot-testing
Patients' interviews were conducted by trained qualitative researchers and audio-recorded for the purpose of analysis. Interviews lasted approximately 60 min and were based on a study-specific interview guide, which contained a summary of the methods to conduct the interview, along with semi-structured questions. The guide also contained questions regarding demographic and clinical variables to capture during the interview. Patients' responses were recorded anonymously on a grid. The QLQ-C30 phone script was subsequently revised accordingly, and the updated version was used for equivalence testing. The interview recordings were destroyed after completion of the analysis, with an anonymised copy of the recordings retained for the study files.

Equivalence testing
A randomised, cross-over design was used to compare the self-administered paper version and the phone-administered version of the QLQ-C30 in patients currently receiving treatment for cancer, following recommendations as set out in the International Society for Pharmacoeconomics and Outcomes Research (ISPOR) PRO Mixed Methods Task Force [19]. Patients were randomised (1:1) to complete either the paper or the phone-administered version first. Randomisation was conducted using a random number generator. After providing informed consent, each patient completed a brief socio-demographic and clinical form. Depending on randomisation, patients were then asked to either complete the paper version of the QLQ-C30 and return it to the recruitment agency in a prepaid envelope or respond by phone to the questionnaire following the phone script as presented by the interviewer, a trained qualitative researcher. The interviewers recorded patients' responses on a paper version of the QLQ-C30. The paper version of the QLQ-C30 was estimated to take approximately 30 min to complete and administration time for the phone version was recorded for each patient. Any comments or observations made by the patient during the phone administration were recorded on a feedback form.
Two days after the first completion of the QLQ-C30, patients were asked to complete it again using the other mode of administration. The date of completion of the paper version was noted for each patient, to assess compliance with the pre-specified 2-day time frame. For patients who completed the phone interview first, the recruitment agency waited for confirmation of interview completion from the study team before sending the paper version by post.

Data analysis
Patients were described in terms of clinical and socio-demographic variables, as reported during the phone interview (pilot-testing) or on the socio-demographic/clinical form (equivalence testing). Age, gender, educational status, and disease history were reported. All data processing and analyses were performed with SAS® software for Windows, Version 9.2 or later (SAS Institute, Inc., Cary, NC, USA).

Pilot-testing
Feedback from patients was compiled in an analysis grid, and reported per patient based on a qualitative assessment of the questionnaire, its instructions and individual items, with any additional comments also recorded.

Equivalence testing
All patients who met the inclusion criteria and completed enough items in the QLQ-C30 questionnaire during each administration for each domain to be scored were included in the equivalence testing analysis. Responses to items from the QLQ-C30 were described based on completion and distribution of responses per administration mode. Missing data were described in terms of number and percent of missing responses per item along with number and percent of missing items per patient, including the number of patients with at least one missing item. Continuous variables were described based on their frequency, mean, standard deviation, median, first and third quartiles, and minimum and 1 3 maximum values. Categorical variables were described based on the frequency and percentage of each response choice, with missing data included in the calculation of percentage.
Equivalence testing was performed at both the item and domain score levels, with the primary objective to evaluate equivalence at the score level between both modes of administration using ICC [20]. A two-way mixed effects, consistency, single rater/measurement approach, described by Fleiss et al. [20], was used to calculate ICCs and their confidence intervals. The widely used benchmark of ICC of ≥ 0.70 was used [21], with ICC values between 0.75 and 0.90 indicating good agreement and values greater than 0.90 indicating excellent agreement [22]. Weighted kappa coefficients [23] were used to assess the extent to which both administration modes produced the same responses by patients to the QLQ-C30 items (results are reported in Online Appendix A). Following Fleiss' guidelines [24], a kappa value greater than 0.75 was characterised as excellent, 0.40-0.75 as fair to good, and less than 0.40 as poor. Mean differences in item-level scores were also calculated and are displayed in Online Appendix B.
To ensure robustness of results between the two waves of recruitment, a sensitivity analysis was conducted to compare the ICC values between patients included prior to the study amendment (first wave of recruitment: n = 26) and those included after (second wave of recruitment: n = 37) using scores from the paper and phone administration modes of the QLQ-C30. Additional sensitivity analyses were conducted on the full group of patients included in the equivalence testing (n = 63) to compare ICC scores by age (< 60 vs. ≥ 60) and gender.

Pilot-testing
Five patients (three males and two females) with a mean age of 51 years completed the pilot-testing interviews. Patients had either liver, testicular, bowel cancer, or lymphoma, and one patient had breast, lung, and bowel cancer, as well as secondary liver cancer. More details regarding demographic and clinical characteristics are provided in Table 1.

Equivalence testing
Sixty-three patients (26 from the first wave and 37 from the second wave) made up the total sample included in the equivalence testing. Patients had a mean age of 55 years and 65% were female. Almost half of the sample (48%) was employed full-or part-time and 76% of patients were living as a couple. Education levels varied with 41% of patients having obtained a bachelor's or postgraduate degree. Breast cancer was the most common disease type, reported in 29% of patients, followed by prostate (11%), lung (10%), and bowel (6%) cancers. A large proportion of patients (41%) reported "other" disease types. The majority of patients were undergoing chemotherapy (25%) or hormone therapy (16%) and other types of treatment included surgery (11%), radiotherapy (10%), biological therapy (13%), mixed therapy (8%) and "other" types of treatment (18%). Detailed demographic and clinical characteristics are provided in Table 2, presented to indicate patients who completed the paper (n = 31) or phone (n = 32) versions of the QLQ-C30 first.
Participants in both waves of testing were largely similar, with more considerable differences observed based on treatment type. Whereas 16.2% of patients reported undergoing surgery in the second wave of testing, only 3.8% reported it in the first wave. Moreover, no patients reported use of biological therapy in the second wave of testing, while 30.8% of patients reported it in the first wave. The full comparison of socio-demographic differences is presented in Table 3.

Pilot-testing
All patients considered the instructions in the phone script to be clear and straightforward. Three comments were raised concerning the time and response scales of the questionnaire. Two patients made comments regarding the time scales, specifying that "during the past week" was too short of a time frame. However, these comments deviated from the source questionnaire and were thus not integrated into the script. One patient suggested numbering the response options from 1 to 4, for clarity. After discussion with the study team, numbers 1 to 4 were added to the response options in the phone script, thereby creating the final version of the phone script in UK English.

Equivalence testing
All patients from both testing waves (n = 63) completed all items in both the paper and phone versions of the QLQ-C30 and there were no missing data.  Table 5. Results for mean differences at the item level are displayed in Online Appendix B. At the domain level, differences between modes were minimal in absolute magnitude, ranging from 0.00 to 11.00 points.
The mean time for completion of the phone version of the QLQ-C30 was 8.6 ± 1.9 min and 39 participants (62%) made comments or asked questions during the interview.
Sensitivity analyses comparing patients included before the study amendment (n = 26) with those included after (n = 37) revealed significant differences (i.e. 95% CI overlapping) only for the nausea and vomiting ICC, which was lower in the first wave of patients, and the constipation ICC, which was lower in the second wave. The full results are displayed in Table 6.
The results of additional sensitivity analyses to assess possible differences in scores based on age (< 60 versus ≥ 60) and gender are displayed in Tables 7 and 8.

Discussion
This study aimed to develop and validate a voice script for phone administration of the QLQ-C30 and evaluate its equivalence to paper administration in a sample of patients actively undergoing cancer treatment. During pilot-testing, the voice script was deemed understandable and relevant with minimal comments received from patients.
Results from the final sample of patients included in the equivalence testing indicated good equivalence between paper and phone administration modes, with all total ICC scores above the 0.70 threshold, ranging from 0.72 to 0.90. In the evaluation of paper administration first, two ICC scores were found to be below the 0.70 threshold, for nausea and vomiting (ICC 0.55; 95% CI 0.24-0.76) and financial difficulties (ICC 0.60; 95% CI 0.31-0.79). When comparing differences in means at the domain score level, the differences were still well below 10 points for the comparison of both administration modes, suggesting minimal clinically meaningful differences despite the ICCs [25]. Failure to reach the 0.70 ICC threshold for nausea and vomiting may also reflect the possibility of more ambiguity surrounding the rating of nausea. While vomiting is a more concrete occurrence, and it is unlikely that a patient's recollection would change over a 2-day timeframe, nausea may be subject to broader interpretation. Moreover, medications are generally readily available to patients, which help to resolve these symptoms on a day-to-day basis, thus indicating that those symptoms can change within a 2-day period. For financial difficulties, the potential source of discrepancy is less clear. There may have been an issue that was not detectable based on the scope of this study, or it may simply be due to random error or noise.
A more general limitation of using ICC to assess equivalence is that the absolute size of a given ICC is dependent on the variation observed within the sample. As such, minimal variation in nausea and vomiting and financial difficulties scores may have contributed to the lower ICCs. Still, the ICCs for nausea and vomiting and financial difficulties were well above 0.50 for paper administration first, indicating that they remain within the minimally acceptable range, especially since the total ICC scores for both scales were over 0.70. It is worth noting that the nausea and vomiting domain score has performed poorly in a previous test-retest study carried out by Hjermstad et al. [26], so there may be other factors influencing that scale, which were not identified in this study. Such factors could also account for the lower ICC score found for nausea and vomiting, when the paper version was administered first.
Differences in mean scores at the domain score level were uniformly minimal, suggesting that, overall, results from both administration modes were equivalent. The relatively short completion time of 8.6 ± 1.9 min for the voice script suggests that it can be integrated into a study protocol with relative ease and minimal patient burden.
Following guidelines from ISPOR's PRO Mixed Modes Good Research Practices Task Force, and drawing on methodology used in similar PRO equivalence studies [13,15], this study had a number of strengths. The randomised cross-over design helped to minimise the potential for bias in either one of the administration modes, and the inclusion criteria ensured that the voice script was evaluated and tested by patients for whom it would be relevant and feasible (i.e. those actively undergoing treatment, with the appropriate language level). The final sample of patients was diverse,     and sufficiently well-balanced in terms of demographic and clinical characteristics, helping to ensure representativeness across patients and disease types. Analyses were also strengthened by the fact that there were no missing data in either the paper or phone-administered versions of the questionnaire, making the results more easily interpretable. The decision to decrease the initial ICC threshold from ≥ 0.90 to ≥ 0.70 following the study amendment to include a second wave of testing, is well-supported by robust evidence in the literature, for which an ICC of ≥ 0.70 has also been used to evaluate equivalence in similar studies [13][14][15].
Moreover, the total ICCs for both waves were largely similar. Significant differences were only observed for nausea and vomiting, which was lower in the first wave, and constipation, which was lower in the second wave of testing. Although the same recruitment procedures and inclusion criteria were applied, and demographic and clinical characteristics were largely comparable between groups, differences in gender and age distribution were found between the two waves of testing. In light of these differences, sensitivity analyses were carried out across all participants by age and gender. While most scores were similar across groups, differences were found in ICC scores for the nausea and vomiting and diarrhoea domain scores by group, with younger patients scoring lower on these domains compared to older patients. Males also scored lower than females on both the nausea and vomiting and physical functioning scales.
Despite these findings, when examining all other ICCs and comparing them across subgroups, no consistent pattern was identified which would support a potential correlation between lower or higher ICCs and the age or gender of patients. Moreover, the limited sample size makes it difficult to draw robust conclusions at the subgroup level. Factors other than age and gender may be related to the experience of disease and treatment, and may help to account for the differences observed; however, such interpretations are beyond the scope of this study, which is limited to the available demographic and clinical data.
Overall, sensitivity analyses showed that differences observed between the two waves of testing are minimal, thereby further supporting the equivalence of paper-and-pen and phone administration modes.

Conclusions
Results from this study support the equivalence of paper and phone administration modes of the QLQ-C30. In addition to its initial source language (UK English) development, the QLQ-C30 voice script is now available in multiple other languages, with more translations anticipated in the future. By providing an alternative means of questionnaire completion, the QLQ-C30 voice script helps to ensure that the questionnaire remains accessible in multiple formats across a wide range of patients.