Introduction

Hepatocellular carcinoma (HCC) is a substantial global health challenge that accounts for 85% to 90% of all reported cases of liver cancer and is the fourth most common cause of cancer-related death [1]. In addition, between 80 and 90% of people worldwide with HCC have comorbid hepatitis B virus (HBV) and/or hepatitis C virus (HCV) infection [2, 3]. The majority of HCC cases (> 80%) occur in Eastern Asia and sub-Saharan Africa, with typical incidence rates of > 20 per 100,000 individuals: China alone accounts for approximately 50% of both new HCC cases and HCC-related deaths worldwide [4, 5]. Southern European countries, such as Spain, Italy, and Greece, have higher incidence rates (10 to 20 per 100,000 individuals) in comparison to Northern Europe and the Americas [4, 5].

Patients with unresectable HCC represent a population with great unmet medical need, having a 5-year overall survival (OS) rate of 18% [6]. These patients often report symptoms (e.g., muscle cramps, pain, fatigue, sleep dysfunction) severe enough to affect their health-related quality of life (HRQoL) [7]. Furthermore, these symptoms affecting HRQoL have been found to correlate with shorter OS [7,8,9,10]. As a result, there has been a shift toward increased recognition of the need to assess HRQoL alongside traditional clinical outcomes in HCC trials [11]. Several different questionnaires have been employed to measure HRQoL in studies of HCC [7]; however, only the European Organization for Research and Treatment of Cancer Quality of Life Questionnaire Hepatocellular Carcinoma 18-question module (EORTC QLQ-HCC18) was developed specifically to assess symptom burden and impact on HRQoL in people with HCC [12, 13].

As it stands, there are limited published data demonstrating the measurement properties of the QLQ-HCC18 within an unresectable HCC population, as well as within specific subpopulations including viral hepatitis comorbidities (comorbid HBV and HCV versus no comorbidity), line of therapy (second- versus third-line or greater), and geographic region (Asia versus Europe). Furthermore, existing validation evidence supporting the robust psychometric properties of the QLQ-HCC18 was obtained within HCC populations distinct from that of the BGB-A317-208 trial population. Compared with the BGB-A317-208 population, most patients had early-stage disease (i.e., Barcelona Clinic Liver Cancer [BCLC] A) and previously underwent surgical treatment, ablation, or embolization [12, 13]. Very few patients in previous validation studies received systemic therapy, but all patients in this trial had received previous systemic therapy. Given these differences in the context of use, the objective of the current project was to validate the QLQ-HCC18 within the BGB-A317-208 trial population. In addition to the context of use motivation, there are currently no published thresholds of meaningful within-patient change (MWPC) for the QLQ-HCC18 as recommended under US Food and Drug Administration (FDA) draft guidance 3 [14]. Thus, following FDA guidance [14, 15], analyses of the QLQ-HCC18 were conducted to evaluate measurement properties (reliability, construct validity, ability to detect change, and MWPC) within this patient population.

Methods

This validation study was conducted using BGB-A317-208 trial data. BGB-A317-208 (NCT0341989) was an open label, multicenter, international, phase 2 clinical trial assessing the efficacy and safety of tislelizumab, an investigational humanized immunoglobulin IgG4 monoclonal antibody with high affinity and binding specificity for programmed cell death protein-1 (PD-1) [16, 17] in patients with unresectable HCC. Enrolled patients received tislelizumab (200 mg) intravenously every three weeks for a total of three or more 21-day treatment cycles, followed by long-term safety and survival assessments.

The protocol, any amendments, and informed consent form were reviewed and approved by the Independent Ethics Committees or Institutional Review Board in conformance with Good Clinical Practice and applicable regulatory requirements. This study was conducted in accordance with sponsor procedures, which comply with the principles of Good Clinical Practice, International Council for Harmonisation of Technical Requirements for Pharmaceuticals for Human Use Guidelines, the Declaration of Helsinki, and local regulatory requirements. The consent forms were signed and dated by the patient or the patient’s legally authorized representative before his or her participation in the study. A copy of each signed consent form was provided to the patient or the patient’s legally authorized representative and all signed and dated consent forms were retained in each patient’s study file or in the site file.

Patients

Patients were male and female adults (≥ 18 years of age), enrolled from international study sites, with histologically confirmed HCC that was not amenable to a curative treatment approach and who had received ≥ 1 line of systematic therapy for unresectable HCC. In addition, patients were BCLC stage C or B not amenable to locoregional therapy or relapsed after locoregional therapy, and not amenable to a curative treatment approach. Patients also had a Child–Pugh A classification. All patients were required to have an Eastern Cooperative Oncology Group (ECOG) performance status score of ≤ 1 [18].

Measures

HRQoL was assessed using three patient-reported outcome (PRO) instruments: the European Organization for Research and Treatment of Cancer Quality of Life Questionnaire Core 30 (EORTC QLQ-C30), the corresponding HCC-specific module (QLQ-HCC18), and the EQ-5D-5L. These PROs were collected at baseline and the first day of treatment cycle 2 (week 3), then every other treatment cycle up to cycle 12 (week 36). At each treatment cycle visit, the PRO administration occurred prior to any clinical activities or dosing. For purposes of this psychometric analysis, only QLQ-HCC18 and QLQ-C30 results are reported (the EQ-5D-5L was not employed in validation).

The EORTC QLQ-C30 [19] is a validated generic HRQoL instrument for cancer patients comprises a global health status (GHS)/QoL scale (two items); five functional scales: physical functioning (five items), role functioning (two items), emotional functioning (four items), cognitive functioning (two items), social functioning (two items); as well as three symptom scales, and several single items: fatigue (three items), nausea and vomiting (two items), pain (two items), and dyspnea, insomnia, appetite loss, constipation, diarrhea, and financial impact (one item each) [20]. The functional and symptom items are rated on a 4-point Likert scale (with 1 = ‘not at all’ to 4 = ‘very much’), while the GHS items are rated on a 7-point Likert scale (with 1 = ‘very poor’ to 7 = ‘excellent’). A high score on the GHS and functional scales indicates high HRQoL and a high level of functioning, whereas a high score on the symptom scales and items indicates a high level of symptom severity. The two individual GHS items were used as concurrent validators. The GHS scale of the QLQ-C30 was used as the PRO anchor variable in test–retest reliability, ability to detect change, and meaningful within-patient change analyses.

The EORTC QLQ-HCC18 [21] measures HCC-specific symptoms and HRQoL. The instrument is an 18-item scale, consisting of six symptom scales and two single items: fatigue (three items), body image (two items), jaundice (two items), nutrition (five items), pain (two items), fever (two items), sexual interest (one item), and abdominal swelling (one item). Scores are based on a 4-point Likert scale (with 1 = ‘not at all’ to 4 = ‘very much’); scaled scores for each domain ranged from 0–100 with a higher score indicating worse symptoms. In addition, an overall QLQ-HCC18 index score was defined to provide an overall characterization of all domains/items. The index score was calculated as the average of all non-missing QLQ-HCC18 scales [9]. Index scores ranged from 0–100, with a higher score indicating overall worse symptoms. Reporting of fatigue and index scores was prioritized in this validation exercise because these domains are important for the assessment of PRO-based clinical significance in the BGB-A317-208 trial. Moreover, in the case of cancer-related fatigue, the field has recognized the importance of this construct and it satisfies the definition of a proximal symptomatic measure of cancer severity [22].

The ECOG performance status [18], a clinical measure of disease severity, was also used as a known-groups validator for this psychometric analysis. The ECOG criteria are used to assess how a patient's disease is progressing and the effect of the disease on a patient’s activities of daily living and was assessed at the baseline visit.

In addition, demographic and medical history data, including age, sex, race, geographic region, line of therapy, and viral hepatitis infection status, were collected at the screening visit.

Statistical analyses

In accordance with existing and emerging FDA guidance [13, 21], psychometric validation of the QLQ-HCC18 was conducted to measure the reliability (internal consistency and test–retest), construct validity (convergent and discriminant validity and known-groups validity), ability to detect change, and MWPC. These analyses were conducted using the safety population, which included all patients receiving at least one dose of tislelizumab. Known-groups validity and MWPC analyses were stratified on several pre-defined subpopulations, including region (China/Taiwan versus Europe), line of therapy (second-line versus third-line or greater), and viral hepatitis infection status (HBV/HCV positive versus hepatitis negative). Table 1 provides a summary of these analyses.

Table 1 Summary of psychometric analyses of QLQ-HCC18

Descriptive statistics for continuous variables were reported as means, standard deviations (SDs), medians, and missing values. Descriptive statistics for categorical variables were reported as frequency counts and the percentage of patients in corresponding categories. Statistical significance was evaluated using a two-tailed α = 0.05 level. Missing data for the QLQ-HCC18 and QLQ-C30 were handled according to the developer’s manuals and no imputation was carried out [21, 23]. All analyses were performed using SAS (version 9.4) and R statistical software (version 3.6.1).

It is important to note that several analyses were stratified by region (strata: China/Taiwan and Europe). These included known-groups validity and meaningful within-patient change. This stratification was motivated by guidance from the Chinese National Medical Products Administration (NMPA), which requires stratification to demonstrate the evidence unique to the Chinese population and whether this differs from the aggregate findings.

Reliability

Internal consistency evaluates score reliability by assessing the strength with which each item measures an assumed single domain. Internal consistency was assessed for each of the multi-item QLQ-HCC18 scales at baseline using Cronbach’s alpha [24]. Internal consistency estimates of ≥ 0.70 were considered acceptable [19].

Test–retest reliability consists of measuring the degree to which an instrument is capable of reproducing scores across time in patients whose condition has not changed [21]. Patients whose responses on the QLQ-C30 GHS scale anchor reflected no change in status between baseline and the first follow-up at week 3 were considered a stable subgroup and test–retest reliability was assessed for each of the QLQ-HCC18 scales and single items. In the case of a continuous score, one appropriate measure of test–retest reliability is the two-way random intraclass correlation coefficient (ICC), employed in this analysis and denoted as ICC(2,1) [25]. Test–retest reliability estimates of ≥ 0.70 indicate satisfactory reliability [26]. Both unconditional estimates and estimates conditioned on no change in GHS were applied. Consistent with regulatory guidance, only estimates derived from the primary GHS anchor-based no-change definition (NC1, defined by GHS change score of 0 between baseline and week 3) are reported [13, 21, 27].To limit the impact of possible treatment effects, three definitions of no change were examined in sensitivity analyses: unconditional, + 1 response category (‘NC2’), or + 2 response categories (‘NC3’). None of these definitions outperformed the pre-specified primary NC1 definition reported in this manuscript.

Construct validity

Construct validity was assessed by tests of both convergent and discriminant validity and known-groups validity. Convergent and discriminant validity is a component of construct validity representing the extent to which two scales assessing similar constructs are related. This was estimated from Spearman correlations between the QLQ-HCC18 and QLQ-C30 scores at baseline. Moderate to strong correlations reflect convergent validity while small correlations reflect discriminant validity [27]. Correlations between QLQ-HCC18 domains (which are symptom-focused) were expected to correlate positively with QLQ-C30 symptom domains, negatively with QLQ-C30 functional domains, and negatively with QLQ-C30 GH domains. For example, the QLQ-HCC18 fatigue domain was expected to correlate with the QLQ-C30 fatigue domain strongly and positively. The QLQ-HCC18 fatigue domain was expected to correlate with the QLQ-C30 physical function domain moderately and negatively. Finally, the QLQ-HCC18 fatigue domain was expected to correlate with the QLQ-C30 GHS moderately and negatively. Spearman correlations of |r|≥ 0.40 met the pre-specified criterion for acceptable convergent validity [26]. Given the exploratory nature of this analysis within this population for the purposes of identification of relevant phase 3 endpoints, no further hypotheses were specified for correlation-based analyses.

Known-groups validity assesses whether PRO scores can be differentiated between clinically distinct groups. Known-groups validity was estimated for the QLQ-HCC18 scores at baseline. Known-groups validators included geographic region (Asia versus Europe), line of therapy (second-line versus third-line or greater), ECOG status (0 versus 1), and viral hepatitis infection status (HBV/HCV positive versus hepatitis negative). Consistent with previous studies, the hypothesized direction of effect predicted that Europe would report lower quality of life than Asia [12, 28], third-line or greater would report lower QoL than second line, worse ECOG status would report lower QoL than better ECOG status, and that HBV/HCV infected patients would report lower quality of life than non-infected patients. The difference in QLQ-HCC18 scores between each known-group was calculated and contrasted using analysis of variance (ANOVA), from which the mean difference between known-groups, corresponding 95% confidence interval (CI), P-value, and R-squared (R2) effect size were estimated. Acceptable known-groups validity was achieved if a preponderance of the known-effect-groups had QLQ-HCC18 mean scores consistent with clinical expectations (i.e., more severe groups had worse symptoms or HRQoL compared to less severe groups). Such evidence was strengthened if and when the corresponding differences across known-groups were statistically significant and the corresponding R2 was greater than 5%. Methods to correct for multiple comparisons were not employed as part of the known-groups analysis.

Ability to detect change

Ability to detect change is a facet of longitudinal validity that evaluates the relationship between changes in the PRO instrument of interest over time in the context of changes in external criteria (i.e., ‘anchors’) [29]. Ability to detect change was assessed by analyzing the extent to which QLQ-HCC18 change scores could be predicted by change in the QLQ-C30 GHS anchor variable. The QLQ-C30 GHS anchor groups were operationalized as follows: improvement was defined by > 0-point change from baseline to week 9; maintenance was defined as 0-point change from baseline to week 9; deterioration was defined as < 0-point change from baseline to week 9.

Analysis of covariance (ANCOVA) was used to estimate differences in QLQ-HCC18 change score marginal means across QLQ-C30 GHS anchor groups (improvement [effect] versus maintenance [reference]; deterioration [effect] versus maintenance [reference]), controlling for age, sex, region, and baseline QLQ-HCC18 mean. Effect size estimates were based on the Omega squared (ω2) statistic [30].Footnote 1 Acceptable ability to detect change was pre-specified as estimates meeting the following criteria: significant differences (P < 0.05) in marginal means across anchor group contrasts and effect sizes exceeding 5%.

Meaningful within-patient change

Traditional estimation of meaningful change thresholds has relied on distribution and anchor-based methods. Increasingly, regulatory reviewers are emphasizing the latter; therefore, anchor-based methods were the focus of the current analyses [13, 21, 24]. Furthermore, such estimates have emphasized between-group differences (e.g., minimally important differences or minimal clinically important differences). The FDA has justifiably taken the position that within-patient change is not acceptably approximated from between-group differences. Instead, regulatory guidance emphasizes MWPC for the derivation of clinical significance estimates [21].

Anchor-based methods aim to define the magnitude of MWPC on a PRO instrument of interest among patients classified as experiencing meaningful change (improvement/deterioration) on an ‘anchor.’ Anchor-based MWPC thresholds were obtained via calculation of mean change in QLQ-HCC18 scores from baseline to week 9 stratified on the QLQ-C30 GHS anchor groups described above. In addition to primary analyses based on the total sample, meaningful improvement estimates were stratified by geographic region (Asia versus Europe), line of therapy (second-line versus third-line or greater), and viral hepatitis infection status (HBV/HCV positive versus hepatitis negative). These stratified estimates were employed to assess the uniformity in clinical significance threshold estimates across known subgroups within the trial, and to characterize unique effects within the China/Taiwan population, as required by NMPA guidance. These estimates of mean change were then validated by visualizing differences in cumulative proportions achieving the point estimates stratified on anchor groups via empirical cumulative distribution functions (eCDFs) and empirical probability density functions (ePDFs).

Results

A total of 249 patients (138 s-line and 111 third-line or greater) were enrolled from 45 international centers in the BGB-A317-208 trial. A sample size of 228 was calculated to provide a power of 0.97 to demonstrate that the objective response rate in patients with previously treated unresectable HCC is statistically higher than the historical rate of 7% in a binomial exact test at a one-sided alpha level of 0.025. The demographics and clinical characteristics of these patients are summarized in Table 2. Patients had an average age of 60.3 years, were mostly male (87.1%), 50.6% were Asian, approximately half had an ECOG score of 1, and the average elapsed time from diagnosis to first dose of study drug was 38.7 months. Approximately one third of the patients (36.1%) were not HBV/HCV infected and approximately half were experiencing progressive disease prior to entering the study (51.4%). The average elapsed time from last systemic therapy dose to first study dose was 3.4 months. These patterns were similar across second-line and third-line or greater cohorts. A single patient who did not contribute QLQ-HCC18 data at baseline was excluded, leaving a final sample of 248 patients for the psychometric analyses.

Table 2 Patient demographics and clinical characteristics

Reliability

The Cronbach’s alpha coefficients of three QLQ-HCC18 domains, namely fatigue, nutrition, and index reflected acceptable internal consistency (0.71, 0.75, and 0.88, respectively). The remaining multi-item domains of body image, jaundice, pain, and fever did not display satisfactory internal consistency for this patient population (< 0.70).

Within the two assessments (baseline and 3-week follow-up) and across domains, 85–87 patients were included within the primary GHS-based no-change (NC1) population upon which test–retest reliability was estimated. Test–retest reliability ICC(2,1) estimates indicated satisfactory reliability for six QLQ-HCC18 domains: fatigue, body image, nutrition, pain, sexual interest, and index (0.72, 0.70, 0.73, 0.75, 0.79, and 0.83 respectively). The remaining domains of jaundice, fever, and abdominal swelling did not display adequate test–retest reliability (< 0.70).

Construct validity

Convergent and discriminant validity estimates are presented in Table 3. Results were largely consistent with expectations for which QLQ-HCC18 domains would demonstrate a preponderance of acceptable associations. Correlations between QLQ-HCC18 scores and QLQ-C30 fatigue, nausea and vomiting, and pain domains met or exceeded the pre-specified criterion of |r|≥ 0.4. As expected, the QLQ-HCC18 fatigue and pain domains correlated with QLQ-C30 fatigue and pain domains strongly and positively (0.76 and 0.60). The HCC18 fatigue domain correlated with QLQ-C30 physical function and GHS strongly and negatively (–0.7 and –0.51–0.52). The fatigue domain achieved this pre-specified criterion for 13 out of 16 (81.3%) correlations, whereas the index score achieved this pre-specified criterion for 14 out of 16 (87.5%) correlations. Conversely, there were weak correlations between domains and items assessing divergent concepts, suggesting acceptable discriminant validity. For example, the correlation between the QLQ-HCC18 fever domain and the QLQ-C30 financial difficulties item was 0.21. The jaundice domain did not achieve the pre-specified criterion for any of the 16 correlations.

Table 3 Convergent and discriminant validity for the QLQ-HCC18 domains and the QLQ-C30 scores at baseline

The known-groups validity estimates are presented in Table 4. Known-groups validity of QLQ-HCC18 domains at baseline was defined upon geographic region, line of therapy, ECOG status, and viral hepatitis status. As hypothesized, European patients had a significantly higher mean score for domains of fatigue (group difference: 5.28, P = 0.025), body image (group difference: 8.32, P < 0.001), jaundice (group difference: 4.88, P = 0.001), pain (group difference: 5.73, P = 0.010), and index (group difference: 4.85, P = 0.002) compared with Asian patients, respectively. These mean differences were associated with effect sizes (R2) indicating 2% to 5% explained variance. A non-significant trend was observed for the remaining domains/items, whereby European patients had higher mean scores. As expected, patients in the third-line or greater therapy group had higher mean scores for all domains/items compared with patients in the second-line therapy group; however, only the jaundice domain demonstrated a significant group difference (group difference: 4.50, P = 0.003).

Table 4 Known-groups validity for QLQ-HCC18 domain and item scores at baseline

Patients with an ECOG score of 1 had a significantly higher mean score for the pain domain (group difference: 4.83, P = 0.030) compared with patients that had an ECOG score of 0. An unexpected trend was observed for jaundice (group difference: − 0.54, P = 0.726) and fever (group difference: − 2.32, P = 0.062), whereby patients with an ECOG score of 0 had higher scores. As expected, patients in the HBV/HCV positive group had higher mean scores for domains of nutrition (group difference: − 0.49, P = 0.804), fever (group difference: − 0.67, P = 0.603), abdominal swelling (group difference: − 1.28, P = 0.650), and sex life (group difference: – 5.12, P = 0.241). An unexpected trend was observed for fatigue, body image, jaundice, pain, and index, whereby patients in the HBV/HCV-negative group had higher mean scores.

The majority of known-groups validity estimates (81%) were consistent with the hypothesized direction of effect, thereby supporting validity of the QLQ-HCC18.

Ability to detect change

Change scores were computed for the QLQ-HCC18 scores based on the QLQ-C30 GHS scale anchor groups of improvement, maintenance, and deterioration. The ability to detect change estimates are presented in Table 5. Clear differentiation of the QLQ-HCC18 change scores between improvement and maintenance groups were observed for body image, fatigue, pain, and index. Effect sizes were small (less than 0.10), most likely induced by the large variability in these data relative to the reasonable sample sizes, as indicated by the wide 95% CIs. No statistically significant changes were observed between improvement and maintenance groups for abdominal swelling, fever, jaundice, nutrition, and sexual interest. Clear differentiation of QLQ-HCC18 change scores between deterioration and maintenance groups were observed for fever and fatigue. No statistically significant differentiation was observed for the remaining QLQ-HCC18 symptom scores, including index.

Table 5 QLQ-HCC18 ability to detect change scores from baseline to week 9 by anchor group

Meaningful within-patient change

The point estimates for MWPC across anchor groups are presented for the total sample and stratified by region, line of therapy, and viral hepatitis infection status in Table 6. Within the primary (unstratified) analyses, point estimates for MWPC defining improvement were –7.18 for QLQ-HCC18 fatigue and − 4.07 for QLQ-HCC18 index. Meaningful improvement estimates for the index scale stratified on either region or HBV/HCV infection were identical to the primary estimates. Region-stratified estimates of meaningful improvement for fatigue were within ± 1 point of the primary estimates. Line of therapy stratified estimates were within ± 2 of primary estimates for both fatigue and index. The viral hepatitis negative sample achieved greater fatigue improvement (− 10) compared to the HBV/HCV infected sample (− 5).

Table 6 QLQ-HCC18 meaningful within-patient change estimates from baseline to week 9 by anchor group

Within the primary (unstratified) analyses, point estimates for MWPC defining deterioration for QLQ-HCC18 fatigue and index were 5.34 and 3.16, respectively. In the case of the fatigue domain, estimates stratifying on either region or HBV/HCV infection status were identical to the primary estimates (the one exception was Europe for which the estimate was 0.66 points higher). In the case of line of therapy, estimates were 2 and 9, respectively, for second-line and third-line or greater, reflecting greater heterogeneity relative to the primary estimates. In the case of the index scale, all stratified estimates were within ± 1 of the primary estimates and therefore unaltered across population stratification.

The point estimates for MWPC for each anchor group definition were validated by eCDF figure. In the case of meaningful improvement for fatigue domain scores, 60% of the improvement anchor group and 50% of the maintenance anchor group achieved the − 7.13 threshold, yielding a 10% improvement advantage. In the case of meaningful deterioration for fatigue scores, 38% of the deterioration anchor group and 18% of the maintenance anchor group achieved the 5.34 threshold, yielding a 20% advantage for maintenance. The eCDF for the QLQ-HCC18 fatigue score is presented in Fig. 1. The corresponding eCDF clarifies the overlap in fatigue domain change score distributions, but also demonstrates that the mass of distributions was offset as expected, with improvement skewed left, maintenance centered about a change score of zero, and deterioration skewed to the right.

Fig. 1
figure 1

eCDF of QLQ-HCC18 fatigue domain change score from baseline to week 9 by anchor group. eCDF: empirical cumulative distribution function; QLQ-HCC18 Quality of Life Questionnaire – Hepatocellular Carcinoma 18-question module

Discussion

The present study examined the psychometric properties, namely reliability, construct validity, ability to detect change, and MWPC, of the EORTC QLQ-HCC18 instrument within the BGB-A317-208 trial population of patients with unresectable HCC. Within this population, evidence suggested that the QLQ-HCC18 demonstrates heterogenous psychometric properties. However, the QLQ-HCC18 fatigue and index domains were found to consistently demonstrate robust psychometrics.

With respect to reliability, this study found that only the QLQ-HCC18 fatigue, nutrition, and index domains demonstrated acceptable internal consistency at baseline. This is not surprising given that previous validation studies found low alpha coefficients for the QLQ-HCC18 jaundice, pain, and fever domains, citing heterogeneity within the HCC patient population as the cause [7, 9, 12]. Specifically, these studies suggested heterogeneity of the items within the scales and within the patient population (e.g., region, viral hepatitis status) may be contributing factors. That may be the case, though a simpler explanation likely exists, and is reviewed within the limitations section. Acceptable test–retest reliability was found for fatigue, body image, nutrition, pain, sexual interest, and index. The observed low ICC estimates for the jaundice domain may have resulted from few patients presenting with jaundice upon admission to the trial.

Convergent and discriminant validity, as with all validation analyses within this phase 2 trial, were treated as exploratory and beyond hypotheses outlined in methods (i.e., direction of association with symptom, functional, and global health domains and the pre-specified criterion for acceptable association). No specific hypotheses for which domains would have greater or lesser association were pre-specified. Associations were exploratory and the preponderance of evidence examined to conclude broadly whether the associations with sufficient domains were detected to justify elevating a given QLQ-HCC18 domain from exploratory to secondary endpoint in a phase 3 clinical trial setting. Results were largely consistent with expectations for which QLQ-HCC18 domains would demonstrate a preponderance of acceptable associations. Going forward, this exploratory evidence will support confirmatory hypotheses in forthcoming phase 3 studies. The fatigue domain achieved this pre-specified criterion for 13 of the 16 correlations, whereas the index domain achieved this pre-specified criterion for 14 of the 16 correlations. This was true for both convergent and discriminant validators. Most of the correlations with the QLQ-HCC18 jaundice domain and sexual interest item failed to meet the pre-specified criterion; this finding is supported by previous validation studies that reported weak correlations between the QLQ-HCC18 jaundice and sexual interest and the QLQ-C30 scores, and FACT-Hepatobiliary scores [12, 13]. This is likely because these QLQ-HCC18 items are specific to symptoms/signs of HCC.

The majority of known-groups validity estimates (81%) were consistent with the hypothesized direction of effect, thereby supporting validity of the QLQ-HCC18. This suggests the QLQ-HCC18 can generally differentiate among distinct groups as hypothesized a priori.

Known-groups validity evidence for the geographic region effect was consistent with the hypothesized direction of effect, under which Europe was expected to report lower QoL/worse symptoms compared with Asia. This hypothesis was driven by findings reported by previous studies that demonstrate geographic areas effect HRQoL in HCC. Specifically, Asian patients with HCC report significantly better scores in HCC18 scales (sexual interest, fatigue) than European patients [31]. It has been posited that these differences in scores stem from variability in management practices between Europe and Asia (i.e., active surveillance programs implemented in Asia) [28].

Interpretable ability to detect change between patients improving versus maintaining according to the pre-specified QLQ-C30 GHS anchor thresholds was found for the fatigue, body image, pain, and index domain change scores. The same was found for ability to detect change between patients deteriorating versus maintaining for the fatigue domain. As expected, unbiased effect size estimates were low, indicating less than 10% explained variance across domains. This is often the case in oncology trials due to heterogeneity within the patient population, which increases dispersion, thereby attenuating effect-size magnitudes within the data.

To date, this is the first study to estimate MWPC thresholds in line with the methods outlined in the latest FDA guidance [21]. In this study, the estimated anchor-based MWPC threshold defining clinical significance for the fatigue domain was found to be lower than previously reported within the literature for the QLQ-C30 [32, 33]. This may be due to the difference between the minimally important difference and MWPC frameworks and has implications for the application of historical QLQ-C30 meaningful change thresholds outside of the original context of use. The revised MWPC deterioration estimates can be employed to define thresholds for progression endpoints, such as time to deterioration. The same is true for improvement endpoints, for which evidence was generated in this analysis indicating an ability of the QLQ-HCC18 fatigue domain to detect meaningful clinical improvement, which is a rare phenomenon in oncology PRO applications.

While the results of this study are important, they should be considered alongside some limitations. The most noteworthy limitation is that many of the QLQ-HCC18 domains did not consistently demonstrate optimal measurement properties in this HCC population. Specifically, body image, jaundice, pain, fever, and abdominal swelling did not display acceptable reliability. However, it is important to note that these domains consist of the fewest items within the QLQ-HCC18 instrument. Consistent with theory and previous evidence, the reliability of a score has been found to increase as the number of items contributing to the score increase [34, 35]. Additional limitations were related to validity and MWPC for domains other than fatigue and index. Jaundice and sexual interest failed to display acceptable validity. In addition, fever, nutrition, jaundice, abdominal swelling, and sexual interest did not show adequate ability to detect change.

Taken together, the validation evidence suggested that the QLQ-HCC18 fatigue and index domains consistently demonstrated robust psychometric properties. This appears to support the use of the fatigue and index domains as suitable patient-reported endpoints within an unresectable HCC population that had previously received one or more systemic therapies. Moreover, the ability to detect change and meaningful within-patient change analyses demonstrated that an uncommon degree of improvement was observed in this trial and the QLQ-HCC18 fatigue domain scores sensitively detected the effect of tislelizumab.