Introduction

Besides survival and treatment-associated adverse events, patient-reported outcomes (PROs) are arguably the most relevant outcome parameters in oncology. A PRO is defined as ‘any outcome evaluated directly by the patient himself or herself and is based on patient’s perception of a disease and its treatment(s)’ [1]. PROs have many potential advantages as they may elucidate the relationship between clinical endpoints and the patient´s well-being [1], allowing for a more comprehensive evaluation of patients’ health [2].

Health-related quality of life (HRQoL) is a multidimensional PRO measure that is of special interest in oncology as it provides a ‘personal assessment of the burden and impact of a malignant disease and its treatment,’ [1] thus, adding valuable information for a true risk–benefit assessment. This is of special interest when prognosis is limited as in primary malignancies of the liver. HRQoL tools can be distinguished into generic, cancer-specific, cancer-type-specific and utility-(preference-)based instruments [3]. While definitions, implementation, evaluation and analyses of survival and toxicity/complication endpoints have been well standardized over the last decades, PROs are still under-evaluated and reported in most clinical settings. Multiple studies have aimed to define suitable HRQoL tools for different clinical settings, e.g. [4, 5], including cancer patients [6,7,8].

Hepatocellular carcinoma (HCC) and intrahepatic cholangiocarcinoma (CCA) account for more than 95% of all primary malignant liver tumours. Hepatitis B and C infections are the most prominent risk factor for HCC [9]. More than 840.000 patients were newly diagnosed with HCC or CCA in 2018, and numbers are estimated to rise > 1.3 million annually until 2040 [10]. Although age-standardized incidence rates are moderate in the Western World, they are high in most parts of Asia and parts of West Africa [10], making HCC one of the most frequent tumours in these parts of the world. Prognosis is dismal with 5-year overall survival being around 15% in the USA and 5% in low-income countries [9]. Besides surgical resection, medical treatment (e.g. chemotherapy, kinase inhibitors) and interventional treatments like radiofrequency ablation (RFA) and transarterial chemoembolization (TACE) constitute the three mainstays of treatment for both HCC and CCA.

Therefore, the objectives of this systematic review and meta-analysis were threefold: (1) to perform a systematic review to identify all published HRQoL tools for primary liver cancer (HCC/CCA); (2) to assess the methodological quality and clinical relevance of these HRQoL measures; and (3) to synthesize quantitative data via means of a meta-analysis to compare surgery vs. interventional treatments vs. systemic therapies with regard to HRQoL.

Material and methods

This systematic review and meta-analysis is reported in line with current PRISMA guidelines [11]. The study was registered in the PROSPERO database on 18th July 2017 (registration number CRD42017068103).

Eligibility criteria

Studies investigating HRQoL in HCC or CCA patients were included independent of language or year of publication. All types of studies were included in our search with the exception of case reports, i.e. randomized controlled trials (RCT), cohort-type studies (CTS), case–control studies (CCS) and cross-sectional studies. Furthermore, studies in animals (non-human studies) were excluded. The patient (P) and outcome (O) terms of the PICOT (patient–intervention–comparison–outcome–time) scheme were used to build a search strategy. The search used the ‘outcome’ term to identify PROMs describing quality of life or HRQoL and the ‘patient’ term to find studies including patients with HCC or CCA. Supplement 1 shows the search strategy for MEDLINE performed via OvidSP. If studies included mixed patient populations (e.g. including HCC patients together with metastatic cancer patients and other tumours), only those trials were included in which HRQoL data could clearly be extracted for HCC and CCA patients.

Information sources

The following databases were searched [12]: (a) MEDLINE via OvidSP last searched on 18th July 2019; (b) Ovid MEDLINE In-Process & Other Non-Indexed Citations via OvidSP last searched on 18th July 2019; (c) the Cochrane library (including Cochrane reviews, other reviews, trials, technology assessments and economic evaluations) via the Cochrane homepage (Wiley online library) last searched on 18th July 2019; (d) PsycINFO via EBSCO host last searched on 18th July 2019; (e) CINAHL via EBSCO host last searched on 18th July 2019 and (f) Excerpta Medica Database (EMBASE) via EMBASE homepage last searched on 18th July 2019. The references of the included articles were hand searched to identify additional relevant studies. Where necessary, authors were directly contacted to retrieve missing information.

Search

Sensitive search strategies were developed for all databases using wildcards and adjacency terms where appropriate. Supplement 1 shows the search strategy for MEDLINE performed via OvidSP. The search strategies for the other databases were adapted to the specific vocabulary of each database.

Study selection

Search results were imported into EndNote software (EndNote X7.7, Thomson Reuters) [13], and duplicates were removed by using the automated duplicate removal function of EndNote. Consequently, titles and abstracts of studies were screened by two authors (KW, ALM) for fulfilment of inclusion and exclusion criteria. Remaining duplicates were removed manually. For the remaining studies, full text articles were obtained, which were then screened for eligibility by two authors independently (KW, ALM). Reasons for exclusion of full text articles were recorded (Fig. 1). All remaining articles were included in the qualitative syntheses (objectives 1 and 2). For objective 3 (quantitative assessment), all articles using adequate HRQoL measures (i.e. fulfilling objective 2) were included in the assessment of quality of reporting of HRQoL data and risk of bias assessment of individual studies. HRQoL data were extracted wherever possible and grouped according to the three clinical settings: (a) surgery; (b) interventional therapy and (c) medical treatment.

Fig. 1
figure 1

Flow chart of included studies

HRQoL assessments were then grouped into 3-month periods. In a next step, quantitative data analysis was performed for those HRQoL measures for which ≥ 2 quantitative data time points were available. For quantitative data analysis, results of individual studies were entered in RevMan 5 software 5.3. (Review Manager, Version 5.3 Copenhagen: The Nordic Cochrane Center, The Cochrane Collaboration, 2014).

Data collection process

Data were extracted by two authors independently (KW, ALM) and collected on pre-specified piloted forms. In case, required data were not reported in the study, and authors were contacted to obtain remaining data. Differences in data extraction were resolved by consensus with a third author (MKD).

Data items

The following data items were collected: title, author, year of publication, country where study was performed, journal, language, cancer type, intervention, control, co-interventions, primary endpoint, secondary endpoints, HRQoL tool used, type of study, number of centres, start and end dates of study and intervention, number of patients (total), number of patients allocated to intervention(s), number of patients allocated to control, number of patients evaluated for HRQoL (at each point in time), number of withdrawals, exclusions, conversions, duration of follow-up, HRQoL data at baseline and during follow-up, analysis strategy, subgroups measured and subgroups reported. Furthermore, the following baseline characteristics of patients (for both intervention and control group) were recorded: age, gender, severity of illness, co-morbidities and other relevant baseline characteristics.

Evaluation of methodological quality of the HRQoL measures

The methodological quality of HRQoL measures was assessed based on specific psychometric criteria. Owing to the lack of uniform consensus on how to appraise PRO measures, criteria were applied based on published recommendations [3, 14] in accordance with U.S. Food and Drug Administration guidance [15] and the Oxford University PROMs Group guidelines and the COnsensus-based Standards for the selection of health status Measurements INstruments (COSMIN) [16]. The criteria and benchmarks laid out in Table 1 were used for evaluation and have been used in previous publications [4, 5]. A rating scale described in previous publications was applied to allocate a mark for each domain [4, 5]: 0 no evidence reported;—evidence not in favour; + evidence in favour; ± conflicting evidence. Lack of basic psychometric evaluation was defined by a priori consensus as evaluation of less than 2 positive ( +) aspects (other than feasibility and interpretability) in HCC/CCA patients. Evaluation was limited to primary hepatic cancers (HCC/CCA), i.e. the psychometric properties of some instruments might have been evaluated in other types of cancer, but not in HCC/CCA patients. In case of lack of psychometric data for a given instrument, searches were conducted in Medline to identify additional studies that have evaluated the psychometric properties of the HRQoL instrument in closely related patient cohorts (e.g. patients with chronic liver disease).

Table 1 Psychometric criteria used to assess the quality of the patient-reported outcome measures

Evaluation of the quality of reporting of HRQoL data

For assessment of reporting, the studies were analysed using the following questions: (a) Is HRQoL data analysis described in methods section? (b) Has an a priori statistical analysis plan for HRQoL outcomes been implemented, addressing common problems like missing data, multiple testing? (c) Is HRQoL raw data presented? (d) Is individual patient data reported? (e) Which summary scores are used for HRQoL data? (f) Which time points of HRQoL assessment are described in the methods section? g.) For which time points is HRQoL data reported in the results section?

Assessment of risk of bias in individual studies

For RCTs risk of bias was judged using The Cochrane Collaboration tool of for assessing quality and risk of bias [17]. Risk of bias for non-randomized, interventional trials was assessed with the ROBINS-I tool (Risk Of Bias In Non-randomized Studies—of Interventions, formerly known as ACROBAT-NRSI) as recommended by the Cochrane collaboration [11]. Non-randomized, non-interventional studies were assessed using the Newcastle–Ottawa risk of bias tool [18], and cross-sectional studies were assessed using the AHRQ checklist. RCTs were judged to be at an overall high risk of bias if there was a serious risk of bias in any of the following domains: random sequence generation, allocation concealment, missing data. For non-randomized trials, the following overall risk of bias judgement for individual studies was used in line with Cochrane recommendations [11]: (a) low risk of bias: the study is judged to be at low risk of bias for all domains; (b) moderate risk of bias: the study is judged to be at low or moderate risk of bias for all domains; (c) serious risk of bias: the study is judged to be at serious risk of bias in at least one domain, but not at critical risk of bias in any domain; (d) Critical risk of bias: the study is judged to be at critical risk of bias in at least one domain.

Statistical analysis

Data were entered in RevMan 5 software 5.3. (Review Manager, Version 5.3 Copenhagen: The Nordic Cochrane Center, The Cochrane Collaboration, 2014) [19]. As level of significance, an alpha of 0.05 was determined. A random-effect model (inverse variance) was used as there has been clinical heterogeneity between the included trials. Heterogeneity was evaluated using I2 statistic. Results lower than 25% were considered as low, between 25% and 75% as possibly moderate, and results of I2 over 60% were considered as a considerable heterogeneity. HRQoL in HCC/CCA patients was compared by meta-analysis for the following types of interventions: (a) surgery; (b) interventional therapies (e.g. TACE, RFA) and (c) systemic therapies (e.g. chemotherapy). Only studies using the FACT-G/FACT-Hep could be used for meta-analysis (see results section). As these subscores are continuous variables, the mean difference in the FACT-G/FACT-Hep subscores was used as effect measure.

Results

Study selection

We identified 3811 studies by database search and 12 additional studies by hand search resulting in a total of 3823 records. 453 of those studies were duplicates (Fig. 1). After screening titles and abstracts, the other 2888 records were excluded according to inclusion and exclusion criteria. Subsequently, the other 358 articles were excluded after full text analyses for the following reasons: no HRQoL tool (n = 74), other type of cancer (no HCC/CCA) (n = 48), no primary data (n = 198), ongoing study without report (n = 21), double publication (n = 15) and no full text available (n = 2). The remaining 124 studies were included in the final qualitative syntheses (Fig. 1).

Study characteristics

The characteristics of the 124 included studies are listed in Table 2 [20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140]. Most studies were cohort-type studies (n = 50; 40.3%), either with (n = 12; 24%) or without control group (n = 38; 76%). The remaining studies were RCTs (n = 41; 33.1%), non-randomized controlled trials (n = 18; 14.5%), cross-sectional studies (n = 7; 5.6%) or case–control studies (n = 8; 6.5%) (Supplement 2). A total of 21,496 patients were included in all studies. Frequently studies investigated HCC patients only (Supplement 2). Most studies were single-centre studies (n = 83; 66.9%; supplement 2). The country of origin is depicted in Supplement 2.

Table 2 Baseline characteristics of the included studies

Health-related quality of life instruments

In total, 29 different HRQoLs in 124 studies instruments were identified by our search (Figs. 2 and 3). Of those, 26 different HRQoL PROMs were identified in HCC patients, 8 in CCA patients and 4 different tools in mixed patient cohorts. Multiple studies used more than one HRQoL tool (Table 1). The identified instruments covered all types of HRQoL (generic, cancer-specific, cancer-type-specific and utility-based HRQoL instruments) (Fig. 2).

Fig. 2
figure 2

Health-related quality of life instruments used in the included studies. Generic (black), cancer-specific (red), cancer-type-specific (green), utility-based (blue) and symptom index (yellow). EORTC European Organization for Research and Treatment of Cancer, EQ EuroQol, ESAS Edmonton symptom assessment scale, FACT Functional Assessment of Cancer Therapy, FLIC The Functional Living Index-Cancer, Pt DATA Form Patient Disease and Treatment Assessment Form, QoL quality of life, NIDDK-QA National Institutes of Diabetes and Digestive and Kidney Diseases QoL Assessment, SF Short Form Health Survey, VAS visual analogue scale, WHO World Health Organization, WHO-BREF abbreviated version of the WHOQOL-100, WHOQOL-100 WHO quality of life 100 tool

Fig. 3
figure 3

Flow chart of a included HRQoL measures and b number of studies from qualitative data analyses to quantitative data analyses. PROM patient-reported outcome measure, MA meta-analyses

Despite being labelled as HRQoL instruments in the studies, a number of the identified instruments solely address cancer symptoms and, thus, lack the multidimensionality that is requested for HRQoL and were, thus, excluded from further analyses (Fig. 3 step 1). These were (a) MD Anderson symptom inventory; (b) ESAS: Edmonton symptom assessment scale; (c) MD Anderson symptom inventory – gastrointestinal and (d) FHSI-8 FACT hepatobiliary symptom index. The remaining 25 instruments (117 studies) were included in the further analyses (Fig. 3). These 25 instruments use two to eight domains covering various aspects of quality of life (e.g. physical and mental health, role functioning and symptom burden). The EORTC QLQ-C30 and the FACT-G have cancer-type-specific supplements (EORTC QLQ-HCC18 and FACT-Hep) which can only be used in combination with the more general questionnaire. The questionnaires comprise 5 (EQ-5D) to 47 questions (NIDDK-QA) and have a recall period from the 24 h (EQ-5D) to 4 weeks (SF-8/12/36, Patient Benefit Form). Most of them can be completed within 10 min.

Methodological assessment of HRQoL instruments

The methodological quality of the remaining 25 HRQoL instruments was assessed as outlined in the methods section. Results are shown in Table 3. If no data for a given HRQoL instruments were available for HCC/CCA patients, additional Medline searches were performed to identify methodology studies that evaluated the PROM in closely related patient populations like chronic liver disease. These studies are indicated in Table 3.

Table 3 Overview of the methodological quality of HRQoL tools in primary liver cancer

The most frequently evaluated dimension in all HRQoL tools was reliability (test–retest reliability and internal consistency). With a test–retest correlation of more than 0.70, adequate performance for 6 out of 12 PROMs (SF-36, FACT-G, EORTC QLQ-HCC18, FACT-Hep, NIDDK-QA and QOL-LC) was confirmed [41, 88, 120, 141,142,143,144,145,146]. For the EQ-5D, correlation coefficients ranging from 0.58 to 0.98 were observed showing that not all scales in this PROM are reliable enough [141]. Internal consistency was evaluated with the calculation of Cronbach’s α. A value greater 0.70 was considered sufficient according to COSMIN guidelines [16]. This could be observed in 8 out of 12 HRQoL tools (NHP, SF-36, WHO-BREF, EORTC QLQ-C30, FACT-G, FACT-Hep, NIDDK-QA and QOL-LC) [27, 77, 88, 120, 141, 142, 144,145,146,147,148,149,150,151]. Concerning validity, rarely all three pre-defined categories (content, criterion and construct validity) were evaluated. More frequently only one or two aspects of validity were examined. Content validity was evaluated investigating the process of questionnaire creation. In case of the FACT-G, FACT-Hep and EORTC QLQ-HCC18, the process described included qualitative studies with inclusion of expert opinions, patient reports and current literature [28, 144, 152]. Merely three PROMs (FACT-Hep, FACT-Hep and NIDDK-QA) were compared to the gold standard (i.e. an already established questionnaire), thus, testing criterion validity [144,145,146]. In order to evaluate construct validity, group comparisons using performance status (such as the Karnofsky Performance Status) were used for the EORTC QLQ-HCC18 and FACT-Hep questionnaires as it is known that a higher performance status correlates with better HRQoL [41, 88]. Construct validity within the SF-36 was evaluated using the correlation with hypothesized scores (conceptually related and unrelated scores) [141, 148, 149]. Kim et al. compared item scores between ambulatory patients and liver transplant recipients as well as examined correlations between the domain scores of NIDDK-QA vs. SF-36 and Mayo risk score, respectively [146]. The Wilcoxon signed-rank test was used by Chie et al. to evaluate if the changes in score were significant before and after treatment. For example, patients undergoing surgical treatment suffered significantly more pain compared to before which reflects an adequate responsiveness of the EORTC QLQ-HCC18 [41]. Steel et al. evaluated the clinically meaningful changes of the FACT-Hep over time and found significant decrements in all subscales from baseline to 3-month follow-up [147]. The SF-36 performed poorly during the evaluation of floor and ceiling effects with patients scoring the highest or lowest possible score in distinctly more than 15% which was the set cut-off [148, 149]. Valid acceptability and feasibility were assumed when the response rate was > 80%, or the time to complete the questionnaire was 10 or less minutes [24, 27, 46, 56, 85, 88, 120, 126, 141, 148, 149, 153]. The interpretability of all PROMs was considered acceptable as higher scores in QoL scales represent higher HRQoL, and higher scores within the symptom scales represent lower HRQoL.

Due to a lack of data concerning the basic psychometric evaluation or negative results, only the following 10 HRQoL instruments were considered methodologically adequate according to the pre-specified criteria (see methods section) and were subsequently included in further analyses (Table 3): (a) Generic HRQoL: NHP, SF-36, WHO-BREF; (b) Cancer (Condition)-specific HRQoL: EORTC QLQ-C30 and FACT-G; (c) Cancer type-specific HRQoL: EORTC QLQ-HCC18, FACT-Hep, NIDDK-QA and QOL-LC; (d) Utility (preference)-based HRQoL: EQ-5D. Only publications using one of the above-mentioned 10 HRQoL measures were included in further analyses (n = 98 studies) (Fig. 3 step 2).

Quality of reporting of HRQoL data

The remaining studies were evaluated for the quality of reporting of HRQoL data. Results are summarized in Supplement 3. Of the 98 included studies, 4 (4,1%) did not specify in their methods section at what exact time points HRQoL data were measured [28, 31, 74, 79]. Many studies showed a marked discrepancy between reported HRQoL data in the results section and the frequency of HRQoL data assessment specified in the methods section. Eight studies reported only baseline HRQoL data although these trials specified in their methods section to have assessed HRQoL also during follow-up [38, 41, 42, 58, 80, 94, 98, 139]. The other 18 studies lacked reporting of HRQoL data altogether in their results section, although assessment had been announced in the methods section (supplement 3) [25, 28, 31, 44, 50, 53, 56, 66, 71, 74, 75, 80, 95, 97, 112, 134, 136, 139]. A total of 32 studies did not report raw HRQoL data and consequently could not be used for meta-analysis [21, 25, 27, 29, 32, 34, 35, 38, 40, 44,45,46, 49,50,51, 53, 56, 58, 66, 71, 75, 86, 95, 97, 112, 118, 128,129,130, 134, 136, 139]. The other 17 papers reported HRQoL data only in graphical form, which impedes meta-analysis [61, 64, 70, 72,73,74, 87, 90, 110, 113, 118, 121, 124, 128,129,130, 137]. Furthermore, although most studies reported the statistical methods, they used to analyse HRQoL, only 6 publications used a pre-specified statistical analysis plan addressing common methodological problems in HRQoL analysis [41, 43, 103, 104, 108, 125]. Finally, nine publications combined patient groups undergoing different treatment options (surgery/medical therapy/interventional treatment) for the reporting of HRQoL outcomes. In these cases, assignment of HRQoL outcomes to a specific treatment (surgery vs. medical therapy vs interventional treatment) was impossible [28, 42, 56, 58, 87, 88, 120, 127, 132]. In summary, only three studies remained for quantitative analyses (Fig. 3 step 3).

Supplement 4 illustrates the discrepancy between supposedly available and reported data for the FACT-Hep (A/B) and EORTC QLQ-C30 (C/D) HRQoL instruments.

Data synthesis for HRQoL tools

For generic HRQoL instruments like the SF-36, EQ-5D or WHO-BREF, no meta-analysis following treatment was possible, either because primary data were insufficiently reported (supplement 4) or only single articles reporting raw data were identified. Similarly, for cancer (type)-specific HRQoL tools like EORTC QLQ-C30, EORTC QLQ-HCC18 and QLQ-LC meta-analysis of HRQoL data, the following treatment was impeded by either insufficient reporting during follow-up (supplement 3), or studies compared interventions that were too heterogeneous for meta-analysis. Only for the FACT-G and FACT-Hep questionnaires, clinically comparable interventions were analysed in several studies: Six studies contained surgical study groups [35, 37, 43, 81, 99, 116], two studies contained data on RFA [37, 116], and 5 studies reported extractable data in TACE patients [73, 99, 103, 116, 123]. Although FACT-G or FACT-Hep was used in several studies investigating medical treatment options for HCC, these were either single-arm studies [32, 34, 94], contained placebo control groups [31, 36, 38, 53, 137] or compared two medical treatment options [72, 136], thus, precluding a comparison to interventional/surgical treatments. Similarly, some studies used the FACT-G or FACT-Hep questionnaire to compare different interventional treatments [73, 103, 116, 122], again impeding meta-analysis. Consequently, only 3 studies using the FACT-G/FACT-Hep remained for meta-analysis (Fig. 3 step 3).

Meta-analyses

For the comparison of surgical resection vs. TACE, only two studies reported raw data at baseline and during follow-up [99, 116] (supplement 5A). Poon et al. split the surgical cohort into two distinct subgroups: those with a complete follow-up of two years and those with a shorter follow-up. This is likely to introduce major bias as patients completing 2-year follow-up are likely to be healthier and have less aggressive tumour diseases. We, therefore, pooled the data for the two surgical groups. Supplement 5A shows the results of this exploratory meta-analysis of the mean difference in FACT-subscores (functional, physical, social and emotional well-being) at 12-month post-intervention/surgery. One additional analysis was possible: the comparison of surgery vs. RFA as data are reported in the two studies by Huang et al. and Toro et al. [37, 116]. Supplement 5B shows the results of the exploratory meta-analysis for the 12-month post-interventional/postoperative follow-up, again comparing mean differences in FACT-subscores.

Discussion

HRQoLs represent an important domain of clinical outcomes in oncology. While definitions, implementation, evaluation and analyses of survival and toxicity/complication endpoints have been well standardized over the last decades, PROs are still under-evaluated and reported in most clinical settings. Multiple studies have aimed to define suitable HRQoL tools for different clinical settings, e.g. [4, 5], including cancer patients [6,7,8]. However, no concise evaluation has been performed for patients with primary liver cancers (HCC or CCA).

Although 124 studies were included in this systematic review, we were able to complete only the first two objectives of our study, namely to identify and evaluated HRQoL measures in HCC/CCA patients. However, meta-analysis of study results comparing the outcome of surgical, interventional or medical treatments for HCC/CCA patients in regard to HRQoL was barely possible due to the use of different HRQoL instruments, lack of data or insufficient reporting.

We identified 29 different HRQoL instruments, which indicate vast heterogeneity and lack of consensus in this field. Similar results have been reported before in other diseases [6,7,8]. Furthermore, many of the identified tools lacked basic HRQoL characteristics like multidimensionality [154, 155]. Hence many authors seemed to be unaware of the difference between mere symptom measures and HRQoL instruments. In addition, validation of HRQoL is poor for most instruments in HCC/CCA patients (Table 2). As expected, the best psychometric data were available for cancer-type-specific HRQoL instruments, like EORTC QLQ-HCC18 or the FACT-Hep. Interestingly, even for common generic and disease-specific HRQoL tools, like the Spitzer quality of life index and the EORTC QLQ-C30, data in HCC/CCA patients are sparse. Hence, evaluation of these common tools in this patient cohort seems necessary in future studies. In addition, even for HRQoL measures developed especially for liver cancer patients, psychometric properties were less stringent as might have been thought. The EORTC QLQ-HCC18 shows mixed psychometric results [41, 88]. FACT-Hep, on the other hand, although showing good psychometric properties, has been validated only in mixed patient populations including patients with liver metastases and pancreatic cancer in addition to HCC/CCA patients [144, 147]. Similarly, the preference-based HRQoL EQ-5D has been extensively evaluated in chronic liver disease, but little psychometric data are available in HCC/CCA patients. Future studies should address these shortcomings.

Nevertheless, our analysis revealed suitable HRQoL instruments with sound psychometric properties that should be used in all future HRQoL studies. These are SF-36 [156] for generic HRQoL measurement. The SF-36 is a generic HRQoL instrument consisting of 36 items divided into eight scales (Physical Functioning, Emotional Role Functioning, Physical Role Functioning Bodily Pain, General Health, Vitality, Social Functioning, Mental Health, Health Transition) [156]. The number of response choices per item ranges from two to six. The scores for each scale range from 0 to 100. A higher score indicates a better QOL. The time frame of the SF-36 is ‘last week’ [141].

For cancer-specific HRQoL measurement in HCC/CCA patients, the EORTC QLQ-C30 [157] and the FACT-G can be recommended. Both have limited, but acceptable psychometric properties in HCC/CCA patients and have been used extensively in this patient cohort. The 30-item QLQ-C30 measures five functional scales (physical, role, emotional, cognitive and social functioning), global health status, financial difficulties and eight symptom scales (fatigue, nausea and vomiting, pain, dyspnoea, insomnia, appetite loss, constipation and diarrhoea). The scores vary from 0 (worst) to 100 (best) for the global health status and functional scales, and from 0 (best) to 100 (worst) for symptomatic scales [157]. The FACT-G consists of 27 items for the assessment of four domains of QOL: (1) Physical Well-Being and (2) Socio-Family Well-Being contain seven items each; (3) Emotional Well-Being contains six items and (4) Functional Well-Being contains seven items. The time frame of the FACT-G is ‘last week’. Each item is scored on a 5-point ordinal scale, where 0 indicates not at all and 4, very much [152].

Cancer-type-specific HRQoL should be measured via the EORTC QLQ-HCC18 or FACT-Hep. The EORTC QLQ-HCC18 is an 18-item HCC-specific supplemental module developed to augment QLQ-C30 and to enhance the sensitivity and specificity of HCC-related QOL issues. It contains six multi-item scales addressing fatigue, body image, jaundice, nutrition, pain and fever, as well as two single items addressing sexual life and abdominal swelling. The scales and items are linearly transformed to a 0 to 100 score, where 100 represents the worst status [28, 88]. The FACT-Hep is a 45-item self-reported instrument that consists of the 27-item FACT-G (see above), and the 18-item hepatobiliary cancer subscale, which assesses specific symptoms of hepatobiliary cancer and side effects of treatment. The FACT-G and hepatobiliary cancer subscale scores are summed to obtain the FACT-Hep total score [37, 144]. The QoL-LC questionnaire shows good psychometric properties but has been developed and tested exclusively in Chinese patients, thus, limiting its generalizability. Similarly, NIDDK-QA as a cancer-type–specific HRQoL tool has been used in only one study and, thus, cannot be recommended currently.

For utility-based HRQoL measurement, the EQ-5D [158] has been identified as the instrument of choice. It fulfils basic psychometric requirements, and a sound database is available in HCC/CCA patients. The EQ-5D consists of five items (mobility, self-care, usual activities, pain/discomfort and anxiety/depression). Each item has three response categories: no problems, some problems and extreme problems. The sixth item is a global health evaluation scale, ranging from 0 (the worst imaginable health state) to 100 (the best imaginable health state). The time frame of the EQ-5D instrument is the present moment.

The quality reporting of the HRQoL results was insufficient overall. Few trials reported common methodological problems of HRQoL data like multiple testing, missing data or a priori hypothesis. Raw data were rarely reported and summarize measures (mean, median etc.) as well as follow-up regimes varied widely between studies. In addition, the methodological quality of the studies was generally poor. Thus, despite a total of 124 studies available, evidence regarding HRQoL in HCC/CCA patients is limited.

It is astonishing that reporting of HRQoL data does not seem to have improved over the last decades despite the publication of multiple guidelines and recommendations concerning HRQoL reporting. Few of the included studies fulfiled basic reporting standards for HRQoL like the ones proposed by Basch et al. [159], Staquet et al. [160], the International Society for QoL research (ISOQOL) [161] or the CONSORT—Patient-reported outcome extension [162].

These shortcomings in the methodological quality and reporting were the main reasons for the insufficient meta-analyses in our study. Studies had to be excluded at various points along the way (Fig. 3). The planned comparison of treatment options (surgery vs. medical treatment vs. interventional treatment) with regard to HRQoL can, therefore, be regarded exploratory at best. Future, high-quality HRQoL trials, adhering to basic reporting standards, are urgently needed to address these shortcomings.

One of the main strengths of the current study is the use of a comprehensive search strategy to identify all relevant publications. Furthermore, to our knowledge, this is the first study that assesses the methodological quality of HRQoL tools in HCC/CCA patients according to internationally accepted standards time [3, 15, 16] thereby identifying suitable HRQoL instruments for the use in future studies. In addition, this study can be used as an easy reference standard to identify available studies and raw data for the design and sample size calculation in future HCC/CCA trials. The transparent analysis process in this study can be regarded as a further strength.

The main limitation of our analysis is the heterogeneity of included studies, patients and trial designs. The variations in the application, analyses and reporting of HRQoL between studies made data synthesis difficult. The meta-analyses should regarded exploratory at best.

In summary, clear recommendations for generic, cancer-specific, cancer-type-specific and preference-based HRQoL instruments in HCC/CCA patients can be given. Meta-analysis of data comparing different treatment options in HCC/CC patients was severely limited due to methodological weaknesses of the included studies and shortcomings in reporting. Future trials should address these aspects and adhere to HRQoL reporting standards.