Applicability of scoring systems predicting outcome of transarterial chemoembolization for hepatocellular carcinoma

Purpose Several scoring systems have been proposed to predict the outcome of transarterial chemoembolization (TACE) in patients with hepatocellular carcinoma (HCC). However, the application of these scores to a bridging to transplant setting is poorly validated. Evaluation of the applicability of prognostic scores for patients undergoing TACE in palliative intention vs. bridging therapy to liver transplantation (LT) is necessary. Methods Between 2008 and 2017, 148 patients with HCC received 492 completed TACE procedures (158 for bridging to transplant; 334 TACE procedures in palliative treatment intention at our center and were analyzed retrospectively. Scores (ART, CLIP, ALBI, APRI, SNACOR, HAP, STATE score, Child–Pugh, MELD, Okuda and BCLC) were calculated and evaluated for prediction of overall survival. ROC analysis was performed to assess prediction of 3-year survival and treatment discontinuation. Results In patients receiving TACE in palliative intention most scores predicted OS in univariate analysis but only mSNACOR score (p = 0.006), State score (p < 0.001) and Child–Pugh score (p < 0.001) revealed statistical significance in the multivariate analysis. In the bridging to LT cohort only the BCLC score revealed statistical significance (p = 0.002). Conclusions Clinical usability of suggested scoring systems for TACE might be limited depending on the individual patient cohorts and the indication. Especially in patients receiving TACE as bridging to LT none of the scores showed sufficiently applicability. In our study Child–Pugh score, STATE score and mSNACOR score showed the best performance assessing OS in patients with TACE as palliative therapy.


Introduction
Therapeutic approaches to hepatocellular carcinoma (HCC) are multimodal. Management and prognosis of HCC patients highly depends on tumor status, general health and actual liver functional reserve (Cabibbo et al. 2010;Llovet et al. 1999b;Marrero et al. 2005;Okuda et al. 1985). Curative treatments in terms of resection, liver transplantation or local ablation are often restricted to subgroups with preserved liver function and limited tumor load (Bruix and Sherman 2005;Llovet et al. 2005;. For intermediate stage HCC patients, TACE is currently considered (palliative) first line-therapy (Bruix and Sherman 2011;Llovet and Bruix 2003;Llovet et al. 2008) offering local tumor control and prolongation of OS (Arii et al. 2000;Ikai et al. 2004;Lee et al. 2012;Takayasu et al. 2006). Apart from its use in intermediate and advanced tumor stages, another application for TACE is as bridging treatment to liver transplantation (Decaens et al. 2005;Llovet et al. 2012;Kollmann et al. 2017;Majno et al. 1997;Porrett et al. 2006;Bruix et al. 2011). Various scoring systems (Table 1) predicting the prognosis of HCC patients undergoing different therapies are available (Ho et al. 2017;Hucke et al. 2014a, b;Kadalayil et al. 2013;Kamath et al. 2001;Kim et al. 2016;Li et al. 2016;Marrero et al. 2005; Cancer of the Liver Italian Program (CLIP) Investigators 1998; Okuda et al. 1985;Sawhney et al. 2011;Song et al. 2016;Yin et al. 2016), to guide treatment decisions, like e.g., the commonly used BCLC classification (Cillo et al. 2006;Guglielmi et al. 2008;Llovet et al. 1999aLlovet et al. , 2008Llovet et al. ,2012Marrero et al. 2005;Vitale et al. 2009). In the setting of TACE a considerable number of scores, such as Child-Pugh (Child and Turcotte 1964;Pugh et al. 1973), ALBI (Johnson et al. 2015), APRI (Song et al. 2016;Wai et al. 2003), HAP (Kadalayil et al. 2013), ART (Sieghart et al. 2013), CLIP (Cancer of the Liver Italian Program Investigators 1998), SNACOR (Kim et al. 2016), MELD (Kamath et al. 2001;Sawhney et al. 2011), Okuda (Okuda et al. 1985) and STATE (Hucke et al. 2014a) aim to predict prognosis of HCC patients undergoing therapy. But especially data on a bridging to transplant collective or comparative data between scores are sparse. The current study retrospectively assessed the proposed scoring systems in HCC patients eligible for TACE for bridging to transplant or in palliative.

Study design
The retrospective cohort study was conducted in a tertiary care center (Heidelberg University Hospital) and was a priori approved by the institutional review board (IRB). Data collection was based on chart review. We included patients with established diagnosis of hepatocellular carcinoma according EASL criteria, who received at least one TACE as a therapy of HCC between 2011 and 2017 in our center (Llovet et al. 2012). Decision for TACE treatment and modality of beads (DEB-TACE, conventional TACE or TACE with biodegradable Particles) was in all cases assessed by a multidisciplinary tumor board. The boards treatment approach followed the current EASL-EORTC Clinical Practice Guidelines (Llovet et al. 2012) in patients who have unresectable lesions and are not suitable to receive other ablative therapies. Patients who had been diagnosed as BCLC stage A, C or D, but were unable or unwilling to receive the proposed therapy (e.g., LT, RF, Sorafenib) were also eligible for TACE therapy. For patients on the liver transplantation list TACE was considered standard bridging treatment.

Subgroup definition
Each TACE procedure of the included patients was categorized in two different subgroups, depending on the treatment plan at the time of TACE therapy ( Fig. 1): Bridging to transplant or palliative therapy. The bridging to LT dataset included all interventions in which patients were enrolled on the transplant waiting list at time of TACE, regardless of whether the LT was performed afterwards. The palliative dataset consisted of interventions performed in patients who did not meet the criteria for a liver transplant at the time of TACE.

Calculation of scores
Scores were calculated at each TACE session. Score calculation was done according to their original formula. In addition, we calculated a modified SNACOR (mSNACOR) score and modified ART (mART) score. The original calculation of these two scores only includes parameters in comparison to the first TACE to assess whether a second TACE should be performed. To assess these scores with respect to each individual    Farinati et al. (2000) and Georgiades et al. (2006) Up-to-seven criteria High risk < 18 points 5.3-14.3 El Khaddari et al. (2002), Farinati et al. (2000) and Georgiades et al. (2006) Serum albumin APRI (Guglielmi et al. 2008, Hinrichs 2017   ≥ 10 (-) Bilirubin TACE, these parameters were calculated in comparison to the previous TACE instead to the first TACE.

Statistics
Statistical analysis was performed using SPSS-25 software (IBM, Germany). The two-tailed Chi-squared was employed to compare categorical data of bridging dataset to palliative dataset. The Mann-Whitney U test was used for continuous variables. The primary endpoint was overall survival concerning the different scores, analyzed by Kaplan-Meier method and compared by log rank test. Significant scoring systems in the univariate analysis were introduced to multivariate Cox regression model to determine the adjusted risk ratio. The ROC analysis examines which score reflects best probability of achieving 3-year survival or probability of treatment discontinuation due to adverse events or death. 3-years survival was calculated from the time of each individual TACE. Statistical significance was set at p value < 0.05 in two-tailed tests.

Patient characteristics and distribution of scores at TACE procedures
A total of 492 TACE sessions were included in this study (158 bridging/334 palliative sessions). In consequence of listing criteria, patients in the bridging cohort were younger, had a limited tumor disease and different tumor properties, such as less frequent portal or hepatic vein infiltration and no extrahepatic tumor manifestation. In the palliative dataset, 28 (8.4%) procedures were performed as conventional TACE with Carboplatin or Doxorubicin as chemotherapeutic agent combined with Lipiodol ® , which is only half as common as in the bridging dataset. In the palliative group 73 (21.9%) TACE sessions were performed in patients who finally discontinued TACE therapy (and received no further local therapy) because of adverse events or death, whereas in the bridging dataset none of the patients discontinued the TACE therapy (Table 2). Median overall survival after TACE was not reached in the bridging dataset due to LT and was 21.8 months in the palliative dataset (Table 4). The descriptive comparative analysis of the scoring systems between both datasets is thus confounded by the different baseline characteristics and showed significant distinct distributions of the scoring values (BCLC stage, Child-Pugh class, STATE score, HAP stage, SNACOR stage, mSNA-COR, ALBI group, CLIP group and MELD score) shown in Table 3. Only the APRI score, Okuda score, mART and the ART score did not differ significantly between the two subgroups (Table 3). Comparing the three most frequent etiologies (viral, alcoholic and cryptogenic/NASH in descending order) in ROC analysis with primary endpoint "3 years survival" and "treatment discontinuation" the scores revealed etiology as a potential confounding factor (Tables 8, 9).

Median overall survival (OS)
The univariate Kaplan-Meier analysis in the palliative dataset showed significant differences of median OS in majority of scores (Table 4, Figs. 2, 3). The ART, mART and SNACOR score were the only three scores that showed no significant results in univariate analysis. In multivariate analysis, only three scores were statistically significant independent parameters for the assessment of median OS. These were the Child-Pugh score, the STATE score and the mSNACOR score (which was calculated for each TACE treatment) (Table 4, Fig. 2).

Treatment discontinuation
The ROC analysis in the palliative group showed that five scores achieved a statistically significant p value concerning the probability of treatment discontinuation due to adverse events or death ( Table 6). The Child-Pugh, MELD-, Okuda-, HAP-and ALBI-score achieved a significant p value the AUC values, but did not reach 70% (Table 6; Fig. 4). The most applicable score to predict the probability of a later TACE discontinuation due to the mentioned circumstances was the Child score (class A versus classes B/C). The number of successfully performed TACE procedures differs significantly (p = 0.001; Table 7) in overall survival for the palliative cohort (but not in the bridging collective; p = 0.354).

3-years survival
The ROC analysis of the database in our study showed that none of the 13 scores had an AUC of over 70% although some of the scores reached significance in the analysis concerning the probability of achieving a 3-years survival such, e.g., CLIP-, Okuda-, HAP-and Child-Pugh score. The APRI score and MELD score also showed statistically significance in the ROC analysis but none of these scores reached an AUC of 60% (Table 5; Fig. 5). The most applicable score to predict the probability of achieving 3-years survival was the Okuda score (stage A versus stages B + C). As in the  palliative collective a significant number of patients received TACE in advanced disease stages (BCLC C), an ROC analysis was additionally performed exclusively for BCLC B stage (n = 182) but with comparable results (data not shown).

Median overall survival (OS)
Only the BCLC score showed significance with regard to median OS in the bridging group (p = 0.002) but without decreasing survival time from stage A to stage D (Table 4; Fig. 3).

Treatment discontinuation
The bridging dataset does not include TACE sessions of patients who had to stop the general TACE therapy because of adverse events or death (Table 2). Therefore the ROC analysis concerning treatment discontinuation was only calculated in the palliative dataset (Table 6).
Receiving an unsuccessful TACE (per patient) does not have influence in overall survival for the bridging (p = 0.803) but for the palliative cohort (p = 0.046).

3-years survival
In the bridging group the CLIP score reached the best AUC with a value of 60.0%, but there was no significance in the ROC analysis. Furthermore, none of the other scores reached statistical significance concerning the probability of achieving 3-years survival.

Discussion
Treatment decisions in hepatocellular carcinoma are often complex. In the context of stage migration the assessment of prognostic factors in patients with HCC is crucial for clinical management. For TACE, prognostic scores should provide adequate therapeutic guidance and avoid over-treatment or inadequate response. The aim of this study was a comparative evaluation of the reported panel of scores predicting prognosis of patients undergoing TACE.
Besides the common application of these scores in palliative treatment, the study also evaluated the applicability of these scores for patients undergoing TACE as bridging to LT. Statistical analysis showed that the different scores are not equally applicable in both datasets: In the palliative dataset most of the scores reached statistical significance for predicting OS, whereas in the bridging dataset, only the BCLC score showed significance. In contrast to the study of Hannover Medical School, in our analysis there was no equally applicable score for both datasets of median OS. However, a significant discriminator regarding prediction of OS between both groups was the number of successfully performed TACE procedures (p = 0.001; Table 7) for the palliative cohort (but not in the bridging collective; p = 0.354). This is in line with the substantial number of patients suffering from advanced liver disease and enlarged tumor size in the palliative subgroup.
There is a certain selection bias due to the calculation of different endpoints per TACE rather than per patient. Nevertheless our results of median overall in both datasets (independently of the subgroups of the different scores) are consistent with previous studies (Abbasi et al. 2017;Biolato et al. 2014; Groupe d'Etude et de Traitement du Carcinome Hepatocellulaire 1995; Bruix 2003, 2008). In the palliative dataset most scores predict significant differences in median OS. Contrary to current recommendations (Hucke et al. 2014b;Sieghart et al. 2013;Yin et al. 2016), we cannot validate the prognostic power of the ART score neither concerning the endpoint OS nor other endpoints (3-years survival, therapy discontinuation).Various studies also showed that the ART score is not suitable to reflect the OS of patients undergoing TACE in palliative intention (Terzi et al. 2014;Tseng et al. 2015).
The SNACOR score also did not show any applicability concerning all endpoints in our analysis. It was developed in 2016 (Kim et al. 2016) and evaluated in one more study in 2018 (Mahringer-Kunz et al. 2018), in which it also failed to distinguish prognostic subgroups (Mahringer-Kunz et al. 2018). Even though there was no significance of the original score, a certain applicability of the modified version of the mentioned score concerning the endpoint median OS in the palliative dataset was shown. Apart from the ART, mART and SNACOR score, all other scores in the palliative subgroup revealed significant differences of median OS  depending on their prognostic groups. These scores may stratify the prognosis of patients undergoing TACE as palliative therapy. As a result of multivariate analysis only three scores could predict independently median OS of patients undergoing TACE in palliative intention: the mSNACOR, STATE and Child-Pugh score. The applicability of the Child-Pugh score for patients undergoing TACE therapy has been validated in several studies (Brown et al. 2004;Dhanasekaran et al. 2010;El Khaddari et al. 2002;Mondazzi et al. 1994), even though there are also studies indicating that Child-Pugh scoring system is highly subjective (Cholongitas et al. 2005;Durand and Valla 2008). According to our analysis we support the application of the Child-Pugh score for predicting the OS in patients undergoing palliative TACE. The most applicable score to predict the probability of a later TACE discontinuation was the Child score. This appears suitable to the fact of the several laboratory and clinical markers which count into Child score: albumin, INR, bilirubin, encephalopathy and ascites. Although the two last ones are highly subjective, the Child score seems to reflect liver synthesis in case of TACE therapy in palliative intention at its best. Severe impact of liver synthesis is one of the most important reasons of treatment discontinuation besides vascular infiltration.
Due to the missing significances of the ART score in our analysis, we do not support the recommendation of sequential using the STATE score and the ART score to assess the prognosis of patients undergoing TACE (Hucke et al. 2014a). We can support the application of the STATE score at each TACE session for the assessment of OS in patients undergoing palliative TACE treatments. The mSNACOR is also an independent predictor of OS in palliative setting. In general it should be calculated in comparison to the previous TACE instead to the first TACE. Furthermore, it should be calculated at each TACE procedure instead of only at the second TACE. The SNACOR score needs further evaluation (Mahringer-Kunz et al. 2018), due to the fact, that the SNACOR score, in contrast to the mSNACOR score, did not reach any statistical significance concerning the endpoint OS in the palliative dataset. The analysis showed a certain applicability of the Child-Pugh, Okuda, HAP and CLIP score for the assessment of the probability of achieving a 3-years survival after TACE procedure. Nevertheless, none of the scores reached an AUC of more than 70%, which is why a further evaluation or modification of the scores is needed concerning the mentioned endpoint to support clinical decision making. All the mentioned four scores were validated in various studies, but mainly regarding to the endpoint of median OS (Allgaier et al. 1998;Dhanasekaran et al. 2010;Farinati et al. 2000;Georgiades et al. 2006;Kadalayil et al. 2013;op den Winkel et al. 2012;Pinato et al. 2016;Rabe et al. 2003). We recommend that scores should be evaluated concerning further endpoints additional to the endpoint of OS. The probability of discontinuation of TACE therapy due to AE or death is another important endpoint to decide which scores have a prognostic importance. The Child-Pugh score as well as the MELD score showed the best applicability concerning AE or death in our analysis.
The MELD score is an established score especially in patients awaiting LT (Bruns et al. 2014;Kamath et al. 2001), but it may also be useful for predicting certain AE or mortality in patients undergoing TACE procedures (Hinrichs et al. 2017;Sawhney et al. 2011;Testa et al. 2003). According to our analysis, further studies that examine the relation of MELD score before TACE procedure and the probability of discontinuation of TACE therapy would be desirable.
In the univariate analysis of the bridging dataset only the BCLC score was a statistically significant predictor of overall survival, but in contrast to the original publication of the BCLC score (Llovet et al. 1999a), there is no decreasing survival time from stage A (early stage) to stage D (terminal stage), which is shown impressively in Fig. 3. Child-Pugh class C is always accompanied by a BCLC stage D as well as a performance status (PST) stage 1 or 2 is always accompanied by a BCLC stage C. Assuming that a patient has a Child-Pugh class B with, e.g., 9 points at the first TACE, he can be upgraded to 10 points at the second TACE due to a single parameter change. Thus, the patient changes the Child-Pugh class from B to C and is therefore also associated with the BCLC stage D (Llovet et al. 1999a). Accordingly, a patient may also change from a BCLC stage A to a stage D, because a Child-Pugh class A or B does not limit the BCLC score to a specific stage, whereas a Child-Pugh class C is always associated with a BCLC D (Llovet et al. 1999a). However, the Child-Pugh score also includes subjective parameters (Cholongitas et al. 2005;Durand and Valla 2008), why this definition (Child-Pugh C = BCLC D) should be critically scrutinized for patients receiving a TACE as bridging to LT therapy. The BCLC score has been validated in several studies (Cillo et al. 2004;Llovet et al. 1999a;Marrero et al. 2005;Vitale et al. 2009;Zhao et al. 2015). The p value in our univariate analysis of our palliative dataset also suggests that the BCLC score is suitable for assessing survival of  patients with TACE treatment. We do not agree with various studies that suggest that the BCLC score is generally suitable for assessing the overall survival of all patients, without making any declaration about therapy indication (Dhanasekaran et al. 2010;Zhang et al. 2014). We do not support the statement that a BCLC stage D is associated with the worst prognosis concerning OS among our analysis, regardless of whether the TACE is performed as bridging to LT or in a palliative intention. ROC Analysis in the palliative collective reveals similar results for patients with BCLC B in comparison to all other BCLC scores with endpoint treatment discontinuation or 3 years OS (data not shown). Therefore, we conclude that the scores are independent in performance concerning BCLC stadium. Scores do not perform better, if only BLCLC stage B patients are analyzed.
But substantial differences in the performance of the various scores were evident when comparing AUROC in dependence of etiology of liver disease. For the three most frequent etiologies in our cohort (viral, alcoholic and cryptogenic/NASH in descending order) ROC analysis for the endpoints "3 years survival" and "treatment discontinuation" were remarkable different, revealing etiology as a potential confounding factor (Tables 8,9). Overall performance of the Scoring systems seems to be best for viral etiologies, but poor in alcoholic liver disease patients.
In general, the ROC analysis for both groups revealed that there is no score reflecting a sufficiently selectivity to make clear clinical decisions. This is probably influenced by the fact that a TACE procedure is still not sufficiently standardized. Neither concerning the type of intervention (conventional, DEB, biodegradable), nor the frequency of the TACE procedures or regarding to the different subsequent therapies (RFA, Sorafenib, BSC etc.) are currently standardized selection criteria. According to the results of the bridging dataset further evaluations and modifications of scores are needed, especially for patients receiving TACE procedures as bridging to LT therapy.

Conclusion
The characteristics as well as the outcome of patients receiving TACE are significantly different depending on the therapy indication. In contrast to previous evaluations, scoring for OS after TACE should be separately evaluated for curative (LT) and palliative settings. Regarding TACE as palliative therapy the Child-Pugh score, STATE score and mSNACOR score performed best for the prediction of median OS. In contrast to other studies we could not validate a prognostic power of the ART score. Furthermore, the SNACOR score was only informative, when directly comparing serial, respectively, when it is calculated such as the mSNACOR.
Overall, none of the evaluated scores seems to be promising in terms of clinical decisions making with respect to stage migration in both cohorts. Only the BCLC score was able to predict the OS probability in the bridging dataset  but without decreasing survival time from stage A to stage D. We conclude that further efforts are needed, especially in patients undergoing TACE as bridging to LT, to establish appropriate criteria for making valid predictions and thus support decision making processes in daily clinical routine.