Liver cancer is among the top five most commonly occurring malignancies, ranking as the fourth highest cause of cancer-related deaths worldwide.1 In this study, hepatocellular carcinoma (HCC) accounted for the vast majority of patients with primary liver cancer.

For patients with sufficient liver reserve, tumor resection has been shown to improve survival.2 Resection is aimed at cure, but for 30–50% of patients, the cancer recurs within the first 2 years.3,4 Therefore, early recurrences form a major challenge because survival in this group is substantially lower and the gain from the performed surgery is less clear. Preoperative risk prediction of HCC recurrence can aid patients and doctors in deciding whether to perform major surgery. Postoperative risk prediction may help in deciding adjuvant therapy and the intensity of follow-up treatment.

In 2018, Chan et al.5 published the pre- and postoperative early recurrence after surgery for liver tumor (ERASL) models using well-established clinical parameters to categorize patients into low-, intermediate-, and high-risk groups. The ERASL models are suggested as the first in the field able to provide personalized survival predictions.

A key step before use of any risk score in clinical practice is to validate the performance of the model.6,7 Ideally, the validation process should be performed on numerous independent samples, with assessment of whether the model is correctly specified, the extent to which it can discriminate between high- and low-risk patients, and whether the predicted survival probabilities match the observed data.

Chan et al.5 have assessed the discriminatory power and calibration of the ERASL models in external validation cohorts from four countries: Japan, the United States, China, and Italy. Although the authors used external cohorts for validation, the absence of an independent validation study is a major restriction for use of the ERASL models in daily practice. Moreover, the calibration was assessed only visually, and the analysis relied heavily on categorization of the patients into risk groups. It should be stressed that the model was derived in a hepatitis B-prevalent region, and it remains to be determined how well the model generalizes to other geographic areas where other causes of liver disease are more dominant. Therefore, we performed a fully independent validation using datasets of resected HCC patients from The Netherlands and Japan.6,7

Patients and Methods

This retrospective cohort study is reported according to the critical appraisal and data extraction for systematic reviews of the prediction modelling studies (CHARMS) checklist and the transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD) guidelines (Electronic supplementary Table 1).8,9

Patients

Data were obtained from the Erasmus Medical Centre, Rotterdam, the Netherlands, and from the Okayama University Hospital, Okayama, Japan. The ethical requirements in both centers were approved (ID: MEC-2019-0498, MEC-2018-1544). The datasets contain the clinical parameters from patients with HCC who received first-time resection with curative intent.

The patients were referred either from other hospitals or from the program that routinely screened patients with chronic hepatitis B, chronic hepatitis C, or cirrhosis. In the Erasmus MC, the screening involved ultrasonography and measurement of alpha-fetoprotein (AFP), which could be combined with computed tomography (CT) or magnetic resonance imaging (MRI), every 9 to 12 months.

In Okayama, additional screening instruments included des-gamma-carboxy prothrombin (DCP) and the lectin-reactive fraction of AFP (AFP-L3). For the Okayama cohort, a default interval of 6 months was used, which was intensified to every 4 months for the patients with advanced cirrhosis.10

In both centers, patient eligibility for surgery was assessed at a multidisciplinary tumor board meeting based on performance status, liver function, and resectability of the tumor. Follow-up evaluation, including CT and laboratory assessment, generally was performed 3, 6, and 12 months after discharge and annually for a period of at least 5 years.

Recurrence-free survival (RFS), the dependent variable, was defined as the time between surgery and recurrence. Patients were censored at the date of their last radiologic examination if they had been lost to follow-up evaluation or had died without recurrence. In concordance with the derivation study, the follow-up evaluation was truncated 2 years after surgery.

The preoperative covariates used in the ERASL scores are gender, albumin (g/l), total bilirubin (μmol/l), serum AFP (μg/l), diameter of the largest tumor (cm), and number of tumors. Microvascular invasion (MVI) is the only covariate added in the postoperative risk score, defined as tumor invasion of vessels identified on histologic microscopic examination. Patients were excluded from analysis if they had one or more missing values of these covariates. The complete case analysis and definitions of covariates were in line with those of the derivation study.5 The full specification of the albumin-bilirubin (ALBI) grade and the ERASL scores is presented in Electronic supplement 2.

Methods

The validation process consisted of three stages in which the misspecification, discrimination, and calibration were assessed using the methods and performance measures presented by Royston and Altman,11 Rahman et al.,12 van Houwelingen,13 and Steyerberg,14 and summarized in Steyerberg.15 In the following description, the linear predictor (LP) is the linear combination of the covariates and associated weights published by Chan et al.5 A patient’s risk score is the scalar value resulting from the evaluation of this patient’s LP.

Model Validity

As an overall test to determine whether the relative risks were correctly specified, the calibration slope was computed. The measure was calculated by performing a Cox proportional hazard (CPH) regression with the LP as the only covariate. With this measure, a coefficient sufficiently close to 1 provides the first evidence that the model is correctly specified.13

Subsequently, we investigated the extent to which the coefficients of individual covariates would differ if they were re-estimated in the validation cohort. In these regressions, a CPH model was estimated in which all the individual covariates were added alongside the LP as an offset variable, with its coefficient constrained to 1. The coefficients represented the differences in hazard ratios between the derivation and validation cohorts. A likelihood ratio test was used to assess whether the estimated coefficients jointly were significantly different from zero.

Discrimination

We evaluated the same performance metrics used by Chan et al.5 to aid the comparison, and included Harrell’s C-index, Gönen and Heller’s K, Royston and Sauerbrei’s Rd squared,2 and the time-dependent area under the receiver operating characteristic curve (tdAUC).

Calibration

Two types of calibration plots to display the extent to which the predictions matched the observed data were used. First, the average predicted survival probabilities over the Kaplan Meier curve were superimposed per risk group.11 In a second plot, the predicted survival probabilities 1 and 2 years after surgery were plotted against the Kaplan-Meier estimates at these time points and compared against the 45° line. Both calibration plots heavily rely on arbitrarily formed risk groups and do not quantify the lack of fit. Hence, the ERASL models were embedded in a Weibull calibration model (Eq. 1). The parameter μ represents the accuracy of the overall risk level, with γ representing the impact of the LP and σ representing the shape of the baseline hazard. The variable T* represents the event time (t) transformed using the cumulative baseline function. It is assumed that the error term \(W\) follows a type 1 extreme value distribution.13

$$ln (T*)=\upmu +\upgamma (LP)+\upsigma W$$
(1)

Thereafter, the Weibull model was used to achieve recalibrated survival probabilities using the following equation:

$$S{\left(t|LP\right)}_{cal}=P\left[T>t|LP\right]=exp\left(-exp\left(\frac{1}{\upsigma }\left(ln\left(-ln\left({S}_{0}\left(t\right)\right)\right)-\upmu -\gamma LP\right)\right)\right)$$
(2)

Model Updating

We used forward selection, which starts with the CPH model using only the LP. Hereafter, in successive rounds, the covariate with the smallest p value was added to the model. To investigate the impact of hepatitis B and C infections, these were added one by one to the model, with the LP constraint to 1.

Statistical Software

Data manipulations were performed in Python 3.7.16 Statistical analysis was performed in R version 3.5.117 using the following packages: survival, rms, survAUC, survcomp, and boot.18,19,20,21,22 The R code with detailed comments is supplied in Electronic supplemental file 2.

Results

Patient Cohorts

The Rotterdam cohort comprised data from 312 patients collected from January 2000 through December 2017. Missing data included 25 albumin, 11 total bilirubin, 12 AFP, 3 tumor size, 3 tumor number, and 27 MVI values. For the validation of the ERASL-pre model, 33 patients (11%) were excluded due to missing data on at least one of these variables. For the validation of the ERASL-post model, the data for 53 patients (17%) were excluded. Ultimately, data for 279 and 259 patients were eligible to be analyzed for the ERASL-pre and ERASL-post models, respectively.

In Rotterdam, disease recurrence rate and survival status of the patients were last updated in February 2020. Recurrence was experienced by 164 of the 279 patients analyzed for the ERASL-pre model. For 116 of these patients, the recurrence developed 2 years after surgery. The median follow-up period was 5 years, with 77% the patients followed up for at least 2 years.23

Of the 259 patients analyzed for the ERASL-post model, 157 were found to have recurrence. For 110 of these patients, the recurrence developed during the first 2 years after surgery. The follow-up period for 78% of these patients was at least 2 years, with a median follow-period of 5 years.

The Okayama dataset comprised patient data collected between January 2007 and December 2017 for 392 patients. This dataset had no missing values. The disease recurrence rate and survival status were last updated in February 2020. A total of 196 patients had disease recurrence, with 139 of the patients experiencing recurrence in the first 2 years after surgery. The median follow-up period was 5 years, with 85% of the patients followed up for at least 2 years.

Baseline Comparability

Baseline characteristics are summarized in Table 1. Information from the Hong Kong derivation cohort has been added to aid comparison. In the Okayama cohort, the cause of the HCC was most often ascribed to hepatitis C (47%), whereas in the Hong Kong cohort, hepatitis B (84%) was most prominent. In the Rotterdam cohort, hepatitis infections occurred less often, including hepatitis B in 25% and hepatitis C in 15% of the patients. In Rotterdam, the median tumor size was larger with 59 mm versus 35 mm in Okayama.

Table 1 Baseline characteristics of the Rotterdam, Okayama, and original derivation cohorts

Major resection and resection combined with radiofrequency ablation were both more common in the Rotterdam cohort (Electronic supplementary Table 2). Postoperatively, MVI differed as well, with a 58% rate for the Rotterdam cohort, a 29% rate for the Okayama cohort, and a 27% rate for the Hong Kong cohort (Table 1). Differences were found in the time until recurrence, including a median of 25 months in the Rotterdam cohort, 48 months in the Okayama cohort, and 66 in the Hong Kong cohort. This finding also was reflected in the baseline survival functions (Electronic supplementary Figure 1). In the Okayama cohort, recurrences were more often intrahepatic than in the Rotterdam cohort. Finally, treatment of HCC recurrence varied between centers. Most notable was the more frequent use of transarterial chemoembolization (TACE) (Okayama 60% vs Rotterdam 12%) and chemotherapy (Okayama 40% vs Rotterdam 23%) in the Okayama cohort.

In the Rotterdam cohort, the mean ERASL-pre and ERASL-post scores were 2.0 ± 0.70. In the Okayama cohort this value was 2.1 ± 0.88 for the ERASL-pre score and 1.9 ± 0.90 for ERASL-post score. The medians in the Hong Kong derivation cohort differed from the those published for the pre- and post-scores. Furthermore, for both the pre- and post-scores, the Rotterdam distributions were symmetric, whereas for the Hong Kong and Okayama cohorts were skewed to the right.

The ERASL-pre model assigned only four patients of the Rotterdam cohort to the high-risk group (Table 2). Furthermore, the differences between risk groups in terms of median survival and hazard ratios were greater overall in the Okayama cohort than in the Rotterdam cohort. In addition, the differences between the risk groups increased as information regarding the MVI was added in the ERASL-post score.

Table 2 Median the recurrence-free survival (RFS) rate and hazard ratio (HR) for each risk groupa

Model Validity

All discrimination measures were higher in the Okayama cohort than in the Rotterdam cohort. Furthermore, all discrimination measures were higher for the ERASL-post score than for the ERASL-pre score (Table 3). The ERASL-pre model attained a C-index of 0.57 (95% CI 0.51–0.63) in the Rotterdam cohort, whereas in the Okayama cohort, a C-index of 0.69 (95% CI 0.65–0.73) was found.

Table 3 Measures of discrimination

Significant differences in the prognostic effects were found for the both the ERASL-pre and ERASL-post models in the Rotterdam cohort, and for the ERASL-pre model in the Okayama cohort. The slope for the preoperative model in the Rotterdam cohort deviated the most, with a value of 0.32 (95% CI 0.04–0.59 (Electronic supplementary Table 3). Specifically, for both the Rotterdam and Okayama cohorts, the impact of gender was significantly smaller. Additionally, the impact of an ALBI grade greater than 1 was significantly smaller in the Rotterdam cohort (Electronic supplementary Table 4) (Fig. 1).

The ERASL models systematically overestimated the RFS for the low- and intermediate-risk groups (Fig. 2). The results from the recalibration confirmed the mismatch in overall risk level, with µ coefficients ranging from − 2.21 to − 0.83 and all significantly different from zero (p < 0.001) (Electronic supplementary Table 5). Also, the exaggerated impact of prognostic factors in the LP was confirmed with gamma coefficients ranging from − 0.90 to − 0.39, all significant (p < 0.001). After use of these coefficients to recalibrate the model, the model matched the observed Kaplan-Meier curves much closer. The calibration plots confirmed that the re-calibration mainly corrected this optimism (Electronic supplementary Figure 2).

Fig. 1
figure 1

Distribution early recurrence after surgery for liver tumor (ERASL) risk scores. Distributions of the ERASL pre- and post-risk scores in the Hong Kong derivation cohort and in the Rotterdam and Okayama validation cohorts. The scores are centred on the median values described in the paper by Chan et al.5 In each histogram, the left black line represents the 50th percentile, and right black line represents the 85th percentile.

Fig. 2
figure 2

Calibration plot. The smooth solid lines represent the average predictions per risk group from the original model. The dashed curves represent the calibrated survival probabilities

Model Extension

Addition of hepatitis B and C infections to the LP did not achieve significance in either the pre- or postoperative models for either cohort (Electronic supplementary Table 6). Modification of risk score coefficients also was investigated. Starting with only the LP, variables were added in a forward selection manner. For the preoperative model in the Rotterdam cohort, ln(AFP) (0.08; 95% CI 0.02–0.14; p = 0.02) was significantly different from zero (Electronic supplementary Tables 7 and 8).

In the postoperative setting for the Rotterdam cohort, micro-vascular invasion (0.64; 95% CI 0.12–1.16; p = 0.02) and ln(AFP) (0.07; 95% CI 0.01–0.13; p = 0.03) were significant. For the Okayama cohort, the only variable achieving significance in the pre- and postoperative models was gender, with respective coefficients of − 0.67 (95% CI − 1.12 to − 0.22; p < 0.001) and − 0.62 (95% CI − 1.06 to − 0.18; p < 0.001).

Discussion

This study assessed the validity of the ERASL models in independent cohorts to evaluate its applicability in daily practice. The ERASL models quantify the likelihood that a patient with HCC will experience early recurrence after resection. Preoperatively this information may help decision-making when the risk of major surgery should be balanced against the risk of early recurrence. Postoperatively, the model enables clinicians to provide the appropriate surveillance to detect recurrent HCC and additional treatment, including re-resection or salvage transplantation.24,25,26

The most important aspect of a model’s performance is its discriminatory power to separate low- and high-risk patients. Our evaluation of discriminatory power, as reported in Table 3, can best be viewed in relation to the results published by Chan et al.5 In the Rotterdam cohort, the model performed least (0.57), with a C-index similar to that for the Italian validation cohort (0.60) and substantially lower than for the Hong Kong derivation cohort (0.71). In the European cohorts, the discriminatory power of the risk score can be considered low (i.e., C-index ≤ 0.6). In contrast, the models in the Okayama cohort (C-index, 0.69) almost achieved the same level attained in the derivation dataset and discriminated well compared with other models.

Apart from a low discriminative performance in the Western cohort, we found that the original models were poorly calibrated for both the Western and Eastern cohorts. The high-risk group appeared to fit better, although the number of cases supporting the Kaplan-Meier curve in this group was minimal. Poor calibration caused the original ERASL models to exaggerate the difference in survival between risk groups and systematically overestimated the RFS. This systematic bias also was visible in all validation cohorts presented in the supplements of the derivation paper, confirming our results.5 Using the Weibull calibration model, for each cohort and model, we estimated three parameters to quantify and correct the calibration. However, we noted that the calibration parameters need to be validated in turn before wider adoption.

Another point of interest is that the published 50th and 85th quantiles on which the risk score thresholds are based, did not match the quantiles of the derivation cohort. The proportion of patients assigned to the intermediate- and high-risk groups were therefore smaller than the intended 35% and 15%. This also held for the other validation cohorts described in the derivation study.5 Therefore, the summary statistics describing the high-risk group are less stable and warrant a different interpretation because they describe even more extreme cases.

Regarding the prognostic profiles, the right skewness of the Japanese cohort matched that of the derivation cohort, whereas in the Rotterdam cohort a more symmetric distribution was observed. Consequently, in the Rotterdam cohort, fewer patients were assigned to the high-risk group than in the Okayama and Hong Kong cohorts. Interestingly, in the Rotterdam cohort, the risk of early recurrence was found to be the highest of all three cohorts. This mismatch between few high-risk predictions and high rates of early recurrence underscores that the models lack sensitivity and cannot be used in daily practice for Western patients.

A candidate risk factor that might explain this difference is the presence of hepatitis B or C. In the current study, the proportion of patients presenting with hepatitis B or C strongly differed between the cohorts. In both the offset regressions and the forward selection procedure, however, neither of these variables was significant. It therefore appears that although hepatitis B and C are important factors for diagnosis and treatment, they do not accurately reflect the severity of HCC after the other variables in the ERASL model have been taken into account.

To explore directions for further research, we re-estimated variables that have already been incorporated. For the Okayama cohort, we found that the coefficient for gender differed significantly from zero in both the pre- and postoperative settings using offset regression and the forward selection procedure. The suggested modification almost completely negated the effect of the gender covariate used in the ERASL models. This result confirms the concern raised earlier by Zhang et al.27 in their letter to the editor, in which they were surprised that gender was such a strong predictor. They performed a multi-center study in which they found similar rates of early recurrence between males and females (43.3% vs 42.0%; p = 0.728).

The misspecification tests for the Rotterdam cohort in the postoperative setting were less clear. Whereas gender ALBI grade and tumor size covariates were significant in the offset regression, the covariates for ln(AFP) and micro-vascular invasion were significant when the forward selection procedure was followed. In the latter, the changes in hazard ratio were substantial, with an additional 8% risk increase per unit of ln(AFP) and an additional 89% risk increase for MVI. It is remarkable that the higher impact of MVI in the Rotterdam cohort was paired with a high incidence.

High incidence of MVI also was found in the validation cohorts from the United States and Italy. Because the higher risk was paired with a higher incidence in Western cohorts, our results reflect differences in the timing of the diagnosis and underlying tumor biology between Eastern and Western cohorts rather than differences in definition.28 This hypothesis is further supported by the fact that the median tumor size in the Rotterdam cohort was almost double that in the Okayama cohort. In addition, early recurrences occurred more often in Rotterdam, and when recurrence was found, it was less often confined to the liver.

Although our research was not designed to inspect East–West differences, we found that the Okayama surveillance protocol was more intense, and we speculated that referring doctors might be more aware of HCC because the incidence was higher in the East.28,29,30,31

The effect of the differences in timing also translates into the RFS. The median RFS was 2 years in the Rotterdam cohort and 4 years and Okayama cohort, whereas the median RFS for the Hong Kong derivation cohort was even longer (5.5 years). This sizeable difference was observed in all other validation cohorts published by Chan et al.5 and also raises questions about the patient selection in the Hong Kong derivation cohort.5 The authors have not mentioned this result or investigated its origin. The impact on the predicted survival probabilities remains unclear. Although the survival data were censored at 24 months, the excellent long-term survival likely affected the baseline survival function. Because the baseline survival function is key in forming the predictions, it therefore also affects the accuracy of the prediction model.

Finally, it is important to note that our study had several limitations. First, the analysis was performed on validation cohorts with limited sample sizes. Especially conclusions for the high-risk group, clinically the most relevant, might have been unstable. Second, we recognize that the mechanisms for missing data might have differed across cohorts and that the complete case setup results in biased estimates if the data are not missing completely at random. However, following the derivation paper, we decided not to use multiple imputation techniques.

Although outside the scope of our research, model extension is needed to explain the differences in discriminatory power between Eastern and Western cohorts because they are clearly distinct. Also, we did not investigate the adequacy of the non-parametric baseline hazard. Parametric baseline hazard functions may improve the efficiency of the model.32 Additionally, stratification of the baseline hazard, dynamic covariates, and time-varying coefficients might prove to be fertile ground for improving the model. Finally, future research should focus on the implementation of prediction models into clinical decision-making. Arbitrary risk groups or abstract survival probabilities might prove to be hard for patients and doctors to incorporate intuitively into their decisions. Currently, a framework about how to translate predictions into care is lacking.

Conclusions

In summary, this study showed that the discrimination of ERASL models may be poorer for Western patients than for Japanese patients, who showed good (or better) performance. The ERASL models require recalibration before risk prediction for individuals. We conclude that a new model needs to be developed that explains the East–West difference or is representative for Western patients.