Introduction

Liver cancer is the second most frequent cause of cancer-related death in the world [1] and the fifth leading cause in the United States [2]. In contrast to other cancer types, liver cancer incidence and mortality rates are rising [3, 4]. Hepatocellular carcinoma (HCC) is the most prevalent form of primary liver cancer, accounting for 70–85% of total liver cancers globally [5]. Imaging is a critical tool for diagnosing and staging of HCC. Magnetic resonance imaging (MRI) provides high spatial resolution and allows soft-tissue characterization of the liver. The liver imaging reporting and data system (LI-RADS) [6] has been developed for non-invasive diagnosis based on imaging criteria, and in most cases, HCC can be detected and diagnosed using contrast-enhanced multiphasic imaging without an invasive biopsy [7].

Various scoring and staging systems have been proposed for HCC mortality risk quantification and to stratify patients into risk categories [8,9,10,11,12,13,14]. However, these systems use only limited one-dimensional tumor size measurements [9,10,11,12,13], do not make use of quantitative imaging biomarkers, and stratify patients into different risk groups based on strict thresholds. Traditional one-dimensional tumor size measurements have major limitations in reflecting viability, actual tumor size, and growth potential [15] and are subject to rater variability. Quantitative image biomarkers, such as radiomic features, can be extracted from regions and volumes of interest from medical imaging data; however, automated segmentation methods are required for integration into clinical practice workloads. Fully automated whole liver segmentation based on deep learning has demonstrated robust, reproducible, and generalizable segmentation performance across disease stages with substantially altered liver morphology and only required processing times in the order of seconds [16].

The aim of this study was twofold: develop and validate an automated method for mortality risk quantification in HCC patients using only routinely available standard-of-care clinical variables and radiomic features derived from automated liver segmentations on baseline multiphasic contrast-enhanced MRI by means of machine learning and use this risk score to stratify patients into low, intermediate, and high-risk groups.

Materials and methods

Compliance with ethical standards

This HIPAA-compliant study was approved by the Yale School of Medicine institutional review board with full waiver of consent and conducted in accordance with the declaration of Helsinki.

Code availability

All code for the methodological implementation and the trained framework is publicly available on GitHub: https://github.com/OnofreyLab/hcc-mortality-risk

Data availability

Patient data and imaging data used in this paper cannot be shared publicly due to legal reasons.

Patient inclusion and exclusion

This retrospective study identified all patients with treatment-naïve HCC treated at our institution between the years 2008 and 2019. HCC was either proven by imaging criteria or biopsy confirmation. We included

  1. (i)

    all patients >18 years old

  2. (ii)

    that had multiphasic contrast-enhanced MRI at the time of diagnosis.

We excluded

  1. (i)

    all patients with missing clinical information,

  2. (ii)

    with no triphasic MRI acquisition or

  3. (iii)

    non-diagnostic MRI.

The patients were randomly sampled into independent cohorts, with 85% for development and 15% for independent validation.

Clinical data

Clinical data were collected from the hospital’s electronic health record system, and conventional clinical staging scores [8,9,10,11,12,13,14] were calculated. Patients were evaluated for extrahepatic metastases on chest computed tomography (CT) and bone scans at the date of diagnosis, and the following laboratory values were collected closest to the date of imaging: alpha-fetoprotein (AFP), total bilirubin, direct bilirubin, serum albumin, international normalized ratio (INR), partial thromboplastin time (PTT), sodium, and creatinine.

MRI acquisition and radiomics extraction

Multiphasic contrast-enhanced MRI was acquired using a standard institutional imaging protocol involving T1-weighted breath-hold sequences before contrast administration and 12–18 s, 60–70 s, and 3–5 min post-contrast injection for pre-contrast-, late arterial, portal venous-, and delayed-phase images, respectively. The scans were de-identified and downloaded from the picture archiving and communication system (PACS) server. Subsequently, automated image co-registration and whole liver segmentation were performed using a convolutional neural network [17], followed by radiomic feature extraction. Full details on the image processing pipeline can be found in Supplement 1.

Survival model and graded prognostic assessment

A random survival forest (RSF) [18] is a non-parametric method that uses an ensemble of survival trees to analyze right-censored time-to-event data. In this study, we fit an RSF with overall survival (OS) as the dependent variable and the combination of all clinical- and radiomic variables as independent variables (termed “candidate variables”). OS was defined as the period between the imaging date and the date of death of any cause. Patients were censored either at their last date of follow-up, end-of-observation date, or at the date of liver transplantation. The following RSF settings were used: 1000 estimator trees were grown with a minimum of 10 samples to split an internal node, and 15 samples were required at a leaf node with \(\sqrt{n}\) variables to consider when looking for the best split. First, all candidate variables with a Pearson correlation coefficient > 0.9 in the development cohort were excluded to reduce multicollinearity. Second, we fit the RSF on the remaining candidate variables and obtained a ranked list of variable importance scores by ten permutations of random shuffling. Third, we re-fit the final survival model on the 30 topmost important variables. To gain further insight and interpretability of the proposed method, we obtained variable importance scores by ten permutations of random shuffling for all included variables. Our method was implemented in Python (v3.8.11) using the scikit-survival package (v0.16.0). Three risk groups were defined from the risk score predictions of the proposed model using the “rhier” function of the R package “rolr” (v1.0), which uses a hierarchical method by applying ordered logrank tests [19], to stratify patients into low-, intermediate-, and high-risk groups. The risk score cutoffs for each risk group were derived from the development cohort, with a minimum of 25 subjects in each risk group. The same cutoff values were then applied for risk group stratification in the independent validation cohort. For performance evaluation of the proposed model and conventional clinical staging scores, we calculated Harrell’s C-index [20] and the area under the time-dependent receiver operating characteristic curve (AUC) [21] at 1–5 years after the date of imaging.

Statistical analysis

Statistical analyses were conducted in SPSS (v27), R (v4.1.1), and Python (v3.8.11). p values < .05 were considered statistically significant, and Bonferroni correction was used when comparing multiple groups. Cox proportional hazards regression analysis was used to determine the association of the developed risk score and the proposed risk groups with OS. Median OS (mOS) was calculated, and Kaplan-Meier survival curves were plotted for each risk group and compared using a logrank test. The survival rates of the proposed risk groups were calculated at 1, 3, and 5 years after the date of imaging. To assess the generalizability of the risk groups’ survival times to new cohorts, the survival times of each risk group were compared between the development- and validation cohort using a logrank test.

Results

Patient characteristics

A total of 555 patients (mean age, 63.8 years ± 8.9 [standard deviation]; 118 females) with treatment-naïve HCC and multiphasic contrast-enhanced MRI at the time of diagnosis were included in the study. Patients without MRI at baseline (= 501), < 18 years (= 2), missing clinical information (= 16), no triphasic image acquisition (= 84), and non-diagnostic MRI (= 14) were excluded from the study (Fig. 1). Patient baseline characteristics are summarized in Table 1, and MRI parameters are reported in Supplemental Table 1. HCC was either proven by imaging criteria or histopathology.

Fig. 1
figure 1

Flowchart of patient inclusion and exclusion. From an institutional database with 1172 patients, 555 patients (118 females, 437 males, 63.8 ± 8.9 years) with imaging- or histopathologically proven treatment-naïve hepatocellular carcinoma and baseline multiphasic contrast-enhanced magnetic resonance imaging at the time of diagnosis were included in the study

Table 1 Patient baseline characteristics

A total of 287 (51.7%) patients died after a median time of 14.40 months (range, 0.20–97.12 months; interquartile range (IQR), 22.23) after the date of imaging, and patients were followed up for a median of 32.47 months (range: 0.20–118.90 months; IQR: 61.5) after the date of imaging. The median time between the laboratory results and the imaging date was 10 days (IQR, 28.3). First treatments based on the institution’s multidisciplinary tumor board decisions were as follows: 192 (34.6%) patients underwent transarterial chemoembolization, 138 (24.9%) thermal ablation, 82 (14.8%) hepatectomy, 68 (12.3%) a combination of transarterial chemoembolization and thermal ablation, 24 (4.3%) Sorafenib, 24 (4.3%) best supportive care, 20 (3.6%) transarterial radioembolization with Yttrium-90, and lastly 7 (1.3%) liver transplantation. For model development and validation, a total of 471 (85%) patients were randomly allocated to the development cohort and 84 (15%) to the independent validation cohort.

Survival model

Figure 2 summarizes the entire model development pipeline. The proposed model attained C-indices of 0.8503 and 0.8234 in the development- and validation cohort, respectively. Table 2 summarizes all performance metrics for the proposed model and conventional clinical staging systems. For the interpretability of the proposed model, Fig. 3 depicts the variable importance scores of the included variables. On average, the proposed framework required a running time of 1.11 min per patient (automated liver segmentation, 0.70 s; extraction of 23 included radiomic features, 1.09 min; model prediction, 0.42 s).

Fig. 2
figure 2

Model development. An automated liver segmentation framework was adopted for radiomic feature extraction after automated image co-registration. To predict overall survival, a random survival forest was fit from a combination of clinical and radiomic variables. Model performance was evaluated using Harrell’s C-index and the area under the time-dependent receiver operating characteristic curve (AUC). Patients were stratified into low-, intermediate-, and high-risk groups based on their predicted risk scores

Table 2 Performance evaluation
Fig. 3
figure 3

Variable importance scores. The bar chart shows the mean variable importance score (error bars show standard deviation) of each included variable of the final risk prediction model obtained by 10 permutations of random shuffling. Naming convention of radiomic features: The prefix specifies the image type (original image or filter-derived MR image (“log”: Laplacian of Gaussian) with extraction parameters); the suffix specifies the MR contrast phase (“_pre”: pre-contrast phase, “_art”: late arterial phase, “_pv”: portal venous phase, “_del”: delayed phase). Equations for the calculation of each radiomic feature are available in ref. [22]. (AFP: Alpha-fetoprotein; INR: international normalized ratio; PTT: partial thromboplastin time)

Mortality risk predictions and graded prognostic assessment

The distribution of risk scores in the development and validation cohort is shown in Fig. 4. In the development- and validation cohort, the mean (± standard deviation) predicted risk score was 121.57 (± 65.31) and 135.64 (± 67.59), respectively. Cox proportional hazards regression analysis showed a highly significant association between the predicted risk score and OS in the development cohort (coefficient, 0.021658 (p <.00001); HR, 1.022 (95% CI: 1.02, 1.024)), and in the validation cohort (coefficient, 0.021676 (p <.00001); HR, 1.022 (95% CI 1.016, 1.028)). The cutoff values determined by the hierarchical method to stratify patients into low-, intermediate-, and high-risk groups based on the proposed model’s predicted risk scores were 93.08 and 172.73. Detailed results of the Cox proportional hazards regression analysis for each risk group can be found in Supplemental Table 2. Example cases are shown in Fig. 5. In the development cohort, 193 (41%) patients were assigned to the low-risk group, 185 (39%) patients to the intermediate-risk group, and 93 (20%) patients to the high-risk group. In the validation cohort, 27 (32%) patients were allocated to the low-risk group, 32 (38%) patients to the intermediate-risk group, and 25 (30%) patients to the high-risk group. Supplemental Table 3 shows a cross-tabulation analysis of the proposed risk groups across conventional clinical staging systems.

Fig. 4
figure 4

Distribution of predicted risk scores in the development- and independent validation cohort. Based on the risk score predictions of the survival model in the development cohort, we derived two cutoff points (93.08 and 172.73) to stratify patients into low-, intermediate-, and high-risk groups. We applied the same cutoff points for stratification in the independent validation cohort. For plotting, a Gaussian smoothing kernel was used

Fig. 5
figure 5

Example cases. Standard-of-care clinical data and axial pre-contrast-, late arterial, portal venous-, and delayed-phase MRI with corresponding automated liver segmentations overlaid in blue. Low-risk group: A 59-year-old male patient presenting with a focal 4.5 cm lesion in the right liver lobe. The patient was censored after 8.75 years. Intermediate-risk group: A 54-year-old male patient presenting with a focal 3.1 cm lesion in the right liver lobe. The patient died 15 months after diagnosis. High-risk group: A 54-year-old male patient presenting with a focal 5.7 cm lesion in the right liver lobe. The patient died 6.7 months after diagnosis

Survival analysis

A comprehensive table with mOS times and 95% confidence intervals for various strata can be found in Table 3. The mOS time of the entire study cohort was 32.47 (95% CI: 28.40, 37.83) months. In the development- and validation cohort, the mOS were 32.57 (95% CI: 29.13, 38.50) months and 24.57 (95% CI: 14.73, NA) months, respectively. Survival rates (± standard error) for the proposed risk groups in the development- and validation cohort are summarized in Table 4. Survival times between the development- and the validation cohort showed no statistical difference (p = .29). In the development cohort, mOS in the low-risk group was 90.40 (95% CI: 62.97, NA) months, 25.8 (95% CI: 23.50, 31.90) months in the intermediate-risk, and 6.40 (95% CI: 5.03, 7.73) months in the high-risk group. In the validation cohort, mOS in the low-risk group was NA (95% CI:48.03, NA) months, 19.60 (95% CI: 14.23, NA) months in the intermediate-risk, and 5.40 (95% CI: 3.80, 12.63) months in the high-risk group. Kaplan-Meier curves for the developed risk groups can be found in Fig. 6. The developed low-, intermediate-, and high-risk groups demonstrated significantly different survival times in both cohorts (development cohort, p <.0001; validation cohort, p <.0001). Notably, no statistical difference was found when comparing the survival times of each risk group between the development- and the validation cohort (low-risk group, p = 1.0; intermediate-risk group, p = 1.0; high-risk group, p = 1.0), thus indicating the generalizability of the risk groups’ survival times to new data. Complete results of the logrank test for pairwise OS comparisons between the proposed risk groups can be found in Supplemental Table 4. Kaplan-Meier curves for the conventional staging scores can be found in Supplemental Figure 1.

Table 3 Median overall survival times across staging systems
Table 4 Survival rates (± standard error) for the proposed risk groups
Fig. 6
figure 6

Kaplan-Meier curves of the proposed risk groups in the development- and validation cohort

Discussion

Using a large dataset (= 555), we devised and independently validated a fully automated framework for mortality risk prediction in hepatocellular carcinoma patients using routinely available standard-of-care clinical data and radiomic biomarkers from automated liver segmentations on baseline multiphasic contrast-enhanced MRI. This completely data-driven method yielded reliable, fast, and reproducible risk predictions and attained state-of-the-art performances for mortality risk quantification at baseline. The generalizability of our method was confirmed in an independent validation cohort, and performance was compared against conventional staging systems. In addition, the proposed stratification into low-, intermediate- and high-risk groups yielded similar overall survival times in the development and validation cohort, indicating the generalizability of the method. By using automated liver segmentations, we ensure that the extracted radiomic imaging markers are stable and reproducible and increase the workflow substantially by demonstrating segmentation computation times under a second without the need for human interaction [17]. Finally, we developed our method using a data-driven approach. Instead of stratifying patients into risk groups based on one-dimensional image measurements [9,10,11,12,13], we derived risk group cutoff values using the full three-dimensional imaging volume itself. Thus, such a risk prediction framework could enable personalized follow-up strategies, guide management decisions, and improve clinical workflow efficiency in tumor boards.

We hypothesize that the proposed framework outperformed conventional systems for mortality risk quantification in terms of C-index and AUC as it uses advanced quantitative imaging biomarkers from the whole liver volume in imaging and a data-driven approach for mortality risk prediction. Three-dimensional quantitative assessment of tumor burden has been shown to be a stronger predictor of patient survival than one-dimensional tumor size measurements [23], and one-dimensional tumor size measurements, as used in conventional staging systems, have shown major limitations in reflecting viability, actual tumor size, and growth potential [24]. Our approach uses the full 3D MR imaging data for the extraction of quantitative biomarkers, which is a rich representation of MRI compared to single-dimensional representations such as diameter-based measurements. The 30 biomarkers with the highest importance scores derived from the data demonstrate a combination of clinical features and radiomics across all four phases of the MRI study (Fig. 4). These biomarkers effectively represent disease through intensity patterns and textures summarized by the radiomics in different phases of the imaging study. The performances of the conventional staging systems are in line with previously published studies [25,26,27,28,29,30,31]. Previous works have developed machine learning models for mortality risk quantification. However, direct comparison is challenging due to the different datasets, imaging modalities, and study endpoints being used. Mei et al. [32] developed and validated a prognostic nomogram to predict survival in patients with unresectable HCC after hepatic arterial infusion chemotherapy in a cohort with 463 predominantly advanced-stage patients attaining C-indices of 0.710 and 0.716 in their development and validation cohorts, respectively. Furthermore, the authors proposed a risk stratification approach to classify patients into three or four risk groups based on a trisection cutoff and a quartile cutoff point with significantly different OS times. While their approach did not include quantitative imagining biomarkers, their data-driven design achieved superior performance compared to conventional staging systems in their cohort, similar to the findings of our study. Liu et al. [33] developed and validated a prognostic nomogram to predict overall survival in HCC patients after hepatectomy from radiomic features from tumor segmentations on portal venous phase computed tomography and clinical- and pathological variables in a cohort with 544 Chinese patients and achieved C-indices of 0.747 and 0.777 in their development and validation cohorts, respectively. Based on the predicted risk scores, patients were allocated into low and high-risk groups with significantly different OS times. The authors could significantly improve performance when integrating radiomic features into the risk prediction model compared to clinical- and pathological features alone. Blanc-Durand et al. [34] proposed a scoring system to stratify HCC patients undergoing transarterial radioembolization with Yttrium-90 into a low- and a high-risk group. Radiomic features were derived from semi-automatic liver segmentations from pretreatment 18F-fluorodeoxyglucose positron emission tomography images. Their proposed risk score was significantly correlated with OS and could stratify patients into two risk groups with distinct OS times. However, their method was not evaluated on an independent validation cohort, and no measures of prognostic power, such as the C-index or time-dependent AUCs, were reported, and their method relies on manually revised semi-automatic liver segmentations. In their multi-institutional study, Ji et al. [35] presented a machine learning framework to predict the time to recurrence after resection. Their method was developed and evaluated in 470 patients with solitary HCC lesions and used radiomic features derived from manual tumor and peritumoral segmentations from baseline contrast-enhanced computed tomography images combined with clinical variables and achieved C-indices of 0.733–0.801, outperforming conventional staging systems. The authors proposed a risk stratification approach to allocate patients into low, intermediate, and high-risk groups with significantly different OS times. Their study confirms the utility of a data-driven approach and the usage of advanced quantitative imaging biomarkers for time-to-event data. However, their method relies on time-consuming manual tumor segmentation and does not use a fully automated segmentation method as our liver segmentation approach, which yields segmentation in processing times of under a second (0.70 s).

Our study has several limitations. First, our method was developed using retrospective data from a single institution; thus, prospective multicenter studies and external validation are warranted before potential clinical translation. However, our method was developed using a large dataset and yielded precise and generalizable results in an independent validation cohort. Furthermore, the large cohort size of this study adds a degree of robustness to the findings, and the relative simplicity of the analytical elements involved enhances the likelihood of reproducibility and the reliability of the findings. Due to these factors, we anticipate that other groups can replicate our study using the publicly available code. Second, we did not control for the effects of different treatment types on OS. Nevertheless, our method is still predictive of OS even without any treatment information given. Finally, the proposed method relies on multiphasic contrast-enhanced MRI and cannot be applied to patients who underwent other imaging modalities at baseline. In future work, we will incorporate longitudinal data from multiple external contributors and evaluate our method on a larger independent validation cohort.

In conclusion, we present a fully automated framework for mortality risk prediction in hepatocellular carcinoma patients using routinely available standard-of-care clinical data and radiomic features derived from automated liver segmentations on baseline multiphasic contrast-enhanced MRI outperforming conventional staging systems for mortality risk quantification. The developed method, based on machine learning, can help personalize follow-up strategies, guide management decisions, and improve clinical workflow efficiency in tumor boards.