Background

Hepatocellular carcinoma (HCC) remains a leading cause of cancer-related morbidity and mortality worldwide, which accounts for 75–85% of all primary liver cancer cases [1]. The poor prognosis of HCC after surgical resection is mainly due to recurrence and metastasis [2, 3]. Nuclear protein Ki-67 expression level- indicates the status of cell proliferation activity which corresponds with tumor biological behavior, treatment efficacy and prognosis [4, 5]. Previous studies have demonstrated that high Ki-67 expression was associated with poor overall survival (OS) [6,7,8,9,10], disease-free survival (DFS) [6, 9, 11], relapse-free survival (RFS) [8, 9, 12]. In particular, Ki-67 is proposed to be an attractive therapeutic target for cancer because it is highly expressed in most malignant cells but rarely detected in normal cells, though this targeting Ki-67 therapy has not been applied in the clinical [5]. Accurate identification of Ki-67 expression level is crucial for prognosis and treatment decision-making to achieve a satisfactory outcome. However, it is difficult to differentiate the nuances among HCCs with different Ki-67 expression through conventional imaging.

Current radiomics, which involves numerous advanced, quantitative, high-throughput features extracted from medical images, has been used to develop diagnostic, predictive, and prognostic models [13, 14]. Previous studies have reported that tumor characteristics at the cellular and genetic levels can be reflected in the phenotypic patterns and subsequently captured by radiomics signatures [15,16,17,18,19,20]. Gadolinium ethoxybenzyl-diethylenetriamine pentaacetic acid (Gd-EOB-DTPA), which has characteristics of both a blood-pool agent and a hepatobiliary agent, is commonly used in clinical practice. Previous studies have applied texture analysis on Gd-EOB-DTPA-enhanced MRI to preoperatively predict Ki-67 expression in patients with HCC and indicated that the texture analysis was superior to subjective MRI characteristics determined by radiologists and obtained a good result in predicting Ki-67 expression [21, 22]. Although previous studies were valuable, they have not compared predictive performance of radiomics models derived from different sequences and phases based on Gd-EOB-DTPA-enhanced MRI.

Thus, this study aimed to develop and compare predictive performance of radiomics models derived from different sequences and phases based on Gd-EOB-DTPA-enhanced MRI, then to further validate the optimal model for preoperative prediction of Ki-67 expression in patients with HCC.

Methods

Patients

This is a retrospective study for which ethical approval was obtained and informed consent from patients was waived. Between January 2013 and November 2019, patients who underwent Gd-EOB-DTPA-enhanced MRI examination before surgery or biopsy were consecutively included in this study according to the following inclusion and exclusion criteria. The inclusion criteria were: (1) pathologically confirmed HCC; (2) received Gd-EOB-DTPA-enhanced MRI of the liver within 1 month before surgery or biopsy; (3) images without obvious artifact; (4) if multiple lesions were present, the largest one was selected with matched  pathological and immunohistochemical diagnosis. The exclusion criteria were: (1) received previous treatment, such as anti-tumor therapies, radiofrequency ablation, transcatheter arterial chemoembolization (TACE), and so on; (2) incomplete clinical or pathological information. All enrolled patients were randomly divided into training and validation cohorts at a ratio 7:3.

Histopathological examination

The tumor tissue sections were stained using monoclonal mouse anti-human Ki-67 antibody (Beijing Zhongshan Golden Bridge Biotechnology Company, Beijing, China). The Ki-67 expression was evaluated by calculating the frequency of 1 Ki-67-positive cells. Ki-67 was considered positive when the cell nuclei were stained brown yellow. Immunoreactive cells were classified as low Ki-67 expression (≤ 14% immune-reactivity) or high Ki-67 expression (> 14% immune-reactivity) according to previous studies [5, 16]. Referring to previous study, we dichotomized histologic subtypes using low-grade tumors and high-grade tumors. Low-grade tumors correspond to well differentiated, well and moderately differentiated, and moderately differentiated HCC. High-grade tumors correspond to moderately and poorly differentiated, poorly differentiated, and undifferentiated HCC.

MRI protocol

The details of MRI protocol and the sequences used in this study were presented in the Additional file 1.

Tumor segmentation

Tumor segmentation was manually performed on (arterial phase, AP), (portal venous phase, PVP), (Hepatobiliary phase, HBP) and T2W images with 3D Slicer (http://www.slicer.org), and a three-dimensional (3D) region of interest (ROI) that covered the whole tumor was delineated along the border of tumors. HBP or T2W images were first for manual segmentation. Subsequently, AP and PVP images were delineated, as the tumor margins on HBP or T2W images were clearer than that on AP and PVP images. Taking this delineating order would mitigate software-related segmentation errors. The segmentation was independently performed by two radiologists (Y.Y., 10 years of liver imaging experience; Y.F., 8 years of liver imaging experience) in 30 randomly chosen patients to assess inter-observe reproducibility. The segmentation was performed again by the radiologist (Y.F.) at another day to assess the intra-observe reproducibility. The remaining images of patients were segmented by the radiologist (Y.F.). Both radiologists were blinded to the clinical outcomes.

Preprocessing and radiomic features extraction

Before radiomic features extraction, preprocessing of images was performed, including Laplacian of Gaussian (LoG) preprocessing, wavelet transformations, bin discretization and radiomic matrix symmetry. Features extraction was performed using the Slicer Radiomics extension, which incorporates the PyRadiomics library into 3D Slicer [23]. Extracted features included first order statistics, shape and texture features, which were gray level co-occurrence matrix (GLCM), gray level size zone matrix (GLSZM), gray level run length matrix (GLRLM), gray level dependence matrix (GLDM) and neighboring gray tone dependence matrix (NGTDM). Among these features, flatness and least axis from shape features were excluded based on the definition of the feature, as discussed in the documentation of PyRadiomics, and sum average was excluded because it is directly correlated with joint average [24]. Thus, a total of 1,300 radiomic features were extracted for each unique lesion.

Radiomic feature selection and model development

The least absolute shrinkage and selection operator (LASSO) logistic regression with 5-fold cross-validation was used to select the most useful features in the training cohort. Rad-score was calculated for each patient using the linear combination of selected features multiplied by their respective coefficients.

Comparison of radiomics model in the training and validation cohort

These models assessed in the training cohorts were applied to validation cohorts. The Receiver operating characteristic (ROC) curve, Delong test, calibration curve and decision curve analysis (DCA) were utilized to illustrate the diagnostic performances of these constructed models, and the cutoff values were selected according to the Youden index to determine the corresponding sensitivity and specificity.

Combined model development and validation

For the development of combined model, we performed multivariate logistic regression analysis of clinical factors in training cohort, including age, sex, hepatitis B, hepatitis C, cirrhosis, serum alanine aminotransferase (ALT) level, serum aspartate aminotransferase (AST) level, serum gamma-glutamyl transferase (GGT) level, and serum alpha-fetoprotein (AFP) level. Clinical factors that reached statistical significance with P values less than 0.05 were selected into the combined model, which also included the optimal Rad-score.

Calibration curves were adopted to analyze the diagnostic performance of the combined model in both training and validation cohort. Decision curve analysis was conducted to determine the clinical usefulness of the combined model by quantifying the net benefits at different threshold probabilities in the validation cohort.

Statistical analysis

The continuous variables were described as median and interquartile range, and the categorical variables were described as frequency and percentage. D’Agostino–Pearson test was used to test normality of dates. Independent sample t-test or Mann–Whitney U nonparametric rank sum test was used to compare clinical characteristics between the training and validation cohort, and between high Ki-67 expression and low Ki-67 expression groups in the training and validation cohort for continuous variables, while.

the Chi-squared test or Fisher exact test were conducted for categorical variables. Two-sided P values < 0.05 were considered statistically significant. The inter-observer and the intra-observer reproducibility to the extracted features were assessed by the intra-class correlation coefficient (ICC). ICC ≥ 0.8, 0.5–0.79 and < 0.5 indicated high, middle, and low consistency, respectively [25]. LASSO logistic regression, and multivariable logistic regression analysis were performed to select radiomics features and clinical risk factors using the “glmnet” and “rms” package running in R software, version 3.0.1 (http://www.Rproject.org.org). The calibration and decision curve were plotted using the “rms” and “rmda” package. Other statistical analyses were performed using the MedCalc software (Version 16.2.0, https://www.medcalc.org).

Results

Baseline characteristics

One hundred fifty-one patients were collected, including 103 patients in the training cohort and 48 patients in the validation cohort (Table 1). Baseline characteristics were not significantly different between training and validation cohort. Among all 151 patients, high Ki-67 expression was pathologically diagnosed in 112 patients (74.2%), low Ki-67 expression was pathologically diagnosed in 39 patients (25.8%). In both cohorts, the serum AFP levels and tumor grade were significantly higher in high Ki-67 expression group than that in low Ki-67 expression group. In both cohorts, low-grade tumors were more frequently in patients with low Ki-67 expression group. In the training cohort, the number of patients with hepatitis B in high Ki-67 expression was larger than that in the low Ki-67 expression group (Table 2).

Table 1 Baseline clinical characteristics of the training and validation cohort
Table 2 Baseline clinical characteristics of the high and low Ki-67 expression in training and validation cohort

Features selection and radiomics model development

No statistically significant difference was found between the inter-observer or between the intra-observer (P values ranged from 0.691 to 0.815, 0.755 to 0.891). Of texture features, for AP, HBP, and T2W radiomics models, 1300 features were respectively reduced to 12 (Fig. 1a, b), 6, and 12 potential predictors in 103 patients of the training cohort. For VP images, no valuable features were selected by the LASSO regression analysis. Rad-score was calculated for each patient by using the linear combination of selected features multiplied by their respective coefficients. These features were presented in the Rad-score calculation formula (Additional file 2).

Fig. 1
figure 1

Feature selection using the least absolute shrinkage and selection operator (LASSO) logistic regression in AP radiomics model. a Tuning parameter (λ) selection in the LASSO model used 5-fold cross-validation. Dotted vertical lines were drawn at the optimal values by using the minimum criteria and the 1 standard error of the minimum criteria (the 1-SE criteria). A λ value of 0.045, with log (λ), − 2.725 was chosen (1-SE criteria). b Vertical line was drawn at the value selected, where optimal λ resulted in 12 nonzero coefficients

Comparison of predictive performance among radiomics models in training and validation cohorts

The AUC values, sensitivity, specificity, and accuracy of the AP, HBP, T2W, combined AP and HBP radiomics model in predicting Ki-67 expression in training and validation cohort were in Table 3; Fig. 2. Delong test showed that there was no significant difference in AUC values among AP, HBP, combined AP and HBP, and T2W radiomics models. DCA showed that the curve of AP was generally higher than HBP and T2W radiomics models (Fig. 3), and combined AP and HBP radiomics model did not result in significantly extra benefits compared with the AP radiomics model only (Fig. 4).

Table 3 Comparison of the predictive performance of the five models in predicting Ki-67 expression
Fig. 2
figure 2

ROC curves for the radiomics model in predicting Ki-67 expression in the training and validation cohort, respectively. a ROC curve in training cohort. b ROC curve in validation cohort 

Fig. 3
figure 3

Decision curve analysis of the AP, HBP, T2W radiomics model and combined radiomics model in the validation cohort. The red line, blue line, yellow line, and green line represent the AP, HBP, T2W and the combined radiomics model, respectively. The combined model includes AP Rad-score and serum AFP level. The curve of AP radiomics model was generally higher than that of HBP and T2W radiomics model. Decision curve shows that at a range threshold probability of 30-60 %, the combined model is optimal decision-making strategy to add the net benefit compared with AP radiomics model only

Fig. 4
figure 4

Decision curve analysis of the AP, combined AP and HBP radiomics model in validation cohort. The red line and blue line represent the AP, and combined AP and HBP radiomics model. The decision curve shows that combined AP and HBP radiomics model does not result in extra significant benefits compared with AP radiomics model in validation cohort

Combined model development and validation

The multivariate logistic regression analysis showed that only serum AFP level and AP Rad-socre was associated with Ki-67 expression in the training cohort (P < 0.05). The combined model was constructed with AP Rad-score and serum AFP level. It yielded an AUC value of 0.922 (95% CI 0.852–0.965) in the training cohort and 0.863 (95% CI 0.733–0.94.5) in the validation cohort (Table 3; Fig. 2). Delong test showed that the AUC value of combined model (0.922, 95% CI 0.852–0.965) was higher than that of AP radiomics model (0.873, 95% CI 0.793–0.930) (P = 0.015) in the training cohort. In the validation cohort, the AUC value of combined model (0.863, 95% CI 73.3–94.5) also showed an improved predicting performance in Ki-67 expression over the AUC value of AP radiomics model (0.813, 95% CI 0.674–0.911), despite the non-significant statistical significance (P = 0.254). The calibration curves showed a good agreement between predicted and actual events in the training and validation cohorts (Fig. 5a, b). The DCA of the validation cohort revealed that at a range threshold probability of 30–60 %, the combined model is an optimal decision-making strategy to add the net benefit compared with AP radiomics model (Fig. 3).

Fig. 5
figure 5

Calibration curve for the combined model in training and validation cohort. a Calibration curves for the combined model in training cohort. b Calibration curves for the combined model in validation cohort

Discussion

In this study, we compared the predictive performance of AP, HBP, T2W, and combined AP and HBP radiomics models. Then, we established and validated a combined radiomics model, including AP Rad-score and AFP based on Gd-EOB-DTPA-enhanced MRI for preoperative prediction of Ki-67 expression in patients with HCC. Results showed that the AP radiomics model yielded an incremental performance in predicting Ki-67 expression of HCC over the HBP and T2W radiomics model, and the combined AP and HBP radiomics model does not result in extra benefits compared with the AP radiomics model only. The combined model yielded higher performance with an AUC value of 0.922 (95% CI 0.852–0.965).

As high Ki-67 expression indicates an active status of cell proliferation, which requires more neovascularities for tumor growth. AP images based on enhanced MRI can best demonstrate the information about the neovascularities of tumors. Accordingly, the AP radiomics model added more net benefit in predicting Ki-67 expression of HCC compared with HBP radiomics model. Although a previous study has reported the AP model of Gd-EOB-DTPA-enhanced MRI was inferior, possibly due to artifacts affecting extraction and calculation of textural-based features [26], our study excluded those patients with obvious artifacts caused by transient severe motion (TSM) [27, 28].

Radiomics, including texture analysis and other features, such as shape and intensity [29], is considered to be a potential bridge between medical imaging and personalized medicine [30]. In our study, 41 features most relevant for Ki-67 expression were selected. Among these features, 13 were first-order statistics, 28 were texture features including gray level co-occurrence matrix (GLCM), gray level dependence matrix (GLDM), gray level run length matrix (GLRLM), gray level size zone matrix (GLSZM), and neighboring gray tone dependence matrix (NGTDM). Although some scholars have recently published articles on the same topic of using a radiomics model based on Gd-EOB-DTPA-enhanced MRI to predict Ki-67 expression in HCC [21, 22], there are many differences in details compared with our study. In the study of Li et al. [21], a single slice with the largest proportion of lesion was delineated, and the predictive performance of models were compared only by misclassification rate. In our study, all slices covering the whole tumor were delineated, and, the predictive performance of different models were compared by AUC values, calibration curve and DCA. In the study of Ye et al. [22], a sum of texture signatures derived from AP, PVP, pre-contrast T1W and T2W images was used to predictive Ki-67 expression by multivariate logistical regression, and predictive performance of radiomics model derived from different phases were not be compared. Although, in the study of Ye et al. [22], the C-index (AUC) of the combined model (AUC = 0.936) was approximately equivalent to that in our study—the AUC value of combined model was 0.922 in the training cohort in our study, the study of Ye et al. incorporated a sum of texture signatures derived from multiple phase into one radiomics model, which was cumbersome in clinical practice. Our study developed and compared predictive performance of radiomics models derived from different sequences and phases, including T2W, AP, PVP, and HBP images, then further validated the optimal model for preoperative prediction of Ki-67 expression in HCC, which obtained a good result and would be feasible for clinical practice. Moreover, both of the previous studies lacked the validation cohort to validate whether their models were overfit.

There are several limitations in this study. Firstly, the sample size is still small compared with the number of included variables, especially the sample size of the low Ki-67 expression group, and our validation cohort was from the single institution as the training cohort, which restricted the generalizability of our findings to other institutions or settings. Secondly, our study compared predictive performances of AP, HBP, and T2W radiomics model of the Gd-EOB-DTPA-enhanced MRI for predicting Ki-67 expression of HCC, however, our study did not compare AP radiomics model of Gd-EOB-DTPA-enhanced MRI with Gd-diethylenetriaminepentaacetic acid (Gd-DTPA)-enhanced MRI. Thirdly, there is currently no standardized Ki-67 expression level threshold in HCC, and it may be controversial that we defined 14 % as the cutoff value. In summary, interpreting the complex associations between the biologic processes and radiomics features remains an enormous challenge, although it is in line with the current trend toward precise and personalized medicine.

Conclusions

Our study established and validated a combined model including AP Rad-score and serum AFP level based on enhanced MRI, for predicting Ki-67 expression in HCC patients. It provides a new non-invasive approach for accurate diagnosis.