Introduction

Peritoneal metastasis (PM) is the most common metastasis in advanced abdominal tumors, with a median survival of no more than 12 months in some patients, and a worse prognosis than other metastasis, such as the liver or lung [1, 2]. In recent years, cytoreductive surgery (CRS) combined with hyperthermic intraperitoneal chemotherapy (HIPEC) has been widely accepted for treatment of PM, emerging as the most effective approach to prolonging survival in the majority of cases [3, 4]. CRS + HIPEC has been endorsed by the Peritoneal Surface Oncology Group International (PSOGI) as the established standard of care for patients diagnosed with peritoneal pseudomyxoma, malignant peritoneal mesothelioma, and colorectal peritoneal metastases. Furthermore, it is the recommended therapeutic approach for patients with peritoneal metastases associated with gastric and ovarian cancers [5,6,7,8,9,10].

However, not all patients may experience positive outcomes from CRS + HIPEC treatment. The peritoneal carcinomatosis index (PCI), together with the cytoreductive completeness score (CCS), has been shown to be a major determinant of overall patient survival [11,12,13]. Patients with severe PM often have a relatively poor prognosis and are unlikely to achieve a clinical cure after CRS treatment (CCS of 0 or 1). For instance, in colorectal cancer patients, PCI > 20 is considered a contraindication to CRS + HIPEC treatment [14], while gastric cancer patients with normal tumor markers and PCI ≤ 16 are more likely to benefit from surgical treatment [15]. Currently, the gold standard for assessing PCI is laparoscopic [16], and some patients with extensive metastasis may have to forgo further surgical treatment after the invasive examination. Open-close surgery has been reported in up to 23.4% of cases [17].

Imaging examinations such as CT and MRI have become common preoperative procedures for patients with PM. While they can provide information about the presence and extent of metastasis, their low sensitivity may result in discrepancies between the imaging results and the actual condition of the patient. As a result, CT-PCI and MRI-PCI demonstrate limited ability to stratify patients for selection when compared to surgical PCI (S-PCI), with a concordance index of only 0.47–0.79 [18,19,20,21]. Furthermore, the unique anatomical characteristics of the peritoneum pose a significant challenge in distinguishing it from surrounding visceral fat on imaging, thereby necessitating a higher level of expertise from the radiologist.

Radiomics (Rad) is a promising image data analysis method that has emerged in recent years, capable of quantifying traditional medical images into structured data features and interpreting images from multiple dimensions [22]. It has demonstrated potential in determining the benignity and malignancy of tumors, staging, and prognosis [23,24,25]. Similarly, deep learning (DL) can also extract quantified features from images with higher dimensionality and more accurate performance than Rad features (RF) [26]. Previous studies have shown that DL exhibits better performance than RF in predicting preoperative occult PM or lymphatic metastasis in gastric cancer patients [27, 28]. However, there are currently no reported studies utilizing DL for the preoperative staging of PM patients.

In summary, the purpose of this study was to develop and validate a DL model for the preoperative staging of PM patients to individualize patient treatment based on accurate staging.

Materials and methods

Patient enrolment

This study included 362 PM patients who underwent surgery at Zhongnan Hospital of Wuhan University from January 2014 to December 2021. The inclusion criteria for patients were as follows: (1) evaluation of PCI via laparoscopic exploration or open surgery, (2) confirmation of PM through postoperative pathological examination, and (3) receipt of a whole contrast-enhanced abdominal CT within three weeks prior to surgery. Patients were excluded if they had poor CT imaging quality or incomplete baseline clinical or pathological information. Of the initial pool of patients, 168 patients were included in the final analysis and categorized into two groups: Mild peritoneal involvement (MPI) group (PCI < 18) and Heavy peritoneal involvement (HPI) group (PCI ≥ 18), based on the S-PCI. This division considered the established indications for gastrointestinal CRS, which typically fall below the range of 16–20 for PCI. The training and validation cohorts were randomly divided into an 8:2 ratio. A flow chart of this study design is provided in Fig. 1.

The Medical Ethics Committee of Wuhan University Zhongnan Hospital approved this study. Since this study was retrospective, informed consent from patients was waived.

Fig. 1
figure 1

Flow chart of the study cohort recruitment

Intraoperative evaluation

The S-PCI was determined during laparoscopy exploration or open surgery by a team of surgeons. The abdominopelvic cavity was partitioned into 13 distinct regions, and the size of the tumor in each region was used to assign a score, with a score of 0 indicating no tumor, 1 for tumor diameter < 0.5 cm, 2 for tumor diameter ranging from 0.5 to 5 cm, and 3 for tumor diameter exceeding 5 cm. The summation of scores from all regions yielded the S-PCI, which could range from 0 to 39 [29]. All samples obtained during surgery were confirmed as PM through postoperative pathological examination.

CT image acquisition

CT examinations were conducted on 128-slice (Siemens SOMATOM Definition CT) and 64-channel (GE Discovery CT750, Philips Ingenuity CT) scanners with patients in the supine position following a 6-h fasting period. Patients were trained to control their breathing before the scan to minimize any breathing-related artifacts. The CT scans encompassed the region extending from the diaphragm to the bony pelvic floor. Prior to undergoing the contrast-enhanced CT examination, patients were administered contrast agents (5.300 mL/kg, iohexol 40 mg I/mL) via the anterior elbow vein at a rate of 5.1 mL/s. The following parameters were used for the CT scan: tube current ranging from 150 to 350 mA, tube voltage of 120 kVp, field of view spanning 30–45 cm, matrix size of 512 × 512, and reconstructed slice thickness between 1 and 5 mm.

Image preprocessing and segmentation

We chose CT images of the portal venous-phase for the relevant study because they could well separate between normal tissues, blood vessels, tumors and non-neoplastic organs. To reduce the effect of different volume pixels, we used the nearest interpolation method to resample the voxels to 1 mm × 1 mm × 1 mm. The 3D peritoneal volume of interest (VOI) segmentation was using ITK-SNAP software (v.4.0.0, http://www.itksnap.org). A novel approach to segmenting the 3D peritoneal volume of interest was developed in this study. This was achieved by dividing the entire abdomen into three sections: upper, middle, and lower abdomen. The abdomen was first segmented without organs and organs, musculoskeletal and vascular areas using semi-automatic segmentation on ITK-SNAP (see Supplementary 1 for details), and then by two experienced imaging physicians manually corrected the VOI in the abdominal window (window level: 30HU, window width: 700HU) to obtain the peritoneal VOI of the upper, middle and lower parts. Thirty patients were randomly selected after 30 days to have the VOI segmented again to exclude intra- and interobserver variation. The flow in the whole study was as in Fig. 2.

Fig. 2
figure 2

The workflow in this study. Firstly, the clinical and imaging data of the patients were collected. Subsequently, ITK-SNAP was employed to outline the imaging data and obtain the region of interest (ROI). The features of the ROI were then extracted using both radiomics and deep learning techniques. Finally, a machine learning-based peritoneal carcinomatosis index (PCI) prediction model was constructed based on filtered features. The direct predictive ability of different models was compared through evaluation metrics including area under the curve (AUC), confusion matrix, and decision curve analysis (DCA)

Radiomics feature extraction

RF extraction of peritoneal VOI was performed using PyRadiomics (v3.1.0) based on the Python (3.9.7) open-source platform. The types and distribution of RF areshown in the supplementary material. The extraction process aligns with the Imaging Biomarker Standardization Initiative (IBSI) [30]. The reproducibility of RF was assessed by the intraclass correlation coefficient (ICC), and features with ICC ≥ 0.75 proved to have low variance and high reproducibility.

Deep learning feature extraction

The layer with the largest peritoneal area from the three abdominal VOI of patients were selected as the patient’s DL region of interest (ROI). DenseNet, InceptionV3, ResNet18, ResNet50, and ResNet101 were pre-trained convolutional neural network models based on ImageNet. In the present study, DL models were utilized to extract DL features (DLF) from the ROI. To obtain the DLF, the last fully connected layer of the model was removed and global maximum pooling was applied to extract the maximum value of each feature layer. The original value of the output image features was used as the DLF. To visualize the DLF, the guided class activation map (Grad-CAM) was used. Grad-CAM is a visualization technique utilized to identify the key regions within an input image that significantly influenced the output of the neural network [31]. The PyTorch 1.4.1 library in Python 3.7.0 (https://pytorch.org) was used to implement the neural network.

Feature selection and model construction

Normalization of data is a common step in machine learning to scale and center the data, which helps in improving the accuracy of the model. In this study, the Z-Score method was used for normalization, which transforms the features to have a mean of zero and standard deviation of one. After normalization, Spearman correlation coefficient was calculated to identify highly correlated features, and one of the features with a strong correlation was retained (correlation coefficient > 0.9). To further select the most representative features, the least absolute shrinkage and selection operator (LASSO) regularization was introduced, which shrinks the regression coefficients towards zero and selects the features with non-zero coefficients. The regularization parameter λ was determined using five-fold cross-validation.

Nine machine learning methods (Table 2) were used to construct the classification model. The training and validation cohorts were randomly split in an 8:2 ratio, and the optimal model was selected based on the area under the receiver operating characteristic (ROC) curve (AUC). Finally, the performance of the model was evaluated using the decision curve analysis (DCA) and calibration curve.

Statistical analysis

Statistical analyses were performed using the Python 3.9.7 platform (https://www.python.org/). Continuous variables were presented as mean values and standard deviations, and analyzed using either Mann-Whitney U test or Student t-test. Categorical variables were reported as absolute number and analyzed using chi-square test or Fisher’s exact test. The Delong test was utilized to determine the statistical difference between the AUC values. A significance level of 0.05 was chosen for all tests, meaning that a p value below this level was considered statistically significant.

Results

Clinical characteristics of the patients

In this study, a total of 168 patients were recruited and allocated randomly into either the training cohort (n = 134) or the validation cohort (n = 34) at an 8:2 ratio. Table 1 presents the characteristics of all patients. Among the patients, 79 (47.02%) were male and 89 (52.98%) were female, while 89 (52.98%) belonged to the MPI group, and 79 (47.02%) were in the HPI group (Table 1). The distribution of each clinical information was balanced between the two cohorts. Chi-square and t tests indicated that Primary tumor, CA125, CA199, and L/M were significantly correlated with PCI in patients (P < 0.05). Subsequently, univariate and multivariate regression analyses were conducted, revealing that Primary tumor (P = 0.004), CA125 (P < 0.001), and L/M (P = 0.001) could serve as effective diagnostic indicators of PCI staging in patients preoperatively.

Table 1 Characteristics of patients

Feature extraction and selection for model construction

A total of 4683 RF were extracted from the upper, middle, and lower peritoneal ROIs, with 1561 features in each part. After excluding 1231 features with poor reproducibility (ICC < 0.75) based on ICC reproducibility assessment, 35 RF (as shown in Fig. 3A) were obtained for model construction through further screening with Spearman correlation coefficient and Lasso to exclude features with weak correlation. The Rad-SVM model demonstrated the best prediction performance (as demonstrated in Table 2), and the AUC in the validation cohort was 0.906 (95% CI 0.804–1.000).

Regarding the combination of six DL models and nine machine learning methods used for feature extraction, DenseNet121-SVM demonstrated the best performance (as shown in Table 3), with 4704 DLF extracted (1568 features each in the upper, middle, and lower abdomen). After filtering, 53 DLF were finally selected (as shown in Fig. 3B), and the AUC of the constructed model in the validation cohort was 0.951 (95% CI 0.887–1.000).

Fig. 3
figure 3

Histogram of filtered feature weighting coefficients. A RF; B DLF

Table 2 Performance of radiomics models constructed by different machine learning methods
Table 3 Performance of SVM diagnostic models constructed based on different DLF

Performance of diagnostic models

We developed three pre-procedural diagnostic models for predicting PCI using clinical data, RF, and DLF. It is worth noting that the AUC of both Rad and DL models was significantly higher than Clinic model. In the training cohort, the AUC of DL model was 0.996, which was higher than Clinic model (AUC = 0.851, P < 0.001) and Rad model (AUC = 0.986, P = 0.1468). In the validation cohort, the AUC of DL model was 0.951, which was higher than Clinic model (AUC = 0.797, P = 0.074) and Rad model (AUC = 0.906, P = 0.4654), as shown in Fig. 4A and D.

Fig. 4
figure 4

The performance evaluation of Clinic, RF and DL signatures. The ROC, DCA and calibration curves of the training (A–C) and validation cohort (D–F)

The results of the DCA demonstrated that patients could derive greater net benefits from DL model in comparison to both Clinic and Rad models (Fig. 4B and E). Additionally, the calibration curves indicated a superior level of agreement between predicted and actual groupings within DL models (Fig. 4C, F).

We also constructed a Clinic-Rad-DL model, which had an AUC of 0.962 (95% CI 0.901–1.000) in the validation cohort. However, the Clinic-Rad-DL model did not result in a substantial improvement in efficacy compared to DL model alone (P = 0.695).

While the features extracted by DL may be challenging to interpret in practical terms, the study implemented Grad-CAM to provide interpretability to the DL model. This involved characterizing the distribution of contributions to the output prediction results (Fig. 5).

Fig. 5
figure 5

Grad-CAM visualization for DenseNet121 feature extraction. A–C shows the upper, middle and lower abdominal CTs of a patient with PM. The degree of contribution to the predicted results gradually increases from red to blue. (Color figure online)

Furthermore, we assessed the generalization ability of the DL model in practical applications by dividing patients into different subgroups. The DL model showed strong generalization ability in the stratified analysis according to age, gender, primary tumor, presence or absence of ascites, and different CT layer thicknesses (Table 4).

Table 4 Enhanced performance evaluation of the DL model across various subgroups

Discussion

The purpose of this study was to develop and validate a DL model using preoperative enhanced abdominal CT to determine the degree of preoperative peritoneal involvement in PM patients. The results demonstrated that the model had satisfactory performance in stratifying patients and could potentially aid in treatment decision-making for PM patients.

Clinically, PM patients are usually at the end stage of the disease. Among the limited treatment options, CRS + HIPEC can maximize the overall survival time of patients, but this is only for patients with a small PM load [32, 33]. However, accurate preoperative staging of patients can avoid unnecessary surgical damage. Imaging examination, the most commonly used noninvasive diagnostic modality, has a limited ability to identify PM [34, 35]. CT combined with DL analysis in patients with PM has been widely used in predicting occult metastasis and recurrence of PM, demonstrating a strong predictive capability [27, 36, 37]. In a study of 163 patients, Zhang and colleagues [38] attempted to use radiomics rather than DL for preoperative PCI scoring, and used an external validation cohort to assess the applicability of the model. In their study, limited by intact peritoneal segmentation, only 6 regions were selected as representatives. In our study, we are the first to segment the complete peritoneum and use DL analysis of CT images for staging patients with preoperative PM. The results demonstrate the superior discriminatory performance of the Rad and DL models over clinical data alone. The DL model demonstrated superior AUC performance compared to the Rad model in both training and validation cohorts, while the DCA curve indicated that both models could enhance patient outcomes following CRS + HIPEC treatment. Additionally, Grad-CAM was used to address the black box problem of DL and provide clinicians with explanations. The focus on mesenteric heterogeneity in DL is consistent with what clinicians see intraoperatively, as the mesentery is a common site for PM.

In previous studies on using Rad or DL for PM, one of the main challenges was the segmentation of the ROI. Manual segmentation of ROI is subjective and can vary among different imaging physicians. Furthermore, the complexity of the peritoneum makes it difficult to manually sketch the entire peritoneum, and most studies [3, 27, 36, 39] only assess the status of the peritoneum in primary foci ROI or a portion of the peritoneum, which may not represent the entire peritoneum. Additionally, there is significant variation in the primary foci in patients with PM from numerous sources, and using only the primary foci ROI does not generalize the entire model to all patients with PM of tumor origin. To address these issues, we segmented the entire peritoneal region of patients semi-automatically using ITK-SNAP to segment the 3D peritoneal VOI for the first time. This approach allows us to accurately and comprehensively assess the peritoneal status of PM patients efficiently. Moreover, we divided the whole abdomen into three parts, and found that the majority of PM patients of gastric origin in our cohort and the majority of the finally screened RF and DLF also originated from the upper and middle abdomen. Clinically, gastric cancer PM is rarely found in the lower abdomen. This finding suggests that we can select different abdominal partitions for the classification of primary tumor origin to further optimize the efficacy of the model. Overall, the semi-automatic segmentation of the peritoneal VOI provides a more objective and comprehensive assessment of PM patients, and dividing the abdomen into different partitions may improve the model’s efficacy in identifying the primary tumor origin.

In our study cohort, a significant proportion of patients (47.0%) with S-PCI scores of 18 or higher underwent surgical exploration but did not receive any further surgical treatment, indicating that our DL model can potentially optimize available surgical options. Furthermore, our model holds promise for application in patients with higher PM load, where multi-stage HIPEC may be a viable option for achieving surgical indications similar to preoperative neoadjuvant chemotherapy [3]. As such, we anticipate that our model can enable selection of patients for surgical treatment and evaluation of HIPEC-treated SIP patients in the future.

However, there are limitations to our study that must be considered. First, fewer patients were evaluated for the S-PCI procedure, and larger samples may be needed to improve the reliability. Second, as our study was retrospective and based on a single-center cohort, the generalizability of the model must be validated with prospective and multicenter data. Finally, while DL analysis in medical imaging is primarily based on 2D imaging, the potential for future models utilizing 3D imaging analysis has the capacity to further optimize model performance.

In summary, our study presents a promising CT-based DL model for preoperative staging assessment of PM patients, providing a non-invasive means to optimize clinical treatment decisions.