Introduction

Cystic renal lesions (CRLs) are being detected more frequently as CT becomes more commonplace [1, 2]. While a small number of CRLs are malignant renal neoplasms requiring surgical intervention, most CRLs are benign cystic renal neoplasms or simple renal cysts which do not require surgery [3, 4]. Precisely identifying the components of CRLs is crucial for deciding on suitable treatment approaches. Contrast-enhanced computed tomography (CECT) imaging is frequently utilized to distinguish among the different types of CRLs [5]. However, the complex pattern of malignant CRLs on CECT images, including septal thickness, mural nodule enhancement, and calcifications presence, pose significant challenges for both diagnosis and management, particularly when dealing with early-stage cases [6].

Existing limitations in diagnosing malignant CRLs necessitate novel approaches [2]. While the Bosniak-2019 classification aimed to improve accuracy, recent studies have shown limited improvement compared to its predecessor [3]. Additionally, the updated criteria reclassify a significant portion of previously diagnosed class III lesions, which also adds to the follow-up burden. The misapplication of Bosniak categorization can lead to inaccurate treatment and associated diagnostic errors, resulting in adverse outcomes such as impairment of renal function, re-operation surgery, and increased medical disputes [7].

To overcome the challenges of subjective biases in visual image evaluations and improve diagnostic sensitivity (SEN), machine learning algorithms, trained on quantitative data extracted from CT scans, are demonstrating potential in differentiating between benign and malignant CRLs [8]. Minisk et al identified six features from nephrogenic CT phase scans, while Dana et al focused on extracting radiomics features from nephrogenic CT phase scans. Both studies achieved promising results in differentiating between benign and malignant CRLs by modeling these features [4, 9].

Our prior study has developed a blending ensemble machine learning algorithm integrating quantitative radiomics features with deep learning features, which demonstrated favorable classification performance in accurately differentiating between benign and malignant CRLs [10]. However, the prior study only utilized corticomedullary phase CT images to build the machine learning algorithm instead of multi-phase CT scans. In CECT phases imaging (renal protocol), renal mass enhancement is assessed by analyzing the difference in Hounsfield units (HU) before and after contrast administration. An increase of fifteen or more HU value within the solid tumor regions indicates enhancement, highlighting critical malignant renal neoplasms components [11]. In this study, we aim to develop and validate an interactive 3D U-Net segmentation model which can efficiently segment CRLs in corticomedullary and nephrogenic phase CT images [12]. To enhance the SEN and specificity (SPE) of diagnosing CRLs, a spatial encoder temporal decoder (SETD) deep learning classifier was created and validated. SETD model utilizes a pre-trained 3D ResNet50 encoder block to extract spatial features and a gated recurrent unit (GRU) network to decode temporal features from multi-phase CT scans.

Methods

Study participants and inclusion criteria

Since this study was retrospective, informed consent was not required, and the research ethics committees of each participating hospital approved it. This study is in accordance with the CLEAR checklist guidelines [13]. In the segmentation model cohort, the inclusion criteria were as follows: individuals with CRLs comprising less than 25% solid tissue, diameter more than 1 cm, and absence of history of prior kidney surgery or polycystic kidney disease.

To ensure the generation of an optimal machine learning model applicable to clinical practice, the inclusion criteria for the SETD model’s training cohort were histopathological analysis results after surgery as well as CRLs cases with no progression observed in CT or MRI imaging over a period of four years. To guarantee the reliability of the SETD model’s performance, all enrolled participants in the SETD validation and testing datasets had confirmed histological diagnosis results. During the construction of the SETD model, CRLs were incorporated in a balanced distribution across the Bosniak classification categories, enabling the model to capture the full variability of imaging appearances. Detailed inclusion and exclusion criteria are displayed in Fig. 1.

Fig. 1
figure 1

Flowchart of the enrollment process and pathology results for cystic renal lesions (CRLs) in each cohort. In the training cohort, benign CRLs were identified using pathology criteria and a 4-year CT imaging follow-up. The validation cohort was employed for model validation. The external validation cohort 1, cohort 2, and cohort 3 were combined for model testing

CT acquisition protocol

The training cohort (The First Affiliated Hospital of Chongqing Medical University), validation cohort (The Second Affiliated Hospital of Chongqing Medical University), external validation cohort 1 (Guangzhou First People’s Hospital), and external validation cohort 3 (Chongqing University Fuling Hospital) used 128-slice or 64-slice spiral CT scanners to obtain contrast-enhanced CT scans. External validation cohort 2 (Yongchuan Hospital of Chongqing Medical University) used a 256-slice spiral CT scanner. All cohorts used standardized CT imaging scanning protocols (Appendix Table E2). The prior study only utilized corticomedullary phase CT images to build the machine learning algorithm [10]. In this study, the corticomedullary phase and nephrogenic phase CT images were utilized to create the artificial intelligence (AI) system.

ROI sketching and Bosniak-2019 system reclassification quality control

All CRLs regions of interest (ROIs) labeling and Bosniak-2019 reclassification was undertaken by two abdominal radiologists (with more than 10 years clinical experience in diagnostic radiology). To precisely delineate the CRL margin, radiologists carefully reviewed multi-phase CT image data using ITK-SNAP software, incorporating information from the corticomedullary and nephrogenic phases across axial, sagittal, and coronal planes. In cases where there was contentious CRL sketching and conflicting classifications, another senior radiologist (FJ-L, over 20 years of hands-on clinical experience in diagnostic radiology, particularly in the interpretation of multi-phase CT scans) would participate in the discussion. This collaboration ensured a consensus was reached on the final ROIs sketching and Bosniak-2019 reclassification outcomes.

Interactive segmentation model architecture

The proposed deep interactive segmentation method based on 3D U-Net and geodesic distance transforms is depicted in Appendixes E1 and E2. To reduce the interactions required with radiologists, we implemented a two-stage 3D U-Net segmentation approach, consisting of an initial Proposal Network (P-net) for segmentation, followed by a Refinement Network (R-net) for enhancing the segmentation accuracy. P-net provides an initial automatic segmentation by using a raw image with single channel. The radiologist then verifies the segmentation results and adds some interactions (clicks or scribbles) to indicate the mis-segmented regions and 3D boundaries of target CRLs. R-net takes the original image, the initial segmentation, and the user interactions (encompassing a total of four image channels) as inputs to provide a refined segmentation result. This iterative process allows the radiologist to repeatedly provide feedback to R-net, refining the outcomes based on P-net’s preliminary automatic segmentation results.

Spatial encoder and temporal decoder model architecture

The SETD model adopts an encoder-decoder based modular design, which takes preprocessed multiphase CT images as input and generates the corresponding CRL malignancy probability. To extract spatial features from multi-phase CT images, the spatial encoder module uses a deep convolutional neural network composed of the encoder layers of 3DResnet50 network with pre-trained weights. Following the outputs of the spatial encoder module, the SETD incorporates a temporal decoder module. This module consists of two GRU layers. The GRU network is particularly effective in identifying evolving patterns across the multiphase CT images, a critical aspect in the differential diagnosis of malignant renal tumors. The final step in the process involves passing the GRU output through a linear layer, which calculates the probability of malignancy (Fig. 2).

Fig. 2
figure 2

Flowchart providing a detailed procedures used to construct the CRLs artificial intelligence (AI) system

Statistical analysis

SEN, SPE, Intersection over union (IOU), and dice similarity (Dice) were used to evaluate the segmentation model. The area under the receiver operator characteristic curve (AUC), decision curve analysis (DCA), and accuracy score (ACC) were utilized to evaluate the performance of SETD model compared with prior machine learning models and Bosniak-2019 version. The level of statistical significance was determined by two-sided p value of less than 0.05. All models were implemented using MONAI framework (version 1.12, https://monai.io/). The interactive segmentation model, SETD model and data processing procedure are publicly available on GitHub (https://github.com/jacobhqh1997/CRL_AI).

Results

Patient clinical characteristics

The characteristic distributions of the patients in each cohort are summarized in Table 1. From 2012 to 2023, we included 477 CRLs in the training cohort, 226 CRLs in the validation cohort, and 239 CRLs in the testing cohort (external validation cohort 1, cohort 2, and cohort 3).

Table 1 Baseline characteristics of patients with CRLs

Based on the 2005 Bosniak version, 11 CRLs in the SETD training cohort were originally categorized as Bosniak II (8 lesions) and III (3 lesions). After a 4-year imaging follow-up, these were confirmed to be benign renal cysts. Additionally, one CRL was reclassified to Bosniak IV according to the updated 2019 Bosniak criteria. Histopathological results have been obtained for all lesions within the SETD validation and test cohorts, along with malignant CRLs in the SETD training cohort.

Interactive segmentation model performance

Detailed performance metrics for both P-net and R-net in the validation and testing datasets are presented in Table 2. The proposed 3D R-net exhibited superior accuracy (validation: Dice = 0.847, IOU = 0.743, SEN = 0.840, SPE = 1.000; testing: Dice = 0.861, IOU = 0.762, SEN = 0.876, SPE = 1.000) with geodesic distance compared to the P-net (validation: Dice = 0.781, IOU = 0.713, SEN = 0.840, SPE = 1.000; testing: Dice = 0.803, IOU = 0.735, SEN = 0.775, SPE = 1.000). Figure 3 shows an example usage of the initial CRLs segmentation and refined CRLs segmentation by using radiologist-provided margin points, respectively.

Table 2 Performance of P-net and R-net models in validation and testing datasets
Fig. 3
figure 3

Example usage of the initial CRLs segmentation model and refined CRLs segmentation model. a Original image inputs, (b) initial segmentation results combined with negative indicator points (red dots), (c) initial segmentation results combined with positive indicator points (green dots), (d) refined segmentation results

SETD classifier performance in CRL classification

The SETD model was evaluated on the validation and testing datasets using the best training weights. The model achieved a SEN of 100% and a SPE of 90.3% for the malignancy prediction in the validation cohort, and a SEN of 100% and a SPE of 97.2% for the malignancy prediction in the testing cohort.

Bosniak-2019 system performance in CRL classification

The Bosniak-2019 classification system classified CRLs into two groups: benign (I, II) and potentially malignant (IIF, III, IV). The system achieved a SEN of 100% and a SPE of 68.1% for the malignancy prediction in the validation cohort, and a SEN of 88.2% and a SPE of 85.5% for the malignancy prediction in the testing cohort.

Machine learning algorithm performance in CRL classification

We extensively assessed the preceding four machine learning algorithms based on our earlier research. The detailed radiomic features used in the previous machine learning algorithms are described in the Appendix E4. The Blending ensemble classifier was the best-performing model in the validation set (AUC = 0.902 (95% CI: 0.768–1.000), ACC = 0.940 (95% CI: 0.938–0.941)). The lightgbm classifier was the best-performing model in the testing set (AUC = 0.968 (95% CI: 0.942–0.993), ACC = 0.883 (95% CI: 0.881–0.884)). The detailed performance metrics of the four models are presented in Table 3.

Table 3 Performance of each model and Bosniak-2019 classification in validation and testing datasets

Clinical utility of SETD classifier

Figure 4 presents a comparative analysis of the AUC performance of each model alongside the Bosniak-2019 classification, with the latter being specifically evaluated by radiologists. Figure 5 shows a detailed comparison between the confusion matrix of the SETD model and the Bosniak-2019 classification, with the Bosniak classification being evaluated by radiologists.

Fig. 4
figure 4

The diagnostic performance of each model evaluated by receiver operating characteristic (ROC) curve. a ROC curve in validation dataset, (b) ROC curve in testing dataset

Fig. 5
figure 5

Confusion matrices for the SETD Model and the Bosniak-2019 classification. a Confusion matrix for SETD model in validation datasets, (b) confusion matrix for Bosniak-2019 in validation datasets, (c) confusion matrix for SETD model in testing datasets, (d) confusion matrix for Bosniak-2019 in testing datasets

In the validation and testing datasets, the SETD model demonstrated a statistically significant higher AUC value (validation: 0.973, 95% CI: 0.944–1.000; testing: 0.998, 95% CI: 0.993–1.000) compared to the Bosniak-2019 classification, as indicated by the AUC DeLong test (p < 0.05). Figure 6 depicts the results of DCA in each model performed on validation and testing datasets, showing that the SETD model provides a greater net benefit than the “none” and “all” treatment strategies for all considered threshold probabilities. The deep learning system outperformed the Bosniak-2019 based management guideline in both the validation and testing datasets, suggesting that clinical decision support could be improved by employing the SETD algorithm. The Appendix Figs. E3 and E4 shows a detailed confusion matrix for machine learning models and the Bosniak-2019 classification.

Fig. 6
figure 6

Decision Curve Analysis (DCA) for each model in validation and testing dataset. a DCA in validation dataset, (b) DCA in testing dataset. The y-axis represents the net benefit, while the x-axis represents the corresponding threshold probability. The SETD classifier is represented by red line

Discussion

In this retrospective multicenter study, we developed a non-invasive AI system to stratify malignant and benign CRLs. The reliability and applicability of the AI algorithm are determined by the following key factors: 1. Radiologists can provide feedback to the interactive CRL segmentation model, which can improve the segmentation results and make the process more efficient and accurate. 2. Histopathologic results serve as the diagnostic gold standard for CRL classification in both validation and testing datasets. 3. Early stopping approaches and weighted cross-entropy loss prevented overfitting in SETD training steps. 4. The SETD classifier demonstrated strong diagnostic performance in both validation and testing datasets. 5. The calculated RQS quality score of this study is 43, which demonstrates that the AI algorithm construction is trustworthy and repeatable.

In diagnosing malignant renal neoplasms, evaluating changes in HU value before and after CECT scans is highly effective. For example, clear cell renal cell carcinoma (RCC) typically exhibits a fast-in, fast-out CT presentation in CECT scans. Chromophobe RCC and Papillary RCC tend to be relatively homogeneous contrast enhancement in  CECT scans [14]. During the corticomedullary and nephrographic phases, benign CRLs, such as simple renal cysts and cystic nephroma, typically exhibit a more gradual and less pronounced intense enhancement [15]. We developed the SETD model to detect useful features from multi-phase CT images, with a focus on feature changes in multi-phase CT images. This greatly improved the model’s predictive performance and interpretability [16]. Given that feature-based machine learning algorithms are more sensitive to variations in imaging parameters, the SETD algorithm exhibits a more robust performance in external validation dataset compared to the machine learning model used in our previous study.

A growing body of evidence suggests that cystic RCC is less aggressive and has a better prognosis than solid RCC of the same size [17]. One plausible explanation for this is that only the solid portions of malignant CRLs contain malignant cells, which generally make up a very small portion of the lesion’s overall volume [18]. Developing reliable methods to differentiate benign cysts from malignant tumors, especially in larger cysts is crucial. Risk stratification based on our deep learning system can help guide management decisions. Low-risk cysts may be monitored conservatively or treated with minimally invasive procedures like laparoscopic decortication, while high-risk cysts may require more extensive evaluation or intervention [19].

Updated guidelines from various medical organizations have shifted toward more conservative approaches to manage renal cysts [20]. Active surveillance has become the recommended option for Bosniak IIF and even some Bosniak III and IV lesions since the majority of CRLs are benign renal cysts [21, 22]. The updated Bosniak classification version aims to reduce inter-reader variability and enhance the accuracy of malignancy predictions in CRLs. However, the effectiveness of this revised classification in clinical practice still requires validation. Moreover, the implications of prolonged medical follow-up for high-level Bosniak renal cysts should be carefully considered. To our knowledge, this is the largest CRLs related AI research with multicenter validation, providing a more reliable tool for preoperatively discriminating malignant CRLs. A recent study on IIF-classified CRLs with a median follow-up duration of 50 months showed a significantly lower incidence of IIF renal lesions upgrading and malignancy than previously reported. The research also observed that all participants did not demonstrate any significant radiological progression beyond 36 months of follow-up. This indicates that follow-up beyond 36 months may not be necessary in most cases [23]. In this research, the SETD model delivered a clear, concise, and binary (yes/no) determination regarding the likelihood of malignancy in CRLs. This demonstrates its practical utility in clinical settings, aiding radiologists and urologists in evaluating and selecting the optimal surgical strategy for CRL treatment.

Our research has several limitations. Firstly, its retrospective design necessitates further prospective validation to reinforce findings. Secondly, the strict inclusion and exclusion criteria may introduce selection bias, suggesting the need for broader international multi-center studies to confirm the generalizability of our results. Thirdly, all ROIs in interactive segmentation model training were labeled by radiologists, and this approach seems a little bit old-fashioned despite it can produce more reliable and convincing results. Our follow-up study will examine whether semi-supervised segmentation methods can be applied in clinical practice to reduce radiologists’ workload [24]. Lastly, clinical characteristics like age and gender may be potential predictors. A mixture model that combines deep learning networks with clinical features may further improve the diagnostic model’s performance.

In conclusion, we developed and validated a novel AI system, combining an interactive 3D U-Net segmentation model with a SETD model, to accurately differentiate between benign and malignant CRLs. The AI algorithm demonstrated good discrimination capability across validation and testing datasets. The non-invasive AI system will facilitate accurate diagnosis of renal cystic lesions, reducing excessive follow-up and overtreatment.