Introduction

The most common extracranial solid tumor in childhood is neuroblastoma (NB). Approximately one-third of NB originates from the adrenal gland [1]. Wilms tumor (WT) is the most common renal tumor in childhood and typically presents as an abdominal mass [2, 3]. These tumors exhibit vastly different clinical course and management strategies. Thus, early distinction is mandatory.

MRI is the preferred modality for the assessment of NB and WT. However, anesthesia is required for MRI, and some surgeons, including the ones in our institute, may prefer preoperative CT prior to surgery. When the tumor size is considerable, which is the usual case with childhood intraabdominal solid tumors, identifying the organ of origin and actual tumor type becomes increasingly tricky [3,4,5,6].

Contrary to North America, European guidelines prefer preoperative chemotherapy and avoid biopsy for WT management since surgery becomes easier, tumor spillage decreases, and downgrading of the tumor, which eventually leads to less radiotherapy requirement, can be a possibility [3, 5]. On the other hand, NB responds well to neoadjuvant chemotherapy, which is usually followed by surgery for local disease control in conjunction with radiotherapy. The chemotherapy protocols are quite different for both tumors, and a misleading diagnosis may lead to inappropriate chemotherapy regimens administration. Therefore, preoperative differentiation between WT and NB is essential for applying appropriate chemotherapy regimens and optimal management of these patients [5, 6].

However, without pathologic specimens using only classical imaging findings, differentiation of these tumors yields an 86% global accuracy [7, 8]. In suspected NB cases, collecting urinary catecholamines may additionally help [9]. Nevertheless, more accurate, reliable, reproducible, and automatized methods are needed to differentiate NB and WT.

Machine learning methods, including radiomics and deep learning, have recently gained considerable attention to quantitively assess medical imaging data. Successful applications of radiomics for WT stage determination, NB MYCN proto-oncogene (MYCN) status determination, and pathologic type determination are published in the literature [10,11,12,13,14].

To our knowledge, no machine learning study has been applied to differentiate WT and NB from preoperative CT images. Our aim in this study was to differentiate NB and WT from the preoperative CT scans using machine learning.

Methods

Patients

This is a single-center retrospective study approved by the Institutional Ethical Board (GO 21/1282). The informed patient consent was waived by the institutional Ethics Committee due to the retrospective nature of the study. CT scans of consecutive patients diagnosed with NB and WT admitted to our hospital from January 2005 to December 2021 were evaluated. Patients who received treatment before CT were excluded from the study. A total of 147 patients with pathologically confirmed diagnoses were included. The inclusion and exclusion criteria are presented in Fig. 1.

Fig. 1
figure 1

Overview of patient selection and study pipeline. a Patient selection flow. b Study pipeline

CT protocol

In our institution, CT images were obtained after administering 2 cc/kg of iodinated non-ionic intravenous contrast material by an automatic injector at a 2–3 mL/s rate. All CTs -including external centers- were performed during the portal venous phase. The distribution of the slice thickness of the CT scans was as follows: 1.25 mm, 2 mm, 2.5 mm, 3 mm, and 5 mm. The kV settings and slice thickness distributions were provided in the supplementary material (Figure S1, Figure S2).

Clinic-radiologic data collection

Age information is dichotomized using 24 months as the cut-off. Vomiting, diarrhea, constipation, abdominal distension, fever, abdominal mass, pain, and incidental detection were noted as clinical variables. Three distinct datasets were curated. For the first dataset, the patients were separated according to the institute’s origin. The images of the patients acquired in our institute comprised the development set, and the patients whose images were acquired in different institutes, referred to our hospital, and loaded to our local PACS comprised the external validation set; for the second and third datasets, CT machine-based partition was applied. For the second dataset, the images acquired by any machine, excluding Siemens, were used as a development set, and the Siemens images were used as an external validation set. In the third dataset, images acquired in any machine, excluding GE, were used as a development set, and the images acquired with GE CT machines were used as an external validation set. CT machines used in this study are listed in the supplementary material.

Ground truth segmentation, human reader study, and image labelling

Three radiologists (B.B., A.A.T., G.O. with 3, 5, and 10 years of experience) blinded to the clinicopathologic data performed volumetric segmentation of 60 patients using Slicer 3D [15]. This dataset is used to determine the robust feature set among different segmentations of radiologists. One radiologist (A.A.T. with 5 years of experience) completed the segmentation of the remaining patients with the same protocol (Figs. 2, 3, and 4).

Fig. 2
figure 2

Difference maps overlaid on images. The lower left corner is a 3D mesh of the error map. Red represents oversegmented; pink represents undersegmented regions compared to the reference. The other three corners are axial, sagittal, and coronal single-slice overlays. a Error map of observer 1 relative to observer 2. b Error map of observer 1 relative to observer 3. c Error map of observer 2 relative to observer 3

Fig. 3
figure 3

Neuroblastoma with large retroperitoneal mass. Typical characteristics such as crossing the midline, calcifications, and encasement of the vessels are evident. a Axial. b Sagittal. c Coronal slices with segmentation map overlays in the lower rows. d Posterior view of a 3D mesh of the tumor (red), left (blue), and right (green) kidneys, along with encased aorta (purple)

Fig. 4
figure 4

Wilms tumor with typical features. a, b Axial. c Sagittal. d Coronal slices. Tumor thrombus extending into vena cava inferior is evident in a. Typical heterogenous areas with large low-density necrotic zones not extending the midline in b, c, and d

The same three radiologists also examined the images after segmentation. They predicted the tumor type based on CT imaging features along with calcification, necrosis, tumor thrombus, and extension across the midline.

Radiomic feature extraction and selection

Pyradiomics [16], an open-source Python package (v3.0 https://pyradiomics.readthedocs.io/en/latest/), was used for feature extraction. Laplacian of Gaussian filter (LoG) transformation with five distinct sigma values and one level 3D wavelet transformation (WaT) was used along with the original images. A total of 1218 features complying with Image Biomarker Standardization Initiative guidelines were extracted [16,17,18,19]. Unsupervised and supervised feature selection methods were applied sequentially. Robust, non-redundant, and high-variance features were selected for unsupervised feature selection. Four different supervised feature selection methods—statistical filter based, volcano plot based, recursive feature elimination model based, and maximum relevance minimum redundancy based—were used to leverage each method’s advantage [20,21,22]. More details are disclosed in the supplementary material.

For the clinical dataset, the number of features was already low; therefore, no feature selection method was applied.

Model building and selection

Support vector machines (SVM) and Random Forests (RF) are successful classifier models that can be applied to both linear and non-linear classification tasks [23, 24]. SVM is a linear classifier; however, by using some kernels to project the data into higher dimensional feature space, it can be converted to a non-linear classifier. We used radial basis function (RBF) kernel for the non-linear application of SVM [23] and Synthetic Minority Oversampling Technique (SMOTE) and weighting the loss function to combat the data imbalance problem. Therefore, we experimented with seven model types (SVM, SVM with RBF, SVM with loss function weighting, SVM with SMOTE, RF, RF with loss function weighting, and RF with SMOTE) along with four feature-selection method-based feature sets. We used 10 times fourfold repeated stratified cross-validation scheme-based training to estimate the skill of the models better and reported the performance of the best models as mean and SD. An overview of the model building and training process is summarized in Fig. 1. We also provided bar plots of selected features against model coefficients (Fig. 5) and ROC curves of the models for each task (Fig. 6).

Fig. 5
figure 5

Selected features with model coefficients for each task. a Clinic-radiologic model. b Clinical center-based radiomics model. c Siemens vs. non-Siemens CT machine-based radiomics model. d GE vs. non-GE CT machine-based radiomics model. e Combined model

Fig. 6
figure 6

ROC curves and decision curves for the models in the training phase. a Clinic-radiologic model ROC curve. b Clinical center-based radiomics model ROC curve. c Combined model ROC curve. d Decision curve for the models in a, b, and c

Statistics

Python (Version 3.7.3 https://www.python.org) was used for feature extraction, selection, and statistical analysis. Matplotlib, seaborn, numPy, pandas, sciPy, and sklearn packages were employed for the analysis. The Mann–Whitney U test was used for the continuous variables (age), and the chi-square test was used for the categorical variables. Statistical significance was set at p < 0.05. Dice score, defined as two times intersection over the union of two segmentation masks, was used to assess the concordance of segmentation ROI among users. Cohen’s kappa was used for inter-user agreement of predictions of observers.

Code availability

The code will be publicly available at https://github.com/ozgurkoska78/wt_nb after the manuscript is accepted.

Results

Patients

Of 174 patients with pathologically confirmed NB and WT patients, 57 consecutive NB and 90 consecutive WT patients were included after applying inclusion and exclusion criteria. The patients’ demographic, clinical, and radiological characteristics are summarized in Table 1. For the image acquisition center-based analysis, the training set acquired in our university hospital included 30 NB and 61 WT patients. The external validation set acquired in different centers included 27 NB and 29 WT patients.

Table 1 Demographic and clinical characteristics of the dataset. Counts, percentages (in the parenthesis), and statistical p values were presented

The NB patients tend to be younger than WT patients (NB mean age 23.77 months and WT mean age 34.78 months); however, after binarization at 24 months cut-off, there was no statistically significant difference between two groups either in center-based training or external validation sets (p = 0.07) There was no statistically significant difference in gender among NB and WT patients either in center-based training (NB male: 16 (53.3%), female: 14 (46.7%), WT male: 28 (45.9%), female: 33 (54.1%), p = 0.65) or in external validation set as well (NB male: 15 (55.5%), female: 12 (44.5%), WT male: 15 (51.7%), female: 14 (48.3%), p = 0.98).

Twenty different machines from four vendors were used for image acquisition. Images of 62 patients were acquired by Siemens (42%), 59 by GE (40%), 18 by Toshiba (12%), and eight by Philips (6%) CT machines.

Interobserver correlation and human reader study

Dice scores were calculated to analyze the robustness of segmentation maps, and 0.86 mean dice scores between observers 1 and 2, 0.88 for observers 1 and 3, and 0.92 between observers 2 and 3 were obtained. The error maps of one NB patient between observers are provided in Fig. 2.

Cohen’s kappa value was calculated to analyze the observers’ agreement for tumor class prediction. The Kappa value was 0.65, interpreted as substantial agreement between raters. Observer 1 and observer 2 predicted the correct class with 0.96 accuracy. The accuracy for observer 3 was 0.93. The two patients could not be correctly predicted by either of the observers.

Clinical data analysis

Age, gender, and clinical symptoms and signs—including diarrhea, constipation, mass, pain, fever, failure to thrive, incidental detection, and metastasis—were used for the clinical model. With the clinical model, 0.80 mean AUC was obtained (Table 2). For the inference with the best model, 0.83 train and 0.70 external validation accuracy were obtained (Table 3).

Table 2 Performance metrics of the models. The metrics were derived from 40 experiments based on 10 times fourfold cross-validation for better estimation of their skills. Mean value and standard deviation in the parentheses were provided in the table
Table 3 Performance metrics with best-performing features, model pipeline and hyperparameter combinations

Clinical center-based radiomics model

The model with the best F1 score was an SVM model with weighted coefficients built with MRMR-selected features. This model achieved an F1 score of 0.90 (SD:0.05) and an AUC of 0.91 (SD:0.05) (Table 2). With best model feature set combination, train accuracy was 0.92, AUC was 0.94, external validation accuracy was 0.71, and AUC was 0.80 (Table 3).

CT machine-based radiomics model

There were 20 NB and 42 WT patients in the Siemens-patients dataset, which served as external validation set, and 37 NB and 48 WT patients in the other machines dataset, which served as training set. Similarly, there were 23 NB and 36 WT patients in the GE-patients dataset, which served as a distinct external validation set, and 34 NB and 54 WT patients in the other machines dataset, which served as a distinct training set for the alternative classification task. With Siemens and other machines-based partitions, the most successful model was built with RFE-selected features and SVM, which achieved an F1 score of 0.84 (SD:0.05) and an AUC 0.90 (SD:0.06). Similarly, with GE and other machines-based partitions, an SVM model with RFE-selected features had the highest performance with an F1 score of 0.89 (SD:0.04) and an AUC of 0.91.

Clinic-radiological and radiomics combined model

We further extended our analysis by combining the clinic-radiologic and radiomics data information. For this analysis, we first created a common union radiomics dataset, which included the features selected from each of the four different feature selection strategies. This strategy yielded a 13-feature dataset. Then, we added all 10 clinic-radiologic features to obtain a combined dataset, followed by RFE-based feature reselection. With this strategy, we selected seven best features, of which two were clinic-radiologic based (age and metastasis), and five were radiomics based (two LoG-transform, two wavelet-transform, and one original shape-based features). SVM with radial basis function-based model with this feature set achieved the best overall performance with an F1 score of 0.94 (SD:0.03) and an AUC 0.96 (SD:0.04).

Clinical utility of the prediction models

Compared with scenarios in which no combined prediction model was used, the combined model produced a better net benefit than clinical and radiomics models for all thresholds (Fig. 6d).

Subgroup analysis based on lesion volume

We further analyzed the effect of lesion size on the predictions of the radiomics model. We had 44 patients with lesions smaller than 150 cc and 64 patients with lesions ranging from 150–500 cc (Supplementary material Figure S3). Human observers correctly classified these small and intermediate size lesions since the organ of origin was more easily determined. In contrast, the model failed more in classifying these lesions because their smaller size introduced difficulties in finding textural patterns. By excluding these small and intermediate-size lesions, the predictive accuracy of the three human observers dropped to 0.84, 0.81, and 0.71. In contrast, the predictive accuracy of the radiomics model in the external validation set increased to 0.84. When we extended this analysis to exclusively > 1000 cc lesions, the predictive accuracy of the human observers further dropped to 0.7, 0.7, and 0.6, while that of the radiomics model on the external validation set increased to 0.9. This analysis further justified the benefit of the radiomics model as an aid to decision-making in large tumors that exhibit difficulty in finding the organ of origin.

Radiology quality score

We further calculated radiology quality scores based on percentages. The score for our study was 50%. However, some items of radiology quality score (RQS), such as cost-effectiveness analysis, biological basis analysis, and test–retest assessment with CT, could not be applied to our study [25].

Discussion

Although there is an increasing interest in radiomics and machine learning studies for medical imaging, there is a lack of built and validated CT radiomics models to predict neuroblastoma or Wilms tumor in solid pediatric abdominal tumors preoperatively. We constructed a machine learning-based CT clinic radiomics model to noninvasively predict neuroblastoma or Wilm’s tumor in pediatric patients in abdominal CT examinations with a mean F1 score of 0.94, 0.93 mean accuracies, and 0.96 mean AUC. We showed that incorporating clinical and radiological knowledge into the radiomics features increased the model’s performance. Importantly, two patients not correctly identified by either of the observers were correctly predicted by the developed model, which is additional evidence of the benefit of machine learning-based radiological decision support systems. Restricting the analysis to the largest tumors with a tumor volume greater than 1000 cc, which introduces the most difficulty for differential diagnosis, our models surpass the human-level performance by a significant margin of 20–30%. Radiomics has proved to be an important digital biopsy approach defining underlying histopathologic features of tumors [26, 27]. We performed hierarchical screening and reduction of radiomics features and integrated clinical and radiological knowledge to build a robust classifier. The inferior performance of the model in the external validation sets may be attributed to the heterogeneity of the dataset obtained from multiple institutions and different machines. In this cohort, 20 machines with various kernel reconstructions and acquisition parameters were included to reflect real-life conditions better. We tried to reduce the data heterogeneity by preprocessing the images, including voxel resampling into 1 mm3 resolution, intensity shift, and gray-level discretization. Regardless, the combined model, which integrated the imaging-based features with demographic and clinical features, demonstrated remarkable performance in predicting NB or WT. Increasing the number of samples acquired by different machines, using different protocols, and different kernel reconstructions, which we aim to carry out in future studies, might be a good strategy to mitigate the performance drop in external test sets. This approach involves modeling and learning the inherent heterogeneity of different image acquisition conditions.

Few research papers in the radiological literature deal with the differential diagnosis of renal and non-renal origin tumors using imaging methods [10, 28]. A research paper that reported the diagnostic accuracy based on these findings achieved 82% global accuracy [29]. In another study, authors stated that nine patients with NB were identified who had an exploratory laparotomy with a preoperative diagnosis of WT; eight underwent nephrectomy at exploration [28]. However, in NB, every effort should be made to preserve both kidneys [10, 28, 29]. The authors recommended checking urinary catecholamine levels, which would be present in 90% of NB if there is any question about the diagnosis [28, 29]. In our dataset, we had the urinary catecholamine records of the NB patients but not for the WT patients; thus, we did not include them in our models. Nevertheless, our model reached 93% accuracy and 96% AUC score, outperforming urinary catecholamines alone. Including the laboratory variables, such as urinary catecholamines, might further improve our result, which we plan to test in a future study.

In our combined model, seven features, of which three were age, metastasis presence, and short axis length, were selected for a final classification task. In one study similar to our findings, it was found that as volume and tumor size increase, there is a greater probability of diagnosing a WT (OR:7.93 and 4.37, respectively), and an inverse correlation between the presence of metastasis and having WT was observed (OR:0.19) [29].

Radiomics has shown promising results for several types of cancer [30,31,32], and more recently, it has been applied to NB [10, 11, 33,34,35] and WT [15]. Most of the research on radiomics for NB focuses on MYC amplification identification [10, 33,34,35]. These studies reported AUC scores ranging from 0.72 to 0.97 [10, 34, 35]. No external validation cohort was used in the published studies. In our study, both center-based and two different CT machine-based trained models achieved 0.82, 0.95, and 0.80 AUC scores in the external validation set, reaching 0.94 AUC in all training sets.

The 123I-MIBG is selectively concentrated in more than 90% of neuroblastomas; therefore, it can also be used to differentiate NB and WT [1]. However, the high radiation of this technique limits its utilization. The availability of this imaging facility in some centers and the additional costs related to the radiotracer should be further considered. CT is routinely acquired to evaluate the anatomical relationships of the tumor, among other benefits. Our model evaluates the readily available CT images without additional cost or intervention to the patient, providing a significant advantage over radiotracer-based methods.

Although there is no research to differentiate WT and NB from abdominal CT with machine learning, the above-discussed studies justify the successful application of radiomics approach to the diagnosis at the cellular and molecular level for WT and NB. The central dogma of radiomics states that imaging phenotypes reflect pathophysiological processes that may alter the lesions’ morphology. These complex interactions of different tissue types and pathological processes can be captured by computational tools [16]. In our study, four second-order radiomics features were selected for the final classifier; two were LoG-based, and the remaining two were WaT-based. Laplace transform takes a second-order derivative of the image, capturing sudden changes in pixel values [19]. In our study, the GLSZM Large Area Low Gray Level Emphasis (LALGLE) feature was obtained from the LoG filter with a sigma value = 1 and GLDM Dependence Variance (DV) feature from LoG with sigma value = 3 was selected for the final classification. The edges in the image represent the interfaces of different compositions. Therefore, LoG-transformed images can leverage accentuated edge information. High GLDM DV values may correspond to more similar regions; thus, homogenous texture and low values may represent local heterogeneity. GLSZM LALGLE measures the proportion in the image of the joint distribution of larger-size zones with lower gray-level values [16, 17]. We can interpret this as the model focusing on low-density zones interspersed within the gross tumor volume.

Wavelet-transform can transform the image into high and low-frequency components by applying high-pass and low-pass filters in each of the three directions, which yields eight different image sets exhibiting different combinations of high and low-frequency components in each direction [18]. Our model selected two WaT-based features: WaT HHL GLRLM Long Run Low Gray Level Emphasis (LRLGLE) and WaT HLL GLSZM Zone Entropy.

The selected feature set also indicated that our model focused on relative distributions of different tissue groups, which gives a heterogeneous appearance to the naked eye, particularly giving more importance to low-density regions.

Our study’s major limitations were dataset size and retrospective design, which may introduce selection bias. Nevertheless, 147 patients from a single center archive for a childhood cancer population is not so small, and we took great care to follow best practices in machine learning in medical image classification tasks. Another limitation was the reproducibility problem of the radiomics studies. We applied strict criteria to select robust features under different user segmentations. However, the value of selected features should be validated in more extensive and heterogeneous cohorts. Last, we could not integrate some important clinical and radiological variables such as urinary catecholamine levels, presence of tumor thrombus, encasement of vessels, and presence of calcification since the information was collected for one class, not both. In a future study, we plan to integrate richer clinical and radiological variables and more advanced feature engineering for better performance.

Conclusion

In conclusion, the CT-based clinic radiologic radiomics combined model could noninvasively predict Wilms tumor or neuroblastoma preoperatively. Notably, that model correctly predicted two patients, which none of the radiologists could correctly predict. This model may serve as a noninvasive preoperative predictor of neuroblastoma/Wilms tumor differentiation in CT as a decision support tool, which should be further validated in large prospective models.