Introduction

Craniopharyngioma is a type of uncommon benign intracranial tumors with two histological subtypes called adamantinomatous craniopharyngioma (ACP) and squamous papillary craniopharyngioma (PCP) [1, 2]. As its locally aggressive growth pattern and anatomic location proximal to vital brain structures, it is so-called behavior malignant tumor leading to devastating neurological dysfunction along with significant mortality and morbidity [3, 4]. Patients with craniopharyngioma may suffer from severe symptoms, including endocrine deficiencies, visual impairment, and increased intracranial pressure [5, 6].

Magnetic resonance imaging (MRI) before and after gadolinium injection is now the golden standard imaging for both detection and assessment of craniopharyngioma [7]. The importance of reliable tumor segmentation on MRIs has been highlighted. On the one hand, it can provide a more individual pre-treatment assessment to formulate therapeutic schemes for both neuro-radiologists and neurosurgeons, as segmentation-based three-dimensional assessment was introduced, providing a more sensitive tumor detection and progression monitoring in image reading [8, 9]. On the other hand, in previous researches, a series of segmentation-required models were proposed, and represented feasible ability in pathological discrimination, differential diagnosis, and brain-invasion prediction [10,11,12,13,14,15]. However, manual delineation is tedious, laborious, and time-consuming, commonly leading to inter-and intra-rater variations that may lead to final analysis deviation [16]. It is urgent to develop a new reliable model to automatically segment craniopharyngioma on routine MRIs.

Deep neural network (DNN) is a multidisciplinary and interdisciplinary research that can shorten the image processing time and improve the reliability of diagnostic results leveraging big data in the field of medical imaging [17, 18]. It has been used as a necessary step in image pre-processing, allowing for a more tumor-specific analysis, whereas comparing to other types of brain tumor, segmentation of craniopharyngioma using DNN is more challenging as its pathologies vary dramatically in their imaging presentation. On routine MRIs, PCP usually presents a noncalcified and solid entity, while imaging features of ACP show a typical combination called 90% rule, that 90% show more or less prominent calcifications, 90% of tumors are predominantly cystic, and about 90% take up contrast in the cyst walls [19, 20]. The variety of imaging presentation hampers the high performance of DNN models in segmentation. In the previous research, classic DNN showed limited performance, and suggested that network should be improved according to the imaging features of craniopharyngioma [14].

Given the importance of tumor segmentation in both clinical routine practice and academic researches, in this study, we aim to develop a novel deep learning model to automatically segment craniopharyngioma on MR images by using convolutional neural networks (CNNs) technology.

Method

Patient selection

This was a retrospective study performed in two medical centers. Institutional Review Boards of our hospitals approved our research, and patient consent was waived (2021-S-851). From January 2016 to December 2021, 293 cases and 54 cases with pathologically diagnostic craniopharyngioma were retrieved from medical center A and medical center B, respectively. Patients underwent pre-surgical MR scans in radiology department and received surgical resection in neurosurgery department in our institutions. The exclusion criteria were as follows: (1) unreadable MRIs with severe motion artifacts; (2) tumor intervention history prior to MR scans, such as biopsy and radiotherapy; (3) unclear pathological subtypes. With above inclusion and exclusion criteria, 264 patients from center A and 38 patients from center B were included in the current study. Figure 1 showed the flowchart of involving the study population. The MRIs were exported from the Picture Archiving and Communication System. Clinical characteristics were also collected, including age, gender, tumor characteristics, and pathological findings.

Fig. 1
figure 1

Flowchart of the study population

Sequence selection and MRI acquisition

Four sequences were available, including T1-weighted imaging, T2-weighted imaging, contrast-enhanced T1 images (T1CE), and fluid attenuated inversion recovery. Among these images, T1CE was chosen as it was optimal for identifying tumor boundary. Different from previous researches that mostly used axial images, we chose coronal images as they could provide more details about the relationship between tumor and surrounding structures, especially regarding internal carotid artery (ICA) compression and cavernous sinus invasion. MR scans were performed in the head-first supine position. Gadolinium-DTPA (Beijing BeiluPharmaceutical Corp.) was used as the contrast agent (0.1 mmol/kg).

In center A, MR scans were performed with 3 T Philips Achieva scanners (Philips Corp.) equipped with a head or head–neck coil. Parameters were set as follows: slice thickness was 5 mm; repetition time was 142.2 s; echo time was 2.3 s; phase field of view was 73.44%; acquisition matrix was 0/256/204/0; flip angle was 90°; average series time was 27.24 min. In center B, MR scans were performed with 1.5 T Siemens Sonata scanners (Siemens AG Corp.) equipped with a head or head–neck coil. Parameters were set as follows: slice thickness was 4 mm; repetition time was 631s; echo time was 7.7 s; phase field of view was 75%; acquisition matrix was 0/256/144/0; flip angle was 80°; average series time was 29.15 min.

Image pre-processing and ground truth segmentation

The images were resized to 256 × 256, resampled spatial resolution to 1 × 1 × 1, and normalized to [0,1]. Contrast limited adaptive histogram equalization approach was also used to enhance the images.

The manual segmentation was performed on coronal images by using ITK-SNAP platform (version 3.80) [21]. Two neuroradiologists manually draw the regions of interest (ROIs) along the tumor boundary in consensus reading, excluding peripheral blood vessels (mainly referred to ICA) or invaded structure (mainly referred to cavernous sinus). All contours were reviewed by two senior neuroradiologists (the corresponding authors, with 20 years of experience in image reading). Twenty cases were randomly selected and segmented another time to examine intraclass reproducibility by another radiologist with a 10-day interval.

U-Net framework architecture and training

The networks were implemented in Pytorch framework. Data augmentation was also adapted that the images were randomly flipped horizontally and vertically, and rotated within the range of −10 to 10 degree for each epoch. Standard U-Net structure was used for modeling. Based on the model performance and image characteristics of craniopharyngioma, the network was improved from two aspects, that first, a non-local attention block was introduced to capture long-range dependencies in images [22]; and second, the loss function was changed as loss total = 0.5 × loss Tversky + 0.5 × loss cross entropy [23, 24]. Hyperparameters of Tversky loss was set as α = 0.1, β = 0.9, respectively. The network structure used in this research is shown in Fig. 2.

Fig. 2
figure 2

The basic structure of the modified U-Net. A none-local block is added at the bottom layer of the network to capture the long-range information of tumor images

Figure 3 shows the workflow of the development of the deep CNN model. The cases from center A were randomly divided into the training group (N = 212) and the internal test group (N = 52) on the proportion of 8:2; cases from center B were used as the independent external test group (N = 38) to test model generalization. Ten cases were randomly selected from the training group, and set as tuning set to validate the performance in each training epoch. As for the network parameters, input size of network was set as 256 × 256, and optimizer was set as adam. The batch size of the sections was set to 8, and the learning rate was set to 1e-5. The number of epochs for the model training was set to 300.

Fig. 3
figure 3

Workflow of the development of the convolutional neural networks CNN models

Evaluation metric

The trained model performance was quantitatively evaluated in both internal and external test group by comparing ground truth masks with predictions in pixel level. For each case, dice similarity coefficients (DSCs) were calculated as indicator of similarity by measuring the overlap of segmentations. Hausdorff surface distance (HD) were calculated as indicator of the distance that the contours in predicted segmentations have to move to enclose the contours in manual ROIs. Classic quantitative metrics were also calculated, including Jaccard value, true positive rate (TPR), false positive rate (FPR), tumor volume, and calculation time. The definitions of the metrics were provided in Supplement Material 1.

All statistical analyses were performed by using IBM SPSS Statistics Version 25.0 (IBM Corp.). Continuous variables were summarized as mean value, and categorical variables were summarized by number and percentage. The significance of intra-group difference was evaluated by using one-way ANOVA analysis. Bland-Altman plots and interclass correlation coefficients (ICCs) were used to access the intra-observer reproducibility and the difference between manual segmentation and automatic segmentation.

Result

Patient characteristics

A total number of 302 patients were included in this research. For the all patients involved in this research, the mean age was 32.92 years old, and the gender ratio of participants was 179:123. A total number of 230 patients were pathologically diagnosed with ACP, and 72 patients were diagnosed with PCP, respectively. The image characteristics of tumor were variant, that 34.4% were predominantly solid, 17.9% were predominantly cystic, and 47.7% were mixed solid-cystic. Lobulated appearance was found in 43.0% patients. Moreover, compressed or enclosed ICA was found in 40.1% cases, and invaded cavernous sinus was found in 54.0% cases. Table 1 shows the detailed patients’ characteristics and histopathological information.

Table 1 Clinical, histopathological, and radiological features of patients

Model performance and intra-group analysis

As shown in Table 2, classic U-net model showed limited performance in tumor segmentation, with average DSCs of 0.767, Jaccard of 0.647, TPR of 0.802, FPR of 0.001, and 95HD of 5.726 mm. The modified model achieved improved performance in the internal test (center A) with DSCs of 0.840, Jaccard of 0.734, TPR of 0.820, FPR of 0.000, and 95HD of 3.669 mm. More results of ablation experiment regarding to the CNN model modification is provided in Supplement Material 2. Interrater reliability for automated volumetric assessment was good with an ICC of 0.987. The sample of the modified model segmentation from internal test group is shown in Fig. 4. Bland Altman plot indicated that the automated method performed as good as expert manual segmentation, as presented in Supplement Material 3.

Table 2 CNN Models’ performance in internal test and external test
Fig. 4
figure 4

Tumor segmentation results of craniopharyngiomas from center A (internal test). In the labeled images, blue arrow represents solid composition of tumor, green arrow represents darken cyst of tumor. AC A case with predominantly solid composition. Segmentation performance of this case is DSCs of 0.865, 95HD of 4.975 mm. DF A case with predominantly mixed solid-cystic composition. Segmentation performance of this case is DSCs of 0.858, 95HD of 2.236 mm. GI A case with predominantly cystic composition. Segmentation performance of this case is DSCs of 0.791, 95HD of 6.001 mm

Four radiological features were evaluated, including predominantly composition, lobulated shape, compressed or enclosed ICA, and cavernous sinus invasion. The results of one-way ANOVA suggested there was no significant difference within the intra-groups, demonstrating that the model performance was not statistically associated with radiological characteristics of tumors (DSCs: p = 0.370, 0.353, 0.809, 0.283, respectively; 95HD: p = 0.504, 0.508, 0.113, 0.401, respectively), as provided in Table 3. An overview of metrics (DSCs and 95HD) for segmentation performance obtained using modified CNN model are summarized in box plots in Supplement Material 4 and Supplement Material 5.

Table 3 Relationship between modified model performance in segmentation and radiological features of tumors

Model generalization

The proposed model also showed to be feasible when tested in the external dataset (center B). The performances were DSCs = 0.816, Jaccard = 0.704, TPR = 0.765, FPR = 0.000, and 95HD = 4.201 mm. The sample of model segmentation from external group is shown in Fig. 5. The performance of the automated method compared to the expert manual segmentation is presented in Supplement Material 6.

Fig. 5
figure 5

Tumor segmentation results of craniopharyngiomas from center B (independent external test). In the labeled images, blue arrow represents solid composition of tumor, yellow arrow represents mixed composition of tumor, green arrow represents darken cyst of tumor. AC A case with predominantly solid composition. Segmentation performance of this case is DSCs of 0.809, 95HD of 3.606 mm. DF A case with predominantly mixed solid-cystic composition. Segmentation performance of this case is DSCs of 0.810, 95HD of 4.583 mm. GI A case with predominantly cystic composition. Segmentation performance of this case is DSCs of 0.808, 95HD of 6.115 mm

Comparison of radiologists and models

Averagely, processing time (including image preprocessing and loading time) for tumor segmentation required 3 min 27 s of processing with the CNN model (a laptop equipped with 8-core Intel Core 11800H CPU and NVIDIA 2060 GPU). In contrast, average processing time of radiologists’ segmentation is 23 min 34 s. For the selected 20 cases, Bland-Altman plots suggested that there was good intra-observer reproducibility of manual segmentation. Moreover, it also illustrated that automatic segmentation highly consisted with manual label, as showed in Fig. 6.

Fig. 6
figure 6

Bland-Altman plots of comparisons among radiologists and model predictions. A Intra-observer reproducibility of manual segmentation. B 1st time manual segmentation vs model prediction. C 2nd time manual segmentation vs model prediction

Discussion

In this study, the trained CNN model showed promising results for segmenting craniopharyngiomas with DSCs of 0.840 in the internal test dataset and 0.816 in the external test dataset. Our model significantly outperformed the previous research, demonstrating a new automatic method for craniopharyngioma segmentation [14].

The strong clinical implementation highlights the importance of automated tumor segmentation as it enhances both diagnostic evaluation and tumor growth monitoring. Automated segmentation may also provide an improved approach to routine clinical imaging assessment as it may allow convenient detection, volumetric assessment, and precise therapy planning [25, 26]. Recent advances of radiomics also suggest that automated segmentation can provide an image pre-processing method for subsequent tumor analysis. Comparing to laborious manual delineation, it is highly reproducible and alleviates the inter- and intra-rater variability of clinicians [27].

A lot of the previous researches have attempted to identify methods or algorithms for brain tumor segmentation [28, 29]. Compared to the majority that focused on gliomas and brain metastases, the dramatic variety of image characteristics of craniopharyngiomas sets a higher bar for CNN segmentation [30]. Radiologically, PCP occurs mostly in adults and is predominantly mixed solid-cystic or solid, with spherical shape mass located in suprasellar location. By comparison, ACP generally appears to be a lobulated cystic or predominantly cystic mass located in intrasellar/suprasellar location [31]. Besides, tumor compression or/and invasion of important surrounding tissues, such as hypothalamic and ICA, is also common [1, 32]. Therefore, traditional DNN models could not show promising performances. In the previous research, DNN model developed with classic U-net network showed segmentation ability with DSCs of 0.662, precision of 0.668, and TPR (sensitivity) of 0.677 [14]. Our research also showed similar results, that the performance of U-net was DSCs of 0.767 and TPR of 0.802. When analyzing the results of segmentation, we found that the classic model performed well in segmenting solid entity and cyst walls of tumor, while not feasible in segmenting the darken cyst within the tumor. This result suggested a relatively greater false negative and relatively lower false positive (FP) in prediction. Therefore, we improved this network by adding a non-local attention block to capture long-range dependencies in tumor image, and by setting Tversky index as loss function to improve the sensitivity to false negative pixels, while keeping conservation to false positive pixels. The results suggested that the modification was effective, and our model significantly outperformed the pervious one with DSCs of 0.840. Also, the generality was also showed feasible in the external dataset test, with DSCs of 0.816. However, it should be mentioned that the main purpose of previous research was to build a two-path deep learning system for predicting the invasiveness of craniopharyngioma. The segmentation network they used was only a sub-net in the system. More research regarding deep learning technology is required to validate our results, and to investigate the new network structure for craniopharyngioma segmentation in the future.

One thing should be further discussed was the difference of model performance within the groups. The results showed that there was no significant relationship between model performance and the radiological characteristics, including tumor composition, lobulated shape, compressed or enclosed ICA, and cavernous sinus invasion. Therefore, the provided tumor segmentation algorithm appeared to be feasible for different craniopharyngiomas entities. However, it should be mentioned that compared with homogeneous tumor, the worst results were commonly found in the heterogeneous tumor with mixed solid-cystic composition (Supplement Material 7). Although this finding was not statistically significant, it should be considered because neuroradiologists, neurosurgeons, and neuro-oncologists may need the most assistance from a CNN for lesion volumetric detection. In these cases, manual corrections are still needed to ensure a precise delineation, where the segmentation model presents unsatisfactory performance.

Our research also provided a convenient segmentation tool for subsequent tumor analysis to save some manual efforts. However, we used coronal images for DNN segmentation, while previous researchers usually used axial images for analysis [10, 12, 15, 33]. Coronal images were chosen as it was hard to determine the boundary between the craniopharyngiomas and the invaded cavernous sinus on axial images in some cases. It was important to consider future CAD research that may need most assistance from a DNN models for lesion segmentation. Thus, additional research may be required to further discussed whether or to what extent might radiomics features extracted from coronal images might impact performance for the machine learning classifiers previously provided.

There were several limitations in the current research. First, this was a multi-center, retrospective study with a relatively small data size. Large-scale, multi-center dataset was required to verify our research. Second, the invaded structures were excluded in the current research. For some cases, it should be considered and manual correction should be performed if the clinicians want to get the entire tumor volume. Third, the CNN segmentation model based on U-Net network was only studied in coronal T1CE images, while multimodal images were not included in the research. Previous researches suggested that multisequence feature sets are superior to single-sequence feature sets in tumor image analysis [34, 35]. Automated tumor segmentation in multisequence images should be investigated in future research. Fourth, the methodological investigation in CNN network designing is still limited. For example, the model may be further improved by adding non-local attention block at the earlier layer of the network or adding more than one non-local attention block. Our method is a balance between computation and model performance. With more advanced platform, more complicated, efficient CNN model could be trained to perform segmentation. More computer-science research is required to investigate if the CNN model could be further improved in segmenting craniopharyngiomas.

Conclusions

This study proposed an automated segmentation DNN model based on U-Net for craniopharyngioma in MRIs. The proposed model showed feasibility in the segmentation, and was a fully automated diagnosis-aid method that could reduce the workload of radiologists. Also, automated segmentation provided a new algorithm for automated craniopharyngioma segmentation in future CAD research.