Introduction

Ocular adnexal lymphoma (OAL) is a common orbital malignant tumor in adults and accounts for 11–55% of orbital malignancies [1,2,3]. Over the past few decades, there has been a marked rise in the incidence of OAL, adversely affecting the quality of life for affected individuals [3,4,5,6]. Presently, radiotherapy is commended as an important treatment strategy for OAL in clinics, as it can easily pinpoint and shrink the tumor [7,8,9]. Determining the tumor boundaries and volume has essential benefits for radiation therapy planning and treatment efficiency evaluation [10, 11].

Prior studies have demonstrated that magnetic resonance imaging (MRI), which offers both quantitative and qualitative information for diagnosis and guided treatment, represents the optimal non-invasive examination approach for orbital tumors [12,13,14]. MRI plays a significant role in delineating tumor boundaries, not only for initial diagnosis but also for monitoring treatment progress and making necessary adjustments based on the tumor’s characteristics. However, in typical clinical settings, radiologists often rely on subjective assessments of tumor size on MRI images, lacking objective tumor volume data. This is primarily due to the laborious and time-consuming nature of manual segmentation, which may introduce inter-observer and intra-observer errors. Thus, the precise and automated recognition of tumor contours and the measurement of tumor volume on MRI images may serve as a pivotal factor in optimizing the daily clinical workflow for patients with OAL.

Recent research suggests that deep neural networks can process multi-modal data [15, 16]. For MRI analysis, convolutional neural networks have been explored for multi-sequence modeling and have proven effective [17]. However, the complexity of deep learning involves configuring numerous parameters, posing a challenge for non-experts in model training. To address this, recent years have seen the development of fully automated deep learning frameworks and pipelines for doctors and radiologists, such as nnU-net, enabling training without extensive coding [18].

This study aimed to develop a multi-sequence deep neural network, utilizing a fully automated self-configuring training framework, nnU-net, for segmenting OAL and measuring its volume on a multi-sequence MRI dataset.

Methods

Study design and participating datasets

This retrospective study obtained approval from the hospital’s institutional review board (No. TREC2023-KY107). Given its inherently retrospective nature, the requirement for written informed consent was waived.

MRI images of patients with ocular adnexal lymphoma (OAL) were retrospectively collected between January 2015 and March 2022 from our institution. Inclusion criteria consisted of (1) histopathologically confirmed diagnosis of OAL; (2) a comprehensive MRI image dataset, including axial T1- and T2-weighted images (T1, T2) and T1-weighted contrast-enhanced (T1c) sequences; (3) presence of identical OAL lesions on MRI images. Exclusion criteria included: (1) severe motion artifacts impeding accurate interpretation; (2) small lesions occupying fewer than consecutive two slices; and (3) lesions lacking a clear pathological diagnosis. All images were anonymized and exported in Digital Imaging and Communication in Medicine (DICOM) format.

Based on the aforementioned inclusion and exclusion criteria, a total of 147 patients from Center 1 were enrolled in the training set. Furthermore, thirty-three OAL patients from three separate institutions (Centers 2–4) meeting the aforementioned criteria during the same period were selected as the external test set. A data-sharing protocol was established and signed among the participating centers. The study workflow is illustrated in Fig. 1.

Fig. 1
figure 1

Flowchart of the patient selection

MRI acquisition

The training dataset (n = 147) was acquired using 1.5T (Philips Ingenia, n = 2) and 3T scanners [(1) Philips Ingenia, n = 83; (2) GE Discovery MR750, n = 50; (3) Siemens Magnetom Prisma, n = 12)] from various MRI manufacturers. The test dataset (n = 33) was obtained exclusively from 3T scanners: (1) Center 2 (n = 18): GE Discovery MR750 (n = 5), Siemens Somatom Skyra (n = 8), and Siemens Magnetom Verio (n = 5); (2) Center 3 (n = 5): Philips Ingenia; (3) Center 4 (n = 10): United Imaging uMR790 (n = 2), Philips Achieva (n = 1), and Siemens Somatom Skyra (n = 7). Due to variations in scanning protocols across different centers, T2 and T1c images were presented in two forms: with and without fat saturation (FS/nFS). However, each patient had only one form of T2 or T1c images. T1c images were acquired by intravenous injection of a gadolinium-based contrast agent at a dose of 0.1 mmol/L. Detailed information on the MRI scanners and sequences is provided in Table 1, with corresponding protocols outlined in Supplementary Table S1.

Table 1 MRI scanners and sequences of the training set and test set

Manual segmentation

All segmentations of OAL lesions were initially conducted by a junior radiologist (Radiologist 1) with four years of experience in head and neck imaging. The regions of interest (ROIs) along the boundaries of OAL on multi-sequence MRI images (T1, T2_FS or T2_nFS, T1c_FS or T1c_nFS) were outlined slice by slice using ITK-SNAP (version 4.0.0, http://www.itksnap.org/pmwiki/pmwiki.php). Subsequently, a senior radiologist (Radiologist 2) with eight years of experience in head and neck imaging reviewed and validated the initial annotations. Any discrepancies were resolved through discussions or consultation with another experienced expert to achieve consensus. The manually segmented results that obtained consensus were deemed to be the ground truth. Furthermore, to assess intra-observer consistency, we randomly selected 30 cases for re-segmentation by Radiologist 1 after one month.

Data processing

All scans were automatically resampled to a consistent voxel spacing of 0.31 × 0.31 × 3.00 mm3 (the median spacing across scans). Subsequently, a Z-score normalization was applied to standardize the distribution of intensities. We employed a patch-based training strategy adapted from the NIFTYNET framework [19]. Given a median image dimension of 16 × 506 × 511 voxels, the patch size was automatically set to 12 × 384 × 384 voxels.

Online spatial data augmentation encompassed random flipping, rotation, and affine transformations. Intensity augmentations included Gaussian blurring, contrast adjustment, low-resolution simulation, and gamma transformation with default parameters. The calculation of tumor volume involved multiplying the pixel by the product of voxel spacing.

Model development

In the development of MRI segmentation models for OAL, we employed the nn-UNet V2 pipeline (version 2.1) [18]. Considering patients in clinical settings may lack contrast-enhanced MRI images due to contrast agent allergies or other reasons, we developed and trained two distinct models to broaden the algorithm’s applicability. The first, labeled “Model 1,” was trained on T1, T2, and T1c images, while the second, labeled “Model 2,” exclusively used T1 and T2 images. The input channel for both models was set to 1. Both models employed a five-fold cross-validation approach during the training process [20,21,22,23]. Specifically, the training dataset was split into 5 nonoverlapping folds, with each fold used for validation while the remaining four folds were utilized for training. This resulted in the creation of five sub-models, where each fold was utilized once for testing and four times for training. Importantly, the final selection of the model for each variation was determined based on its performance in internal validation across each fold. At the inference stage, the OAL segmentation was achieved by averaging the predictions from the 5 sub-models using the nnUNetv2_ensemble module (Supplementary Fig. S1).

The training configuration employed the “3d_fullres” mode with the default nnU-Net V2 sampling strategy, randomly selecting 250 patches containing the OAL region per epoch. Each model underwent 1000 epochs of training with a batch size of 2. The training process was executed on two GPUs, each equipped with 24 GB of graphics memory, using PyTorch (version 2.0.0, https://pytorch.org/get-started/locally/) and CUDA (version 11.7, https://developer.nvidia.com/cuda-11-7-0-download-archive). The workflow of data processing is shown in Fig. 2.

Fig. 2
figure 2

The processing workflow of nnU-net for OAL segmentation and volumetric assessment

Statistical analysis

Statistical analysis was conducted by Python (version 3.9) and SPSS (version 20.0, SPSS Inc., Chicago, USA). To assess the normality of the data distribution, the Shapiro-Wilk test was employed. Data not conforming to a normal distribution were expressed as the median and quartile ranges, or else were summarized as mean ± standard deviation (SD). The t-test and chi-square test were used to compare the continuous and categorical variables, respectively. The performance of nnU-net was evaluated by the Dice similarity coefficient (DSC, ranging from 0 to 1), sensitivity, and positive prediction value (PPV). Bland–Altman plots were generated using the volumes of the ground truth and predicted tumor, and Lin’s concordance correlation coefficient (CCC) was employed to evaluate the consistency between them. The consistency between nnU-net prediction and manual segmentation was interpreted as almost perfect (> 0.99), substantial (0.95 ~ 0.99), moderate (0.90 ~ 0.95), and poor (< 0.90) [24]. The intra-observer agreement and the consistency of tumor volume obtained through manual segmentation across the three MRI sequences were evaluated using the intraclass correlation coefficient (ICC) with a two-way mixed absolute agreement model. The interpretation of ICC values was as follows: poor (ICC < 0.5), moderate (0.5 ≤ ICC < 0.75), good (0.75 ≤ ICC < 0.9), and excellent (ICC ≥ 0.9) [25]. A p value less than 0.05 is regarded as a statistical difference.

Results

Patient characteristics

The characteristics of the patients included in this study are presented in Table 2. There were no significant differences in terms of sex (p = 0.327) or the side of the lesion location (p = 0.496) between the training set and test set. Patients in the test set were younger than those in the training set (p = 0.030), and a statistically significant difference in the distribution of tumor volume between these two datasets was observed (p = 0.022). The intra-observer agreement demonstrated great consistency with an ICC range of 0.889 to 0.956, indicating the robust stability of ROI delineation. Tumor volume by manual annotation among the three MRI sequences showed excellent consistency, with an ICC of 0.945.

Table 2 Patient characteristics of the training set and test set

Evaluation of performance of training set

Using 5-fold cross validation, nnU-net achieved a DSC in the range of 0.78–0.87 considering all MRI sequences in Model 1. The volumetric difference between nnU-net prediction and the ground truth was 0.02–1.28 cm3, with the CCC of the five sub-models ranging from 0.92 to 0.95. For Model 2, nnU-net obtained DSC ranging from 0.75 to 0.86, with a volumetric variance of 0.08–1.71 cm3. The CCC was 0.84–0.89 for all of the sub-models in Model 2 (Supplementary Fig. S2, Table S2-S3).

Performance of Model 1 for automated segmentation

In the evaluation of the test set in Model 1 (Table 3), nnU-net demonstrated the highest median DSC of 0.84 (0.70, 0.90) for T1c_FS, slightly outperforming T2_FS with 0.82 (0.75, 0.87). The DSC for T1 was 0.80 (0.75, 0.87), essentially equivalent to T1c_nFS with 0.80 (0.77, 0.85). The nnU-net showed the lowest DSC of 0.79 (0.75, 0.89) for T2_nFS (Fig. 3a). For PPV, the nnU-net achieved better results for T1 and T1c_FS, with values of 92.8% (88.4%, 95.0%) and 92.1% (86.9%, 94.8%), respectively. T2_nFS and T1c_nFS yielded similar PPVs of 90.2%, which were superior to the PPV for T2_FS with 86.1% (74.7%, 93.7%) (Fig. 3b). The nnU-net exhibited the highest sensitivity of 82.3% (67.3%, 87.9%) for T2_nFS, surpassing the sensitivity of T2_FS at 81.2% (73.2%, 87.9%). These values exceeded the sensitivities of 78.7% (66.0%, 86.1%), 76.3% (64.8%, 82.4%), and 72.0% (66.0%, 85.0%) for T1c_FS, T1, and T1c_nFS, respectively (Figs. 3c and 4).

Table 3 The segmentation performance of nnu-net in the test set on multi-sequence MRI images
Fig. 3
figure 3

The segmentation performance of nnU-net for OAL. The values of Dice (a), PPV (b), and sensitivity (c) of multi-sequence MRI images in Model 1 and Model 2. Outliers are represented as circles. Bland-Altman plots of volumetric comparisons between nnU-net predictions and the ground truth defined by radiologists in Model 1 (d) and Model 2 (e). The mean differences between nnU-net and manual segmentation are 1.07 cm3 and 2.99 cm3 for all lesions in Model 1 and Model 2, respectively. In the cases of smaller lesions with a volume less than 10 cm3, the mean differences between predicted and ground truth are 0.27 cm3 and 1.39 cm3 in Model 1 and Model 2, respectively. The concordance correlation coefficients (CCCs) of tumor volume between the prediction by nnU-net and the manual segmentation in Model 1 and Model 2 on T2_FS images (f) are 0.96 and 0.93, respectively

Fig. 4
figure 4

Tumor segmentation performance of OAL with multi-sequence MRI images in Model 1. In the labeled images, the green area represents the manual segmentation, and the red area represents the predicted tumor by nnU-net. a-c A 75-year-old male with OAL of the right lacrimal gland. The DSCs of T1 (a), T2_FS (b), and T1c_FS (c) are 0.89, 0.90, and 0.95, respectively. d-f A 46-year-old male was diagnosed with OAL of the right superior outer orbit. The DSCs of T1 (d), T2_FS (e), and T1c_nFS (f) are 0.78, 0.84, and 0.89, respectively

Performance of Model 1 for volumetric measurement

The Bland-Altman plot of nnU-net demonstrated excellent performance, with a mean difference of 1.07 cm3 between the predicted tumor volume and the ground truth value (Table 4; Fig. 3d). For smaller lesions with a volume less than 10 cm3, the mean difference was 0.27 cm3 between nnU-net prediction and the ground truth (Fig. 3d). The CCC of volume between prediction and manual segmentation was 0.90 for the entire test set (Table 4, Supplementary Fig. S3).

Table 4 The concordance correlation coefficients (CCCs) and differences in tumor volume between nnu-net predictions and manual segmentation in model 1 and model 2

For each sequence, nnU-net achieved a minor difference of 0.03 cm3 and 0.22 cm3 between the predicted volume and the manual annotation for T2_nFS and T2_FS, respectively, while the maximal difference was 2.37 cm3 for T1c_nFS. The variation between nnU-net predicted volumes and manual outlines in T1c_FS and T1 was 0.99 cm3 and 1.45 cm3, respectively (Table 4, Supplementary Fig. S4). Among all sequences, T2_FS exhibited the highest CCC of volume between prediction and manual delineation, with a value of 0.96 (Fig. 3f). The CCCs on T1, T2_nFS, T1c_FS, and T1c_nFS images were nearly the same, with values of 0.87, 0.87, 0.86, and 0.87, respectively (Table 4, Supplementary Fig. S3).

Performance of Model 2 for automated segmentation

We conducted further evaluation of the segmentation performance in Model 2, which contained only non-contrast enhanced T1 and T2 images for training. As shown in Table 3, the DSCs were 0.80 (0.59, 0.86), 0.79 (0.74, 0.87), and 0.76 (0.76, 0.90) for T2_FS, T1 and T2_nFS, respectively (Fig. 3a). However, there were 58% (19/33) cases of T1c that could not be automatically segmented in the test set. Furthermore, nnU-net failed to detect 21% (3/14) cases of T1c_nFS from Center 2, as well as 84% (16/19) cases of T1c_FS (3 cases from Center 2, 4 cases from Center 3, and 9 cases from Center 4). Nonetheless, nnU-net exhibited nearly moderate segmentation performance for T1c_nFS in the remaining 14 cases, achieving a DSC of 0.59 (0.10, 0.85), despite these T1c images not being seen in the training dataset (Table 3; Fig. 3a). The PPV and sensitivity were 92.2% (89.8%, 95.1%) and 72.7% (63.0%, 81.8%), 84.5% (76.1%, 93.2%) and 77.6% (61.4%, 83.8%), 91.8% (69.5%, 97.4%) and 84.3% (62.7%, 87.8%) for T1, T2_FS, and T2_nFS, respectively (Table 3; Fig. 3b-c). For T1c_nFS, the PPV and sensitivity were 91.2% (20.1%, 95.5%) and 51.4% (6.1%, 80.2%), respectively (Table 3; Figs. 3b-c and 5).

Fig. 5
figure 5

Tumor segmentation performance of OAL with multi-sequence MRI images in Model 2. In the labeled images, the green area represents the manual segmentation, and the red area represents the predicted tumor by nnU-net. a-c A 54-year-old female with OAL of the right lacrimal gland. The DSCs of T1 (a), T2_FS (b), and T1c_nFS (c) are 0.77, 0.88, and 0.13, respectively. d-f A 70-year-old male was diagnosed with OAL of left space within the muscle cone. The DSCs of T1 (d), T2_FS (e), and T1c_FS (f) are 0.77, 0.79, and 0.77, respectively

Performance of Model 2 for volumetric measurement

The Bland-Altman plot demonstrated that nnU-net had a mean difference of tumor volume with 2.99 cm3 between the prediction and the ground truth in the test set (Table 4; Fig. 3e). The CCC of volume between prediction and ground truth was 0.69 for the entire test set (Table 4, Supplementary Fig. S3).

For each sequence, nnU-net achieved a subtle difference of 0.03 cm3 and 1.24 cm3 on T2_nFS and T2_FS images, respectively (Table 4, Supplementary Fig. S5). The variation in tumor volume between nnU-net prediction and the ground truth was 2.25 cm3, 6.03 cm3, and 4.89 cm3 for T1, T1c_FS, and T1c_nFS images, respectively (Table 4, Supplementary Fig. S5). The tumor volume predicted by nnU-net moderately matched that outlined by radiologists on T2_FS images, with a CCC of 0.93 (Table 4; Fig. 3f). The CCCs were lower on T1, T2_nFS, T1c_FS and T1c_nFS images with values of 0.85, 0.85, 0.25 and 0.44, respectively. (Table 4, Supplementary Fig. S3).

Analysis of outliers

Upon analyzing the outliers, we found that nnU-net in Model 1 produced a false-positive region in a small tumor (≤ 10 cm³) on the T2_FS sequence (Fig. S6). For Model 2, nnU-net’s predictions for OAL regions were smaller than the ground truth in two T1c_FS cases (Fig. S7) and larger in the same cases’ T1c_nFS and T2_FS (Fig. S8) images. Notably, the outlier in Model 1 and the overestimated regions in Model 2 were from the same patient, whose lesion was located in the left eyelid. The other two outliers in Model 2 are due to nnU-net predicting false negative regions on T1_FS images. This suggests that in Model 2, nnU-net may be more prone to false negatives on T1_FS images.

Discussion

In this multi-center study, we have introduced a self-configuring framework based on deep-learning, nnU-net, and have utilized two training models to assess its potential for automatic segmentation and volumetric measurement in OAL. We have further confirmed its generalizability and reproducibility on an external test set with heterogeneous MRI data. The main findings are: (1) Model 1 outperformed Model 2 in segmentation performance across all MRI sequences; (2) nnU-net achieved exceptional performance and reliability in delineating OAL on T2_FS images for both Model 1 and Model 2, with a DSC ranging from 0.80 to 0.82. Additionally, minor discrepancies of 0.22–1.24 cm3 in tumor volume between nnU-net predictions and the ground truth were observed on T2_FS, showing acceptable consistency (CCC: 0.93–0.96). These results indicate that nnU-net displays strong adaptability and versatility in delineating and measuring the volume of OAL on multi-sequence MRI datasets, regardless of the imaging equipment and scanning parameters.

Interestingly, despite the absence of T1c_nFS in the training set of Model 1, nnU-net demonstrated great segmentation performance on the test set, achieving a DSC of 0.80 and PPV of 90.2%. Similarly, Model 2’s training set lacked T1c images, yet nnU-net successfully identified the boundaries of OAL in 79% (11/14) of T1c_nFS cases and even in as few as 16% (3/19) of T1c_FS images in the heterogeneous MRI dataset. These findings suggest nnU-net’s ability to identify the boundary of OAL lesions in previously unseen images, further supporting its intrinsic representation learning capability of deep learning. The incorporation of contrast-enhanced images in the training dataset significantly enhanced the scores of DSC for T1c images, with DSCs increasing from 0.59 to 0.80 for T1c_nFS. However, fewer alterations were noted for T1 and T2 images, indicating that a more extensive distribution of the dataset results in improved segmentation performance.

In Model 1, nnU-net yielded the highest DSC with 0.84 for T1c_FS, yet the volumetric difference (0.99 cm3vs. 0.22 cm3) and CCC (0.86 vs. 0.96) were inferior to T2_FS. Nonetheless, the DSC of T2_FS resembled that of T1c_FS (0.82 vs. 0.84). Although the difference in volume between the nnU-net prediction and manual segmentation on T2_nFS was negligible, at 0.03 cm3 for both models. However, this could be influenced by bias originating from the small number of cases (n = 3). These findings indicate that T2_FS could be a more appropriate imaging modality for the follow-up of OAL patients, offering meaningful and time-saving benefits for evaluating radiotherapy effectiveness and assisting in optimal treatment selection.

With the advancement of deep learning in the field of radiology, several studies have concentrated on using these methods to select MRI radiomics features and then employing them, such as the conventional neural network, to distinguish between OAL and idiopathic orbital inflammation [26]. Nevertheless, there is limited research exploring the potential of deep learning models for automated segmentation in cases of OAL. The nnU-net’s potential performance for segmenting tumor lesions has gained attention in neuroblastoma [27], lung cancer [28], glioma [29], and meningioma [30]. As far as we know, this marks the initial implementation of nnU-net for automatically segmenting OAL lesions using a multi-sequence MR dataset from various institutions.

Radiotherapy stands as one of the primary treatment approaches for OAL patients due to its widely reported effectiveness [7, 31]. However, many OAL patients typically show residual tumors after radiation therapy [32], which increases the risk of recurrence and necessitates follow-up orbital MRI examinations. The ability to automatically segment OAL lesions on MRI images can significantly aid in monitoring these patients during follow-up. In this present study, we introduced a deep learning-based self-configuring nnU-net network for automatic segmentation of OAL lesions and volume measurement using multi-sequence MRI images. The findings indicated a minimal difference between nnU-net predictions and manual delineations on T2_FS images, implying that nnU-net could serve as an efficient tool for evaluating radiotherapy outcomes in OAL.

In our study, we manually delineated the tumor volumes on each MRI sequence to capture the unique contrast and detail each sequence provides. This approach allowed us to determine that the nnU-net performs best on the T2_FS images, with the smallest difference between the predicted tumor volumes and the manual delineations. While we agree that ultimately there is only one tumor volume, our methodology enabled us to understand how well the algorithm adapts to different MRI contrasts and details. Image registration based on the three MRI sequences and then performing a single delineation is indeed a proposed approach. Currently, our approach of delineating in the original space of each MRI sequence has the advantage of avoiding potential errors introduced by registration. We further evaluated the consistency of the tumor volumes across the three MRI sequences, and our results showed high consistency, which supports the validity of our approach.

The current investigation has several limitations that need to be acknowledged. Firstly, we focused on evaluating the automated segmentation capability of nnU-net specifically for OAL, which resulted in the inclusion of only patients diagnosed with OAL. As a result, there is a shortfall in the discussion concerning the differentiation of OAL from other orbital tumors. We are trying to construct an MRI database of orbital tumors in our country (NCT06336499), and future research will expand upon this foundation by analyzing a larger, more diverse dataset encompassing various types of orbital tumors. Secondly, our assessment of the segmentation performance and volume measurement was restricted to a single deep learning model, nnU-net, thereby overlooking the opportunity to utilize and compare alternative models. In future work, we plan to conduct a broader comparative study encompassing multiple methods to provide a more comprehensive analysis for aiding in clinical method selection. Thirdly, the findings of this study are still in the preliminary stage due to the diversity of MRI data considered. To further validate the feasibility of nnU-net in OAL segmentation, it is essential to carry out studies with larger sample sizes and incorporate more diverse MRI data types, such as diffusion-weighted imaging (DWI), apparent diffusion coefficient (ADC) maps, and dynamic contrast-enhanced MRI (DCE-MRI). Finally, due to the relatively small sample size of the external test set, we cannot definitively determine common causes for the outliers in each model. This indicates the need to increase the sample size in future research for a more thorough investigation and summary. Nevertheless, we remain hopeful that the analysis of typical cases identified among outliers will offer valuable insights for future research or for anyone attempting to replicate the proposed method.

Conclusions

In summary, nnU-net offered excellent performance and robustness in the automated segmentation and volumetric evaluation of routine MRI images for OAL patients, particularly on T2_FS images. This suggests its promising potential for assessing therapeutic efficacy and streamlining the follow-up workflow. Besides, nnU-net demonstrates the ability to predict the boundaries of OAL tumors on T1c images trained by unseen contrast-enhanced MRI images, which reduces the requirement for manual delineation and minimizes the duration of model training.