Introduction

Clinical target volume (CTV) and organ-at-risk (OAR) delineation are crucial in successful radiotherapy (RT) treatment planning, with accurate segmentation being vital for delivering safe and effective radiation doses to tumor lesions while minimizing damage to surrounding normal tissues. Conventionally, radiation oncologists manually contour CTV and OAR structures slice by slice, the process of which involves a certain degree of variability. Over the past years, considerable efforts have been made towards developing deep-learning based auto-segmentation (DLAS) models specific for CTV and OAR delineation in radiotherapy. Numerous studies demonstrate promising benefits of DLAS in both accuracy and efficacy over atlas-based methods [1,2,3,4]. Also, DLAS’s effectiveness in mitigating dose inconsistencies has been notably observed in a simulation study [5] based on RTOG 0617 [6, 7], a multi-institutional clinical trial, highlighting substantial potential in streamlining and standardizing clinical workflows. Despite proven advantages in DLAS, most advancements predominantly rely on in-house developed DLAS models, and application of DLAS as a routine tool in clinical settings falls far below anticipation. Challenges persist in the clinical adoption of DLAS models, which is highlighted by a survey across 246 institutions where only 26% reported using DLAS in their clinical practice [8].

Generalizability remains a primary challenge for current DL models, where validated models may perform inferiorly in clinical scenarios not represented in their training procedure. For example, deploying top-rated DLAS models from prestigious challenges directly to local institutions resulted in suboptimal performance [9]. Similarly, several studies [10,11,12] reported notable performance decline when using DL models in external data. Duan et al. [13] evaluated three commercial DLAS products with local cases and observed compromised performance as well.

The mismatching among training and validation datasets, known as data shift, contributes significantly to performance deterioration after clinical deployment of DLAS models [14]. Such shifts may result from variations in clinical practices [15], evolution of delineation guidelines [16], or differences in imaging equipment.

To address these issues, upfront model recalibration or adaption is recommended to meet institution-specific standards prior to clinical application [15]. Instead of retraining DLAS models from the ground, a viable and efficient solution is to incrementally retrain or localized fine-tune pretrained models using pooled data to incorporate institution-specific protocols. Balagopal et al. [17] proposed a network model named PSA-Net that segments CTV for postoperative prostate cancer, and observed 5% DSC improvement when adapting to the style of a separate institution.

Fortunately, some commercial vendors offer model retraining services or research tools for users to customize their DLAS models using institution-specific data. However, relevant studies on their clinical implementation is quite limited. Previous studies by Duan et al. [13] and Hobbis et al. [18] investigated fine-tuning a commercial DLAS software (INTContour, CarinaAI) for OAR structures in prostate cancer patients. However, experiences in localized adaptation, particularly for CTV or other tumor sites, remain unexplored.

To this end, this study addresses this notable gap by detailing the process and outcomes of localized fine-tuning and validation of a popular commercial DLAS product for rectal cancer radiotherapy in a clinical setting. The key novelties and contributions of our work are manifold: (1) first study on DLAS model fine-tuning specifically for rectal cancer radiotherapy, (2) specific retrained model has been applied on the basis of a popular DLAS product in mainland China, (3) comprehensive focus on the adaptation and validation of both CTV and OAR structures, and (4) practical insights into model generalizability in the context of changes in imaging equipment- a frequent scenario in clinical settings and we eventually encountered.

Materials & methods

Fig. 1
figure 1

Conceptual design and implementation workflow of this study in model fine-tuning and performance evaluation (external validation and generalizability evaluation)

The conceptual design and overall workflow are shown in Fig. 1. The work is generally composed of two procedures-model fine-tuning, and performance evaluation. The latter includes external validation, evaluating model performance on patients scanned on the same CT simulator as training patients but not utilized during model training. Generalization evaluation refers to assessing the model performance on patients scanned on a different CT simulator and not involved in model training.

Data collection

Patient cohort

This retrospective study was approved by the institutional review board (IRB) of Peking University Cancer Hospital. A total of 120 patients were included in this work, who were diagnosed with Stage II/III mid-low rectal cancer (i.e., gross tumors were located within 10 cm from the anal verge) and received chemoradiation at the institutional radiotherapy department. Over the enrolled cohort, 71 were female and 49 were male, and the ages ranged from 33 to 86 with the median as 65.

The enrolled patients were grouped into three datasets - a training dataset, an external validation dataset denoted as ExVal and a generalizability evaluation dataset denoted as GenEva as shown in Table 1. The training dataset was composed of 60 patients treated between March 2020 and October 2022. The external validation dataset ExVal was composed of 30 patients treated between November 2022 and May 2023. At the end of 2022, a Philips RT-specific CT scanner was commissioned into clinical service at our institution, and 30 patients scanned on this CT-Sim between February 2023 and May 2023 were collected as the dataset GenEva to evaluate model generalizability.

Table 1 Description of patient data grouping

Image acquisition

In this study, patients were immobilized using a pelvic thermoplastic in a supine position. The training dataset and ExVal were scanned on a Siemens Sensation Open CT simulator, while the GenEva dataset was scanned on a Philips Big-Bore CT simulator. Detailed specifications of the scan parameters are listed. The CT images were imported into the Eclipse Treatment Planning System (Varian Medical System Inc., USA) for physician to delineate target and OAR structures. The contours as well as plans were reviewed by an internal panel before approved for clinical treatment.

In this retrospective study, we retrieved the planning CT images as well as CTV and OAR contours from the treatment planning system in an anonymized approach under the IRB approval. The CTV and OAR contours approved for treatment were used ground truth (GT) reference in model training and performance evaluation. It’s important to emphasize that all the contours used were based on real-world data, and no editing was done to refine them specifically for this study.

DL model and localized fine-tuning

DL kernel network

The DL model for rectal cancer neoadjuvant radiotherapy herein was adopted from the work by Wu et al. [19, 20] and commercialized as RT-Mind-AI (MedMind Technology Co. Ltd., Beijing, China). The backbone network, referred as DpnUNet, was characterized by integrating dual-path-network (DPN) modules into the UNet structure. The overall architecture of DpnUNet was generally depicted in Fig. 2.

Fig. 2
figure 2

Schematic of the kernel DpnUNet network architecture

Localized model fine-tuning

The model was pretrained using 122 patients’ data from a single institution [19]. We further trained the model with the enrolled training data (60 patients) to adapt to the institutional contouring protocol. The contours of interest were CTV, bladder, femoral heads and small intestine. The class weighted cross-entropy was used to take into account the overall accuracy in both CTV and OARs. Localized model fine-tuning was performed on a single GPU workstation (Nvidia GeForce RTX 2080Ti) using 5-fold cross validation (48 vs. 12). The optimizer was Adam, and the batch size was 4. The initial learning rate was 0.0001, and the value decayed exponentially by a factor of 0.9 over each epoch. The epoch was 60, and the model with the lowest cross-validation loss was selected as the final output.

Performance evaluation

External validation and generalizability evaluation

This study used two datasets (ExVal and GenEva) with 30 cases in each to evaluate model performance in two aspects. The data in ExVal were acquired on the same CT simulator with the training data, and therefore used for external validation. The data in GenEva were acquired on a different CT simulator, and herein were used to evaluate model generalization in the context of imaging equipment changes.

Quantitative metrics

Two sets of deep learning predicted contours were generated for all 60 testing cases, using a vendor-provided pretrained model (VPM) and a localized fine-tuned model (LFT) respectively. We utilized several valid and widely used metrics to quantify segmentation performance, including the Dice Similarity Coefficient (DSC), the 95th percentile of the Hausdorff Distance (95HD), sensitivity, and specificity, using the clinically approved CTV and OAR contours as GT.

DSC, the most used measure in the field of medical image segmentation, provides an effective assessment of similarity, and is defined as:

$$\text{D}\text{S}\text{C}(\text{D},\text{G})=\frac{2\left|\text{D}\cap \text{G}\right|}{\left|\text{D}\right|+\left|\text{G}\right|}$$
(1)

where D and G represent the DLAS-predicted and GT contours respectively, and |DG| represents the intersected volume between D and G.

The 95HD metric is a routinely used spatial distance-based metric to measures the distance between the DLAS-predicted and GT contours, which is defined as

$$95\text{H}\text{D}(\text{D},\text{G})=\text{p}\text{e}\text{r}\text{c}\text{e}\text{n}\text{t}\text{i}\text{l}\text{e}\left(\text{h}\right(\text{D},\text{G})\cup \text{h}(\text{G},\text{D}),95\text{t}\text{h})$$
(2)
$$\text{h}(\text{D},\text{G})={\text{m}\text{a}\text{x}}_{{\text{d}}_{\text{i}}}\left({\text{m}\text{i}\text{n}}_{{\text{r}}_{\text{j}}}||{d}_{i}- {g}_{j}||\right),{d}_{i}\in \text{D}, {g}_{j}\in \text{G}$$
(3)

where ||.|| stands for the Euclidean norm of the points of d and g.

Sensitivity and specificity are popular metrics for the evaluation of medical image segmentation performance [14, 15], which are defined as

$$\text{S}\text{e}\text{n}\text{s}\text{i}\text{t}\text{i}\text{v}\text{i}\text{t}\text{y}=\frac{\text{T}\text{P}}{\text{T}\text{P}+\text{F}\text{N}}$$
(4)
$$\text{S}\text{p}\text{e}\text{c}\text{i}\text{f}\text{i}\text{c}\text{i}\text{t}\text{y}=\frac{\text{T}\text{N}}{\text{T}\text{N}+\text{F}\text{P}}$$
(5)

which TP, FP, TN and FN denote the pixel numbers of true positive, false positive, true negative and false negative respectively for DLAS-predicted CTV and OAR contours, which reflect the number of pixels that are classified correctly or incorrectly with respect to the GT [21].

In addition, the CTV volume was also measured. The DSC, 95HD, sensitivity, specificity, and CTV-volume values of each testing case were calculated in the 3D Slicer software (version 5.4.0) [16].

Statistical analysis

The mean and standard deviation (SD) values were calculated for each metric. Within each testing dataset, the Wilcoxon paired signed-rank test was used to compare the performance between VPM and LFT. The statistical analysis was performed in OriginPro (version 2021a, OriginLab, USA), and the significance level was set at 0.05.

Results

Fig. 3
figure 3

Representative patient cases of CTV and OARs (bladder, femoral head, and small intestine) contours. The upper two rows are cases in ExVal, and the lower two rows in GenEva. (GT-red line, VPM-blue line, and LFT-green line)

Visualization of representative cases

Figure 3 shows representative patient cases that are selected from ExVal and GenEva datasets, acquired on two different imaging devices as shown in Table 1, for subjective visual illustration. Images in ExVal (Fig. 3(a-f)) exhibit finer texture patterns, while images in GenEva (Fig. 3(g-l)) appear smoother with reduced noise. Also, we can see that CTV and OAR contours predicted either by VPM or LFT models are generally consistent with GT, especially in organs of bladder and femoral heads, where the contours of VPM, LFT and GT are highly overlapped. Regarding the small intestine, although some deviations are shown in Fig. 3(a), (d), and (j), the majority of intestine loops predicted by VPM and LFT closely conform to the ground truth (GT).

Quantitative assessment of CTV

From Fig. 3, we can also see that, although both VPM and LFT contours are generally consistent with GT, CTV volumes predicated by VPM appear to be larger than either GT or LFT. This over-contouring effect is observed in both ExVal and GenEva datasets and further validated in case-by-case comparison in Fig. 4.

The CTV volumes (mean ± std) predicted by VPM and LFT in ExVal are 720.304 ± 90.789 cm3 and 606.443 ± 90.677 cm3 respectively with the benchmark GT as 617.879 ± 110.506 cm3 (p-value < 0.05). The corresponding relative errors in comparison with GT 17.921 ± 11.663% and − 1.281 ± 5.655%. The CTV volumes predicted by VPM and LFT in GenEva are 735.997 ± 109.678 cm3 and 610.804 ± 74.917 cm3 respectively with the benchmark GT as 630.320 ± 82.261 cm3 (p-value < 0.05). The corresponding relative errors in comparison with GT 16.923 ± 9.661% and − 2.798 ± 6.228%.

Figure 5 shows the distributions of DSC, sensitivity, specificity and 95HD values for DLAS-predicated CTV contours. It indicates that the performance of the LFT model is superior to VPM in metrics of DSC, specificity and 95HD in both ExVal and GenEva datasets. Specifically, the improvements of DSC mean values are 11.406% in ExVal and 9.340% in GenEva with statistical significance (p-value < 0.01). The mean specificity values are improved by 2.497% in ExVal and 2.591% in GenEva (p-value < 0.01), and the mean 95HD values are reduced by 46.866% and 42.120% respectively (p-value < 0.01). On the contrary, the sensitivity distributions between VPM and LFT predicted CTV contours are very close in both ExVal and GenEva (p-value > 0.05).

Fig. 4
figure 4

Profile and relative errors of VPM and LFT predicted CTV volumes in comparison with GT over (a) ExVal, and (b) GenEva respectively. (GT-black line, VPM-red line/bar, and LFT-blue line/bar)

Fig. 5
figure 5

Distribution comparison and statistical analysis of the CTV contours by VPM and LFT models in (a) ExVal and (b) GenEva compared with GT in metrics of DSC, sensitivity, specificity and 95HD respectively

Quantitative assessment of OAR

Table 2 summarizes the statistical analysis in metrics of DSC, 95HD, sensitivity and specificity for OAR structures. As for bladder and femoral heads, both VPM and LFT models exhibit sufficient and comparable performance. Despite that some p-values are < 0.05, differences in metrics values between LFT and VPM are negligible, so are those between ExVal and GenEva. This is consistent with the overlapping contours in Fig. 3, which demonstrates that both VPM and LFT models are adequately accurate in segmenting bladder and femoral heads. As for small intestine, we can see that LFT generally outperformed VPM (p-value < 0.05 in DSC-GenEva, 95HD-GenEva, sensitivity-ExVal and sensitivity-GenEva) except in specificity, of which the difference is negligible.

Table 2 Summary and statistical analysis of OAR structures (bladder, femoral heads, and small intestine) predicated by VPM and LFT in comparison with GT in metrics of DSC, 95HD, sensitivity and specificity

Discussion

In this study, we meticulously detailed our process and outcomes from localized fine-tuning and validation of a popular commercial DLAS product, RT-Mind-AI, specifically targeting rectal cancer radiotherapy. This work marks a significant stride in not only addressing the imperative need in enhancing DLAS model performance in real-word clinical settings but also the generalizability of RT-Mind-AI in the context of imaging equipment changes.

In the process of localized model fine-tuning, we used real-world patient data that had been approved for clinical treatment with minimal data preprocessing. The training cohort was composed of 60 patients, much smaller than pertinent studies on in-house DLAS model development (135 patients in Wu Y et al. [19], 218 patients in Men K et al. [22], 136 patients in Song Y et al. [23], 100 patients in Larsson R et al. [24]) and on incrementally training a commercial DLAS model in [13] (with 100 patients) as well. The use of a small volume of real-world patient data has underscored the practicality and efficiency of on-site data preparation, which facilitates the localized fine-tuning process without extensive data collection and labor-intensive data preprocessing.

The CTV evaluation results demonstrate that the fine-tuned RT-Mind-AI exhibited comparable performance with in-house models (mean DSC = 0.879 in Exval /0.874 in GenEva, 0.90 in Wu Y et al. [19], 0.877 in Men K et al. [22], 0.88 in Song Y et al. [23], 0.90 in Larsson R et al. [24]). Notably, the LFT model showed superior performance over the VPM model, especially in the accuracy of CTV volume estimation and the reduction of over-contouring tendencies. The substantial improvements in metrics such as DSC and 95HD highlight the effectiveness of adapting models to institution-specific clinical standards.

For bladder and femoral heads, the segmentation performance of both VPM and LFT was adequately accurate, as indicated by the high metrics values. This finding indicates that the vendor-provided model has been well trained and these model components may not require further retraining, potentially easing the implementation process in clinical settings. This indication is consistent with the study by Hobbis D et al. [18]. This is mainly due to the distinct anatomical characteristics of bladder and femoral heads and lower inter-observer variability among physicians. Accurate segmentation of the small intestine poses a significantly greater challenge than that of the bladder and femoral heads. Although the LFT method has shown significant improvements in DSC, 95HD and sensitivity metrics (p-value < 0.05), these metrics still fall short of the benchmarks established in bladder and femoral heads segmentation. This discrepancy is primarily due to two factors. Firstly, the small intestine’s anatomy is inherently complex, exhibiting considerable variability in positioning and filling of the intestines, which makes distinguishing between intestinal loops and adjacent tissues particularly difficult. Secondly, the variation in contouring the small intestine among different observers is significant [13, 18], making it challenging to achieve a strictly consistent standard. This also highlights the importance of enhancing DLAS accuracy for complex anatomical structures.

Equipment changes are significant events that may take place occasionally in clinical practice, and one of the significant aspects of this study is addressing the impact of different CT simulators on model performance. The robustness of the commercial DLAS product, even with changes in imaging equipment, is a promising finding for institutions undergoing technological upgrades or those using multiple imaging systems. This robustness is crucial for the widespread adoption of DLAS technologies, ensuring consistent performance across various clinical environments.

Despite these promising results, we acknowledge several limitations in this work. First, this study is based on a specific commercial DLAS product (RT-Mind-AI) and two CT simulators. As for other DLAS products or simulators, some findings may not be solidly valid. The generalizability of these results to other DLAS products or imaging equipment needs further investigation. Second, the cohort size for model fine-tuning was empirical based on our previous work in Geng J et al. [25]. Maybe a smaller cohort would be enough, and future studies might explore the optimal cohort size for effective and efficient model fine-tuning. Third, this study is focused on rectal cancer radiotherapy. Next-step efforts will be directed to DLAS model fine-tuning for other tumor sites to facilitate clinical application as well as further validation.

Conclusion

We detailed the process and outcomes from localized fine-tuning and validation of a popular commercial DLAS product (RT-Mind-AI) specifically for rectal cancer radiotherapy in real-world clinical settings. The comprehensive validation explicitly underscores the necessity and potential benefits in institution-specific DLAS model adaption and continues model updating, which indicates that localized model fine-tuning for various clinical settings is crucial in realizing the full potential of DLAS in enhancing the precision and effectiveness of radiotherapy treatments. This work also demonstrate that the RT-Mind-AI software is highly robust to imaging equipment changes, and exhibits superior accuracy once localized fine-tuned.