Introduction

External beam radiation therapy (EBRT) and brachytherapy (BT) are both the critical treatment modalities for cervical cancer with early and locally advanced stages. The delineation of clinical target volumes (CTVs) and organs at risk (OARs) is the first step and important task that may affect the clinical outcomes in cervical cancer radiotherapy1,2,3. Indeed, manual contouring of these planning structures is such a labor-intensive part of the workflow and maybe inaccurate4,5,6. The workload pressures and most errors could be avoided if a rapid and accurate auto-segmented methods were available. With the development of machine learning (ML), particularly the advent of deep learning (DL) represented by convolutional neural networks (CNNs), auto-segmented tasks are thought to provide excellent assistance and superior results7,8,9,10.

The U-Net model used for auto-segmentation of OARs in cervical cancer obtained highly consistency with those of expert contouring which was assessed by radiation oncologists11. The DpnUNet model applied to CTV segmentation in cervical cancer achieved an acceptable clinical results with the mean dice similarity coefficient (DSC) of 0.8612. As a novel technique, however, there are still some limitations with uncommon clinical practice13. In fact, DL based methods are usually to generate the expected outcomes because the tested datasets are typically related to the training and validating datasets. Therefore, the quality and reliability of DL models should be further verified using an independent cohort in the process of cervical cancer radiotherapy.

The geometric metrics and subjective assessment are always chosen as the standard analysis indicators of contour comparison14,15,16. A few studies have reported the relationship between auto-segmentation and dosimetry in head and neck which proved more accurate auto-segmentation carried out smaller dosimetric differences17. However,whether or not the differences of DL based auto-segmentation would affect the clinical relevance of cervical cancer is rarely mentioned.

The purpose of this study used geometric and dosimetric metrics to evaluate the accuracy of DL based auto-segmentation and focus on the question of whether DL based approach could generate precise dosimetric endpoints compared to standard manual contours in a real-world independent cohort of cervical cancer patients.

Methods and materials

Experiments

The work flowchart of this study is illustrated in Fig. 1. Briefly, the evaluation was divided into 3 sections. Section 1, the accuracy of DL based auto-segmentation was assessed using geometric metrics. Section 2, the dosimetric comparison was performed between standard manual contours and auto-segmented contours form original EBRT plans. Section 3, the correlation analysis was explored followed by geometric and dosimetric metrics.

Figure 1
figure 1

The flowchart of manual and DL based auto-segmentation evaluation experiment. Original EBRT plans were designed and optimized based on the standard manual contours and the auto-segmentation structures were transmitted to original EBRT plans for dosimetric evaluation.

Clinical datasets

The independent cohort of this study was consisted of 75 cervical cancer patients who received EBRT at our department between August 2021 and December 2021. All patients were diagnosed with FIGO stage IA2-IVB and histology G1-G3, treated with prescription dose of 45 Gy-50.4 Gy (1.8 Gy/fraction). The average age ± standard deviation of these patients was 55.60 ± 13.35 years old. For each patient, the contrast agent was required to intravenously inject before computed tomography (CT) scanning, meanwhile, the CT images were covered from the lower lumbar spine to the whole pelvic cavity and reconstructed with 512 × 512 matrix size and 5 mm slice thickness using a Philips Brilliance Big Bore CT scanner system (Philips Healthcare,Best, the Netherlands).

CTVs delineation of 75 patients were defined manually by junior radiation oncologists including entire cervix, uterus, bilateral parametria, upper half of vagina, and lymph nodes following the guideline of Radiation Therapy Oncology Group (RTOG) protocol18. Relevant OARs included for EBRT plans were spinal Cord, left kidney (Kidney L), right kidney (Kidney R),bladder, left femoral Head (Femoral Head L), right femoral Head (Femoral Head R), pelvic bone, rectum, and small intestine. The EBRT planning structures were performed on the Pinnacle Treatment Planning System (Pinnacle, V9.16.2, Philips Corp, Fitchburg, WI, USA). All of the manual contours were reviewed and approved by senior radiation oncologists specialized in cervical cancer to generate the standard delineation.

Deep learning based auto-segmentation

We introduced a deep learning model based on CNN19 to segment the CTVs and OARs for cervical cancer patients. As shown in Fig. 2, the network consists of three encoders and three decoders. The InProj was used to extract the features of medical image, and the OutProj performed the pixel-wise classification. Down-sampling and up-sampling were performed by each encoder and each decoder. All the weight filters of the 2D convolution (Conv2d) had a window size of 3 × 3 and a stride of 1. Batch Normalization (BN) was a process by which biased output distribution and used for the feature normalization. For this network, rectified linear unit (ReLu) followed by every Conv2d was used as the feature activation function. Max Pooling could reduce the number of parameters and computation in the network. ConvTranspose2d was opposite of that used for Conv2d, whereby pixel size is increased using a 3 × 3 pixels filter. The skip connection was used to concatenate the encoder and decoder of the same level to facilitate the fusion of multi-layer features. We used some general methods for data enhancement (cut and flip) to obtain a superior model. This model is an end-to-end segmentation architecture that can predict pixel class labels in CT images.

Figure 2
figure 2

Architecture of DL based automatic segmentation network.

A total of 300 retrospective clinical CT scans diagnosed with cervical cancer who received radiotherapy were enrolled for training and validating this model, and the datasets were come form multiple cancer centers in order to verify the robustness of CNN model. The cross-entropy loss was selected as the loss function, and all of the training computations were performed using Intel-Core i7 processor with a graphics card.

Geometric metrics

The geometric accuracy of contours was compared using the Dice Similarity Coefficient (DSC), 95% Hausdorff Distance (HD) and Jaccard Coefficient (JC). DSC and JC describe the relative overlap between segmentation A and B. HD is used to quantify the 3D distance between two segmentation surfaces. The 95%HD is the distance that indicates the largest surface-to-surface separation among the closest 95% of surface points.The definitions are as follows:

$$\begin{aligned} & DSC = 2\left| {A \cap B} \right|/(\left| A \right| + \left| B \right|) \\ & HD = \max (h(A,B),h(B,A)),\;h(A,B) = \mathop {\max }\limits_{b \in B} (\mathop {\min }\limits_{a \in A} \left\| {a - b} \right\|) \\ & JC = \left| {A \cap B} \right|/\left| {A \cup B} \right| \\ \end{aligned}$$

For the complete overlap, the value of HD is 0, and the values of DSC and JC are 1. For the incomplete overlap, the value of HD is large, and the values of DSC and JC are close to 0. In order to verify the recognition performance of DL based model in boundary of segmentation,no cropping of the superior or inferior borders for contours was performed for this study particularly in spinal cord, femoral head and pelvic bone.

Dosimetric metrics

The EBRT plans were calculated and optimized with these standard manual contours by using Pinnacle Treatment Planning System. Table 1 is presented the constraints and dosimetric metrics. For CTV, we mainly focused on Dmean and V100%. For serial organs and parallel organs, we mainly focused on Dmax and Dmean, respectively. Dmean and Dmax are defined as the average dose and maximum dose of structures receiving. V100 is defined as the volume of CTV receiving 100% prescription dose.

Table 1 The constraints and dosimetric metrics for EBRT planning structures.

Statistical analysis

IBM SPSS Statistics software (version 19.0, IBM Inc., Armonk, NY, USA) and Python software (version 3.6.5,Anaconda Inc.) were used for statistical analysis,where mean ± standard deviation (SD) was used for presenting and summarizing the results. For the test of agreement between manual and DL based methods, the Bland–Altman test was used to calculate the consistent limits for each EBRT planning structures. P > 0.05 means agreement of two segmented methods. For the difference, the Wilcoxon’s paired nonparametric signed-rank test was performed to compare the variables. P < 0.05 indicates that the difference is statistically significant. The correlations between geometric metrics and dosimetric difference were evaluated with Spearman’s correlation analysis.

Results

The geometric accuracy of the DL based auto-segmentation for EBRT planning structures is presented in Fig. 3. Automatic delineation produced the results for CTV with average DSC value of 0.77 ± 0.03, 95%HD of 5.81 ± 1.83 mm and JC of 0.62 ± 0.04. The right kidney, left kidney, bladder, right femoral head and left femoral head were generated the similar geometric performance between two methods with average DSC value of 0.88–0.93, 95%HD of 1.03–2.96 mm and JC of 0.78–0.88. The quality of the automatically generated pelvic bone was barely satisfactory with average DSC value of 0.65 ± 0.05,95%HD of 18.14 ± 9.77 mm and JC of 0.49 ± 0.05.

Figure 3
figure 3

DSC, 95%HD and JC box plot from comparing DL based auto-segmented contours to standard contours for CTV and OARs.

The Bland–Altman test was not calculated for CTV because of abnormal distribution. The Fig. 4 showed 95% consistent limits for all of the OARs between two methods. The test of agreement for DL based auto-segmentation method can be evaluated according to the number of the points outside the 95% consistent limits (brown horizontal dotted lines) and the maximum difference within the consistent limits (distance between blue and green horizontal lines). From the Bland–Altman plot, right and left kidney, bladder, right and left femoral head showed no significant inconsistency (P > 0.05) between two segmented methods.

Figure 4
figure 4

Bland–Altman plot for OARs. The brown horizontal dotted lines represents the upper and lower bounds of 95% limit agreement; the blue horizontal solid lines represent the average of the differences; the green horizontal dotted lines represent the location with difference equal to 0.

Examples of delineations and dose distributions from manual and DL based auto-segmented methods are illustrated in Fig. 5. The comparisons of dosimetric parameters between two methods using Wilcoxon’s paired nonparametric signed-rank test are presented in Table 2. No significant dosimetric differences were found except for CTV, spinal cord and pelvic bone (P < 0.001). For all of the OARs, both the manual and automatic delineation were able to meet the clinical dose constraints. However, the dose-volume index (DVI) of CTV was hard to meet the clinical requirements with V100 (%) of 94.27 ± 1.86 (D99% > Prescription).

Figure 5
figure 5

Results of delineations and dose distributions for CTV and OARs in CT slices. The green lines represent manual contours approved by the senior physician; the blue lines represent DL based contours; colourwash represent dose distributions with the range of 95% prescription to 100% prescription.

Table 2 Dosimetric metrics of manual and DL based auto-segmented delineations in the original clinical treatment plans.

Table 3 shows the results of Spearman’s correlation analysis between three geometric metrics and dosimetric differences (Δdose). No structures showed strong correlation except for the ΔDmean of pelvic bone and its 95%HD (R = 0.843,P < 0.001), and the correlation heatmap was used to further prove the weak link between all of the dosimetric difference and its geometric metrics in the remaining EBRT planning structures (Fig. 6).

Table 3 The correlation between geometric metrics and dosimetric differences.
Figure 6
figure 6

The heatmap of Spearman’s correlation analysis between all the geometric metrics and dosimetric differences for EBRT planning structures.

Discussion

Modern radiotherapy has become a systematized and programmed process resulting in a nearly reliance on human–machine interactions with the development of mechanical technology and computer science. Meanwhile, the growth of Artificial intelligence (AI) has the potential possibilities to change the way of radiation oncology because of its recognition and analysis in complex medical data. Various studies have investigated the advantages of AI based method during each stage of radiotherapy,such as AI platforms might improve the efficiency and quality of automated segmentation20,21,22, predict and optimize the radiation dose of the targets23,24, provide the clinical decision of radiation toxicities25, and build the robust models to manage the treatment outcomes26,27. However, these studies were always fragmented and we should establish the complete radiotherapy workflow using AI technology with validating every step for the real-world cohort.

Delineations of CTV and OARs are an essential step for precise delivery28 which would affect the overall survival in the radiotherapy treatment planning process,even in standardizing clinical trials29. However, the manual process always suffers from inter- and intra-observer variability in structure delineations. Automatic contouring of structures is highly desired in radiotherapy because of the minimized variability. The purpose of this study is to compare the performance of DL based autosegmentation against standard contours from senior radiation oncologists on independent datasets.

As for geometric metrics, we observed that DL based model generated structures with average DSC of 0.77 for the CTV, 0.74 the spinal cord, 0.93 for the left and right kidney, 0.91 for the bladder, 0.88 for the left and right femoral head, 0.65 for the pelvic bone, and 0.71 for the rectum, respectively. The comparison of DSC and HD for other DL based model is presented in Table 4. Overall, the geometric similarity of kidney, bladder and femoral head were equivalent to or better than other published literature. Nevertheless, the DSC values of CTV, pelvic bone and rectum from our model showed poor results compared with other DL based models. Generally, the accuracy maybe decrease when using the independent testing datasets. Rhee et al.30 reported the DSC values of automatic CTV segmentation was 0.86 using internal test CT scans and the clinical acceptance decreased to 80% for external test CT scans. However, the mean 95%HD value of CTV used our model was 5.81 mm, which was comparable to DpnUNet model12 and superior than 3D CNN and 3D V-Net models31,32. These findings seemed to indicate that the discrepancy between these DL based models might caused by the difference of training datasets, and our DL based model showed a relative strong robustness for most EBRT planning structures enrolled the independent cohort. In this study, the boundaries of the spinal cord in cervical cancer were not clear (the resolution of soft tissue in CT images was deficient and we didn’t modify the superior or inferior borders), the delineations generated by DL based model were always been overestimated or underestimated compared with standard contours. The small intestine was absent to assess because the contours of the small intestine in CT images was different from the location during EBRT process. Indeed, small intestine is an important organ for dosimetric evaluation especially in the EBRT combined with high-dose rate BT for cervical cancer, and the DL based performance of small intestine would be included in our further study with “dose prediction”.

Table 4 Summary of DL based auto-segmentation results for CTV and OARs in cervical cancer from other published literature.

The quality of auto segmented contours cannot be determined only by geometric values which was reported by Kaderka33, and few studies have focused on dosimetric impact on the automatic CTV and OARs delineations for cervical cancer radiotherapy. For CTV dosimetric metrics, the most significant dose difference was V100 with 94.27% for DL based model and 99.98% for standard contour (P < 0.001), and the original dose distribution showed poor results in automatic CTV segmentation (Fig. 5). These data indicated the final CTV segmentation generated by DL based model remains necessary to be reviewed by senior radiation oncologists rather than geometric values. For the test of agreement, the DL based segmented method has been proven to obtain dose consistency for kidney, bladder and femoral head compared with expert contouring. For dosimetric metrics of OARs, no significant differences were found except for spinal cord and pelvic bone (P < 0.001). Point dose such as Dmax in spinal cord was sensitive to the range of the segmentation in radiotherapy which means the performance of identifying boundaries in DL based model should be improved.

The heatmap of Spearman’s correlation analysis showed that there was no clear strong relationship between geometric metrics and dosimetric differences for most structures (Fig. 5). The only strong correlation was shown for the mean dose of pelvic bone and its 95%HD (R = 0.843, P < 0.001). This phenomenon cloud be explained that the dosimetric differences were generated by random noise because of the similar delineation between two methods such as kidney and bladder. Otherwise, the weak link was caused by the segmented reproducibility of DL based model such as CTV and femoral head. However, significant correlation between geometric metrics and dosimetric differences could still be observed due to the inaccurate delineation such as pelvic bone.

In this work, we investigated the performance of DLbased auto segmentation in cervical cancer for patients treated with EBRT. Indeed, as an assisted and efficient tool, automatic approach would relieve physicians from the labor-intensive tasks as well as increase the accuracy and reproducibility of structure delineation.Instead of incorporating a prior knowledge into the process of segmentation that describe as atlas-based segmentation (ABS)34, DL based auto segmentation explores the informative representations in a self-learning algorithm and utilizes hierarchical layers of extracted abstraction to accomplish high-level tasks efficiently. Furthermore, in spite of the superior performance of DL based methods on algorithm, the studies are confined mostly to the field of segmentation rather than to establish the workflow solution which have been mentioned above.In other words, DL based methods could play an important role in the complete process of radiotherapy such as “dose prediction”, “toxic prediction” , “efficacy prediction”, etc., segmentation/ “delineation prediction” is only a part of this workflow. Certainly, this work was focus on the question of segmented accuracy which would be a basic part implemented in the workflow of cervical cancer radiotherapy.

Several limitations still exist in our study. First,this work was lack of subjective assessment such as radiation oncologist evaluation or Turing imitation test35. Second,the diversity of CT scanner machines,image acquisition protocols, standard contouring,and even tumor staging hampered meaningful comparison of our results with other CNN models. Overall, increasing the amount of training data from different centers using different techniques could make the DL based model more robust, improving the segmentation accuracy.

Conclusion

This study has demonstrated through both geometric and dosimetric metrics that our DL based auto-segmentation can achieve clinically acceptable contours for most of the EBRT planning structures in cervical cancer patients, although the dosimetric consistency of CTV was a concern. Automatic delineation will be an essential component in cervical cancer workflow which would generate the accurate contouring.