Introduction

Organs at risk (OARs) delineation is a critical task in radiotherapy. It affects many aspects of treatment planning, which can further affect the probability of local tumor control and normal tissue complications [1,2,3,4]. However, manual OARs delineation is time-consuming and tedious work. This fact is especially true for cancers with complex anatomy, such as nasopharyngeal carcinoma (NPC).

Auto-segmentation can reduce the work intensity of oncologists and improve work efficiency [5,6,7,8,9,10]. Recently, deep learning-based auto-segmentation has become a mainstream assistance segmentation technique provided by many software vendors [7, 11,12,13]. The latest relevant studies have shown promising results for these systems, improving consistency among oncologists and shortening the delineation time [14,15,16].

As an emerging technique, sufficient clinical application assessment is required. Although many studies have evaluated the performance of auto-segmentation in terms of geometric metrics [7, 14, 15, 17,18,19], few studies have focused on dosimetric impact [11, 20]. Because the OARs delineation directly affects the plan optimization and local dose distribution, and then affects the plan evaluation and the normal tissue complication probability. Therefore, dosimetry evaluation has important clinical significance, only geometric metric evaluation is not sufficient for clinical application.

There are many approaches for dosimetry evaluation of OARs auto-segmentation. Van Dijk et al. [11] compared the dosimetric difference between auto-segmented and manually delineated OARs with a clinically approved treatment plan. The results proved that more accurate auto-segmentation translated into smaller dosimetric differences compared to the manual contours. Kaderka et al. [20] used an atlas-based method for cardiac substructure segmentation and proved that the quality of auto-segmented contours cannot be determined by geometric metrics only, and geometrical measures did not predict the accuracy of dosimetric parameters. However, both two studies used clinically approved treatment plans based on manual delineation and assessed on auto-segmented contours.

The future goal of OARs auto-segmentation is to be applied to clinical plan optimization and evaluation with little or no manual modification. The OARs delineation will directly affect the plan optimization and local dose distribution. Re-optimizing the plan based on auto-segmented OARs is more in line with the actual clinical situation, so we think it may be the most reasonable approach. However, the existing researches have not evaluated the feasibility of applying the auto-segmented OARs to plan optimization.

We believe that the feasibility evaluation of applying the auto-segmented OARs to plan optimization has important clinical significance, because it is the basis of the whole process automation of treatment planning (including automatic delineation, automatic planning, plan evaluation, etc.), and this paper has conducted a preliminary exploration on this. In this study, we reoptimized the treatment plan based on auto-segmented contours and then used manual contours to evaluate the dosimetric differences between the reoptimized plans and the original clinical treatment plans.

To further assess the dosimetric impact of deep learning-based auto-segmentation, we have designed a dosimetric comparison study. Two sites, including the nasopharynx and rectum, and two deep learning-based auto-segmentation systems, including a commercial tool from United Imaging Healthcare (UIH, Shanghai, China) and an in-house auto-segmentation tool developed by our institution, were investigated. To evaluate the application of deep learning-based auto-segmentation in clinical situations, the whole planning process was following our clinical routine requirement. Meanwhile, the correlation between geometric metric and dosimetric difference was investigated.

Methods

A schematic workflow of this study is presented in Fig. 1. After auto-segmentation, the assessment was divided into three parts. First, the accuracy of auto-segmentation was evaluated based on geometric metrics. Second, we reoptimized the plan based on the auto-segmented OARs and compared it with the original treatment plan to evaluate the dosimetric differences. Third, we explored the correlation between the geometric metrics and dosimetric differences.

Fig. 1
figure 1

The workflow of this study

Patients and treatment protocol

Two sites, including the nasopharynx and rectum, were investigated. Ten patients for each site who received radiotherapy at Fudan University Shanghai Cancer Center between 2017 and 2019 were randomly selected from our database and enrolled in this study. The details of the patient characteristics are shown in Additional file 1: Supplement A, Table S1. For NPC patients, the prescription was 70.4 Gy in 32 fractions for T3-T4 stage patients and 66 Gy in 30 fractions for T1-T2 stage patients. For rectal cancer, all of the patients received 50 Gy in 25 fractions.

OARs manual delineation

Manual delineation was performed on the Pinnacle (Pinnacle, v9.10, Philips Corp, Fitchburg, WI, USA) treatment planning system. The targets and OARs are presented in Table 1. These contours were delineated by radiation oncologists with more than 5 years of experience in radiation oncology and revised and approved by senior radiation oncologists. All of the manually delineated OARs were used for patient treatment.

Table 1 The target and the OARs constraint functions and dosimetric evaluation metrics

Deep learning-based auto-segmentation

Two deep learning-based auto-segmentation systems were used in this study. FD is an in-house developed deep learning-based auto-segmentation system, the details of the network and model training have been presented in our recent studies [21,22,23,24]. Briefly, we used approximately 200 NPC and 200 rectal cancer cases from our institution as the training dataset. The delineation of the training dataset came from clinical routine without modification for this task. The network was a modified 2D U-Net. It was used in 2018 for OARs auto-segmentation clinical testing on NPC and rectal cancer. The OARs segmented by this system were marked as OAR_FD.

UIH is a commercial treatment planning system developed by UIH Corporation [25, 26]. It uses a two-phase 3D U-Net for OARs location and segmentation. The training data did not come from our institution. We used UIH from 2019 for clinical testing. This system provided NPC and rectal cancer OARs auto-segmentation, which was used in this study. The OARs segmented by this system were marked as OAR_UIH.

Treatment planning

Pinnacle (Pinnacle, v9.10, Philips Corp, Fitchburg, WI, USA) and Varian Trilogy Linac (Varian, Polo Alto, CA, USA) with 120 multileaf collimator were used for treatment planning for all plans. All of the treatment planning processes were the same as our clinical routine for consistency.

The NPC clinical treatment plans used the 9-field static intensity modulated radiotherapy (sIMRT) technique, and the gantry angles were 0°, 45°, 85°, 120°, 160°, 200°, 245°, 275°, and 315°. The field could be split based on field width. The maximum number of segmented subfields was set to 55. The rectal cancer clinical treatment plans adopted 7 fields of the sIMRT technique. The beam angles were chosen based on clinical experience. Here, we mainly considered having the bladder and femoral heads receive less radiation exposure. The maximum number of segmented subfields was set to 35. For all of the plans, the minimum subfield area was set to 10 cm2, and the minimum subfield monitoring unit was set to 10 MU. The dose calculation grid was set to 3 mm.

The prescription was normalized to the mean dose of PTV as in our clinical routine. For NPC, we prescribe 220 cGy per fraction to 97% of the PTV70.4 mean dose for 32 fractions or 220 cGy per fraction to 97% of the PTV66 mean dose for 30 fractions. For rectal cancer, we prescribe 200 cGy per fraction to 96% of PTV mean dose for 25 fractions. In this setting, the D95 of PTV was close to the prescription dose. All of the treatment plans were completed by medical physicists with more than 3 years of experience.

Each patient had three plans: Plan_Manual, Plan_FD and Plan_UIH. Plan_Manual was a clinically approved plan that was used for patient treatment. Plan_FD and Plan_UIH were reoptimized based on manually delineated PTVs and auto-segmented OARs. For OARs that were not generated by the auto-segmentation system (temporomandibular joints and chiasm), we used manually delineated OARs to replace them. The beam angles and initial optimization parameters for the reoptimized plans (Plan_FD and Plan_UIH) were consistent with Plan_Manual. The physicist could adjust the optimization objective function based on his or her experience and judgment, the same as the routine clinical treatment planning process.

Geometric evaluation

Manual delineated contours were used as references. The performance of auto-segmentation was evaluated by the following four geometric metrics: Hausdorff distance (HD), mean distance to agreement (MDA), Dice similarity coefficient (DICE), and Jaccard index [27,28,29]. HD and MDA were used to quantify the maximum and mean 3D distances between contours A and B, respectively. DICE and the Jaccard index were measures of the overlap between contours A and B. The definitions are as follows:

$${\text{HD~}}\left( {{\text{A}},{\text{~B}}} \right) = max\left\{ {H~\left( {A,~B} \right),~~~H\left( {B,~A} \right)} \right\}$$
$$H\left( {A,~B} \right)~ = ~{}_{{a\epsilon A}}^{{max}} \left\{ {{}_{{b \epsilon B}}^{{min}} \left\{ {d~\left( {a,b} \right)} \right\}} \right\}$$

where d (a, b) represents the 3D Hausdorff distance between point a from contour A and point b from contour B.

$${\text{MDA~}}\left( {{\text{A}},{\text{~B}}} \right)~ = ~\frac{{h~\left( {A,~B} \right)~ + ~h~\left( {B,~A} \right)}}{2}$$
$$h~\left( {A,~B} \right)~ = ~{}_{{a\epsilon A}}^{{mean}} \left\{ {{}_{{b \epsilon B}}^{{min}} \left\{ {d~\left( {a,b} \right)} \right\}} \right\}$$
$${\text{DICE}}~ = ~2~*~\frac{{\left| {{\text{A}} \cap {\text{B}}} \right|}}{{\left| {\text{A}} \right| + \left| {\text{B}} \right|}}$$
$${\text{Jaccard}}~ = ~~\frac{{\left| {{\text{A}} \cap {\text{B}}} \right|}}{{\left| {{\text{A}} \cup {\text{B}}} \right|}}$$

For a perfect overlap, the values of HD and MDA are 0, and the values of DICE and Jaccard are 1. For an imperfect overlap, the values of HD and MDA are large, and the values of DICE and Jaccard are close to 0.

Dosimetric evaluation

Plan_Manual was clinically approved treatment plan, Plan_FD and Plan_UIH were reoptimized plans based on manually delineated PTVs and auto-segmented OARs. In the dosimetric evaluation, we used manually delineated OARs to compare dose-volume metrics between Plan_FD, Plan_UIH and Plan_Manual. As Fig. 2 shows, the red solid line represents the method we used in this study, the blue dash line represents the traditional evaluation method. For serial organs, we mainly focused on Dmax. For parallel organs, we mainly focused on Dmean, V30 or V40 (Table 1). The dose-volume metrics of PTVs and manually delineated OARs were extracted from Plan_FD, Plan_UIH and Plan_Manual.

Fig. 2
figure 2

Different dosimetric evaluation methods between this study and others

A 3D gamma analysis was performed with 3% and 3 mm for whole-body and PTV dose distribution comparison. The homogeneity index (HI) and conformity index (CI) for PTV were further calculated using the following formulas:

$${\text{HI}} = \frac{{{\text{D}}_{2} - {\text{D}}_{{98}} }}{{{\text{D}}_{{\text{p}}} }}$$
$${\text{CI}} = \frac{{{\text{V}}_{{\text{R}}} {\text{*V}}_{{\text{R}}} }}{{{\text{V}}_{{\text{P}}} {\text{*V}}_{{{\text{dose}}}} }}$$

where DP is the prescription dose, VP and Vdose are the volume of PTV and the prescription dose region, respectively, and VR is the intersection volume of VP and Vdose.

Correlation between the geometric metric and dosimetric metric

The correlation between the geometric metric and the ∆Dose was analyzed by Spearman’s correlation test. The ∆Dose is the dose-volume metrics difference between reoptimized plans (including Plan_FD and Plan_UIH) and Plan_Manual. Please note that the volume-metrics difference is also denoted by ΔDose.

Statistical analysis

R software (v4.0) was used for statistical analysis. For a value comparison, the Shapiro–Wilk normality test was performed first. If a normal distribution was found, the paired-sample t test between groups was performed; otherwise, the Wilcoxon’s paired-sample nonparametric signed-rank test was performed. p < 0.05 indicates that the difference is statistically significant. The correlations between geometric metrics and dose-volume metrics difference were evaluated with Spearman’s correlation coefficient R.

Results

Geometric evaluation

Figure 3 shows the geometric evaluation results of auto-segmentation. Both deep learning systems can provide similar results in some OARs, including the parotids, temporal lobes, lens, and eyes (DICE, p > 0.05). Here, the p-Value indicates the DICE difference between OAR_FD and OAR_UIH. For the brainstem and spinal cord, although there was a significant difference (p < 0.05), the deviation was small (less than 0.05 in DICE), while OAR_FD had better performance in the optic nerves, oral cavity, larynx, and femoral heads. OAR_UIH had better performance in the bladder. Representative rectal cancer and NPC examples of auto-segmentation are illustrated in Fig. 4 and Additional file 1: Supplement B, Fig. S1. More examples are presented in Additional file 1: Supplement D, Figs. S6-S12.

Fig. 3
figure 3

Geometric evaluation results of auto-segmentation. a The DICE; b The mean distance to agreement (MDA); c The Jaccard; d The Hausdorff distance (HD)

Fig. 4
figure 4

An example of rectal cancer patient. a Manual OARs; b FD OARs; c UIH OARs; d Contour comparison; e Plan_Manual dose distribution; f Plan_FD dose distribution; g Plan_UIH dose distribution; h PTV contour; i Plan_FD with manual OARs; j Plan_UIH with manual OARs; k 3D Gamma analysis of Plan_FD; red color represents gamma index > 1; l 3D Gamma analysis of Plan_UIH, red color represents gamma index > 1

PTV dosimetry evaluation

Table 2 lists the PTV dosimetric parameters of Plan_Manual, Plan_FD and Plan_UIH. No significant dosimetric differences were found by comparison Plan_FD, Plan_UIH with Plan_Manual.

Table 2 Summary of the PTV dosimetry parameters of the reoptimized treatment plans (Plan_FD and Plan_UIH) and the original clinical treatment plans (Plan_Manual). All of the values are reported as the mean ± standard deviation

OARs dosimetry evaluation

Table 3 lists the OARs dosimetry parameters. No significant dosimetric differences were found except for left temporal lobe Dmax for Plan_FD vs. Plan_Manual (6376 ± 2126 cGy vs. 6444 ± 2156 cGy, p = 0.05). Figure 4 and Additional file 1: Supplement B, Fig. S1 present the dose distributions of Plan_Manual, Plan_FD and Plan_UIH for representative rectal cancer and NPC cases. Figure 5 shows an example of dose-volume histogram (DVH) of Plan_Manual, Plan_FD and Plan_UIH for representative rectal cancer cases. If readers are interested in the dose-volume metrics data of Plan_Manual on OAR_FD and OAR_UIH, please refer to Additional file 1: Supplement B, Table S2.

Table 3 Summary of the OARs dosimetry parameters of the reoptimized treatment plans (Plan_FD and Plan_UIH) and the original clinical treatment plans (Plan_Manual). All of the values are reported as the mean ± standard deviation
Fig. 5
figure 5

The dose-volume histogram (DVH) of Plan_Manual, Plan_FD and Plan_UIH for an representative rectal cancer cases

The correlation between dosimetric differences and the geometric metrics

Table 4 shows the results of the correlation analysis between dosimetric differences and geometric metrics, there is no OARs shows strong correlation between its ∆Dose and all of four geometric metrics. The only significant correlation was found between the femoral head ΔDmean and its geometric metric HD (R = 0.40, p = 0.01 for femoral head ΔDmean vs. HD). Although the brainstem ΔDmax and its DICE was significantly correlated, this might be a statistical random error since the trend is contrary to our expectations. For detailed data, please refer to the Additional file 1: Supplement C, Figs. S2–S5).

Table 4 The correlation between dosimetric differences and the geometric metrics

Discussion

In this study, we assessed the dosimetric impacts of deep learning-based OARs auto-segmentation on nasopharyngeal and rectal cancers. Our results showed that deep learning-based OARs auto-segmentation had no significant impact on the PTV dose distribution or most OARs dose-volume metrics, while the correlation between the geometric metrics and OARs dosimetric differences was weak.

Two deep learning auto-segmentation systems were investigated. Both systems are under clinical testing in our institution. The clinical test for FD started in November 2018. Radiation oncologists can use this system for NPC and rectal OARs auto-segmentation in our institution. These auto-segmented contours were usually reviewed and modified by radiation oncologists before clinical approval. This process has been applied on more than 500 patients. For UIH, we started testing it in March 2019. Similar to the FD system, radiation oncologists are required to review auto-segmented contours before clinical approval. The preliminary feedback of these two systems can reduce radiation oncologists’ workload. More detailed data are being collected.

For quantitative geometric evaluation, both systems can provide similar performance for five OARs (eyes, parotids, lens, oral cavity and temporal lobes, p > 0.05 DICE). These results are similar to those reported in other researches [7, 11]. Although the differences for the spinal cord and brainstem were significant, the deviation value was small (approximately 0.04 in DICE and < 0.5 mm in MDA). Six OARs, including the bladder, femoral heads, spinal cord, brainstem, optic nerves and larynx, were significantly different between the two systems (p < 0.05, DICE). The reasons might be as follows.

FD can provide a better performance than UIH (p < 0.05 DICE) for the femoral heads, optic nerves, spinal cord, and larynx, which might be caused by the different OARs definitions between our clinical routine and UIH training data. For example, we did not include the femoral necks in femoral head segmentation. UIH included the femoral necks (Fig. 4c, red arrow). Additionally, OARs that do not have clear visible boundaries on CT images like temporal lobes can have large delineation variations (Additional file 1: Supplement D, Figs. S10). By retraining the auto-segmentation model on our institution data, these deviations might be eliminated.

The performance of the bladder for FD was worse than that for UIH (p < 0.05 DICE). This finding might have been caused by the algorithm difference between the two systems. Our system used a 2D U-Net network, which could have some outliers, as our previous study demonstrated [21, 22]. UIH used a two-phase algorithm, which was more robust according to region location.

In dosimetric analysis, no difference was found for the PTV target (p > 0.05). The most significant dose difference was rectal PTV D2 (Manual: 5349 ± 177 cGy, FD: 5384 ± 167 cGy, UIH: 5383 ± 160 cGy, p = 0.08). This study did not involve the auto-segmentation of target volume, all reoptimized plans used manually delineated PTV. The small dosimetric difference of PTV might be mainly caused by the experience, skills and operating habits of different dosimetrists. For OARs dose-volume metrics, the most significant dose difference was in the left temporal lobe Dmax for Plan_FD vs. Plan_Manual (6376 ± 2126 cGy vs. 6444 ± 2156 cGy, p = 0.05). This finding might have been caused by the large variation in the delineation of the temporal lobes (Additional file 1: Supplement D, Fig. S10).

However, no significant dose-volume metrics difference was found for PTV and OARs. A plan dose distribution review remains necessary to fully investigate the dosimetric impact of an auto-segmentation system. The delineation could have a different impact on the final dose distribution. As we demonstrated in Fig. 4c, g, the femoral neck delineated by the UIH system was spared from 10% dose coverage. The low-dose isodose lines (10 Gy and 25 Gy) of Plan_UIH have different shapes compared to Plan_Manual and Plan_FD. In contrast, the difference between oral cavity delineation for UIH and manual delineation did not cause a significant dose distribution difference (Additional file 1: Supplement B, Fig. S1. C and G, red arrow).

This study showed that there was no clear monotonic relationship between the geometric metrics and dosimetric differences for most OARs. The only significant correlation was shown for the femoral head mean dose. There could be several reasons for this result. First, the difference between manual and automatic delineation might be too small to cause a dosimetric difference beyond the random noise dose levels. In other words, the performance of our two auto-segmentation systems was “good enough”. When the delineation difference is sufficiently large, such as with the femoral head definition, the correlation between geometric metrics and dosimetric difference can still be observed. Second, the interoperator difference or intraoperator difference during treatment planning could cause a larger difference than auto-segmentation. These interoperator differences were difficult to avoid by the manual planning process. By automatic planning, these subjective deviations can be decreased. To analyze the impact on routine clinical practice, we did not implement it.

In this study, we used manually delineated contours as references. This fact does not mean that manual delineation is “better” or more “accurate” than deep learning-based delineation. In our ongoing evaluation study, radiation oncologists preferred auto-segmented contours over manual delineation for the parotids, optic nerves, lens and eyes. This phenomenon was also observed in [11]. Manual delineation represents a clinically acceptable and approved contour quality, which also implies some clinical experience or the habits of local institutions. Therefore, using a commercial auto-segmentation system that is not trained on local data requires more investigation.

For segmentation evaluation, geometric evaluation is a straightforward method for auto-segmentation performance. Many studies using these indices have been published in recent years [17, 27,28,29]. Geometric metrics, such as DICE and MDA, are the critical indices for segmentation algorithm development. Using high-quality and consistent training or validation data, the algorithm performance can be quantified and compared. However, the clinical assessment of auto-segmentation can be much more complicated and should be based on clinical purposes. A small improvement in geometric metrics, for example, DICE increase of 0.05, could represent substantial progress in the algorithm. However, its clinical value is likely to improve only marginally. A more practical assessment procedure should mimic clinical practice as much as possible. This principle is also consistent with some task-based evaluation procedures proposed by other studies [30, 31].

The main limitation of this study was that it did not investigate interoperator variations. Using the auto-planning technique might reduce these variations, in turn increasing objectivity when plans are compared. These tasks were left for the future to complete.

Conclusion

Deep learning-based OARs auto-segmentation for NPC and rectal cancer might not have a significant impact on PTV and OARs doses. Correlations between the auto-segmentation geometric metric and dosimetric difference were not observed for most OARs. A dosimetric evaluation is recommended for applying auto-segmentation systems in the clinic.