Introduction

To create a patient-specific radiotherapy plan, the radiation oncologists (ROs) manually contour the tumour or target region and organs at risk (OARs) on the patient’s computed tomographic (CT) or magnetic resonance (MR) images. The accuracy of the contours is essential as inaccurate contours have the potential to affect the outcome of the treatment. The manual contouring process is time-consuming, and the time taken for manual contouring can vary according to professionals’ abilities and knowledge. It can take several hours to complete contouring for one patient [4]. Previous studies found that manual contouring can take up to 3 h in Head and Neck intensity-modulated radiotherapy (IMRT) planning [9].

These factors can also lead to noticeable delays in treatment, resulting in unwanted treatment outcomes [4]. A previous study found that the increased waiting time for radiotherapy can increase the risk of local recurrence, which can be translated into decreased overall survival rate in some clinical situations [6].

Additionally, the contouring process suffers from large inter- and intra- observer contouring variabilities between professionals [9, 12, 17, 20]. A considerable mean volume variations of about 50% during parotid delineations was found [9]. A study of inter-observers/institutions variability in target and OARs contouring for breast radiotherapy planning found that the overlap between manually contoured structures was low (up to 10%) and the variation between manually contoured volumes had standard deviations up to 60% [17]. Inter-observer variations were also found in radiotherapy planning for other anatomical sites such as cervical cancer radiotherapy [12] and oral cavity cancer radiotherapy [20]. Inter-observer variation has been shown to have a dosimetric impact during radiation therapy planning [17].

The auto-segmentation method has the potential to replace manual contouring. This auto-contouring technique was developed based on the capability of the algorithms to use prior knowledge. In the early stage, the auto-contouring technique had no or minimal capability of using prior knowledge due to limitations on computing power and the limited availability of prior segmentation data. These were low-level segmentation approaches such as intensity thresholding, region growing, and heuristic edge detection [4]. As the computer powers rapidly developed along with a much larger availability of prior knowledge, the auto-contouring developed rapidly, for example, Atlas-based auto-contouring and deep-learning auto segmentation depending on the size of prior knowledge used in the technique.

Deep-learning auto-segmentation is a technique of machine learning where the algorithms learn or get trained to calculate the final contour. This technique uses a multi-layer neural network called convolution neural networks (CNNs) [4, 31]. A large set of pre-contoured data referred to as training data, is passed through the CNNs to train the algorithm and optimise its parameters through the backpropagation algorithm to calculate and create the optimised contour for target structures [16, 31]. The type and performance of deep-learning based auto-segmentation depend on which network structure was used, such as U-Net [24], V-Net (3D version of U-Net) [4] or ResNet [14] and the quality and quantity of training data set [2, 31]. More advanced network structures such as vision transformer (ViT) were introduced [28] and other studies showed ViT performed better than CNNs when both networks were trained on larger datasets [11].

Many studies have compared the performance of in-house AI-based, and atlas-based auto-contouring systems in OAR delineation accuracy in different cancer types such as Head and neck [5], breast [8], and liver [1]. Even though these studies had demonstrated its better performance in OAR contouring and better efficiency over atlas-based auto-contouring, the development and implementation of in-house AI-based auto-contouring can be complex due to challenges such as the required expertise in developing and implementing the programming code and limitations in collecting a large amount of “training” set [26].

In this study, we compared the performance of seven different commercially available AI-based auto contouring systems: Radiotherapy AI (Radiotherapy AI, Sydney, Australia), 2 different versions of Limbus Contour (Limbus AI Inc, Regina, SK, Canada), Therapanacea ART-plan Annotate (Therapanacea, Paris, France), MIM Contour Protégé AI (MIM, Cleveland, USA) Siemens AI-Rad Companion Organs RT (Siemens Healthineers, Erlangen, Germany) and RadFormation AutoContour (RadFormation, New York, USA) in OAR delineation.

Method

Clinical dataset

A total of 42 clinical cases (10 head and neck (HN), 10 brain (B), 10 pelvis (PLV), 4 breast (BT), 4 lung (L) and 4 abdomen (ABO) cases) treated at Chris O’Brien Lifehouse between 2019 and 2021 were selected in this study. The patient scans were selected consecutively from the clinical patient scans for each relevant body site. The computed tomographic (CT) images were acquired with the Canon Aquilion LB CT scanner. Different CT scan parameters were used depending on the patient and anatomical site scanned, illustrated in Table 1. Twenty-three organs at risk were delineated by a single expert for each corresponding case, including brain (total number of sample, n = 10), brainstem (n = 19), left eye (n = 12), right eye (n = 12), spinal cord (n = 19), oesophagus (n = 12), optic chiasm (n = 11), left optic nerve (n = 11), right optic nerve (n = 11), left parotid gland (n = 10), right parotid gland (n =9), left submandibular gland (n = 5), right submandibular gland (n = 4), bladder (n = 10), left femoral head (n = 10), right femoral head (n = 10), heart (n = 7), liver (n = 6), left kidney (n = 5), right kidney (n = 5), left lung (n = 9), right lung (n = 9), rectum (n = 10), and stomach (n = 4). During this study, the manual contours of OARs in each case were considered as the reference contours to be compared with automated contours from AI systems.

Table 1 CT parameters used for each tested case

AI-based auto-contouring systems

Seven different AI-based segmentation systems were used to delineate the same OARs contoured in each case during this study, Limbus Contour version 1.5 and 1.6, MIM Contour Protégé AI version 1.1.1, Radformation AutoContour version 2.0.19, Radiotherapy AI version RTAI lifehouse-v0.2.0, Siemens AI-Rad Companion Organs RT (AIRC) version VA31A and Therapanacea ART-plan Annotate version 1.10.1. Each AI system uses different network structures to train its model. Limbus Contour [22] and MIM Contour Protégé AI [29] both use CNN based on U-Net structure. Radformation AutoContour [18] uses CNN based on V-Net structure. Siemens AI-Rad Companion Organs RT [15] uses deep image-to-image network (DI2IN). Radiotherapy AI uses an adapted 3D U-Net. The author were unable to identify the network used for Therapanacea ART-plan Annotate.

Radiotherapy AI used clinical data from Chris O’Brien Lifehouse as the training data set for its model. The training data set and the data set used for this study were mutually exclusive. Radiotherapy AI is in the development stage and is not commercially available yet.

Quantitative evaluation method

The volumetric Dice Similarity Coefficient (DSC), surface Dice Similarity Coefficient (sDSC) and maximum Hausdorff Distance (HD) between manual segmentation and AI-based auto-contouring systems’ segmentation were calculated to quantitatively evaluate the performance of each AI-based auto-contouring software in OAR delineations [25]. The DSC, sDSC and HD were calculated using python script with PlatiPy version 0.4.0 [7]. The volumetric Dice Similarity Coefficient (DSC) calculates the overlap between 2 contoured volumes and is defined as:

$$\begin{aligned} DSC = \frac{2|A|\cap |B|}{|A| + |B|} \end{aligned}$$

Where A is the volume of manual contours and B is the volume of contours delineated by an AI system. The value of the DSC metric varies from 0, which illustrates no overlap between two contours, to 1, which illustrates the complete overlap between two contours.

The surface Dice Similarity Coefficient (sDSC) is a new metric for assessing the segmentation performance introduced by Nikolov et al [21]. This metric calculates the overlap between the two surfaces at a defined tolerance (\(\tau\)) and is defined as:

$$\begin{aligned} sDSC^{(\tau )}_{A,B}=\frac{|S_A \cap B_{B}^{(\tau )}| + |S_B \cap B_{A}^{(\tau )}|}{|S_A|+|S_B|} \end{aligned}$$

where \(S_A\) and \(S_B\) are surface of manual contours (A) and AI contours (B) and \(B_A\) and \(B_B\) are the border regions of manual contours (A) and AI contours (B) respectively. As in radiotherapy, the OAR is contoured slice by slice and the segmentation performance is assessed by the fraction of the surface of the contour which needed to be edited, sDSC has been suggested as a more suitable metric compared to volumetric DSC to assess the segmentation performance as the volumetric DSC weighs all regions where two volumes do not overlap equally and independently of their distance from the surface, and is biased towards OARs which has large volume [21]. Another study showed that sDSC is a better indicator than DSC and HD of the time needed to edit and time saved by using auto contouring systems [27]. The tolerance parameter \(\tau\) needs to be set appropriately where variation is clinically acceptable by measuring inter-observer variation in contouring [21]. For this study, \(\tau\) value of 0 mm was used for sDSC calculation to evaluate the absolute difference between manual and AI system’s contours and additional sDSC calculations with different \(\tau\) values (1,2,3 mm) were performed as previous study by Rhee et. al found sDSC with tolerance value of 1, 2, 3 mm are most accurate similarity metrics compared to other metrics used to detect the errors in contour [23].

The maximum Hausdorff Distance (HD) between two contoured volumes to calculate the greatest distance from a point in one contour to the closest point in the other contour based on equation:

$$\begin{aligned} HD(A,B)\ =\ \text {max}(h(A,B),h(B,A)) \end{aligned}$$

Where h(A,B) is the directed Hausdorff distance between A and B. The directed Hausdorff distance is expressed as:

$$\begin{aligned} h(A,B)\ =\ \text {max}_{a\in A} \text {min}_{b\in B}||a-b|| \end{aligned}$$

\(||a-b||\) is the Euclidean distance between point a in A and point b in B. The zero HD value represents there is no difference between 2 contours’ shapes but as the HD value increases, the difference between 2 contours’ shapes are increasing.

To ensure a valid comparison, cases with non-identical numbers of data sets were divided into separate groups, ensuring that each set had an equal number of data points when calculating mean DSC, sDSC and HD. For instance, 19 cases were selected for testing in spinal cord segmentation. However, data from RTAI was unavailable for 9 out of the 19 cases, as the RTAI model was exclusively designed for Head and Neck cases at the time of the study.

Statistical analysis

The statistical difference between each index of DSC and HD for each tested AI-based software was calculated using a suitable type of statistical test between 3 tests, (1) Student’s t-test, (2) Welch’s t-test and (3) Wilcoxon-Signed Rank test, depending on properties of compared data sets with a p-value lesser than 0.05 indicating significance [26]. The test was automated using an in-house Python script combined with published python packages. The box plots of each data set in each case were created to check if there are any outliers. Then the histogram was created to visually inspect the distribution of data. The Shapiro-Wilk and Q-Q plot tests were used to test the normality of the distribution of each sample. When the data was assumed to be normally distributed, the F-test was used to find whether each compared data set’s variance are equal. The Student’s t-test was used in case of equal variance between 2 compared data sets, and the Welch’s t-test was used in case of unequal variances between 2 compared data sets. The Wilcoxon-Signed Rank test was used when both compared data sets were not normally distributed and when normally distributed data sets were compared with data sets which were not normally distributed. It was also used to compare two data sets where any one of the data sets or both had outlier data points [13]. The detailed results of statistical test conducted during study can be found in supplementary data A (DSC), B (HD) and C (sDSC).

Fig. 1
figure 1

Manual and AI systems’ contour of the spinal cord in Varian Eclipse Treatment Planning system

Fig. 2
figure 2

3D representation of manual and AI systems’ contour of both left and right femoral heads in Varian Eclipse Treatment Planning system

Table 2 Dice Similarity Coefficient (DSC) values between manual contours and individual automated contours of OARs considered
Table 3 Dice Similarity Coefficient (DSC) values between manual contours and individual automated contours of OARs considered
Table 4 Maximum Hausdorff Distance (HD) values between manual contours and individual automated contours of OARs considered
Table 5 Maximum Hausdorff Distance (HD) values between manual contours and individual automated contours of OARs considered
Table 6 Surface Dice Similarity Coefficient (sDSC) values between manual contours and individual automated contours of OARs considered
Table 7 Surface Dice Similarity Coefficient (DSC) values between manual contours and individual automated contours of OARs considered

Results

The performance of each individual AI-based auto-contouring system in contouring twenty three different organs at risks considered in various clinical cases (head and neck, brain, lung, breast, pelvis, and abdomen) was quantitatively evaluated by calculating the DSC, HD and sDSC between contours of each tested organ contoured manually by expert (Manual) and automatically by each software, Radiotherapy AI (RTAI), Limbus AI version 1.5 (Lim1.5) and version 1.6 (Lim1.6), Therapanacea (TH), MIM (MIM), Siemens AIRC (SAIRC) and RadFormation (RF). The higher DSC, sDSC and lower HD value illustrate better agreement with the Manual. The mean, standard deviation, range, and maximum absolute difference of the DSC for each considered OAR case in head and neck and brain cases are illustrated in Table 2. Similarly, the values for lung, breast, pelvis, and abdomen cases are presented in Table 3. The mean, standard deviation, range, and maximum absolute difference of the maximum HD for each considered OAR case in head and neck and brain cases are illustrated in Table 4. Similarly, the values for lung, breast, pelvis, and abdomen cases are presented in Table 5. The mean, standard deviation, range, and maximum absolute difference of the surface DSC for each considered OAR case in head and neck and brain cases are illustrated in Table 6. Similarly, the values for lung, breast, pelvis, and abdomen cases are presented in Table 7. Both highest mean DSC and sDSC values and lowest HD value for each case are presented in bold and highlighted. The distribution of individual data for each OAR were tabulated and illustrated in both scatter and box plot, corresponding statistical results are illustrated in Supplementary data A for DSC, B for HD, C for sDSC. The box plot of data for individual AI systems for all considered OARs are shown in Supplementary data 1 (DSC), 2 (HD), 3 to 6 (sDSC with different tau value).

Discussion

In this study, seven different AI-based auto-contouring systems were tested to study each system’s performance in contouring organs at risk considered in different clinical cases. In general, the study showed sDSC values were considerably smaller than volumetric DSC values, especially for OARs with large volumes as reported from previous studies [10, 21, 27].

In head and neck and brain cases, the contours delineated by each AI system showed good agreement with reference contours for most of OARs considered. The DSC for brain, brainstem, left eye, right eye, left parotid gland, right parotid gland, left submandibular gland and right submandibular gland from tested AI systems were comparable to the previous study by Doolan et. al [10] and by Liu et. al [19]. This study reported slightly lower sDSC for brain, brainstem, left eye, right eye, left parotid gland, right parotid gland, left submandibular gland and right submandibular gland from tested AI systems [10]. The HD for the same set of OARs from tested AI systems were slightly higher compared to previously reported HD [10].

The study found that the AI systems had shown reduced and inconsistent performance in contouring small and complex structures such as optic structures and oesophagus which is difficult to visualise in CT images rather than MR. The reduced and inconsistent performance of auto contouring systems in contouring small and complex structures had been previously reported in other studies. The previous study by Liu et. al [19] reported low DSC value for optic chiasm and wide variation in DSC value for the left and right optic nerve across multiple previous studies. Similarly, the reduced and inconsistent performance was found in this study for oesophagus cases which correlates with previously reported DSC, sDSC and HD values for oesophagus case [10].

The Radiotherapy AI software showed the best performance across all tested systems. The better agreement between the Radiotherapy AI contours and manual contours in this study may be due to the fact that the Radiotherapy AI model was trained on our clinic’s contours and therefore produced contours similar to those used in our clinic. This result demonstrates the advantages of an in-house built AI system or AI systems which were trained based on clinic-specific data. This would provide contours more similar to those currently used in that clinic. On the other hand, this could perpetuate incorrect contouring and does not provide review of current contouring practice. Nor would it lead to standardisation of contours across radiation therapy centres. However, the study found very small maximum differences in both DSC and HD values across all tested systems. So, in most test cases, the shape of contours delineated by AI systems were comparable to each other.

Low DSC of spinal cord was found across all tested AI systems during this study where previously reported DSC of spinal cord was considerably higher [10, 19]. This large disagreement occurs because the manual contours only cover the part of spinal cord which lies in the treatment field, while AI systems contour all area of spinal cord in the image as shown in Fig. 1.

There was no specific AI based software showing overall superior performance compared to others in lung, breast, pelvis and abdomen cases. Again, the very small maximum differences in both DSC and HD values across all tested systems supports that the shapes of contours delineated by each AI system are comparable to each other.

The DSC for bladder, left and right lungs, heart, left and right kidneys, liver, rectum and stomach from tested AI systems were comparable to the previous study [1, 10]. This study reported slightly lower sDSC for bladder, heart, left and right lung, liver from tested AI systems compared to previously reported sDSC [10]. The HD for same set of OARs from tested AI systems were slightly higher compared to previously reported HD [10]. This study reported slightly lower performance in rectum case compared to previously reported DSC, sDSC and HD [10].

Both left and right femoral head DSC and sDSC were comparable and HD was slightly higher compared to previously reported DSC, sDSC and HD [10]. The study found that DSC values of RadFormation were lower and HD values were higher compared to other tested AI for both left and right femoral head cases. The low DSC values, high HD values and large variation in the average DSC value when compared with other AI software were due to the difference in contouring method of RadFormation, which delineated the femoral head only while other systems and the manual reference contours included a small portion of the femoral neck as shown in Fig. 2.

There were several limitations in this study. Firstly, there were limitations in a few tested AI systems’ models. The Radiotherapy AI model was only available for head and neck, and brain regions, while the MIM model only contoured structures in male pelvis cases at the time of study. Not long after the analysis of the study was performed, most AI systems updated their models to improve their contouring quality and also offered additional structures to be contoured. Due to the rapid development of the field, it was not feasible to reflect the performance of all tested AI systems up to date. So it must be noted to the reader that this study only reflects the specific version of each tested system which was stated previously in the method section. This implies that clinics, whether in the planning stages of implementing or already having integrated an AI system, require a set of workflows or a tool to assess the AI system’s performance. This will be crucial for keeping pace with the rapid advancements in this field. Secondly, the sample size used may have been insufficient to provide adequate power for the statistical tests [30]. The sample size for some OARs was very small, with only four or five reference contours for the right submandibular gland and the stomach. So the statistical test performed for data sets with less than five samples were ignored and denoted as ***** in supplementary data A, B and C. Thirdly, in a few cases, some software systems were not able to produce particular contours for every patient. For instance, the Radiotherapy AI produced an incompleted contour of the left optic nerve by contouring on only a single CT image slice in case HN10. Fourthly, the manual contours considered as the reference during this study were contoured by only a single expert. Using cross-validated contours would have ensured the accuracy of the reference data. Lastly, Baroudi et. Al [3] discussed that to clinically accept the automated contours, the AI systems need to be evaluated in multiple domains such as quantitative evaluation of automated contours using geometric metrics, qualitative evaluation of automated contours by the end users using Likert scales and Turing tests, the dosimetric evaluation of automated contours by assessing the impact on the dose for OARs and targets when automated contours were used in planning, and lastly assessing the improvement of efficiency of clinical workflow when the AI system was used. This study exclusively conducted a quantitative evaluation of automated contours and as one of the main intentions of this study was to provide a starting point or guidance to other clinics that are considering implementing the AI system into their clinical workflow, additional forms of evaluations are planned as future work.

Conclusion

The study successfully investigated the performance of multiple AI-based auto-contouring systems by performing quantitative comparisons. Each tested AI system was able to produce comparable contours to the expert’s contours of organs at risk which implies that these contours can potentially used for clinical use after experts’ assessment and QA on the system. This study has demonstrated a method of comparing contouring software options which could be replicated in clinics or used for ongoing quality assurance of purchased systems. A statically significant difference between AI systems’ performance in various cases was found, but the absolute difference between values was not large which illustrate that all tested AI systems’ performance were comparable to each other. A reduced performance of AI systems in the case of small and complex anatomical structures was found and reported, showing that it is still essential to review each contour produced by AI systems for clinical uses.

Supplementary information There are nine supplementary files that contain all results sets collected during the study.

Supplementary file 1 and 2 contains box plot of all DSC (Supplementary data 1_DSC Box plot) and HD (Supplementary data 2_HD Box plot) data for each tested AI based contouring system. and Supplementart file 3 to 6 contains box plot of all sDSC with different \(\tau\) value applied (0 to 3 mm) data for each tested AI based contouring system.

Supplementary file 7 (Supplementary data A_DSC) contains all results data for each tested organ at risk obtained from the method conducted in this study. Each tab with the name of the organ at risk tested has:

  • The table of calculated dice similarity coefficient

  • The scatter plot and box plot of data

  • Histogram, Q-Q plot and table of Shapiro-Wilk Test results

  • The table of statistical test results

Supplementary file 8 (Supplementary data B_HD) contains all results data for each tested organ at risk obtained from the method conducted in this study. Each tab with the name of the organ at risk tested has:

  • The table of calculated maximum Hausdorff distance

  • The scatter plot and box plot of data

  • Histogram, Q-Q plot and table of Shapiro-Wilk Test results

  • The table of statistical test results

Supplementary file 9 (Supplementary data C_sDSC) contains all results data for each tested organ at risk obtained from the method conducted in this study. Each tab with the name of the organ at risk tested has:

  • The table of calculated surface dice similarity coefficient with different \(\tau\) value applied (0 to 3 mm)

  • The scatter plot and box plot of data

  • Histogram, Q-Q plot and table of Shapiro-Wilk Test results

  • The table of statistical test results