Background

Structure delineation is a necessary, yet time-consuming manual procedure in radiotherapy. Consistent and accurate delineation of organs-at-risk (OARs) and target structures for prostate patients is vital when performing dose escalation and treating patients with highly conformal plans [1]. Traditionally, computed tomography (CT) has been used for radiotherapy simulation and structure delineation [2]. In the last few decades, magnetic resonance imaging (MRI) has found its way for radiotherapy simulation as it provides superior soft-tissue contrast compared to CT [3, 4], thus enabling more accurate delineation of target regions and critical structures compared to CT [57].

The manual segmentation of anatomical structures is a time-consuming process [8]. Besides, with the advent of MR-guided radiotherapy [911], the accuracy and speed of delineations become the weakest link [12] that hinders the possibilities of online adaptive radiotherapy by being responsible for longer fraction time [13].

To automatically perform delineations of target and OARs for patients affected by prostate cancer, various methods have been developed over the past years. For example, three-dimensional (3D) deformable model surface [14], organ-based modelling [15], and atlas-based solutions [16, 17] have been demonstrated. For all these methods, the time required to perform segmentation is in the order of minutes, if not hours, which is excessive to enable online adaptive treatments. To obviate this limitation, currently in online treatments only the target delineations and the OARs in the vicinity of the target (e.g. within a ring of 3-5 cm) are adjusted due to the excessive time needed for OARs segmentation [1820].

Recently, deep learning has been proposed to speed-up and automatise automatic segmentation obtaining promising results [8, 21, 22]. Deep learning is a branch of artificial intelligence and machine learning that involves the use of neural networks to generate a hierarchical representation of the input data to achieve a specific task without the need of hand-engineered features [23, 24].

Many studies focused on target delineations [8] reaching mean dice similarity coefficients compared to manual delineations in the range 0.82-0.95 [2531]. Automatic delineation of OARs is also a crucial aspect to achieve full online adaptive radiotherapy and to possibly save time to manual contouring.

In this study, we aim at investigating the feasibility of convolutional neural network-based automatic OARs delineation on MRI. A preliminary retrospective study was conducted to select a suitable network architecture and prepare for clinical implementation. After having chosen the most suitable convolutional network and performing clinical implementation, performances of automatic deep learning-based OARs delineation from our clinic are presented.

Material and methods

Patient data collection

Patients diagnosed with intermediate and high-risk prostate cancer undergoing MR-only radiotherapy [32] in the period between June 2018, and January 2020 were included in the study. Further inclusion criteria were: the presence of four gold fiducial markers for position verification and absence of hip implants. The patients were also scanned with a specific radio-frequency spoiled gradient-recalled echo (SPGR) sequence that will be described in more detail further on. The clinical exclusion criteria for MR-only radiotherapy were: patients with more than four positive lymph-nodes (N1, as on PET-CT or after pelvic lymph-nodes dissection), life expectancy <10 years (as from WHO >3), prior pelvic irradiation, IPSS >20, presence of prostatitis, active Crohn’s disease, colitis ulcerosa or diverticulitis, an anastomotic bowel in the high dose region and patients undergoing trans-rectal prostate resection less than three months before treatment. With the application of these exclusion criteria, a total of 150 patients that were included in this study and treated with external beam radiotherapy.

For all patients, 3T MRI (Ingenia MR-RT, v 5.3.1, Philips Healthcare, the Netherlands) was acquired after requesting the patients to empty their bladder and drink 200-300 ml of water one hour before the acquisition. Patients were positioned on a vendor-provided flat table using a knee support cushion (lower extremity positioning system, without adjustable FeetSupport, MacroMedics BV, the Netherlands). Patients were tattooed at the MRI with the aid of a laser system (Dorado3, LAP GmbH Laser Applikationen, Germany) to facilitate treatment positioning. Also, MR-visible markers (PinPoint for Image Registration 128, Beekley Medical, USA) were used to identify the set-up location on MRI. MR images were acquired using anterior and posterior phased array coils (dS Torso and Posterior coils, 28 channels, Philips Healthcare, the Netherlands). Two in-house-built bridges supported the anterior coil to avoid skin contour deformation.

OARs were contoured on Dixon images [33] obtained with a dual-echo three-dimensional (3D) Cartesian radio-frequency SPGR sequence. For each patient, in-phase (IP), water (W), and fat (F) images [34] (Fig. 1) were reconstructed as in [35]. Dixon images were generated as part of a proprietary solution (MRCAT, rev. 257, Philips Healthcare, Finland) that enabled MR-based dose calculation for patients with prostate cancer [36, 37]. The imaging parameters, reported in Table 1, were locked by the vendor; therefore, they were stable through the whole study. Radiotherapy technicians (RTTs) with dedicated experience in contouring delineated bladder, rectum and femurs using IP, W and F Dixon images. The OARs delineations were approved or revised by a radiation oncologist. Besides, the radiation oncologist delineated the target structures. The delineation indications followed RTOG guidelines [38] requiring that the rectum was delineated from the outer part of the sphincter (anus) until the sigmoid fold (expected length of the rectum was 10-15 cm), as described in [39], with the sphincter delineated as a separate structure. The bladder was entirely delineated, while the femurs were delineated in the whole FOV of the image. In the case of regional radiotherapy, the bowel bag was also included.

Fig. 1
figure 1

Transverse view of in-phase (IP), water (W) and fat (F) images for a patient (69 yo) diagnosed with T2b cancer. Note the large portion of void space surrounding the patient body. Cropping has been applied as preprocessing to remove such void regions

Table 1 Image parameters of the sequences used for the OARs contouring. The term FOV refers to the field-of-view, while AP to anterior-posterior and LR to right-left

Study design

The first 48 patients (treated until January 2019) were included in a feasibility study training two state-of-the-art 3D convolutional networks called DeepMedic [40] and dense V-net (dV-net) [41] (“Networks architecture, image processing and training” section). Three-fold cross-validation was performed, splitting the patients in 32/16 for train/validation. The network hyperparameters were optimised on the first fold and maintained for the other two folds. For example, the number of epochs was chosen considering the loss function in the validation set by performing early stopping when loss function did not decrease after five consecutive epochs.

The performance of the networks was compared against a research version of commercial software based on multi-atlases and deformable registration and against the clinically used delineations (“Evaluation” section).

This preliminary study enabled us to choose among the three automatic methods. The preferred approach was retrained on 97 patients that were imaged and treated until August 2019; it was implemented for automatic use in the clinical workflow. The performances of the implemented model were reported on the 53 successive consecutively treated patients. A schematic overview of the study design is presented in Fig. 2.

Fig. 2
figure 2

Schematic of the study design representing the timeline and the number of patients included. Also, the length and the number of patients for the preliminary study, the training of the final model and the patients used for testing the clinical implementation are reported

Networks architecture, image processing and training

Three-dimensional network architectures were chosen to investigate performance differences considering as perceptive field the whole volumes or smaller patches. In particular, DeepMedic [40] was the network chosen to perform patch-based training, while dV-net [41] was chosen to perform training on whole volumes. The two architectures, which will be described in detail in the next sections, required similar pre-processing. Three channels were used as input: IP, W and F images. The OARs that were considered as target are: bladder, rectum, right and left femur; they were decoded as masks with values from 1 to 4 without overlapping each others. To increase the amount of contextual information, the CTV was also decoded with a value of 5, which means that the networks also output CTV. Note that CTV was not considered in our study given that CTV delineation is clinically based on a different MRI, i.e. T2-weighted turbo spin-echo sequences [42]. The networks were trained on a graphical processing unit (GPU) Tesla P100 (NVIDIA Corporation, USA) with 16 GB of memory. To allow the whole volume to fit on the GPU, the IP, W and F images were initially cropped with 90 voxels at the borders of the anterior-posterior and lateral directions obtaining matrices of 348x348x120 voxels. Note that an observer controlled the presence of femurs within the FOV. Also, the image intensity of IP, W and F were clipped at their respective 99.9 percentile per each patient volume. Images were subsequently divided by the standard deviation (σ), and then a fixed value of 1 was subtracted.

After training and inference of the networks, the delineations were post-processed generating four binary volumes. Morphological operations of closure and hole filling by one voxel were applied. The largest 3D connected region was selected for each delineated structure. These operations were performed to remove possible small-sized delineations that may have been found by the networks.

DeepMedic

The DeepMedic [40] implementation employed was provided by the Kamnitsas et al.Footnote 1 in Tensorflow v1.7. The model employed a three-pathway architecture for multi-resolution processing of 3D patches. A low, medium and high-resolution pathway with receptive fields of 853, 513, 173 voxels were employed with each pathway consisting of 11-layers. A fully connected network (FCN) was used for combining the pathways and post-processing, as presented by Kamnitsas et al. [40]. Note that the size of the receptive fields has been modified compared to the original implementation.

The training configuration was kept as the original, with learning rate = 0.001, Adam optimiser with momentum = 0.6, epochs = 35, batch size = 10 and \(\mathcal {L}_{1}\) and \(\mathcal {L}_{2}\) regularisations Footnote 2 weighted with factor 0.000001 and 0.0001, respectively. The configuration file is reported in the Supplementary Material. All the OARs were equally sampled during training enforcing that the patches considered in each epoch contains the four OARs the same amount of times. Also, as in Kamnitsas et al. [40], volumetric dice similarity coefficient was adopted as the loss function. Data augmentation was applied in terms of random shifts and rescaling perturbation of the intensity (I) by the following: I=(I+s)∗m, where s and m where Gaussian distributed with μ=0, 1 and σ=0.05, 0.01, respectively. For training, DeepMedic made use of about 9 GB of GPU memory.

Dense v-net

The dV-net implementation provided in NiftyNet was employedFootnote 3. It consisted of a 3D U-Net with a sequence of three downsampling and dense upsampling feature strided stacks with skip connections to propagate higher resolution information to the final segmentation. Dilated convolutions were employed to reduce the number of features [41].

The training configuration was kept as the original, with learning rate= 0.001, Adam optimiser with momentum = 0.6, batch size = 6, \(\mathcal {L}_{2}\) regularisation (weight = 0.001) and epoch = 25. The configuration file is reported in the Supplementary Material. Dice was adopted as loss function, and data augmentation was applied in terms of elastic deformation, as implemented within NiftyNet. For training, dV-net made use of about 16 GB of GPU memory.

Evaluation

Preliminary study

The first 48 patients treated between June 2018 and August 2019 were included in a preliminary study to compare the performance of the two networks and atlas-based approach to the delineation used during clinical treatment planning.

The advanced medical imaging registration engine (ADMIRE, research version 1.13.5, Elekta AB, Sweden) was the software considered; ADMIRE is based on multi-atlases [43, 44] and gradient-free dense mutual information deformable registration [45]. In particular, the rectum was delineated based on the F image, bladder and femurs were delineated based on IP images using an atlas of 9 patients that were previously acquired with the same sequence. ADMIRE took about 10 to 15 minutes to generate automatic contouring on a Tesla K20c GPU (NVIDIA Corporation, USA) with 6 GB of memory.

Performances of the three automatic approaches were evaluated in terms of (volumetric) dice similarity coefficients (DSC), 95% boundary Hausdorff distances (HD95) [46], mean surface distance (MSD) against clinical delineations. All the metrics were calculated using PlastimatchFootnote 4, except for the surface distance, which was calculated as from https://github.com/deepmind/surface-distance. In particular, violin plots [47] representing the mean, σ, 95% percentile and the probability distribution were obtained for the three metrics. Also, Wilcoxon signed-rank tests were conducted among the three evaluation metrics with a confidence interval of 0.05.

For a subset of 8 patients, an RTT with five years of experience in contouring scored the quality of the delineations for all three methods. The delineations were classified from zero to three, which corresponds to clinically acceptable, small modifications, large modifications, or clinically unacceptable contours. In total, the RTT scored 96 delineations. The percentage of each score over all the contours was reported for the three methods and visualised in a pie chart. Also, the most challenging structures (structures with an average score ≥2) were reported for each method.

Clinical implementation

After a choice was made among the three automatic approaches, the best performing network was retrained for the first 97 patients that were included up to August 2019. The hyperparameters were identical to the preliminary study. The network was implemented for clinical use complying with the medical device regulation (MDR 2017/745)Footnote 5. Quantitative evaluation was perfomed in terms of DSC, HD95 and MSD for the 53 consecutive patients undergoing MR-only radiotherapy from August 2019 to January 2020. The delineations adopted for clinical use, i.e. delineated by RTTs and approved or re-adjusted by a radiation oncologist, were considered as reference. Also, surface dice similarity coefficient (SDSC) [48] was calculatedFootnote 6 to enable comparison with previous work [49]. Besides, the performance of the network clinically implemented was compared with the performance of the same network obtained during the preliminary study.

Results

Timing performance

The inference time of the network was about 60 s for DeepMedic and approximately 4 s for dV-net using the full resolution images of 328x328x120 voxels on GPU. ADMIRE generated contours in approximately 14 min on GPU.

Preliminary study

Figure 3 represents the violin plots for DSC, HD95 and the MSD. One can observe that performances were higher for both the networks compared to ADMIRE. For the bladder, no significant differences were observed between the networks, but significant differences were observed between the networks and ADMIRE. For the rectum, no significant differences were observed among the three automatic methods. When considering the femurs, DeepMedic outperformed both dV-net and ADMIRE. For example, for the right femur, the mean (±σ) HD95 was 2.2 ±1.4, 2.5 ±1.8 and 3.2 ±1.4 mm for DeepMedic, dV-net and ADMIRE, respectively.

Fig. 3
figure 3

Violin plots representing the mean (white dot), σ (black vertical rectangle), 95% percentile (black vertical line) and the probability distribution for the dice similarity coefficient (DSC, top) and 95% Hausdorff distance (HD95, middle) and surface distance (bottom) for the OARs against clinical contours in among the preliminary study. The statistical significance of the Wilcoxon signed-rank test is reported as well as the mean(±σ) of each metric. The asterisks represent p ≤0.05 (∗), p ≤0.01 (∗∗) and p ≤0.001 (∗∗∗)

The qualitative scoring by an RTT expert (Fig. 4) demonstrated that delineations from DeepMedic required fewer adaptations, followed by dV-net and then ADMIRE. Specifically, the expert RTT stated that, for all the structures, the number of delineations that were acceptable or needed small adjustment was 81%, 59% and 3% for DeepMedic, dV-net and ADMIRE, respectively. For both the networks, the rectum followed by bladder were indicated as the most challenging structures, while for ADMIRE, the bladder followed by rectum and femurs (same scoring) were the structures considered as the most challenging (score ≥ 2).

Fig. 4
figure 4

Pie chart reporting the percentage of the qualitative scoring performed by the expert RTT for each auto-segmentation method

Clinical implementation

On the basis of the preliminary analysis, we decided to implement DeepMedic for our clinic. Clinical implementation was performed in August 2019.

The performance of DeepMedic in the preliminary study and after clinical implementation are presented in Table 2. After retraining DeepMedic and testing on the successive patients, the performances slightly improved. For example, it can be observed that, on average, the performance of DSC, HD95 and MSD after retraining the network on a more extensive set was ameliorated by 0.01-0.03, 1.2-1.4 mm and 0.1-0.4 mm, respectively. Delineations obtained with DeepMedic for a patient in the test set are presented in Fig. 5.

Fig. 5
figure 5

Example of in-phase MRI after cropping along with segmentations (OARs) obtained with DeepMedic (contours) versus clinical segmentations (filled contours) in the transverse (left), coronal (centre) and sagittal (right) view for a patient in the test. For this patient, average performance was obtained in terms of DSC: 0.96, 0.86, 0.97 and 0.97 for bladder, rectum, and femurs, respectively. Note that DeepMedic also outputs CTV, but it was not considered for clinical evaluation

Table 2 Comparison of performance between the preliminary study (PS) and after the clinical implementation (Clinic) for DeepMedic in terms of (volumetric) dice similarity coefficient (DSC), 95% Hausdorff distance (HD95) and mean surface distance (MSD)

Also, the SDSC was calculated for several threshold, τ= 0.5, 1, 1.5, 2 and 3 mm as reported in Fig. 6. The mean (±σ) DSCS was 0.98 ±0.03, 0.92 ±0.05, 0.989 ±0.008 and 0.997 ±0.003 for τ=2 mm for bladder, rectum, left and right femur, respectively.

Fig. 6
figure 6

Boxplots for each structure of surface Dice similarity coefficient (SDSC) as a function of threshold (τ) for the 53 patients after clinial implementation. The data is plotted for the range of τ from sub-pixel (0.5 mm) to above the voxel size (3 mm). Box plots are shown with an inter-quartile range from 25 to 75% with the horizontal line representing mean value. Upper and lower whisker represent the 2.5 and 97.5 percentiles

Discussion

The use of MRI for prostate radiotherapy delineation is becoming increasingly common among radiotherapy departments [50]. MRI are used to plan radiotherapy [32, 51]. Besides, use of MRI is also accelerated by the adoption of new advancements in linear accelerator technology, whereby daily MR imaging in treatment position is possible [911].

In this study, we demonstrated that deep learning-based approaches can utilise MRI to automatically segment OARs achieving high conformality. Also, a convolutional network has been implemented for clinical use, demonstrating the capability to maintain the performances obtained in a preliminary study.

Table 3 compiles previous work based on the use of convolutional networks and a selection of conventional approaches [16, 17, 52] for OARs delineation in the pelvic area. One can notice that CT-based segmentation [5355] achieved mean DSC in the range 0.88-0.95 for prostate, rectum and bladder. Also, MRI-based segmentation [27, 49, 56] achieved mean DSC in the range 0.82-0.95. This study seems to outperform previous studies in almost all the metrics (in bold in the Table) except for the rectum, as obtained by Kazemifar et al. [54] and the HD95 and MSD as obtained by Kazemifar et al. and Dong et al. [56]. Comparing the results of automated contouring methods should be done with caution. For example, the guidelines used for clinical delineation may be different, and the impact of inter-observer variability on deep learning-based methods is not generally investigated [57]. In this sense, our study is novel given that a comparison of approaches based on CNNs to an atlas-based method is presented.

Table 3 Overview of the performance of automatic OARs delineations based on MRI and CT subdivided in convolutional network-based and conventional approaches. The number of patients included in the study (Pts), the imaging modality, a brief description of the method and metrics as dice similarity coefficient (DSC), 95% boundary Hausdorff distance (HD95) and mean surface distance (MSD) were reported for each study. HD95 and MSD are expressed in mm

In this study, a qualitative assessment by a manual observer has been presented. Unfortunately, it has not been recorded whether the overall time for the delineation has been reduced. Previous studies investigated this aspect [58] when introducing deep learning-based techniques in their clinic. Also, it is unclear whether the performance of the network may further improve when a dataset larger than 97 patients is used for training. This may be an object of future research.

The time necessary for automatic delineation on full FOV is within a minute. Such time-scale can be of interest for conventional radiotherapy and for MR-guided treatments. On the one hand, for conventional radiotherapy, fast automatic OAR segmentation may facilitate the reducing delays in the start of the treatments that may lead to hampered clinical outcomes [59]. On the other hand, for online adaptive MR-guided radiotherapy, fast OAR segmentation may relieve clinicians from dedicating effort in OARs segmentation while facilitating the delineation of the target [60]. Currently, it has been reported that about 5-10 min is necessary for the for delineation in an online setting [19]. The time frame reported in our work may facilitate online adaptive radiotherapy, especially with an integrated automatic workflow.

Conclusion

High conformality for OARs delineation was achieved with two in-house trained networks, obtaining a significant speed-up of the delineation procedure. One of the networks, DeepMedic, was successfully adopted in the clinical workflow maintaining in the clinical setting the accuracy obtained in the feasibility study conducted before clinical implementation.