Keypoints

  • 3D U-Net exhibits good performance for segmentation of normal pelvic bones.

  • A SCORE system was designed for the qualitative evaluation of segmentation.

  • It lays a foundation for the detection of pelvic bony metastases.

Background

Multiparametric magnetic resonance imaging (mpMRI) has previously demonstrated high sensitivity and specificity in diagnosing pelvic bone metastases [1]. As a well-recognised and widely used sequence in oncologic imaging, diffusion-weighted imaging (DWI) has been reported to be able to offer both qualitative (signal intensity) and quantitative (apparent diffusion coefficient [ADC] maps) information for lesion detection and characterisation [2,3,4]. Adverse bone events, such as pathological fracture and spinal cord compression, were often led by bone metastases [5, 6]. Therefore, timely and accurate detection of bone metastases on DWI and ADC images is of great significance in guiding patient care and assessing therapeutic benefits.

When radiologists interpret pelvic magnetic resonance imaging (MRI) images, bone metastasis location should first be determined, followed by size and ADC measurement of the metastatic foci. Thus, the initial step to achieve accurate bone metastases detection on DWI and ADC images requires accurate skeleton segmentation with their semantic labels. It is the first step in developing an automated method for quantifying skeletal metastatic tumour burden. Deep learning techniques lead the transformation of manual segmentation into automated segmentation [7,8,9]. For example, fully convolutional neural networks (CNNs) such as the U-Net model proposed by Ronneberger et al. [10] and the V-Net model proposed by Milletari et al. [11] have significantly increased the potential of automated image analysis to an unprecedented level.

Previously, several studies have reported CNNs for segmentation of normal bone structures on CT images and bone scans [12, 13], however, only a few studies of automatic segmentation of normal bone structures on MR images are available. Quantitative evaluation is a routine method for segmentation models, while qualitative evaluation has a better correlation with clinical practice [14]. Quantitative and qualitative evaluations are complementary, and together evaluate the segmentation model more completely [15]. The purpose of this study is to develop a 3D U-Net model to automatically segment different pelvic bony structures on DWI and ADC images—lumbar vertebra, sacrococcyx, ilium, acetabulum, femoral head, femoral neck, ischium, and pubis, which represent the most frequent sites of bone metastases for prostate cancer [16], and test its feasibility quantitively and qualitatively. This research may provide essential localisation information for subsequent research on pelvic bone metastases.

Materials and methods

This retrospective study was performed with permission from the local Institutional Ethical Committee. The need for written informed consent was waived.

Patients enrollment

For algorithm development, we retrospectively collected 288 patients who performed pelvic mpMRI scans for suspected prostate cancer at our institution between August 2018 and August 2019. The inclusion criteria included the following: (1) no sign of obvious bone structure abnormality, (2) DWI images with low (0 s/mm2) and high (800 or 1000 s/mm2) b values and ADC maps accordingly reconstructed, and (3) good image quality without obvious artefacts. The exclusion criteria included those with: (1) pelvic surgical history, and (2) bone diseases that occurred within the pelvis such as primary bone tumour and necrosis. In total, 264 patients remained enrolled for this study, with patients excluded for various reasons including poor image quality (n = 7), those with primary bone diseases such as degeneration, hemangioma and sarcoma (n = 13), and those with a pelvic surgical history (n = 4). Retrospectively, a further 60 consecutive pelvic mpMRI data were collected between January 2020 and March 2020 from our institution that satisfied the above inclusion criteria in order to provide external validation (5 patients were excluded due to poor image quality) (Fig. 1). All mpMRI data were de-identified before inclusion.

Fig. 1
figure 1

The workflow of patient enrollment and distribution

mpMRI data

A consecutive cohort of pelvic mpMRI data from our hospital was retrospectively collected to develop and test the algorithm used in this study. Details of the imaging parameters and techniques of the DWI sequence are summarised in Table 1. The ADC maps were calculated from two b values (0, 800 or 1000 s/mm2) using the corresponding scanner software.

Table 1 Imaging system and typical parameters

Algorithm development

The neural network model developed for the segmentation of pelvic bone structures on axial DWI and ADC images is a 3D U-Net [17], which replaced all the 2D operations of U-Net architecture with 3D counterparts (Additional file 1). Considering the 3D nature of bones, pelvic bone segmentation may be better performed using 3D U-Net, where 3D segmentation can utilise the inherent dependency between the spatial location of multiple slices [18] in contrast to 2D CNN which ignores 3D continuity of segmented bones between slices.

The developed CNN takes the combination of 264 DWI images (b = 800 or 1000 s/mm2) and 264 ADC images acquired from three vendors as input, and each image sequence was an independent input data. The CNN model was developed with one input channel. The 264 patients were randomly divided into either the training (n = 208), validation (n = 28) or testing (n = 28) sets with a ratio of 8:1:1. The independent dataset in the external validation set (n = 60) was used to evaluate further the performance and generalizability of the 3D U-Net model.

The training set was used to fit the classifier weights, and the validation set was used to tune the classifier architecture. The testing set was used to assess the fully specified classifier performance, and the external set was used to externally evaluate the model using data collected from various times (Fig. 1).

The original sizes of DWI images (acquired from three vendors) were 24 × 180 × 156 (z, y, x); 25 × 256 × 256; 24 × 224 × 224. All the input images were unified and resized to 64 × 256 × 256 (z, y, x) before training in order to maintain the optimal image features. All of the images were normalised of pixel intensity to [0, 1]. The CNN was coded by Python3.6, Pytorch 0.4.1, Opencv, Numpy, SimpleITK, and trained on the GPU NVIDIA Tesla P100 16G. A total of 300 epochs of training were performed. The Adam Optimizer was employed to minimise loss with a learning rate of 0.0001, a batch of size 2, and a binary cross-entropy loss function. Other hyperparameters tuning (such as weight initialisation and dropout for regularisation) were randomly searched and automatically executed in the validation set during U-Net development.

Manual annotation

Under the supervision of a board-certified radiology expert (with more than 15 years of reading experience), a radiology resident with three years of reading experience evaluated all mpMRI examinations and, section by section, manually labelled the pelvic bones (containing bone marrow and cortex) on DWI images with high b values. The labels were created with the following values: 1 = lumbar vertebra, 2 = sacrococcyx, 3 = ilium, 4 = acetabulum, 5 = femoral head, 6 = femoral neck, 7 = ischium, and 8 = pubis. A connected domain (CD) is defined as the part of a label area with a continuous structure in 3D space, and one label may contain multiple CDs. As shown in Fig. 2, a lumbar vertebral label is a single CD, while the ilium label contains two CDs (one on the left, and one on the right). Therefore, there are 14 CDs in the complete reconstructed pelvic bone, including one lumbar vertebra, one sacrococcyx, two ilia, two acetabula, two femoral head, two femoral necks, two ischia and two pubes.

Fig. 2
figure 2

Reconstruction of manual segmentations of pelvic bones. A multiclass label was created for the patients; 1 = lumbar vertebra, 2 = sacrococcyx, 3 = ilium, 4 = acetabulum, 5 = femoral head, 6 = femoral neck, 7 = ischium, 8 = pubis

Since DWI and ADC images were co-registered by the scanner (ADC maps were calculated from DWI images), the manually segmented labels on DWI images could be matched to the ADC maps. The supervisor reviewed all labels on ADC images that copied from DWI images and made corrections wherever necessary. Images were manually annotated with an image segmentation software (ITK-SNAP 3.6; Penn Image Computing and Science Laboratory, Philadelphia, Pa).

Quantitative evaluation

The Dice Similarity Coefficient (DSC) between CNN-predicted and manual segmentation was computed to evaluate the segmentation accuracy of DWI images and ADC maps quantitatively. The DSC was defined as the volume of overlap between the CNN-predicted and manual segmentation, given by:

$${\text{DSC}} = \frac{{2 \times (P \cap M)}}{{P + M}}$$

where P and M are the volume of voxels in the predicted segmentation and manual annotation, respectively, and P ∩ M is the volume of voxels that are consistent between the two methods. The DSC ranged between 0 and 1, with higher values indicating a better segmentation performance.

The volume calculation also quantified the accuracy of bone segmentation. The Bland–Altman method was used to assess the volume difference between CNN-predicted and manual segmentations.

Qualitative evaluation

A qualitative SCORE system established in this work was used to evaluate the CNN-predicted results at the CD level, which focuses on evaluating whether the predicted results meet the requirements for clinical application. The evaluation criteria of DWI images and ADC maps at the CD level are shown in Fig. 3. Given the different usage of DWI and ADC images in evaluating bone metastases (i.e. DWI images for the qualitative diagnosis, ADC maps for quantitative measurement) [2, 3], the evaluation criteria are slightly different. For example, if the location of the predicted CD overlaps the manually annotated CD (i.e. Condition A), and the range of the predicted CD is larger than the manual CD (A1), which is clinically acceptable segmentation on DWI images for which does not affect the detection and localisation of lesions. While if the range of the predicted CD is smaller than the manual CD (A2), that is clinically acceptable for ADC images, eliminating the interference of other tissues to ADC measurements.

Fig. 3
figure 3

The SCORE system and evaluation criteria on DWI images and ADC maps. According to the diagram of the SCORE system, the evaluation of the predicted segmentations on DWI images and ADC maps included three steps: firstly to calculate the DSC value of each label predicted by the model, then to judge whether the location of the predicted CD of the label overlaps the manually segmented CD, and finally to calculate the average scores of the labels according to the number of the predicted CDs that do not overlap with manual CD. Condition A means that the location of the predicted CD overlaps manual CD, and the range of the predicted CD is larger than (A1) or less than (A2) the manual CD, or partially overlaps with the manual CD (A3). CD connected domain, DSC dice similarity coefficient

The score of a single CD is between 0–10, where CDs with the same label were used to calculate the average label score, and the score at the patient level is the average value of all labels. A radiology expert (with more than 15 years of reading experience) assessed the predicted results according to the SCORE system. A resident (with three years of reading experience) also qualitatively evaluated the segmentation performance of the model to compare the consistency between readers.

Statistical analysis

One-way analysis of variance was performed to compare age and prostate-specific antigen (PSA) level (total PSA [T-PSA], free PSA [F-PSA], F/T-PSA) among four data sets (the training set, validation set, testing set and external validation set). A Kruskal–Wallis test was applied for the comparisons of these metrics among different vendors. Comparisons of DSCs between testing and external validation sets were performed using the Student’s t-test. One-way analysis of variance was applied to compare DSC values among different vendors, and the least significant difference (LSD) was used for post hoc multiple comparisons. To compare manual versus CNN-predicted bone volume, linear regression and Bland–Altman analyses were performed on both testing and external validation sets. The Student’s t-test was also used to compare the mean scores between DWI and ADC images. SCORE results between two readers were assessed using the single-measure intraclass correlation coefficient (ICC), with ICC > 0.75 considered an excellent concordance. A two-sided p < 0.05 was considered statistically significant.

Results

Patient demographics

The demographics of patients are presented in Table 2 and Additional file 2. There was no significant difference in age and PSA level (T-PSA, F-PSA, F/T-PSA) among the four data sets (Table 2, p > 0.05) and the three vendors (Additional file 2: Table S1, p > 0.05).

Table 2 Demographics of patients among different dataset

The different field-of-views among different MR scanners may result in an unequal number of labels and CDs in each dataset, where some patients may lack the section of pubis or lumbar vertebra. The distribution of different CDs among the datasets can be seen in Fig. 4, where it shows that all CDs of the bones had roughly equal distribution among the four datasets, confirming the network’s results are not biased.

Fig. 4
figure 4

The distribution of different connected domains. The distribution of the connected domains among four data sets shows that all bones had roughly equal distribution in different data sets

Segmentation accuracy of pelvic bones

DWI and ADC images were independently inputted to train the model. Also, both the DWI and ADC were predicted independently to get their segmentation. As shown in Table 3, the model achieved good segmentation accuracy of pelvic bones in both testing and external validation sets, attaining average DSC values of 0.80 ± 0.05 and 0.79 ± 0.06 on DWI images and 0.85 ± 0.04 and 0.84 ± 0.04 on ADC images. The average DSC values from ADC images were higher than those of DWI images (testing set: tDWI vs ADC = − 4.238, PDWI vs ADC < 0.001; external validation set: tDWI vs ADC = − 5.490, PDWI vs ADC < 0.001).

Table 3 The DSC values of 3D U-Net in pelvic bones segmentation on DWI and ADC images

Additionally, the DSC values among different vendors on external validation sets for DWI and ADC images were compared. As shown in Table 4, the images from vendor 1 (3.0 T Discovery) attained significantly higher DSC averages than vendor 2 (1.5 T Intera) and vendor 3 (1.5 T Avanto) (DWI images: FV1 vs V2 vs V3 = 22.405, PV1 vs V2 vs V3 < 0.001; ADC images: FV1 vs V2 vs V3 = 13.420, PV1 vs V2 vs V3 < 0.001).

Table 4 The DSC values of 3D U-Net in pelvic bones segmentation on external validation set grouped by vendors

Figure 5 shows the predicted section examples of eight bony structures (lumbar vertebra, sacrococcyx, ilium, acetabulum, femoral head, femoral neck, ischium and pubis) from DWI and ADC images in the external validation set (Fig. 5a, c) and the corresponding overlapping images with manual segmentations (Fig. 5b, d).

Fig. 5
figure 5

Examples of the comparison between CNN-predicted and manual segmentations. a Section examples of eight bones on DWI image; b The corresponding overlapping images between manual segmentation (white background) and CNN-predicted segmentation; c Section examples of eight bones on ADC image; d The corresponding overlapping images between manual segmentation (white background) and CNN-predicted segmentation

Volume calculation of pelvic bones

The overall bone volume between CNN-predicted and manual segmentation strongly correlated (DWI images in testing set: R2 = 0.94, Fig. 6a; DWI images in external validation set: R2 = 0.85, Fig. 6c; ADC images in testing set: R2 = 0.97, Fig. 6e; ADC images in external validation set: R2 = 0.94, Fig. 6g). When compared with the manual method, the automated CNN method slightly overestimated bone volume on both DWI images (testing set: mean bias = 4.5 cm3, 95% limits of agreement [LoA] were − 8.3 cm3 and 17.3 cm3, Fig. 6b; external validation set: mean bias = 2.6 cm3, 95% LoA were − 20.6 cm3 and 25.9 cm3, Fig. 6d) and ADC images (testing set: mean bias = 4.3 cm3, 95% LoA were − 4.0 cm3 and 12.6 cm3, Fig. 6f; external validation set: mean bias = 4.2 cm3, 95% LoA were − 9.7 cm3 and 18.1 cm3, Fig. 6h). The detailed volume difference between CNN-predicted and manual segmentations of pelvic bones is provided in Additional file 3.

Fig. 6
figure 6

Agreement of bone volume assessments between CNN-predicted and manual segmentations. a, c Linear regression and (b, d) Bland–Altman analysis of bone volume assessments from DWI images. e, g Linear regression and (f, h) Bland–Altman analysis of bone volume estimates from ADC images

SCORE results of pelvic bones

To identify if the qualitative segmentation results of the developed 3D U-Net model meet the requirement for clinical application, two readers scored every single CD on DWI and ADC images on testing and external validation sets (Fig. 7). A summary of the average scores at the patient level is provided in Table 5. The relatively high scores and excellent concordance between the two readers (ICC = 0.904, 95% confidence interval: 0.871–0.929) confirmed the feasibility of using the CNN clinically.

Fig. 7
figure 7

SCORE results on DWI and ADC images. a The range of the CNN-predicted segmentations of sacrococcyx (yellow label) and ilium (deep blue label) was slightly more extensive than the manual segmentations (white background), attaining DSC values of 0.92 and 0.88, respectively. According to the SCORE system, 10 on the DWI image and 8 on the ADC image were obtained. b The range of the CNN-predicted segmentation of lumbar vertebra (light blue label) was slightly smaller than manual segmentation (white background) with a DSC value of 0.82. According to the SCORE system, 8 on the DWI image and 10 on the ADC image were obtained

Table 5 The summary of patient scores on DWI and ADC images

Discussion

In this study, a 3D U-Net model was trained to segment normal bony structures on pelvic DWI and ADC images to provide localisation information for subsequent detection of pelvic bone metastases. This method has been successfully used for segmentation of the lung lobes on Computed Tomography (CT) scans—a task that solves similar challenges related to the localisation of lung tumours [18]. In this research, the 3D U-Net model achieved good segmentation performance on pelvic bones with high average DSC values on testing and external validation sets. The quantitative volume comparison between CNN-predicted and manual segmentations was highly correlated and in close agreement.

Both quantitative and qualitative evaluations were done to determine the value of the 3D U-Net model for clinical applications. Generally, quantitative evaluation indicators are horizontally comparable among different technical studies [19,20,21]. However, the specified qualitative evaluation is more important when faced with different clinical problems raised in clinical research, which yields the difference between basic and clinical research in this field. Taking the evaluation of this semantic segmentation (multiple areas of pelvic bones) as an example, the quantitative evaluation indicators include DSC values and volume comparisons between CNN-predicted and manual segmentations. However, considering the different usage of DWI and ADC images on bone metastases evaluation, identical quantitative results on DWI and ADC images may result in different evaluation criteria and clinical acceptability [2, 22, 23]. Besides, on ADC maps, the DSC of lumbar vertebra in the testing set was significantly larger than in the external set (t = 2.564, p = 0.012), while the difference has no clinical significance since the DSC values were large enough (both with > 0.80).

Undefined performance metrics adapted to clinical requirements represent one of the barriers to the clinical evaluation and adoption of fully automated segmentation methods [24]. At present, the majority of qualitative evaluations of automated medical image segmentations are mainly based on visual observation [25,26,27], and the lack of standardised criteria on visual observations can introduce some degree of variability and inconsistency between intra- and inter-readers. It is therefore vital to ensure uniformity on the qualitative evaluation standard. In this study, a SCORE system was formulated to promote standardisation in the qualitative evaluation of pelvic bone segmentation on DWI and ADC images, potentially diminishing estimation variability and increasing precision.

DWI was primarily used to detect bone metastases. When the prediction range is slightly more extensive than the manual segmentation (i.e., A1) on DWI, it is still acceptable for lesion detection. The ADC map was mainly used to measure ADC values. The ROI used for the ADC value measurement should be contained within bone structures. Therefore, the slightly smaller range of CNN prediction than the manual segmentation (i.e., A2) can meet the demand of ADC value measurement.

Notably, these results present a trend that the ADC images outperformed the DWI images in terms of the average DSC value in both the testing (PDWI vs ADC < 0.001) and external validation sets (PDWI vs ADC < 0.001), while the SCORE results showed no significant difference between DWI images and ADC images (p > 0.05). We believe this is due to greater variation in signal intensity of DWI images than in ADC images among the different MR units, resulting in a lower DSC value. However, considering the clinical usage of DWI images for lesion localisation, the lower DSC value due to overestimation is not necessarily a flaw according to the SCORE system thus allowing for bias. The segmentation performance on DWI and ADC images acquired from vendor 1 was better than vendors 2 and 3 (DWI images: PV1 vs V2 vs V3 < 0.001; ADC images: PV1 vs V2 vs V3 < 0.001), which may be due to more data from vendor 1 were acquired in the clinical practice and collected during model development. These results remind us that multiple and multi-vendor data are necessary for segmentation algorithm development instead of single-vendor algorithms that are not suitable for real clinical applications.

Comprehensive clinical research based on deep learning is usually divided into multiple sequential steps, with each step employing deep learning or traditional image processing methods [28, 29]. Regarding the segmentation of kidney stones on CT images, the kidney is firstly segmented by the deep learning method, after which the high-density stone is identified using the traditional threshold segmentation method [30]. The division of the complex clinical tasks can not only improve the acceptability of the model but can also save training resources, which highlights the value of the clinicians participating in model training. This research focused on one of the sequential tasks to detect pelvic bone metastases using the deep learning method, aiming to achieve the localisation of pelvic bones. An algorithm for detecting bone metastases is to subsequently develop and, finally, achieve automation of both detection and localisation of pelvic bone metastases.

Substantial diversities across mpMRI images (different patients and vendors) make automated segmentation challenging for real clinical applications. In this study, a 3D U-Net model was supplied with MR data collected from three independent vendors with various parameters. The performance demonstrated the high robustness of the model to different technical parameters and scanner types, which would greatly benefit patients who undergo different scans in MRI examinations during routine clinical care. Moreover, to simulate clinical application scenarios, independent and consecutive MR datasets were collected from different periods as a source for external validation and to evaluate the generalizability of this CNN.

There are several limitations to this research. Like all supervised learning techniques, this method relies heavily on manual annotation which gives rise to user variability. No images with metastasis have been included in the dataset, and this CNN was not tested on pelvic bone structures containing metastatic lesions. Thus its performance in the segmentation of such cases still needs to be verified, and the generalisation of the model to the presence of these lesions need further confirmation. The SCORE system for qualitative evaluation of segmentation established by a radiology expert at our institution may be subjective and limited, and may necessitate a consensus from several experts from multiple institutions for actual clinical research. Furthermore, the 3D U-Net model was only trained and validated with retrospective data, while prospective evaluation still needs to be conducted, particularly on benign bone abnormalities and common imaging artifacts. Multicentre data may be required before the algorithm can be deployed in a clinical workflow.

Conclusions

In conclusion, the presented 3D U-Net CNN can achieve automated and accurate segmentation of pelvic bones without metastases on DWI and ADC images acquired from different MR vendors. This work presents a promising step toward a highly desired automated mpMRI-based imaging methodology to detect skeletal metastases.