Background

Digital image segmentation involves the labeling of each pixel or voxel into different regions which exhibit the same set of attributes. When applied to medical images, these segmentations may support surgical planning, promote patient empowerment, aid students in education through augmented or virtual reality visualization, provide input for three-dimensional (3D) printing and be the initial step to achieve surgical simulators using personal data [1,2,3]. Furthermore, it may be implemented as a tool for diagnostic interpretation, allowing precise volume estimation and tissue localization in three spatial dimensions [4,5,6].

The diagnosis of knee injuries relies on a summary of the information collected through injury history, imaging and clinical examination. However, the most specific knee instability tests require numerous repetitions and training in order to achieve the skills warranted and there are numerous examples in the clinical world that some of these knee injuries are diagnosed years after the original knee incident [7]. Recent research has identified artificial intelligence (AI) as a tool to predict the need for overnight hospital stays in ligament surgery and total knee arthroplasty (TKA) [8]. AI has also contributed to new advanced treatments of meniscus injuries [9]. Orthopedic surgery has made significant progress by adapting new techniques as arthroscopy in the last century and machine learning possess another potential for a ground breaking shift towards optimal treatment. The first step is to delineate normal anatomy and explore the potential of this tool to reveal knee pathology in order to initiate therapeutic handling as early as possible.

Most commonly, the annotation of MR images involves manual labeling of gray scale image data. Although established semi-automated methods such as region-growing, intensity thresholding, and logical operators contribute to manual annotation efficiency, it is time-consuming and labor expensive. With the advancement of artificial intelligence and machine learning methods, the possibility of rapidly yielding accurate automatic segmentations of medical images is introduced [4, 10]. In radiology, convolutional neural network (CNN) algorithms have proven to be a technique ideally suited for image-based tasks such as segmentation, object detection, classification, and image generation, among others [10,11,12,13].

The performance of CNNs is not only related to the algorithms themselves but also depends on the availability of image features and contextual information in the datasets applied [14, 15]. The quality and quantity of the grayscale images and of the manually annotated ground truth definitions are therefore strongly related [16, 17]. The majority of studies regarding deep learning segmentation of joints are based on datasets comprising protocols of either single or multiple channels of two-dimensional (2D) MR sequences [18,19,20,21,22,23]. The slice thickness in MRI is usually in the range of 1 - 3 mm. In addition, most datasets possess a limited number of anatomical classes defined in the ground truth, often narrowed down to bone, meniscus, and cartilage. Segmenting multiple tissues is important for a more complete anatomical visualization [20, 24]. For virtual and augmented surgical simulators, the benefit of a complete knee segmentation furnishes realism and may aid in navigating anatomical landmarks [25,26,27,28,29]. Segmenting all the different tissues is often a demanding task due to morphological complexity, homogeneous intensities and class imbalance of data [30].

The rendering of an anisotropic multislice 2D MR scan has the disadvantage of a decreased through-plane resolution [31]. For a more realistic spatial representation of all anatomical structures in three dimensions, it is recommended to operate with isotropic voxels of reasonably high resolutions [32, 33]. The hypothesis is that, in relation to neural networks, implementing a protocol consisting of multiple MRI weightings exhibiting superior resolution will leverage the capacity of image features, leading to better results with more precise 3D models [1, 4, 10, 14, 15, 34]. To further extend the contextual data availability, the ground truth extraction ought to be as complete as possible.

The main purpose of this work was to determine the performance of a convolutional neural network as a deep learning method to automatically segment musculoskeletal anatomy of the human knee for visualization. The ultimate aim will be to use these models to make 3D models for preoperative planning, and use the model to detect pathology.

Materials and Methods

Magnetic Resonance Imaging

The study design involved retrospective interpretation of prospectively acquired data. Imaging was performed on a 1.5 Tesla whole-body MR system Philips Achieva software release 3 (Philips), fitted with maximum strength gradients of 33 mT/m and a maximum slew rate of 160 T m\(^{-1}\) s\(^{-1}\) (Nova gradients). A dedicated eight-channel knee coil was applied. The MR imaging protocol consists of four pulse sequences: T1 TSE, PD TSE, PD FS TSE and Angio GE, and all sequences following a 3D sampling pattern. In comparison with the standard clinical knee protocol applied daily at the imaging center, the segmentation protocol acquires four times the data points (voxels) per examination. The essential imaging parameters of the sequences are presented in Table 1.

Table 1 MR protocol imaging parameters. PD = Proton Density, FOV = Field of View, TSE = Turbo Spin Echo, TR = Repetition Time, TE = Echo Time, NSA = Number of Signal Averages, Refoc. ctrl = Refocusing control

The gradient echo-based angio sequence [35] was scanned in the transversal plane for maximum inflow effects. This sequence was subsequently reconstructed into 400 slices in the sagittal orientation to match the exact geometrical features of the 3D TSE sequences. This specialized protocol requires approximately 40 minutes compared to the routine examination protocol, which is 15-20 minutes. The imaging sequences T1, PD, FS and Angio are presented as axial projections in Fig. 1.

Fig. 1
figure 1

(a) Axial T1 TSE, (b) Axial PD TSE, (c) Axial PD FS TSE (d) Axial Angio GE, (e) Manually annotated ground truth, (f) Volume view of the 3D imaging planes (TSE = Turbo Spin Echo. FS = Fat Saturated)

Sample selection and dataset

The right knee of 46 participants were scanned exclusively for this study. The sample size was derived from experiences gained empirically during a preliminary pilot study prior to the commencement of this work [36]. An informed and written consent agreement regarding the handling of data was signed by all participants, along with the standard MRI safety sheet. The inclusion criteria were adults with fused growth plates and without any known damage to the knee joint such as fractures, cartilaginous wear, ligamentous tears, and meniscal damage. We excluded two subjects due to excessive movement during the scan. Also, three subjects revealed some apparent pathology and were thus removed. One subject was discarded due to significant wrap-around artifacts.

The included participants were divided into independent subgroups of 20, 5, and 15 for training, validation, and test dataset, respectively. The mean age of the volunteering participants in these subgroups were 36.7, 37.2, and 28.8 years (range: 21 - 75 years) and the ratio of men and women were 13:7, 2:3, and 2:3, respectively. The single pathology case was female age 27. The research was approved by the Norwegian Regional Committee for Medical and Health Research Ethics (REK nr. 61225). The study is performed in accordance with the ethical standards of the 1964 Helsinki declaration and its later amendments or comparable ethical standards.

The DICOM images were initially anonymized and converted to the NIfTI format. For the network to operate correctly, it is imperative that each individual scan has all four input channels co-registered. The ITK-SNAP [37] and NordicICE 4.2.0 (NordicNeuroLab) software packages were used to create the manual ground truth of the 40 healthy participants. The manual segmentations (ground truth) consists of 13 classes including bone medulla, posterior cruciate ligament (PCL) anterior cruciate ligament (ACL), cartilage, meniscus, cortical bone, collateral ligament, tendon, adipose tissue, artery, vein, nerve, and muscle, shown in Fig. 1e. The manual ground truth annotation was performed using selected images from each of the four MR sequences depending on which tissue to annotate, taking advantage of the inherent intensity and contrast characteristics of each sequence. For example the PD FS sequence allowed easier annotation of cartilage, while the Angio sequence was mainly used for vessels. A feature of the annotation software utilized allowed for making a final multiple tissue ground truth by merging the individual tissue ground truths, which in turn reduced the overall manual annotation time. Ground truth annotation of all tissues segmented in this work can be performed using the T1 TSE sequence exclusively, but with considerably longer time spent for the manual annotation. These annotations were crafted by a clinical medical physicist, and validated by a radiologist, both holding over 25 years of experience in their respective fields of work.

Neural network

Over the recent years, several platforms facilitating the development of deep learning neural networks for medical imaging have emerged from the artificial intelligence community [10, 38]. One of these is NiftyNet [39]-an open-source neural networks platform based on the Tensorflow framework. We have used the DenseVNet convolutional neural network (CNN) for automated knee MR image segmentation, and the network architecture is presented in Fig. 2.

Fig. 2
figure 2

The neural network architecture

We present the network hyper parameters for training and inference in Table 2. These parameters are detailed in the configuration manual of NiftyNet [40]. The requirement of spatial window size for DenseVNet is set equally divisible by 2 * dilation rates [41]. The dilation rate variable was set to the default value 4, resulting in a mandatory patch size equally divisible by 8. The image patch \(320\times 320\times 320\) is taken from each image with batch size one as the input for the network. The image volume is down sampled using convolution with stride 2, and an average pooling to \(160\times 160\times 160\) from the input. At the decoder level, there were three dense feature stack (DFS) blocks that outputs a skip connection with a convolution with stride 1, and a convolution with stride 2 for down-sampling the spatial size. Finally, the skip connection channels of the second DFS block and the bottle-neck layer channels were up-scaled to the half spatial size of the input shape. All the output channels (average pooling and skip connections) were concatenated and convoluted to get a single channel output. The Parametric rectified linear unit (PReLU) [42] was chosen as activation function. A combination of dice loss with cross entropy loss (DicePlusXEnt) [43] was applied as the loss function. A constant learning rate [44] of 0.0001 was utilized throughout the training. Data augmentation was achieved through random flipping axes, elastic transformations and random scaling [45]. In order to maximize network quality performance, the patch size was kept as large as possible without exceeding system video memory.

Table 2 Selected hyper parameters for training and inference

The following computer hardware components were acquired for training the network: AMD Ryzen 3900X 12-core CPU, Corsair 128 GB DDR4 memory, and a single NVIDIA TITAN RTX 24GB GPU.

Validation metrics

The correctness of deep learning segmentation outputs can be measured using the confusion matrix which consists of true positive (TP), false positive (FP), false negative (FN), and true negative (TN). One of the most commonly used metrics is the Dice similarity coefficient (DSC) [46, 47]. DSC measures similarities between the manually segmented ground truth and the automated deep learning predicted segmentation, defined by

$$\begin{aligned} \text {DSC} = \frac{\text {2TP}}{\text {2TP+FP+FN}}. \end{aligned}$$

The Jaccard index is defined as the intersection of AI prediction and the manual segmentation over their union, that is

$$\begin{aligned} \text {Jaccard}= \frac{\text {TP}}{\text {TP} +\text {FP} + \text {FN}}. \end{aligned}$$

Hausdorff distance (HD) measures the distance between the manual segmentation and AI predicted segmentation surfaces. The Hausdorff distance defined by

$$\begin{aligned} \text {HD}(A, B) = \max \left[ {\underset{a \in A}{\sup}}\ {\underset{b \in B}{\inf}}\ d(a,b),\; {\underset{b \in B}{\sup}}\ {\underset{a \in A}{\inf}}\ d(a,b) \right] , \end{aligned}$$

where A and B are two surfaces, sup and inf represent the supremum and infimum, respectively [48].

We trained the deep learning network with different combinations of image channels to determine the effect of multiple MR weightings on segmentation results. The network was trained for approximately 14 days and terminated at 25,000 iterations for each combination of input channels. The objective of the validation dataset was to determine the optimal iteration in which training should be terminated to avoid network overfitting. The mean Dice score of all label classes were determined for the validation dataset at intervals of one thousand iterations. The iteration corresponding to the highest Dice score for each image channel combination was used to evaluate the test dataset. The results obtained from validating the individual combinations are shown as Supplementary Fig. 1 in the Supplementary materials.

Results

Deep learning results

Our experiments indicate that the learning of the neural network varies depending on the combination of image channels, shown in Fig. 3. The input image channels T1PDFS and T1FS performs better than the T1 and PDFS sequence combinations. We recognize T1PDFSAngio as the best performing model in comparison with the other combinations of sequences. The T1PDFSAngio model attained optimal learning at 21,000 iterations with the mean Dice score of 0.967 (±0.032) averaged across all 13 classes. The results of the remaining image channel combinations can be found in Supplementary Table 1 of the Supplementary materials.

Fig. 3
figure 3

Evaluation of validation dataset by training different combinations of respective image channels ( T1 PDFS, T1PDFS, T1FS, T1PDFSAngio)

The test dataset was evaluated at the optimal iteration corresponding to the different image channel combination models. We observed that the beneficial effect of combining image channels during validation is reinforced by the test dataset. The outcome is particularly pronounced when training all MR weightings, and the correlation between training different image channels collectively are presented in Fig. 4a. For example, the hypo-intense vascular tissue typically seen in T1 images were ineffective for the network to learn arteries and veins. However, we noticed that adding Angio GE with other sequences to the network elevated the scores of vascular tissue.

Fig. 4
figure 4

The Dice score evaluation distribution of T1PDFSAngio by 15 subjects of the test dataset (a) Different combinations of MRI input data channels of I:T1, II:T1FS, III:PDFS, IV:T1PDFS, V:T1PDFSAngio versus averages of all classes of the test dataset (b) the individual anatomical classes 1. Bone medulla, 2. PCL, 3.ACL, 4.Muscle, 5.Cartilage, 6.Cortical bone, 7.Artery, 8.Collateral ligament, 9.Tendon, 10.Meniscus, 11.Adipose tissue, 12.Vein, 13.Nerve

The evaluation of test dataset metric for each of the 13 classes is summarized in Fig. 4b and Table 3. The scores are averaged across all classes, resulting in a mean DSC (±standard deviation) of 0.967 (±0.040). The achieved Dice scores were higher and more stable for larger structures such as bone medulla, muscle, and fat, while the scores appeared less for the minor structures. This is believed to be partly due to the class imbalance of the data. Also, the common low intensities of the ligaments, tendons and meniscus might cause these connective tissues to be indistinguishable for the network. Another argument is that some subjects may have more or less degenerative changes to tissues, resulting in false predictions.

Table 3 The average Dice scores, Jaccard Index and Haursdorff distances (±standard deviation) of the test dataset evaluated at 21,000 iterations by the neural network model trained with all image channels combined

Pathology case

We chose the trained deep learning network model from the T1PDFSAngio input sequence combination, and inferred the left knee image volume of a post-surgical ACL reconstruction of an adolescent patient. The patient had a complicated displacement of the intra-articular graft preventing full extension, resulting in a mechanically locked knee [49]. The network prediction correctly revealed that the ACL was missing, along with visualization of the tunnel left after drilling the femoral and tibial bone, shown as segmentations and hologram in Fig. 5. The surgical graft visualized in the hologram was annotated manually directly to the neural network inference output.

Fig. 5
figure 5

Pathology case: (a) Post-surgical proton density fat suppressed TSE (FS TSE) sagittal plane image showing the torn anterior cruciate ligament. (b) Predicted output by the neural network. The tibial drill hole and the intra-articular graft from the reconstructive surgery is visible. (c) 3D render of the segmentation. (d) Holographic 3D model representation

Discussion

Generally, neural network performance depends on the algorithms applied and the input data, both quantitatively and qualitatively. In our study, 1600 slices were scanned divided into four pulse sequences per knee. This is considerably more detailed anatomical coverage than most standard clinical protocols. The total protocol scan time used in this study was just shy of 40 min. The scan time reflects that the main objective for this study was detailed anatomic visualization and segmentation based on isotropic data, and not disease detection.

Other relevant research focusing on deep learning of MRI images of the knee joint has been carried out. Most of which are focused on cartilage segmentation and lesion classifications [19,20,21,22,23]. The quantitative evaluation scores obtained in our study cannot be directly compared to the studies listed in Table 4, mainly because different data sets were used. These data sets contain subjects whose images demonstrate pathology which is likely to significantly affect the result. One possible reason for the improved evaluation metrics of the method used in this work could, however, be explained by respecting the fact that the dataset described in the study is based on higher resolution 3D scans of multiple weightings and a more complete annotation of the entire knee anatomy. This reasoning is justified given that both the quality and the quantity of input data affects the performance of this neural network configuration. The very high DSC could also be explained given that the initial starting point for the ground truths were generated using a neural network, and then thoroughly corrected by manually removing false positive voxels and adding voxels where the network had made mistakes. However, because the same ground truths were used to evaluate all different channel combinations, the means to find the best fitting model would not be affected by this approach. The mean DSC calculated across all tissue classes might have a very high DSC owing to the fact that the larger structures such as bone, muscle and adipose tissue will increase the average DSC, even if the smaller structures score lower.

Table 4 Other work relevant to automated segmentation of knee structures [19,20,21,22,23]. DSC = Dice Score Coefficient, OAI = Osteoarthritis Initiative, DESS = Double Echo Steady State

This study had some noteworthy limitations. The network has been tested using images acquired by a single vendor 1.5T scanner. The training dataset volume is limited to 20 subjects; a larger quantity for training is believed to benefit the neural network in terms of performance. Furthermore, the pathology case was a post-surgical knee and the ability to differentiate between normal anatomy and injured pathology will likely be more sensitive and specific with a broader spectrum of injured knees added to the training data set. External validity may be reduced as is it only tested on a limited number of healthy participants. Evaluation based on matching the predicted output against a validated ground truth is one of the most common methods of evaluating the accuracy of a neural network. However, the method may be debatable as the result is strongly dependent of the validated ground truth definition. A poorly defined validated ground truth consequently causes statistical uncertainty, mainly related to errors of omission and commission. Structures labeled correctly by the neural network might not be defined in the validated ground truth. In other words, a false-positive voxel during validation might be true positive. Differential classification of a given tissue while exerting manual annotation may in some instances be difficult and is prone to individual subjectivity, particularly in anatomical regions occupied by overlapping isointense voxels.

The obtained spatially realistic segmentations may be of clinical value for surgical planning, patient empowerment, and student education. Moreover, neural network-labeled outputs based on healthy subjects may further be annotated using appropriate software. These extended datasets including labels mimicking pathological lesions may be reintroduced to the neural network and trained progressively. The segmented structures, as shown in Fig. 6, can easily be imported into into a 3D-printing environment. Additionally, they can be made viewable through augmented or virtual reality as inputs to a surgical simulator.

Fig. 6
figure 6

Results of test dataset in 3D (a), (b) Bone, ACL, PCL, meniscus and collateral ligaments, (c) Bone, muscles, ligaments and veins (d) Complete segmentation with transparent adipose tissue (e) Cartilage, (f) ACL, (g), PCL, (h), (i) Meniscus

Conclusion

By introducing an optimized MR imaging protocol for depicting human knee anatomy, our study demonstrates that a convolutional neural network can achieve automatic semantic segmentation with high performance. The prediction of the neural network represents spatially realistic representations of the entire knee joint in three dimensions, which are well suited for visualization purposes.

Figures 5 and 6 are examples of the three-dimensional anatomic outlay of the knee. We may predict that the future will give orthopedists the possibility of having such models available before an operation either 3D printed or as holograms. We also predict that the capability of the model to detect pathology may become a useful tool for both radiologists and orthopedists.