Deep learning-based high-accuracy detection for lumbar and cervical degenerative disease on T2-weighted MR images

To develop and validate a deep learning (DL) model for detecting lumbar degenerative disease in both sagittal and axial views of T2-weighted MRI and evaluate its generalized performance in detecting cervical degenerative disease. T2-weighted MRI scans of 804 patients with symptoms of lumbar degenerative disease were retrospectively collected from three hospitals. The training dataset (n = 456) and internal validation dataset (n = 134) were randomly selected from the center I. Two external validation datasets comprising 100 and 114 patients were from center II and center III, respectively. A DL model based on 3D ResNet18 and transformer architecture was proposed to detect lumbar degenerative disease. In addition, a cervical MR image dataset comprising 200 patients from an independent hospital was used to evaluate the generalized performance of the DL model. The diagnostic performance was assessed by the free-response receiver operating characteristic (fROC) curve and precision–recall (PR) curve. Precision, recall, and F1-score were used to measure the DL model. A total of 2497 three-dimension retrogression annotations were labeled for training (n = 1157) and multicenter validation (n = 1340). The DL model showed excellent detection efficiency in the internal validation dataset, with F1-score achieving 0.971 and 0.903 on the sagittal and axial MR images, respectively. Good performance was also observed in the external validation dataset I (F1-score, 0.768 on sagittal MR images and 0.837 on axial MR images) and external validation dataset II (F1-score, 0.787 on sagittal MR images and 0.770 on axial MR images). Furthermore, the robustness of the DL model was demonstrated via transfer learning and generalized performance evaluation on the external cervical dataset, with the F1-score yielding 0.931 and 0.919 on the sagittal and axial MR images, respectively. The proposed DL model can automatically detect lumbar and cervical degenerative disease on T2-weighted MR images with good performance, robustness, and feasibility in clinical practice.


Introduction
Intervertebral disc degeneration was first described by Dexler in 1896 [1].And now, it is a worldwide health problem related to enormous medical and social costs [2].This degenerative progression occurs most commonly in the cervical and lumbar intervertebral discs.And in clinical practice, magnetic resonance imaging (MRI) is the best noninvasive assessment for investigating degenerative discs.However, there are several pitfalls in MR imaging.Some benign lesions and normal variations could be confused with more severe pathologies in clinics.Intervertebral disc prolapse sometimes mimics infective spondylitis or vertebral osteophytes.A sequestrated disc fragment could easily be mistaken for neurogenic tumors, epidural synovial cysts, epidural hematomas, and conjoined nerve roots [3].
In the last decade, there has been a massive increase in the use of artificial intelligence (AI) and machine learning (ML) technologies for auxiliary diagnosis clinically.It has excellent image recognition ability for the intervertebral disc.And it can improve diagnostic accuracy, mark critical information and reduce human error.
Previous reports show that deep learning approaches can achieve good performance in identifying, diagnosing, and grading lumbar degenerative diseases.The accuracy of some classification methods is comparable to that of radiologists [4][5][6][7].However, most of these studies only use limited data from a single center.Thus, the robustness and generalization ability of the algorithms could be unreliable.In addition, many of those still require manual segmentations as input, which is expensive, time-consuming, and may also cause subjective bias.
In this multicenter study, we proposed a deep learning model for detecting degenerative discs in both sagittal and axial views of T2-weighted MRI.As an exploration of the application of deep learning for detecting lumbar degenerative disease, this work also assesses the robustness and generalization ability in detecting cervical degenerative disease on an independent dataset via a transfer learning approach.

Patient enrollment
Patients who underwent MR scans from Jan 2019 to Jan 2020 in Center I hospital, July 2021 to Oct 2021 in center II hospital, and May 2021 to July 2021 in center III hospital were retrospectively reviewed.The inclusion criterion was that patients had been diagnosed with lumbar intervertebral disc prolapse.Patients were excluded if (1) they had received any treatment (surgery or chemoradiation) before the MR scan; (2) they had been diagnosed with neurogenic tumors, epidural synovial cysts, epidural hematomas, or infective spondylitis; (3) MR images could not be obtained or interpreted.Ultimately, 590 patients from center I, 100 patients from center II, and 114 from center III were enrolled in our study.The patients from the center I were randomly divided into a training dataset (n = 456) and an internal validation dataset (n = 34).The patients from center II to center III were used as independent external validation datasets.
The scanning parameters of MR for the patients from different hospitals are listed in Table 1.

Pre-processing of MR images
The axial and sagittal views of MR images were collected from the picture archiving and communication system (PACS) and stored in a workstation as digital imaging and communications in medicine (DICOM) metadata for annotation and further analysis.The low-frequency intensity nonuniformity present in MRI images was corrected with the N4 bias field correction algorithm [8].The lumbar degenerative discs were manually labeled as ground truth by two radiologists with more than ten years of experience in consensus.
When there were disputes between the two radiologists, a senior expert with more than 20 years of experience in musculoskeletal imaging was consulted for the final decision.The pixel interpolation approach was used to transform the original sagittal-view and axial-view MR images to the voxel size of 12*256*256 and 24*256*256, respectively.8630 annotations of 2497 lumbar degenerative discs from 804 patients were labeled as ground truth.To enhance the training and robustness of the convolutional neural network, sophisticated data augmentation techniques, including random flipping, random rotation, perturbations to contrast, Gaussian random noise, and nonlinear brightness transformation, were also performed [9].We modified the original Resnet18 network into a 3D version according to our data characteristics, and the channels between each layer were reduced to 16-32-64-128-192 due to the parameter redundancy in 3D networks.Transformerbased multi-modality cross attention was also applied to enhance the interaction of two MR modalities and better investigate multi-modal paired attention.The head number of multi-head attention was set to eight [10].

Development of the deep learning model
The proposed DL model was trained based on the unified loss function generalizing both dice loss and focal loss.The weights of hidden layers were randomly initialized, and the batch size and initial learning rate were set to 4 and 0.0001, respectively.Adam W, an algorithm that modifies the typical implementation of weight decay in Adam, was used as the optimizer in the training stage owing to its fast convergence and improved implementation of weight decay [11].The training was terminated when the loss in the validation set stopped decreasing, and the max number of epochs was set to 300.
The supervised training process of the DL model was performed on a computer with a Core i7-6700 K 4.00-GHz central processing unit (Intel, Santa Clara, Calif), 32 GB memory, and a GeForce GTX 2070 graphics processing unit (NVIDIA, Santa Clara, Calif).The DL model was developed

Transfer learning for the detection of cervical degenerative disease
To investigate the generalizability of our DL model in the detection of common vertebral lesions, we also applied a transfer learning framework to the detection of cervical degenerative disease.In detail, the convolutional layers of the DL model were frozen and transferred into the new model, while the fully connected layers (softmax layer) were retrained with randomly initialized parameters on the top of the transferred convolutional layers.Then, the newly initialized DL model took the original image bottlenecks as input and retrained to detect degenerative spine lesions.We collected an external cervical MR image dataset containing the sagittal-view and axial-view MR images of 200 patients with the cervical degenerative disease in Center I hospital from Oct 2021 to Jan 2022.The inclusion criterion was that patients had been diagnosed with cervical intervertebral disc prolapse.Patients were excluded if (1) they had received any treatment (surgery or chemoradiation) before the MR scan; (2) they had been diagnosed with neurogenic tumors, epidural synovial cysts, epidural hematomas, or infective spondylitis; (3) MR images could not be obtained or interpreted.The annotation and pre-processing of MR images were the same as those of the lumbar degenerative patients, which had been described before.

Model performance evaluation
To evaluate the capacity of the DL model for detecting the lumbar degenerative disc in the validation dataset, a precision-recall curve was plotted to show the precision-recall pairs for different probability thresholds.The precision is the ratio of TP/(TP + FP), and the recall is the ratio of TP/ (TP + FN), where TP, FP, and FN are the numbers of true positives, false positives, and false negatives, respectively.A TP was defined as a correct detection of the lumbar degenerative disc.An FP was defined as a wrong prediction of the lumbar degenerative disc.An FN was defined as a missed detection of the lumbar degenerative disc.Since there were no true negatives in the lumbar degenerative disc detection task, the ROC curve and specificity were not applicable in this study.Therefore, the free-response receiver operating characteristic (fROC) curve was used to evaluate the comprehensive performance of the DL model.F1 score was also used to measure the weighted average of precision and recall of our DL model, which was defined as follows: F1 = 2 × Precision × Recall/(Precision + Recall).

Statistical analysis
The Mann-Whitney U test was used to evaluate the differences in the numerical variables across different categories.Differences in the dichotomous variables were calculated with the Chi-squared test.A two-sided p value less than 0.05 was considered statistically significant.All analyses were performed using SPSS for Windows (version 26.0, IBM).

Study population characteristics
The flowchart of patient enrollment is presented in Table 2, and a total of 268 males and 188 females were divided into the training dataset with an average age of 43.4 years (range 12-86 years).There were 67 males and 67 females in the internal validation dataset with an average age of 48.0 years (range 12-79 years), 51 males and 49 females in the external validation dataset I with an average age of 52.7 years (range 16-84 years), and 55 males and 59 females in the external validation dataset II with an average age of 59.5 years (range 17-88 years), respectively.The total number of labeled lumbar degenerative discs in the training dataset, internal validation dataset, external validation dataset I, and external validation dataset II was 1157, 395, 194, and 144, respectively.There were 607 labeled cervical degenerative regions from 128 males to 72 females with an average age of 53.9 years (range 23-82 years) in the external cervical MR image dataset.

Model performance for detecting lumbar degenerative disc
The DL model showed favorable performance in lumbar degenerative disc detection in the validation dataset, with the F1 score and the areas under the fROC curve achieving 0.971 (95% CI 0.951-0.987)and 0.968 (95% CI 0.943-0.994)for the sagittal-view MR images, and 0.903 (95% CI 0.870-0.935)and 0.896 (95% CI 0.870-0.932)for the axial-view MR images, respectively.The DL model also showed promising detective capability in the external validation dataset I, and the F1 score and areas under the fROC curve were 0.768 (95% CI 0.693-0.838)and 0.764 (95% CI 0.691-0.837)for the sagittal-view MR images as well as 0.837 (95% CI 0.762-0.895)and 0.808 (95% CI 0.636-0.880)for the axial-view MR images.Similar performance was observed in the external validation dataset II with the F1 score and areas under the fROC curve yielding 0.787 (95% CI 0.723-0.855)and 0.732 (95% CI 0.648-0.817)for the sagittal-view MR images, and 0.770 (95% CI 0.678-0.844)and 0.721 (95% CI 0.640-0.802)for the axial-view MR images, respectively.The detailed precision and recall of the DL model in these multicenter validation datasets are summarized in Table 3.The fROC curve is shown in Fig. 2, and the precision-recall curve is shown in Fig. 3.

Transfer learning performance on cervical degenerative disease
In the model robustness and generalized evaluation on the external cervical MR image dataset, the DL model had achieved an F1 score of 0.931 (95% CI 0.907-0.955),with a precision of 0.974 (95% 0.953-0.991)and a recall of 0.893 (95% CI 0.857-0.929)for the sagittal MR images, and an F1 score of 0.919 (95% CI 0.891-0.944),with a precision of 0.942 (95% CI 0.909-0.973)and a recall of 0.897 (95% CI 0.858-0.934)for the axial MR images, respectively.The areas under the fROC curve were 0.911 (95% CI 0.873-0.950)for the sagittal MR images and 0.882 (95% CI 0.839-0.925)for the axial MR images.These results demonstrated the generalization capability of our DL model, which had good overall accuracy for cervical degenerative disease detection.The fROC curve and the precision-recall curve of the DL model in the external cervical MR image dataset are shown in Fig. 4.

Discussion
The aging of society has resulted in a significant increase in spinal images over the past decades, for both the growing burden of spinal disease and the more popularized application of MR.Artificial intelligence (AI) and machine learning (ML) technologies are playing a more critical role in the diagnosis of spinal disorders [12].Researchers and engineers are working together to develop AI-assisted diagnostic systems to improve the accuracy of diagnosis and reduce the disease burden on society.In recent years, machine learning techniques have been applied in the diagnosis of spinal disorders, such as spinal degenerative disease, trauma, oncology, and deformity.Recent literature also shows the potential of deep learning-based approaches for reliable quantifications of the vertebrae and discs and for decision-making in the treatment of lumbar disc herniations.However, most of the ML studies are based on limited data in a single center.It is essential to develop optimized algorithms that allow AI programs have the ability to analyze all kinds of images precisely from different medical centers with various scanning parameters.Furthermore, it is also crucial for an ideal AI system to give both qualitative and quantitative descriptions of the lesions.For example, in the case of intervertebral disc prolapse, the AI system should not only tell surgeons the level of the prolapsed disc, but also exhibit the borderline of the lesion and its relationship with other critical organs and tissues.This study included a wide range of patients aged 12-86 years to add a relatively wide variety of disc morphology and hydration to the Deep Learning database.And it provided more reasonable results.The major findings of our study reveal that the deep learning algorithm has achieved good performance and is highly consistent with the radiologist's expert reading for detecting lumbar degenerative disease.Several previous researches have reported automated detection of intervertebral disc degeneration using deep learning approaches [13].However, these studies only use limited data from a single center to validate the performance.As an important step for evaluating a deep learning-based model, external validation is necessary for assessing the model's robustness and generalization, which could avoid the overestimation of model performance caused by overfitting [14].Various studies have demonstrated that the diagnostic performance of deep learning models can vary across different datasets [15,16].Our proposed model showed high detective accuracy on both sagittal and axial views of T2-weighted MRI in the internal validation dataset.The model also had good robustness, achieving favorable performance in two external validation datasets.
Quantitative analysis provides valuable information and can help clinicians make better decisions, and some deep learning-based lumbar disc quantitative models have been developed in recent years [2,17,18].However, the reliability and validity of quantitative segmentation and measurement of disease regions still need further validation in clinical practice.Moreover, the diseased area of the degenerative cervical spine is commonly irregular, with unclear boundaries in the MRI image, which makes the accurate annotation of the disease region difficult.Meanwhile, the quality of MR images is mainly dependent on the scanning parameters, which are usually changed across different institutions or even varied in each MR scanner in the same center [19].Therefore, the subjective bias of assessment and segmentation of lesion areas in the MR images seems inevitable, even between experienced radiologists [20].This, in turn will have a substantial impact on the training and validation of the segmentation neural networks.Therefore, in this case, we choose to use detection networks to solve this problem rather than the segmentation networks such as U-Net.
An important distinction of our work from previous studies is the use of transfer learning to detect cervical disc herniation.Cervical and lumbar disc herniations are both degenerative spinal diseases in their pathophysiological mechanism, and the degenerative discs in the cervical and lumbar regions show similar characteristics in MR imaging.For these reasons, our research further explored the possibility of transferring the learning technique generated from the lumbar dataset to evaluate cervical disks.Although it was only a preliminary study, the results were quite inspiring.As a highly effective technique, transfer learning is suitable for dealing with medical images, particularly when faced with limited data [21,22].By using the transfer learning technique, Kermany et al. [23] built a generalized platform that could both classify diabetic macular edema and age-related macular degeneration on optical coherence tomography images and diagnose pediatric pneumonia on chest X-ray images.In this study, we also investigated the effectiveness of transfer learning in detecting cervical degenerative disease on limited T2-weighted MR images from 200 patients.We found that our model retained high detective accuracy, thereby illustrating the generalization of our proposed deep learning algorithm and the power of transfer learning to achieve good performance even with a small dataset.
There are several limitations of our study.First, although a relatively larger cohort from several institutions was collected for model development and validation, we are still aware that the sample size is not big enough for a deep learning algorithm that contains millions of parameters.More patients from different hospitals should be collected to improve the accuracy of lumbar and cervical degenerative disease detection.Second, our study focused on the analysis of T2-weight MR images, which were most commonly used to investigate degenerative discs.However, other modalities such as radiographs, CT, and lumbar discography could also infer anatomic changes and help to exclude other diagnoses [24].The fusion of multi-modality images such as CT and MR could be a promising area of future investigation.Third, although our model had achieved high accuracy in detecting degenerative discs, there were some technical difficulties in the accurate identification of the boundary of a herniated disc, especially the floor of the lesion in MRI.Further quantitative analysis, including characterization of different degeneration grades, would lead to more critical in clinical practice, which would also be our future work.

Conclusion
The proposed DL model can automatically detect lumbar and cervical degenerative disease on T2-weighted MR images with good performance, robustness, and feasibility in clinical practice.

A
deep learning (DL) model was proposed for lumbar degenerative disc detection.The schematic illustration of the DL model is shown in Fig. 1.The DL model was based on the U-Net encoder-decoder architecture.Two modified 3D Resnet18 networks without parameter sharing were applied separately as encoders for the two different modalities (sagittal-view MR image and axial-view MR image).

Fig. 1
Fig. 1 Schematic illustration and detection flowchart of the 3D ResNet18 and transformer architecture of the proposed deep learning model

Table 3 Fig. 2
Fig. 2 Free-response receiver operating characteristic (fROC) curve analysis of the deep learning model in the internal validation dataset (A), external validation dataset I (B) and external validation dataset II (C)

Fig. 3 Fig. 4
Fig. 3 Precision-recall (PR) curve analysis of the deep learning model in the internal validation dataset (A), external validation dataset I (B) and external validation dataset II (C)

Table 2
Clinical characteristics of the enrolled patients