Introduction

Chronic low back pain is a leading cause of global disability, which increases medical burden [1, 2]. Lumbar intervertebral disc (IVD) degeneration is a common cause of pain, which is considered as the precursor to other lumbar degenerative diseases [3, 4]. Magnetic resonance imaging (MRI), with its unique advantages in soft tissue visualization, provides clear depictions of the IVDs’ morphology and structure, establishing itself as the main diagnostic tool for IVD degeneration [5]. However, the interpretation of lumbar spine MRI is complex and time consuming, necessitating considerable surgical expertise [6]. The increasing number of patients in recent years has further amplified the demand for radiologists and spinal surgeons.

The development of artificial intelligence (AI) presents the potential for rapid, accurate, and stable imaging analysis [7,8,9,10]. AI, once trained on extensive datasets, can surpass human experts in medical image processing [8]. Despite the scarcity of AI systems for widespread clinical use in spinal surgery, there has been a significant increase in research concerning AI’s role in IVD degeneration [8, 10]. In 2023, a systematic review and meta-analysis revealed that machine learning and deep learning (DL) algorithms can offer relatively accurate and repeatable diagnosis of lumbar disc herniation and degeneration grading [5]. These techniques are also applicable to diagnose other disc-related diseases, support clinical decision, and predict patient outcomes [11,12,13].

However, the further optimization and clinical application of these algorithms hinge on a fundamental requirement: image segmentation. In MRI assessment of IVDs, accurate segmentation delineates the regions of interest for diagnostic models, enhancing their precision and interpretability [14]. The IVD segmentation technique can be applied to quantitative imaging assessments, including the automatic measurement of disc height and protrusion distance. These assessments were previously performed manually by physicians, which was a tedious and time-consuming process with low consistency in measurement results [15]. Moreover, AI algorithms can use image segmentation data to construct three-dimensional models of IVDs for applications in CT/MRI image fusion, surgical planning, and navigation [16]. DL algorithm frameworks, such as U-net, have become the state-of-the-art and primary methods for image segmentation [6, 10, 17]. However, to our knowledge, there has been no systematic investigation or summary of the performance of DL technology in IVD segmentation and quantitative measurement within lumbar spine MRI.

This systematic review and meta-analysis aims to bridge this knowledge gap by evaluating the performance of DL models in segmenting and measuring IVDs in MRI scans, with a focus on segmentation accuracy. We believe that this review will offer a comprehensive overview for further research and application in this critical area.

Methods

General guidelines

This systematic literature review strictly followed the guidelines outlined by the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA, see supplementary file 1) [18, 19]. The protocol of this study has been registered in PROSPERO (https://www.crd.york.ac.uk/prospero/) under the registration number CRD42024534092. Given the nature of this systematic review and meta-analysis, ethical approval and informed consent from participants were not required.

Search strategy and review process

A systematic literature search was conducted independently by two researchers (A.W. and C.Z.), with records collected from three major databases up to the search date of April 10, 2024. The databases included PubMed, Embase, and Web of Science (including Medline). The following key terms were used for literature search: “Deep Learning,” “Artificial Intelligence,” “Neural Networks,” “Segmentation,” “Feature extraction” “Intervertebral Disc,” and “Lumbar Vertebrae.” Additionally, references from included studies were reviewed to identify any relevant literature.

Titles and abstracts of the identified studies were screened for eligibility by the two researchers independently. A list of references from relevant studies and systematic reviews was also screened. Disagreements were resolved by a third researcher and co-author (L.Z.).

Inclusion and exclusion criteria

Inclusion criteria for this review were: (1) studies involving adult participants; (2) utilization of MRI to assess IVDs; (3) application of DL methodologies with comprehensive data on segmentation performance; (4) acceptance of both retrospective and prospective study designs.

Exclusion criteria included: (1) reviews, letters, guidelines, editorials, or errata; (2) studies involving animals, cadavers, in vivo biomechanics, or patients with lumbar tumors or trauma; (3) studies with overlapping cohorts, which would be summarized but not included in meta-analyses; (4) use of other machine learning algorithms other than DL; (5) studies of low quality; and (6) non-English publications.

Quality assessment

The quality of included studies was assessed using the second version of the Quality Assessment Tool for Diagnostic Accuracy Studies (QUADAS-2) [20], which included four domains: patient selection, index test, reference standard, and flow and timing. For patient selection, the focus was on the inclusion of a well-defined patient population with clear criteria for inclusion and exclusion. For the index test, the explicit description of the DL algorithm for segmenting and evaluating IVDs was scrutinized. The reference standard domain evaluated the reliability of the ground truth determination through manual segmentation and quantitative measurement. The flow and timing domain assessed the clarity of the research flow [21].

Data extraction

The following variables were extracted and recorded: (1) study attributes, including the primary author, publication year, study design, and duration; (2) medical data, including patient count and object of the study; (3) characteristics of MRI scanning; (4) DL specifics including the algorithm framework, dataset partition, and data augmentation strategies; (5) performance metrics for IVD segmentation, including dice similarity coefficient (DSC) score, precision, recall, and Intersection over Union (IoU).

For summarization and subgroup analysis, the algorithms applied in the included studies were categorized into U-Net variants, Deeplab variants, Generative Adversarial Networks (GAN) variants, and CNN variants. The U-Net variants included 2D/3D U-Net networks and those combined with frameworks like ResNet, as well as other U-Net-like algorithms. CNN variants referred to other CNN or FCN frameworks not covered by the aforementioned categories. Given the specialized features of most algorithms, these classifications may be very crude. For detailed characteristics of a particular algorithm, consultation of the original literature is strongly recommended.

Statistical analysis

Statistical analyses were performed using the Comprehensive Meta-Analysis software (version 3, Biostat, Englewood, NJ, USA). A random-effects model was applied for the meta-analysis, with p < 0.05 indicating statistical significance. Forest plots were generated to visualize the estimated DSC and IoU scores and the overall performance. Subgroup analyses were performed to explore relationships between outcomes and potential influencing factors. Heterogeneity between studies was assessed using the Q-test and Higgins I² statistics, categorized as follows: 0–25% (not important), 26–50% (low), 51–75% (moderate), and 76–100% (high). Publication bias was investigated using a funnel plot, with asymmetry evaluated by the Egger’s test.

Results

Basic characteristics

The PRISMA flowchart for the literature search is shown in Fig. 1. Initially, 583 publications were identified through database searching, and an additional 4 publications were retrieved through cross-referencing. After removing duplicates, 376 publications were screened, and 295 of them were excluded based on the titles and abstracts. Ultimately, 45 publications were included in the systematic review after full text screening [15, 22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65]. However, only 16 publications were eligible for the meta-analysis [22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37]. Since they provided sufficient quantitative data. It should be noted that 2 of the publications were based on the same cohort [29, 36], but differences in the MRI slices used for training and the algorithm frameworks led to their both inclusion in the meta-analysis. Attempts to contact corresponding authors of other publications did not obtain the necessary data.

Fig. 1
figure 1

PRISMA flowchart for the current meta-analysis

Table 1 outlines the basic characteristics of the included studies and objects. Most studies were designed retrospectively and performed based on single-center datasets, including public datasets. However, since these studies only involve the processing of medical images, retrospective or prospective studies may not make a significant difference in data quality. The number of patients across the studies ranged considerably, from as few as 8 [57] to as many as 520 [55]. The study subjects primarily included healthy individuals and patients with various types of degenerative lumbar diseases. Many publications did not report the study durations.

Table 1 The patient and study characters

Table 2 shows the information about the MR scans and DL strategies. 35 of the studies used sagittal slices for IVD segmentation, while axial and coronal slices were also used in other studies. The specific MRI slices selected for segmentation varied. 16 of the studies used mid- or para-sagittal slices, which can clearly show the IVD structures. Some studies used several or all sagittal or axial slices, or used 3D SPACE sequences. In terms of image capturing methods, 35 of the studies implemented T2 sequences, while other studies implemented both T1 and T2 sequences, or fused images produced by T2 images registered in T1 images. Although the scanner, slice thickness, Tesla, and image size may affect the IVD segmentation, many studies did not report detailed information about these items, especially those with multiple data sources. Therefore, this study did not further summarize these data.

Table 2 Characteristics of used datasets and algorithms frameworks

Data preprocessing and DL algorithms

The preprocessing of medical image data aims to enhance image quality and augment sample size to improve training effectiveness. Image cropping and resizing aims to standardize image dimensions for ease of training, or to pre-crop images to specific segments or regions of interest, and it was used in many studies [15, 22,23,24,25, 27, 29, 30, 33, 36, 39, 40, 42,43,44, 46, 48,49,50,51, 62, 64, 65]. Normalization, which standardizes the intensity values of images, was also employed in some studies [15, 24, 25, 28,29,30,31, 35, 36, 39, 44, 46, 48,49,50,51, 62, 64]. Data augmentation includes applying transformations such as rotation, flipping, and contrast enhancement, and some studies utilized this strategy to increase the amount of the training dataset [15, 22, 28,29,30, 32, 34, 36, 39, 43, 44, 46,47,48, 60, 62, 64]. Padding [24, 25, 36, 48, 62, 65] is also an optional preprocessing method. As shown in Table 2, all data were randomly or manually partitioned into training, testing, and validation sets. Some studies randomly grouped MR image slices, while others, especially those employing 3D algorithmic frameworks, grouped data by patient, meaning all data from a single patient’s examination belonged exclusively to one dataset.

Most studies included in this review employed specifically designed or improved algorithms for IVD segmentation, with the U-net network and its variants being the most commonly applied models (28 studies), including classic 2D/3D U-net, V-net, ResUnet, etc. GANs and DeepLab segmentation networks were utilized in 4 and 2 studies, respectively, while the remaining studies employed other variants of CNNs or FCNs. The information about the algorithms is summarized in Table 2, and we will further discuss the characteristics of the various algorithms in the discussion section.

DSC and IoU are the most commonly used performance metrics for automatic IVD segmentation. The reported DSC ranged from 0.810 [27] to 0.982 [40]. While the reported IoU ranged from 0.771 [42] to 0.972 [52]. Among the included studies, 3 studies performed IVD segmentation only at a specific segment (L4/5 or L5/S1). 5 studies also segmented other structures of the lumbar spine, such as the vertebral body and spinal canal, and reported only the overall segmentation results for all structures, with the reported DSC ranging from 0.803 [62] to 0.948 [58]. Other evaluation indexes of IVD segmentation, such as precision and recall, were also reported in several studies [24, 25, 29, 31, 41, 43, 46, 48, 65], with the reported precision ranging from 0.868 [41] to 0.986 [46], and the recall ranging from 0.904 [24] to 0.950 [46].

Several studies conducted automatic quantitative measurements of IVDs based on image segmentation [15, 49, 61], including measurements of disc height and area. These studies all reported good consistency between automatic segmentation and the gold standard (manual measurements). However, due to differences in measurement methods and evaluation metrics, this study did not summarize the performance of the quantitative measurements. The authors believe that quantitative measurements are also reflect of the accuracy of automatic segmentation.

Methodological quality

All integrated studies underwent quality assessment using the QUADAS-2 tool. Regarding bias risk within patient selection, 8 studies were classified as having a high risk of bias since they did not report clear inclusion and exclusion criteria [23, 30, 34, 42, 52, 58, 59, 64]. The ambiguity of the subjects may limit the applicability of the results. 6 studies exhibited an indeterminate risk of bias [33, 38, 49, 50, 62, 63]. Concerning the reference standard, 5 studies were assessed with a high risk of bias due to the lack of description on ground truth establishment [22, 49, 50, 52, 62]. 2 studies exhibited an indeterminate risk of bias [38, 57]. All studies were considered to have low risk of bias in the index test and flow and timing, since an explicit algorithm model and a clarified research flow are necessary conditions for this type of research to be recognized. However, judging the performance of the proposed models solely based on the description in the articles may not be sufficiently accurate. The repeatability and applicability of applying algorithmic models for automatic IVD segmentation can only be confirmed through further external validation.

Regarding applicability, Subjects recruited from the community and patients with lower back or leg pain were considered to match the review question. 9 studies were assessed as having high concern in patient selection [22, 28, 32,33,34, 40, 52, 57, 64]. The datasets included images from specific treatment stages, reformatted images, images without clear patient information. 9 studies were assessed as having indeterminate concern [23, 30, 42, 49, 50, 58, 59, 62, 63]. The datasets were derived from hospital databases but without associated patient information. Concerning the reference standard, 5 studies were assessed with high concern [22, 49, 50, 52, 62] and one study exhibited an indeterminate concern [57] due to lack or insufficient description. The detailed information of quality assessment was shown in Figure S1 and Table S1.

Meta-analysis of the included studies

As is shown in Fig. 2, the pooled value of DSC from 14 studies was 0.900 (95% confidence interval [CI]: 0.887–0.914) [22, 24,25,26,27,28,29,30,31,32, 34,35,36,37]. The Higgins I2 statistic showed not important heterogeneity across the studies (I2 = 20.501). A sensitivity analysis confirmed the robustness of the results, as the overall effect sizes remained statistically significant even when any individual study was excluded from the analysis (Figure S2). In addition, 4 studies reported the IoU of IVD segmentation [23, 28, 33, 35], and the pooled value was 0.863 (95% CI: 0.730–0.995, I2 = 0.000, p = 0.073, Fig. 3).

Fig. 2
figure 2

Forest plot of deep learning algorithms’ dice similarity coefficient

Fig. 3
figure 3

Forest plot of deep learning algorithms’ Intersection over Union score

Subgroup analysis

Subgroup analyses were conducted to determine if factors such as network dimensionality, type of algorithm, publication year, number of patients included, scanning direction, data augmentation, and cross validation might influence the effect of IVD segmentation (DSC). The detailed results are shown in Table 3. Although the Higgins I² statistics indicated moderate heterogeneity between subgroups for network dimensionality (I² = 73.174) and publication year (I² = 65.760), the Q-test suggested no significant difference between subgroups when stratified by these factors (p > 0.05). The forest plots of the subgroup analyses are presented in Figures S3-S8.

Table 3 Results of subgroup analyses

Publication bias

Publication bias analysis was conducted for the DSC using a funnel plot (Figure S9), as the number of studies available for other outcome evaluation metrics was limited. The p-value of the Egger’s test was 0.458, suggesting no significant publication bias.

Discussion

The global prevalence of low back pain is 18%, with IVD pathology identified as a significant contributor [66]. The interpretation of imaging for IVD diseases is often time-consuming and challenging. In light of recent advancements in AI, particularly DL, the application of these technologies to medical imaging has the potential to improve current medical practices. This systematic review and meta-analysis showed that the pooled DSC for lumbar IVD segmentation in MRI using various DL techniques was 0.900, with a 95% CI of 0.887–0.914, indicating a satisfactory level of accuracy. This technology can be further applied in diagnosis, measurement and evaluation, and surgical planning. To the best of our knowledge, this is the first systematic review and meta-analysis to address this topic. However, due to inconsistencies in reporting metrics and algorithm frameworks among the included studies, the interpretation of the results should be approached with caution.

In this systematic review and meta-analysis, the reported DSC for IVD segmentation ranged from 0.810 to 0.982. The studies exhibited no significant heterogeneity. This may be because the structural contours of IVDs in lumbar MR images are usually clear, making automatic IVD segmentation easier and more stable than structures that are harder to distinguish, such as tumors [67]. The included studies were divided into different subgroups based on various criteria, but no significant statistical differences were found between any of the subgroups. We think that most studies used high-quality lumbar MRI datasets and followed similar research designs and processes. Additionally, all studies were based on limited datasets, and the algorithm design is the main factor affecting IVD segmentation performance. Although the applied algorithms can be broadly classified, almost no two studies used identical algorithm frameworks. For instance, the U-Net network, widely used for medical image segmentation due to its symmetric encoder-decoder structure, captures both global context and local details [68]. With technological advancements, there are now several common variants like U-Net + + and V-Net. Researchers can optimize the performance of specific algorithms by adjusting the number of convolutional layers, replacing convolution kernels and pooling layers, and introducing other structures. It may be difficult to quantify the specific impact of these strategies on segmentation performance.

Based on the above discussion, it is worth noting the improvement strategies proposed by studies to achieve a precise and practical algorithm model for IVD segmentation. Researchers have made several representative improvements in various aspects:

(1) To mitigate poor training quality or overfitting caused by limited datasets, in addition to common strategies such as cross-validation, Pang [15] proposed a method called adaptive local shape-constrained manifold regularization, this method forces the output of the cascade amplifier regression network to lie on the target output manifold using local linear representation, which reduces the overfitting of the model. Das [28] and Li [22] conducted IVD segmentation using the MACCAI Challenge datasets, which contained multimodal MR images of a few patients. They used region-to-image matching and dropout strategies to improve feature learning and generalization, maximizing the utilization of multiple MRI sequences. Another solution is to make the data more useful. Gaonkar [30] designed the Eigenrank by Committee (EBC) algorithm. EBC can choose images that are harder to classify for training, which improves the effectiveness of manual annotation and gives better results compared to randomly partitioned training sets. Although data augmentation is a common strategy to increase the sample size. Some scholars believe that methods such as rotation and contrast adjustment may change key information in the MRI and therefore may not be suitable for such rigorous medical images [5]. In the subgroup analysis of this study, data enhancement showed no significant effect on segmentation performance.

(2) To improve the performance and usability of algorithms, a common strategy is to use multi-scale feature fusion [32, 33, 43, 49, 50, 62]. By utilizing multi-branch structures, such as BiSeNet and PSPNet, it combines high-resolution details from low-level features and semantic information from high-level features. This approach fully leverages the small inter-class differences and large intra-class variations in spinal anatomy features, enhancing detection capability. Multi-scale feature fusion also helps the model understand the broader anatomical context and relationships within the spinal structure, which allows it to segment various spinal structures at the same time [50]. Another popular approach is semi-supervised learning (SSL), which combines labeled data with weakly labeled or unlabeled data. Common methods include self-training, pseudo-labeling, and generative models. SSL has many advantages, such as increasing the amount of data, reducing the workload of expert labeling, enhancing the model’s generalization ability, and allowing AI systems to identify imaging features that may be undetectable by human doctors [29, 64]. Other strategies include the level set approach [34] and residual refinement attention [33] (for tracking and refining image boundaries), mixed loss functions [63] (to enhance model robustness), and ensemble learning [43] (to combine outputs from multiple models), etc. Additionally, He [38, 41, 53, 56, 58, 59], Wang [38], and Liu [57] proposed reducing convolutional parameters and calculation complexity. These approaches minimize the algorithm’s size and memory usage while maintaining segmentation performance, which is crucial for applying and deploying such algorithms in further clinical practice.

(3) Several studies have further explored the use of IVD segmentation. The segmented IVDs can be used to diagnose degenerative diseases such as IVD herniation [26, 44]. Recent advancements include the application of meta-interpretive learning, dimensionality reduction and integration of MR images, and multi-input, multi-class algorithms, aiming for more precise diagnosis and report generation [37, 41, 46]. To improve the appearance of segmentation, Hou [60] introduced Gaussian divergence loss and contour loss to address issues such as irregular edges and isolated segments. Meanwhile, He [58] developed a filling algorithm for sparse segmented images, which leverages contextual information from adjacent slices to generate interpolated slices and smooth the 3D reconstructed image. 3D reconstruction of the IVD and surrounding structures has been explored for clinical applications, particularly for morphological evaluation and surgical planning of the lumbar spine [24, 25, 65]. However, these applications still rely on manual planning by physicians, and there is a lack of systematic automated surgical planning algorithms.

Through our review of the studies in this field, we have identified areas that require improvement to drive technological advancements. First, the training data rather than the algorithm framework is the essence of determining algorithm performance. However, due to gaps in specialized knowledge and the confidentiality requirement of medical data, we have not yet encountered research that can truly be considered “big data”. Such research should encompass diverse ages, races, and other variables to minimize bias and include a broader range of pathologies, such as internal fixation, infections, and deformities, to ensure applicability. Second, most related studies are conducted by engineers rather than clinicians, resulting in a primary focus on algorithm design. Many studies lack detailed reports on patient inclusion and ground truth establishment, which may limit the models’ applicability. Therefore, closer collaboration between engineers and clinicians is essential for further research. Third, although no significant differences were observed in the subgroup analysis, we still recommend that future studies employ more 3D MRI data, as it provides richer detail and aids subsequent applications. A considerable number of studies train segmentation on specific slices, such as midsagittal slices, which, while providing critical information, are not sufficiently suitable for direct clinical application. Finally, as many scholars have recently highlighted, the practical issues of software implementation of models, ethical approval, and cost-effectiveness must be addressed in future research [7, 10]. Despite these challenges, current advancements demonstrate a promising outlook for the application of DL technology. Therefore, continuing to explore ways for DL technology to provide tangible benefits to clinicians and patients remains worthwhile.

Additionally, this systematic review and meta-analysis have some limitations. First, there are certain discrepancies in the reporting metrics of the reviewed studies, and many do not provide complete data on segmentation performance, such as standard deviations or confidence intervals, resulting in a limited number of studies that can be included in quantitative summaries. Second, image segmentation requires less complete patient baseline data compared to diagnostic and prognostic studies, however, the heterogeneity of datasets may still limit the significance of this study’s results. Third, there is currently very limited peer review and external validation of the models.

Conclusion

In conclusion, the DL algorithm enables automatic segmentation of IVDs in MRI imaging with relatively satisfactory performance. This technology has potential applications in diagnosis, measurement and evaluation, and surgical planning. However, the current results should be interpreted with caution due to limitations, such as small sample sizes, differences in reporting metrics, and lack of external validation of the algorithms.

In future studies, it is recommended to use larger and more diverse datasets for training, and to promote external validation and applied research of the algorithms. Clinicians and DL experts can work together to guide this technology and bring tangible benefits to patients and clinical practice.