Introduction

Intervertebral Disc Herniation (IVDH) is a common and severe spinal disease in dogs, accounting for 2.3–3.7% of veterinary hospital admissions1,2,3. Its manifestation in canine patients varies, depending on the type and location of the herniation. Clinical symptoms can range from mild discomfort and pain to severe neurological deficits. In more severe instances, IVDH may cause muscle atrophy and paralysis of the hind limbs, significantly impacting the quality of life of the affected dogs2,4,5.

With the increasing availability of veterinary magnetic resonance imaging (MRI), there is an optimistic perspective for the early detection of IVDH6. However, a global shortage of radiologists skilled in veterinary MRI interpretations leads to diagnostic challenges7,8,9. While artificial intelligence (AI) or deep learning research has advanced IVDH detection in humans10,11,12,13,14, the anatomical differences between humans and animals, especially the smaller size of animal intervertebral discs, create unique challenges in adapting these methods for veterinary applications. Moreover, while a few studies15,16 have applied AI techniques to canine IVDH, their focus has predominantly been on image quality improvement15 or image-level disease classification16. The localization of IVDH lesions on segment level, as an arguably more challenging yet crucial task, remains largely unexplored yet.

Addressing this gap, recent advances in deep learning-based object detection methods have paved the way. Object detection methods generally fall into two types: one-stage and two-stage detectors17. The two-stage methods, such as the influential Faster Region-based Convolutional Neural Network (R-CNN) method18, involve a two-step process. Initially, the Region Proposal Network (RPN) identifies potential regions that may contain objects. Subsequently, the detection network undertakes the tasks of classifying and accurately localizing these identified regions with bounding boxes. Although Faster R-CNN is indeed substantially “faster” than earlier two-stage methods, two-stage detection methods still generally involve more computational steps and complexity compared to one-stage methods17.

In contrast, one-stage methods, such as the You Only Look Once (YOLO) algorithms19, adopt a more efficient approach. YOLO divides the image into a grid, predicting bounding boxes and class probabilities directly from this grid structure. Widely recognized for its high inference speed, this method is particularly suitable for real-time applications on mobile devices. However, one-stage methods generally trade off some accuracy, especially in detecting smaller or more irregularly shaped objects, compared to its two-stage counterparts.

As one-stage detection methods continue to advance, they are increasingly seen as capable of matching the accuracy of two-stage methods20,21,22. However, there is still debate over whether the state-of-the-art one-stage methods can fully replace two-stage methods, especially in specialized areas23,24. For IVDH detection on veterinary MRI, where the objects of interests, i.e., discs, are small in size and the difference between normal and herniated discs is subtle, the suitability of one-stage methods remains a question.

To address the above research gap, this study aims to investigate the feasibility and methodology of AI-assisted detection of IVDH, with a specific focus on pet dogs. Our primary hypothesis is that simply adopting the latest object detection models from the computer vision field may not be the optimal solution for accurately detecting IVDH lesions in the context of veterinary radiology. Here, “accuracy” can be quantitatively assessed by the Average Precision (AP) metric. Our experiments reveal that more traditional two-stage detection models outperform more popular one-stage models in terms of IVDH detection accuracy. Furthermore, we propose a novel spinal localization module, and demonstrate that it can be used to enhance the IVDH detection accuracy for various models. Lastly, we show that it is possible to adapt the IVDH detection model to pet cats via transfer learning, potentially broadening the applicability of the proposed method.

Materials and methods

Dataset compilation

From September 2019 to August 2022, our study collected 487 mid-sagittal plane MRI images from 213 pet dogs. All pet owners were informed of the details of the study and signed a consent form before their dog participated in the experiments. All procedures were approved by the Ethics Committee of Shenzhen Technology University (reference number: SZTU20200208), and were carried out in accordance with relevant guidelines and regulations. All animal experiments were complied with the ARRIVE guidelines (https://arriveguidelines.org). The dog samples represented a variety of breeds, sexes, ages, and weights. The most frequent breeds included Poodles (n = 55), Mixed breeds (n = 31), French Bulldogs (n = 18), Pomeranians (n = 16), and Welsh Corgis (n = 13). The age range of the dogs was from 0.3 to 18 years (mean value = 5.62, standard deviation = 3.94), and their weights varied from 1.5 to 46 kg (mean value = 9.45, standard deviation = 7.83).

MRI data acquisitions were performed on a super-conductive animal MRI scanner (1.5 T vPetMR, GSMED) at a local veterinary hospital (Pet Burgh YangZi Pet Hospital). A multi-slice 2D T2-weighted fast spin-echo sequence was used with the following imaging parameters: repetition time (TR) = 2895 ms, echo time (TE) = 110 ms, matrix size = 256 × 384, and slice thickness = 3.5 mm. For all dogs, anesthesia was induced intravenously with propofol (2.5 mg/kg), and maintained with inhaled isoflurane at a 1.5% concentration during imaging. For simplicity, only T2-weighted sagittal MRI images were used in this study.

The MRI images were initially in Digital Imaging and Communications in Medicine (DICOM) format and were later converted to .bmp files. Two veterinary radiologists interpreted the images and marked the spine region and intervertebral disc herniation (IVDH) lesions using the labelMe software (https://github.com/labelmeai/labelme). One radiologist holds a bachelor's degree in veterinary medicine and has 7 years of experience in veterinary radiology, while the other holds a master's degree in veterinary medicine and has 4 years of experience in veterinary radiology. Subsequently, the dataset was divided randomly into a training set (50%, 106 dogs) and a test set (50%, 107 dogs). Figure 1 shows the distributions of the subject weights, ages, and the number of annotations per subject. Out of the 213 dogs, 139 (64 in the training set and 75 in the test set) had at least one IVDH lesion, while 74 (42 in the training set and 32 in the test set) had none.

Figure 1
figure 1

Distribution of subject ages, weights, and annotations per subject of the compiled canine IVDH dataset. The dataset represented a variety of breeds, sexes, ages, and weights, and was divided randomly into a training set (50%, 106 dogs) and a test set (50%, 107 dogs).

Proposed methodology

Figure 2 presents an overview of our proposed IVDH detection workflow where a coarse-to-fine strategy is employed. After obtaining annotations of the spine and IVDH lesions on MRI images, a preprocessing model, termed the spine localization module, is first trained to identify the spine regions. Once the spine regions are detected, they are cropped from the full-sized images, effectively eliminating irrelevant background tissues. The IVDH detection model (IVDH detection module) is subsequently trained on these cropped images, providing bounding boxes and confidence scores for IVDH lesions.

Figure 2
figure 2

Illustration of the proposed IVDH detection workflow using deep learning models. After obtaining annotations of the spine and IVDH lesions on MRI images, the spine localization module, is trained to effectively eliminate irrelevant background tissues. The IVDH detection model (IVDH detection module) is subsequently trained on the cropped images, providing bounding boxes and confidence scores for IVDH lesions. The inference (test) phase is similarly performed with this coarse-to-fine strategy.

During testing, the spine localization module first processes the raw images to identify the spine region and crop images. The cropped image is then fed into the IVDH detection module, which locates the IVDH lesions with bounding boxes and provides a confidence score for each detection.

Implementation details

Since the spine detection task is relatively simple, we implemented only one model, i.e., Dynamic R-CNN25 for the spine localization module.

For the IVDH detection module, our experiments involved various well-known one-stage models, including YOLOv326, FCOS27, YOLOF21 and YOLOX20, as well as various two-stage models, including faster R-CNN18, Cascade R-CNN28, Grid R-CNN29, Cascade Region Proposal Network (Cascade RPN)30, and Dynamic R-CNN25. These models were selected for their high impact in the field of computer vision. They encompass a broad spectrum from earlier methods such as Faster R-CNN (proposed in 2015) and YOLOv3 (proposed in 2018) to more recent approaches like Dynamic R-CNN (proposed in 2020) and YOLOX (proposed in 2021). Additionally, the YOLOX model was implemented in a small (YOLOX-S) and large (YOLOX-L) version, respectively.

We used the mmdetection framework31 to implement these models. Following standard configurations, the FCOS, YOLOF, and all two-stage models utilized ResNet5032 backbones pretrained on ImageNet33, and were trained on the IVDH dataset for 100 epochs. The YOLOv3 and YOLOX models, which do not have official ResNet50 backbones, had different configurations: YOLOv3 used a DarkNet5334 backbone pretrained on ImageNet, and was trained on the IVDH dataset for 273 epochs; YOLOX models used CSPDarkNet35 backbones without pretraining and were trained on the IVDH dataset for 300 epochs. Note that it is a common practice not to pretrain the backbones of YOLOX20. For comparative purposes, we also trained IVDH detection models using a more straightforward, end-to-end approach, i.e., directly on the full-sized images. All model training and testing were conducted on an Ubuntu 22.04 server equipped with two NVIDIA RTX 3090 GPU cards.

Evaluation metrics

Two key evaluation metrics are employed to quantify the IVDH detection accuracy: the precision-recall (PR) curve and average precision (AP)36. Before defining PR curves and AP, the Intersection over Union (IoU)36 is needed to define a true positive detection. IoU is calculated as follows:

$${IOU = \frac{{area\, \left( {B_{p} \cap B_{gt} } \right)}}{{area\,\left( {B_{p} \cup B_{gt} } \right)}}}$$
(1)

where \(B_{p}\) is the area of the predicted bounding box and \(B_{gt}\) is the area of the ground truth bounding box. The IoU evaluates how well the predicted bounding box overlaps with the ground truth box. In this study, we used an IoU threshold of 0.5, meaning that a predicted box must have an IoU of at least 0.5 with a ground truth box to be considered a true positive detection.

Given a specific IoU threshold, a PR curve can be plotted as the graphical representation of model precision (the ratio of true positive predictions to the total positive predictions) and recall (the ratio of true positive predictions to all actual positive instances) at various confidence threshold levels. Then, average precision (AP) score can be defined as the area under the PR curve:

$${AP = \mathop \int \nolimits_{0}^{1} p(r)\,dr}$$
(2)

In Eq. (2), p(r) is the precision as a function of recall r. The integral computes the area under the curve of precision plotted against recall, from 0 to 1. In practice, since the PR curve is typically discrete, the AP score is calculated as the weighted mean of precisions achieved at each threshold, with the increase in recall from the previous threshold used as the weight37. Therefore, AP score is a comprehensive metric that combines the insights of both precision and recall, providing a holistic view of the model performance in object detection tasks.

Statistical analysis

Statistical analysis is conducted on the AP score difference between one-stage and two-stage methods, as well as the impact of the spine localization module. A bootstrap-based hypothesis test method38 is employed. Specifically, our test set is resampled with replacement to the same size, repeated 10,000 times to create 10,000 bootstrap datasets. For each bootstrap dataset, the AP scores of the detection methods are recalculated, resulting in estimated AP score distributions. To assess the statistical significance of the observed AP differences, the bootstrap distributions of AP differences are shifted to have a mean of zero to approximate the null distributions38. Subsequently, the p-value is calculated based on the proportion of samples with a value as large or larger than the observed difference. Such hypothesis tests are performed between the top-ranked one-stage and two-stage methods, the mid-ranked one-stage and two-stage methods, and the bottom-ranked one-stage and two-stage methods. Additionally, hypothesis tests are conducted for each method before and after incorporating the spinal localization module.

Transfer learning to the feline IVDH detection

Considering that IVDH affects various animal species beyond dogs1,39,40, we pilot to extend our research to include cats. A feline IVDH dataset was constructed, consisting of 111 images from 63 cats, using the same acquisition and annotation methods as with the canine dataset. This dataset was also collected from a local veterinary hospital with written consent from the pet owners, and subsequently divided into training (n = 33) and test (n = 30) sets.

With limited training samples, this feline dataset was used to explore the adaptability of models across species. We focused primarily on the Dynamic R-CNN model and evaluated four training strategies: (1) directly applying the canine model without model retraining (no retraining), (2) retraining the model on the feline dataset (retraining on cats), (3) retraining the model on a combined dataset of both dogs and cats (retraining on dogs and cats), and (4) using the canine model weights as a starting point and fine-tuning on the feline dataset (transfer learning). For strategies (1), (2), and (3), we set the learning rate at 2.5 × 1e−3. In contrast, for the transfer learning approach, the learning rate was 2.5 × 1e−5. Other model parameters remained consistent across all four methods.

Results

Figure 3 illustrates typical results for canine IVDH detection without the use of the spine localization module. Limited by space, only four representative methods are presented: YOLOv3, YOLOX, Faster R-CNN, and Dynamic R-CNN. These methods are chosen as they represent the earliest and most recent advancements in one-stage and two-stage methods. The first three rows show typical examples where all models effectively identify and locate most of the IVDH lesions, albeit with some occurrences of false positives or negatives. It is observed that two-stage models tend to have fewer incorrect predictions compared to one-stage models. The fourth row presents a more complex case, where the models are more likely to generate false positives outside the spinal area, emphasizing the need of developing a spine localization module.

Figure 3
figure 3

Visualization of deep learning based IVDH detection without using the spine localization module. The green boxes are ground truth labeled by the radiologists, while the red boxes are detections by the deep learning models. (AC) Typical detection results, where all models could locate most of IVDH lesions annotated by the radiologists (green boxes), despite the fact that the one-stage models (YOLOv3 and YOLOX) resulted in more false negatives. (D) A more challenging case, where the models were prone to false positives outside the spinal area. A confidence score threshold of 0.4 was used for all methods.

Figure 4 plots typical results of spine localization and the resulting PR curve, demonstrating the high accuracy of the trained spine localization module, achieving a notable AP of 99.8%. This enables the proposed spine localization to be a highly reliable and fully automatic step. Table 1 compares the AP scores of all tested models, both with and without the spine localization module. The results show that two-stage models outperform one-stage models irrespective of the inclusion of the spine localization module. The incorporation of this module particularly benefits Faster R-CNN and Dynamic R-CNN, with AP score increases of 5.93% and 4.18%, respectively. The positive impact of the spine localization module is further visible in the PR curves shown in Fig. 5, where curves including the module generally surpass those without it at most recall levels.

Figure 4
figure 4

Spine localization results using the trained deep learning model. (A) Visualization of spine localization results on nine dogs of different body sizes. The green boxes are ground truth labeled by the radiologists, while the red boxes are detections by the deep learning model. (B) The overall precision-recall curve of spine localization. The spine localization module was found to be highly accurate as a fully automatic preprocessing step, with average precision reaching 99.8%.

Table 1 Average precision (AP) of the trained models with and without spine localization. The two-stage models outperformed the one-stage models, and the spine localization module improved IVDH detection accuracy for nearly all models.
Figure 5
figure 5

Precision-recall (PR) curves of the IVDH detection models with and without the spine localization module. The spine localization module led to higher detection accuracy at nearly all recall levels.

The statistical analysis reveals that two-stage models consistently achieve significantly better AP scores than their one-stage counterparts using a p-value threshold of 0.05. Specifically, without the spinal localization module, the top-ranked two-stage model significantly outperforms the top-ranked one-stage model (Cascade RPN versus YOLOX-s, p < 0.05), and similar significant differences are observed for the mid-ranked (Grid R-CNN versus FCOS, p < 0.05) and bottom-ranked (Faster R-CNN versus YOLOF, p < 0.01) models. With the spinal localization module, the top-ranked two-stage model again significantly outperforms the top-ranked one-stage model (Dynamic R-CNN versus YOLOX-l, p < 0.05), and the same pattern holds for the mid-ranked (Cascade RPN versus FCOS, p < 0.05) and bottom-ranked (Grid R-CNN versus YOLOF, p < 0.0001) models. Furthermore, the AP improvements brought by the spinal localization module are statistically significant (p < 0.05) for four models: YOLOX-l, Faster R-CNN, Cascade R-CNN, and Dynamic R-CNN. In addition, the p-value for FCOS is 0.054, which is close to the significance threshold.

Figure 6A presents a typical image with annotations from the feline dataset. Figure 6B shows the precision-recall (PR) curves for the four training strategies evaluated on feline dataset, with the corresponding AP scores. The results highlight the difficulty in training an IVDH detection model using only the limited feline data, leading to a low AP score of 29.82%. Models trained exclusively with canine data yield satisfactory results, achieving an AP score of 63.40%, suggesting certain similarities between the anatomical structures of the two species. Interestingly, training on a combined dataset of both cats and dogs result in a slightly lower score of 60.53%. This could be due to the complexity of learning the distinct image features of both species simultaneously. In contrast, the efficacy of our transfer learning strategy is evident, leading to the highest AP score of 67.65%. This underscores the effectiveness of transfer learning in adapting models for successful application to smaller, species-specific datasets.

Figure 6
figure 6

Pilot test to adapt the IVDH detection model to feline dataset. (A) A typical image of the feline IVDH dataset. (B) Precision-recall (PR) curves obtained using four training strategies. The transfer learning strategy led to the best PR curve and highest average precision among other retraining strategies. The average precision scores are labeled in the brackets in percentage.

Discussion

In this study, we explored the capability of AI-assisted intervertebral disc herniation (IVDH) detection in veterinary medicine. A key development is the use of a spine localization module in the preprocessing phase. This module not only effectively removes false positives from outside the spinal area but also concentrates the model attention on the spine region. Our internal tests indicate that this approach is more effective than directly applying models to full-sized images and subsequently using a spine localization module for false positive removal. These results highlight that in medical applications, tailored preprocessing/postprocessing strategies can be essential to pursue high accuracy.

Another key insight from our study is the superior performance of two-stage models. Despite the rising popularity of one-stage detectors, even the relatively old two-stage method Faster R-CNN outperforms the recent one-stage models YOLOX. The two-stage models, which first generates region proposals and then classifies these regions, is more effective for handling small target lesions. Also, the relatively slow inference speed of two-stage models is not a concern in most medical imaging applications like ours, since the imaging process itself takes minutes while running the two-stage model, even with the extra spine localization module, costs less than 1 s/subject. This finding cautions against the blanket application of the newest computer vision models to medical contexts without considering their suitability. Indeed, Fig. 7 shows that models with higher accuracy on COCO41, a natural image detection dataset, do not necessarily lead to higher accuracy on animal IVDH detection.

Figure 7
figure 7

Detection accuracy on the COCO dataset versus on the canine IVDH dataset for various models. Because the COCO dataset contains multiple classes of objects, mean average precision (mAP) is used as its accuracy metric. Higher mAP on COCO does not necessarily lead to higher average precision (AP) on IVDH detection.

The detection accuracy achieved in our study (AP score up to 75.32%), generally falls below that achieved in human IVDH, e.g., AP scores of 89.3%14 and 92.4%12 have been demonstrated on human lumbar disc herniation. Several factors likely contribute to this discrepancy. Firstly, the absolute size of spinal segments in dogs and cats is small. Secondly, a large imaging field-of-view (FOV) typically covering over 16 spinal segments was used in this study. A large FOV is necessary for veterinary MRI due to the varied anatomy of animals and their inability to communicate specific pain points. This, however, leads to relatively large voxel sizes, and even fewer voxels per spinal segment. On the other hand, similar human studies10,11,12,13 often focus on imaging only a few spinal segments, and thus have substantially more voxels per segment. Finally, due to the absence of standardized diagnostic criteria for animal IVDH, the annotations on samples that present borderline conditions can be subjective and slightly inconsistent to certain degree, which further complicates the training of AI models. These factors together highlight the existing challenges and the importance of further research in AI-assisted veterinary medicine.

Our study also has limitations that should be addressed in future research. The absence of evaluations for other diseases means our findings may not fully reflect clinical accuracy. Additionally, we did not differentiate between acute and chronic intervertebral disc herniations, which is an important aspect of clinical diagnosis. Lastly, it is important to acknowledge that all images in this study were obtained from a single institution. Future work should explore multi-institutional, multi-vendor datasets to ensure broader applicability and reliability in different clinical settings.

The development of automated segment-level lesion localization, as in this study, is not in competition with image-level spinal disease classification16; rather, it serves as a complement, providing more detailed spatial information and additional interpretability of the image-level classification results. Combining these two methods could lead to a more comprehensive diagnostic approach, where the high-level perspective of image-level classification is merged with the detailed insights from segment-level analysis. Such integration has the potential to significantly improve diagnostic accuracy and inform more precise treatment strategies. From another perspective, optimizing imaging protocols and enhancing image reconstruction quality15 can also play a crucial role in supporting further improvements of animal IVDH diagnosis. For example, refining MRI sequences and applying advanced reconstruction algorithms for better contrast and spatial resolution could lead to clearer visualization of IVDH lesions, thereby aiding AI models in more reliable detection. One promising direction here is to incorporate 3D imaging techniques42 that can provide thinner slices to reveal subtle structural changes.

In summary, this study has demonstrated the feasibility of AI-assisted intervertebral disc herniation detection for veterinary care and has explored various strategies in preprocessing, model design, and training to effectively improve detection accuracy. Our proposed strategies, supported by experimental results, offer valuable guidelines for future research. They emphasize the importance of creating methodologies that go beyond merely replicating the latest AI advancements, focusing instead on addressing the specific challenges and needs of the target domain.