This study is based on a cohort collected during the Genodisc Project. The primary selection for recruitment to Genodisc was “patients who seek secondary care for their back pain or spinal problem”. Genodisc sourced MRI scans from centers in UK, Hungary, Slovenia and Italy. The scans from the study were not standardized, came from routine care in a number of different centers using different machines, and resulted in scans which varied in acquisition protocol. In this study, we excluded subjects whose MRI scan was of poor quality or in a non-DICOM format and used only the T2 sagittal scans. The scans were annotated with various radiological scores (global, the whole spine, and local, per disc) by a single expert experienced spinal radiologist (IMcC).
In all, we obtained images of 12,018 individual discs, six discs per patient, and their scores. Some scans contained fewer than six discs but the majority showed the complete lumbar region. To train and test the performance of our system, patients were grouped into two different sets, 90% in a training set of 1806 patients, and 10% in an independent sample of 203 patients. The test set, used to test the accuracy or concurrent validity of the automated ratings, was compared to the reference standard of expert ratings of the experienced radiologist.
An overview of the system is shown in Fig. 1. The system uses routine MRI scans acquired from a DICOM file stored on a standard laptop computer. The first step in the analysis is the delineation of the vertebral bodies and then the discs. The discs are then analyzed for the desired radiological features, and then classified. Here the automatically generated classification was compared with the radiologist’s score of each feature.
Intervertebral disc localization
The radiological scores for analysis of the discs are tied to each intervertebral disc, with the six discs per patient (T12-Sacrum) usually visible in standard clinical MRI protocols; these discs have to be accurately detected. In the first part of the study, we followed a conventional image analysis approach that detects vertebral bodies from T12 to S1 [9, 10]. From these detected vertebral bodies, a more suitable region is defined and annotated, i.e., T12–L1 to L5–S1, for each spine (Fig. 2).
The detection regions are in the form of 3D bounding volumes in the scan where each volume includes a disc and the surrounding upper and lower endplate regions. These volumes are normalized to reduce the signal inhomogeneity across a scan, and are centered on the detected middle slice for each disc to reduce lateral shifts (for example, from a scoliosis) (Fig. 3). Examples of the output regions are shown in Fig. 4.
Radiological scores classification
In the second part of the study, a classifier, which predicts the radiological features, is then trained using the detected regions as the input, and the prior determinations of the radiological features from the experienced radiologist’s assessments as the output. Since each intervertebral level/disc possesses eight radiological scores, preferably the classifier used must be able to simultaneously predict them without human intervention. To this end, we opted for a convolutional neural network (cnn), which can both learn without feature crafting (human input), and classify multiple scores at once. Hence, there is no need to create individual descriptors for the classifier suited for each radiological score. This method is the current state-of-the-art approach in machine learning, and employs deep learning . This is the use of multiple layers of abstraction to describe the relationship between the raw input data . Another advantage of using a CNN model as a classifier is the possibility of ease of troubleshooting predictions of the model. For each prediction of a specific radiological score, there exists a corresponding probability that suggests the degree of confidence of the prediction of the model.
This study has focused on six main radiological features that can be seen in part or totally on sagittal T2 images (Fig. 5): (1) Pfirrmann grading, (2) disc narrowing, (3) spondylolisthesis, (4) central canal stenosis, (5) endplate defects, and (6) marrow signal variations (Modic changes).
‘Pfirrmann grading’ classifies disc degeneration using criteria of disc signal heterogeneity, brightness of the nucleus and disc height into 5 grades . ‘Disc narrowing’ is defined as a multi-class measurement of the disc heights; 4 grades. In this study, ‘spondylolisthesis’ is a binary measure of the vertebral slip. ‘Central canal stenosis’ is the constriction of the central canal, in the region adjacent to each intervertebral disc. The radiologist’s score is based on assessment of both sagittal and axial images, so we have only studied a binary ‘present’ or ‘absent’ stenosis. ‘Endplate defects’ are any deformities of the endplate regions, both upper and lower, with respect to the intervertebral disc. ‘Marrow signal variations’ can be viewed as either Type 1 or Type 2 Modic changes, as both T1 and T2 scans are needed to differentiate the two types. Both types of Modic changes manifest as visible signal variations at the endplate extending into the vertebral body, observed on a T2 scan . Table 1 shows a summary of the grading of each radiological feature and Fig. 5 shows the examples of each radiological score and some of the output examples of the system.
The performance measure used for validation was ‘class average accuracy’, this is generally used in image analysis systems for highly unbalances classifications as occurred in the Genodisc dataset , e.g., only 9% of discs showed upper marrow changes [16, 17]. For our benchmark, the average class intra-rater agreement was calculated from two separate sets of labels by the same radiologist on a subset of the dataset that consists of 121 patients . We are essentially comparing the radiologist’s reliability against the reproducibility of our Model.