3D Intervertebral Disc Segmentation from MRI Using Supervoxel-Based CRFs

  • Hugo Hutt
  • Richard Everson
  • Judith Meakin
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9402)


Segmentation of intervertebral discs from three-dimensional magnetic resonance images is a challenging problem with numerous medical applications. In this paper we describe a fully automated segmentation method based on a conditional random field operating on supervoxels. A mean Dice score of \(90\pm 3\) % was obtained on data provided for the intervertebral disc localisation and segmentation challenge in conjunction with the 3rd MICCAI Workshop & Challenge on Computational Methods and Clinical Applications for Spine Imaging - MICCAI–CSI2015.


Support Vector Machine Intervertebral Disc Conditional Random Field Smoothness Term Large Margin Nearest Neighbour 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Segmentation of intervertebral discs in three-dimensional (3D) magnetic resonance (MR) images is an important step in many applications, but remains a difficult problem for automated computational methods. In this paper we report a fully automated method and evaluate it on data provided for the intervertebral disc localisation and segmentation challenge in conjunction with the 3rd MICCAI Workshop & Challenge on Computational Methods and Clinical Applications for Spine Imaging - MICCAI–CSI2015. The approach described here is adapted from a method for segmentation of vertebrae from MR imaging and computerised tomography (CT) data that we introduced in the works of Hutt et al. [1, 2]. The method is based on a conditional random field operating on supervoxels and incorporating a classifier and distance metric learned on sparse supervoxel features. Compared to our previous work, the most notable difference is that we do not use any location features for intervertebral disc segmentation. In Sect. 2 we give a brief description of the main components of the method, before presenting the segmentation results.

2 Methods

2.1 Supervoxels

We formulate the segmentation problem as one of assigning class labels to supervoxels (groups of similar voxels). Operating on supervoxels enables more descriptive features to be extracted over the supervoxel regions and greatly reduces computational complexity compared to operating directly on the individual voxels of the images. To generate supervoxels for a volume, we use a modified version of simple linear iterative clustering (SLIC) [3] which results in supervoxels with approximately equal physical extent in all directions. We determine the supervoxel parameters empirically by searching for the maximum supervoxel size that still preserves almost all object boundaries in the training images.

2.2 Multi-scale Dictionary Learning

We aim to characterise the supervoxels by extracting descriptive features from them which can be used to learn a model from training data to estimate the class label (i.e. disc or background). We next describe our supervoxel features, which are obtained by encoding and pooling the responses from learned multi-scale dictionaries of linear filters.

To learn the dictionaries, we first construct a Gaussian pyramid representation for each of the training volumes by successive smoothing and downsampling by a factor of 2. We then randomly sample 100 000 patches of dimension \(5 \times 5 \times 5\) voxels from each pyramid level of the training images and reshape them into vectors \(\{\mathbf {v}_i\}_{i=1}^M\). The sampled vectors are whitened and then encoded into a separate dictionary of filters corresponding to each pyramid level using sparse coding [4]. For the results given in this paper we used 3-level pyramids and learned a separate dictionary of 128 filters at each level of the training pyramids. This results in a set of dictionaries which are able to capture large-scale structure in the volumes due to being learned over multiple scales, but are also very efficient to compute.

To obtain the final supervoxel features for a volume, patches are first sampled densely over the pyramid using a step-size of 2 voxels. The sampled patches are then encoded using non-linear functions of the linear filter responses, given by
$$\begin{aligned} \mathbf {u}_i = \max \left\{ 0, \bigl [-\mathbf {D}, \mathbf {D}\bigr ]^{\top }\mathbf {v}_i\right\} , \end{aligned}$$
where \(\bigl [-\mathbf {D}, \mathbf {D}\bigr ]\) is a matrix formed by column-wise concatenation of the dictionary \(\mathbf {D}\) of learned filters.1 Features corresponding to different levels of the pyramid are concatenated into a single vector at each location. The densely extracted features are then aggregated within each supervoxel using a max pooling operation and \(\ell _2\)-normalised to unit length.

2.3 CRF with Learned Potentials

Given the set of supervoxel feature vectors for a volume, we define a conditional random field (CRF) over the supervoxels that relates the features to the underlying class labels. The resulting model promotes spatial consistency of the labels and enables segmentation to be carried out very efficiently using graph cut algorithms.
Fig. 1.

(a), (c) Segmentation results overlaid onto a mid-sagittal slice from two subjects (numbers 2 and 10, respectively). (b), (d) Overlap between the CRF segmentations (cyan) and manual annotations (magenta). (Color figure online)

The CRF defines a conditional distribution over the supervoxel class labels \(\mathbf {x}\) given the features \(\mathbf {y}\) and can be formulated in terms of an energy function. The energy function is a sum over a unary data term and a pairwise smoothness term:
$$\begin{aligned} E(\mathbf {x}, \mathbf {y}) = \sum _{i \in \mathcal {S}} \underbrace{\psi (\mathbf {y}_i \mid x_i)}_{\text {Data term}} + \lambda \sum _{i \in \mathcal {S}} \sum _{j \in \mathcal {N}_i} \underbrace{\phi (\mathbf {y}_i, \mathbf {y}_j \mid x_i, x_j)}_{\text {Smoothness term}}, \end{aligned}$$
where \(\mathcal {S}\) is the set of supervoxels and \(\mathcal {N}_i\) are the neighbours of supervoxel i. The constant \(\lambda \) controls the relative importance of the data and smoothness terms. The data term of the CRF is defined as the negative log likelihood of the supervoxel feature vector given the class label:
$$\begin{aligned} \begin{aligned} \psi (\mathbf {y}_i \mid x_i) = -\log \Big (P(\mathbf {y}_i \mid x_i)\Big ). \end{aligned} \end{aligned}$$
The likelihood \(P(\mathbf {y}_i \mid x_i)\) is the probability estimate for the supervoxel given by a learned support vector machine (SVM) classifier. We train the SVM on labelled supervoxel examples using a generalised RBF kernel, given by
$$\begin{aligned} K(\mathbf {y}_i, \mathbf {y}_j) = \exp \Big (-\gamma (\mathbf {y}_i - \mathbf {y}_j)^{\top }\mathbf {M}(\mathbf {y}_i - \mathbf {y}_j)\Big ), \end{aligned}$$
where \(\gamma \) is an overall kernel width parameter. The matrix \(\mathbf {M}\) defines a pseudometric between supervoxel features which we learn from training data using the large margin nearest neighbour (LMNN) [5] algorithm. The smoothness term of the CRF incorporates the learned distance metric as follows
$$\begin{aligned} \phi (\mathbf {y}_i, \mathbf {y}_j \mid x_i, x_j) = {\left\{ \begin{array}{ll} \exp \Big (-(\mathbf {y}_i - \mathbf {y}_j)^{\top }\mathbf {M}(\mathbf {y}_i - \mathbf {y}_j)\Big ) &{} \text {if} \quad x_i \ne x_j \\ 0 &{} \text {otherwise} \end{array}\right. }, \end{aligned}$$
which penalises neighbouring supervoxels that have similar feature vectors and are assigned to different classes.
We compute soft estimates (max-marginals) for the supervoxels \(P(x_i \mid \mathbf {y}_i)\) from the CRF and obtain the final voxel-level segmentation by thresholding the max-marginals after smoothing with a Gaussian filter.
Table 1.

Segmentation results on the training dataset. The table shows the Dice score (%) and average absolute surface distance (ASD) (mm) for each individual subject. The final column gives the median value over all subjects.




















































3 Results

The method was evaluated on a dataset consisting of T2-weighted turbo spin echo MR images from 25 different subjects provided for the MICCAI–CSI2015 intervertebral disc localisation and segmentation challenge [6].2 Each image contains intervertebral discs of the lower spine from T11 to L5. A total of 7 discs in each image have been manually annotated. The complete dataset is split into an initial training dataset consisting of 15 subjects and two test datasets each consisting of 5 subjects.

Leave-one-out testing was used to evaluate the performance of the method on the training dataset. At each iteration the model was learned on the 14 training subjects and then evaluated on the single held out subject. The process was repeated for all subjects, thus ensuring that the training and test data were always from separate subjects. For each test subject, parameters, such as \(\lambda \), were learned using leave-one-out cross validation on the 14 training subjects; here too, the validation subject was always separate from the remaining 13 training subjects. The average value of \(\lambda \) over all leave-one-out iterations was 0.67, while the average values of the SVM regularisation and kernel parameters were \(C = 5.24\) and \(\gamma = 0.52\). The execution time for processing a single volume after learning was approximately 6 min using an Intel Core i5 2.50 GHz machine with 8 GB of RAM.

The segmentations were evaluated using the Dice score and average absolute surface distance (ASD); the results on the training dataset are summarised in Table 1. The mean Dice score on the training dataset was \(90 \pm 3\) % and the mean ASD was \(0.63 \pm 0.32\) mm. On the two test datasets the mean Dice scores were \(90 \pm 4\) % and \(91\pm 3\) %; the mean ASD values were \(1.24 \pm 0.24\,\text {mm}\) and \(1.19 \pm 0.20\,\text {mm}\). Figure 1 provides a visual comparison, for single slices of the 3D segmentation, between example automatic segmentations and manual annotations.

4 Conclusion

We described a method for automated segmentation of intervertebral discs from MR imaging data based on a conditional random field on supervoxels. The method was shown to obtain accurate and consistent results on the challenge data, with each volume taking approximately 6 min to segment. An advantage of our approach is its generality, which means it can be applied to segment different structures without fundamental change to the model. This is illustrated by the consistently good performance of the method on the intervertebral disc dataset, along with previous vertebra segmentation results obtained on MR imaging and CT data.


  1. 1.

    Separate dictionaries are used to encode patches at different levels of the pyramid.

  2. 2.

    Available from the SpineWeb:



We are grateful to the organisers of the challenge and to the SpineWeb initiative for making the data available.


  1. 1.
    Hutt, H., Everson, R., Meakin, J.: Segmentation of lumbar vertebrae slices from CT images. In: Yao, J., et al. (eds.) CSI 2014. LNCVB, vol. 20, pp. 61–71. Springer, Switzerland (2015)Google Scholar
  2. 2.
    Hutt, H., Everson, R., Meakin, J.: 3D segmentation of the lumbar spine from MRI using supervoxel-based CRFs. Technical report, University of Exeter, UK (2015)Google Scholar
  3. 3.
    Achanta, R., Shaji, A., Smith, K., Lucchi, A., Fua, P., Süsstrunk, S.: SLIC superpixels compared to state-of-the-art superpixel methods. IEEE Trans. Pattern Anal. Mach. Intell. 34(11), 2274–2282 (2012)CrossRefGoogle Scholar
  4. 4.
    Olshausen, B., Field, D.: Emergence of simple-cell receptive field properties. Nature 381(6583), 607–609 (1996)CrossRefGoogle Scholar
  5. 5.
    Weinberger, K., Saul, L.: Distance metric learning for large margin nearest neighbor classification. J. Mach. Learn. Res. 10, 207–244 (2009)zbMATHGoogle Scholar
  6. 6.
    Chen, C., Belavy, D., Yu, W., Chu, C., Armbrecht, G., Bansmann, M., Felsenberg, D., Zheng, G.: Localization and segmentation of 3D intervertebral discs in MR images by data driven estimation. IEEE Trans. Med. Imaging 34(8), 1719–1729 (2015)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  1. 1.University of ExeterExeterUK

Personalised recommendations