Introduction

Chiari II malformation is a complex developmental malformation of the central nervous system. It is characterized by a small posterior fossa and downward displacement of the cerebellum and brainstem through an enlarged foramen magnum (hindbrain herniation) [1]. Chiari II malformation is almost uniquely associated with open spinal dysraphism [2]. McLone and Knepper [3] hypothesized that leakage of cerebrospinal fluid through the spinal anomaly reduces the distention of the embryonic ventricular system. The decreased inductive pressure on the surrounding mesenchyme results in an abnormally small posterior fossa. Approximately one third of the patients with Chiari II malformation develop signs and symptoms of brainstem compression [4]. The mortality in this symptomatic group is 15 to 35 % [5, 6]

Usually, Chiari II malformation is clinically diagnosed with help of MR imaging to assess severity. Although the malformation is characterized by a constellation of morphological features [711], the evaluation of MR images may not always be straightforward. A previous study showed that the assessment of several features is unreliable because judgment of these features varied between observers (see part 1). Assessment of MR images is complicated by the morphological diversity of the malformation, the qualitative nature of the features, and the fact that the distinction between normal and abnormal brain development is not defined by an unambiguous cutoff point.

Still, brain MR imaging plays a substantial role in clinical decision making regarding the management of children with spina bifida [9, 10, 12]. On the one hand, the discussion on selective treatment of severely affected newborn infants is still ongoing [13]. On the other hand, fetal imaging and prenatal surgery are becoming more important every day. Recently, a randomized control trial showed important improvement of hindbrain herniation following prenatal surgery for spina bifida [14]. However, the assessment of Chiari II malformation may be even more complicated in prenatal MR imaging. A discrepancy of 41 % was seen in judgment of the degree of hindbrain herniation in prenatal MR imaging studies [15]. When choices have to be made about pre- and postnatal treatment options, morphometric analyses may improve the assessment of severity of Chiari II malformation on MR images in clinical and research settings. Measurements of the cerebellum, brainstem, and posterior fossa may give quantitative information about the extent of the malformation and may provide objective cutoff points between normal and abnormal brain development. A few morphometric studies on Chiari II malformation have been reported [1621]. These studies generally focused on the small posterior fossa and the degree of cerebellar herniation in the midsagittal plane, but not on dimensions in the axial or coronal plane. Interobserver reliability and diagnostic performance of the morphometric measures are hardly addressed in the literature.

Therefore, we investigated the interobserver reliability and diagnostic performance of morphometric measures of the cerebellum, brainstem, and posterior fossa, not only in the midsagittal plane but also in the axial and coronal plane, to select appropriate measures for the MR assessment of Chiari II malformation.

Materials and methods

Patients

Brain MR images of 79 children [mean age 10.6 (SD 3.2; range, 6–16) years] were evaluated. Of these children, 43 children had spinal dysraphism (26 with open spinal dysraphism and 17 with closed spinal dysraphism [22]). The majority of these children (n = 36) were recruited at the outpatient clinics of Pediatric Neurology of the Radboud University Nijmegen Medical Centre (RUNMC) as part of a prospective research program dedicated to the outcome and prognosis of spina bifida. MR images of the remaining seven children were obtained retrospectively from the archives of the Department of Radiology of the RUNMC, from which we also obtained brain MR images of 36 children without spinal dysraphism. Although MR imaging in these 36 children was performed with suspicion of or to rule out cerebral pathology, the MR images had been assessed as normal by an independent radiologist in a clinical setting before the start of the study. All 79 children were reassessed for Chiari II malformation using the criteria: cerebellar herniation on a sagittal MR image and the presence of open spinal dysraphism. Consequently, the study population consisted of three diagnostic groups: 23 children with spinal dysraphism and Chiari II malformation [SDCM+ group; mean age 11.4 (SD 2.9; range, 6–16) years], 20 children with spinal dysraphism, but without Chiari II malformation [SDCM− group; mean age 10.9 (SD 3.1; range, 7–16) years], and 36 children without spinal dysraphism or cerebral pathology [reference group; mean age 9.9 (SD 3.2; range, 6–16) years].

MR imaging

All MR images were acquired using a 1.5-T MR imaging unit (Siemens Avanto; Siemens Medical Solutions, Erlangen, Germany) with a standard head coil. MR imaging in the 36 children who were part of the prospective research program consisted of T1-weigthed images in the sagittal plane and T2-weigthed images in the axial and coronal plane. The retrospectively obtained MR images were acquired using comparable sequences. For different reasons, MR images were not acquired in three planes for all 79 children. Images in the sagittal plane were available for 69 children (21 in the SDCM+ group, 20 in the SDCM− group, and 28 in the reference group), images in the axial plane for 58 children (19 in the SDCM+ group, 13 in the SDCM− group, and 26 in the reference group), and images in the coronal plane for 51 children (18 in the SDCM+ group, 19 in the SDCM− group, and 14 in the reference group).

The Regional Committee on Research involving Human Subjects approved the study protocol. Prior to inclusion in the study, written informed consent was obtained from the parents of all 36 children and all children above 12 years of age taking part in the prospective research program.

Image analysis

All MR images were blinded for demographic and diagnostic information. The MR images of the three diagnostic groups were mixed and arranged by plane into three data sets: a sagittal set, an axial set, and a coronal set. These three data sets were reviewed independently by three observers: a junior pediatric neurologist (N.G.) with 6 years of experience in reviewing pediatric brain MR images, a senior pediatric neurologist (R.A.M.), and a senior neuroradiologist (T.V.), both with more than 20 years of experience in reviewing pediatric brain MR images. The images were available on compacts disks and were reviewed on an Agfa workstation or on a personal computer using Agfa software (Impax Client, release 4.5).

The MR images were reviewed for 13 sagittal, 4 axial, and 4 coronal morphometric measures (Table 1). Most of the measures in the sagittal plane were selected from the literature. The measures in the axial and coronal plane were defined by the authors to appraise the width of the cerebellum, the degree of wrapping of the cerebellar hemispheres around the brainstem, and the degree of upward tentorial herniation of the cerebellar hemispheres.

Table 1 Morphometric measures of Chiari II malformation

First, the feasibility of the protocol was evaluated in a pilot study (n = 10), resulting in the final set of measures with their definitions. Measures were assessed to the nearest decimal of a millimeter. If an observer could not identify a landmark or could not assess the measure for other reasons, the measurement was classified as “indeterminable.”

Statistical analysis

For each measure, the indeterminable measurements were tallied up per observer to assess the feasibility of each measure. If at least two observers considered a measure to be indeterminable in more than 5 % of the MR images, the measure was qualified as unfeasible and subsequently excluded from the further analyses.

The interobserver agreement of the feasible measures was quantified by the agreement index (AI), defined as AI = 1 − RRE, where RRE denotes the relative random measurement error expressed as the pooled coefficient of variation across patients of the observations made by the three observers. This AI can be seen as an extension to more than two observers of the AI defined for two observations per patient [23, 24]. The relative random measurement error was used instead of the absolute random measurement error in order to compare measures among each other. An AI ≥ 0.90 was considered to indicate reliable interobserver agreement. Using this method, the overall interobserver agreement, the interobserver agreement between pairs of observers, and the interobserver agreement per diagnostic group were calculated.

The reliable measures were also analyzed for diagnostic performance regarding Chiari II malformation. Initially, the measurements of observer A were used for this purpose. Differences between the three diagnostic groups were analyzed with the Kruskal–Wallis test. Using the diagnosis of Chiari II malformation (defined as cerebellar herniation on a sagittal MR image and presence of open spinal dysraphism) as the reference standard, a receiver operating characteristic (ROC) curve was constructed for each measure. The area under the ROC curve (AUC) and its 95 % confidence interval (CI) were calculated to assess the diagnostic performance. The cutoff value with the optimal sensitivity and specificity was ascertained from the curve. Subsequently, the consistency of the measures with a high diagnostic performance (AUC > 0.90) was assessed using the measurements of the other two observers. All statistical analyses were performed using SPSS software version 14.0.1.

Results

Reliability

Most measures turned out to be feasible, except for fourth ventricle level in the sagittal plane and vermis length in the axial and coronal planes. These three measures were excluded from the further interobserver agreement and diagnostic performance analyses.

The interobserver agreement of the remaining measures is presented in Table 2. For most measures, the interobserver agreement was reliable (AI ≥ 0.9), both overall and per diagnostic group. In general, the agreement was slightly weaker in the SDCM+ group than in the other diagnostic groups, but this difference was only meaningful for tentorial length. The agreement was very poor for vermis level, tonsil level, and cisterna magna width. The interobserver agreement for pairs of observers showed that the poor agreement for cisterna magna width and tonsil level were not observer dependent. The poor agreement for vermis level, however, was observer dependent (Table 3). For all other measures, pairwise agreement did not differ among pairs of observers.

Table 2 Agreement indexes (calculated as 1 − RRE; for further details, see “Materials and methods”) of morphometric measures overall and per diagnostic group
Table 3 Agreement indexes (calculated as 1 − RRE; for further details, see “Materials and methods”) of the three measures with poor interobserver agreement, overall and by observer pair

Diagnostic performance

In the sagittal and axial plane, all but one measure differed statistically significantly between the SDCM+ group and the other two diagnostic groups (Table 4). In the coronal plane, only cerebellar width was statistically significantly smaller in the SDCM+ group than in the other two groups. No differences were present between the SDCM− group and the reference group.

Table 4 Measurements (mean values in cm) by diagnostic group (data obtained from observer A)

The diagnostic performance of the measures based on the data from observer A is presented in Table 5 and illustrated by ROC curves in Fig. 1. The AUC was substantial (>0.90) for five measures: foramen magnum diameter, pons length, pons thickness, and mamillopontine distance in the sagittal plane (Fig. 2), and cerebellar width in the axial plane (Fig. 3), but sensitivity or specificity were not all that high for pons length and pons thickness. Consistency of the performance of these five measures was evaluated using the measurement values of observers B and C (Table 6). In this analysis, only mamillopontine distance and cerebellar width maintained their excellent diagnostic performance. Despite the high sensitivity and specificity in the primary analysis, foramen magnum diameter failed to the consistency test.

Table 5 Results of ROC analyses showing the diagnostic performance of Chiari II malformation measures (data obtained from observer A)
Fig. 1
figure 1

Receiver operating characteristic curves for measures with a good diagnostic performance (AUC > 0.90). See Table 5 for further details

Fig. 2
figure 2

a Sagittal T1-weighted brain MR image of a 16-year-old child with open spinal dysraphism and Chiari II malformation. The arrows indicate foramen magnum diameter (FM), pons length (PL), and pons thickness (PT); b sagittal T1-weighted brain MR image of a 8-year-old child with open spinal dysraphism and Chiari II malformation. The arrow indicates mamillopontine distance (MPD)

Fig. 3
figure 3

a Axial T2-weighted brain MR image of a 16-year-old child with open spinal dysraphism and Chiari II malformation. The arrow indicates axial cerebellar width; b coronal T2-weighted brain MR image of a 13-year-old child with open spinal dysraphism and Chiari II malformation. The arrow indicates coronal cerebellar width

Table 6 Consistency [tested by applying the results of the ROC analysis (see Table 5) to the data obtained from observer B and C] of the measures with the best diagnostic performance in the ROC analyses

Discussion

On brain MR images, Chiari II malformation is generally evaluated based on a constellation of morphological characteristics in the midsagittal plane. The current study provides quantitative measures that may provide information about the extent or severity of Chiari II malformation. The measures mamillopontine distance and cerebellar width seem to be highly specific and sensitive for assessing Chiari II malformation.

In the present study, most measures turned out to be reliable, both overall and per diagnostic group. The literature provides some morphometric studies of Chiari II malformation [1621, 25], but only the study of Salman et al. [21] deals with interobserver agreement of several measures. As far as the same measures were studied, our results agree with the previous findings. The additional value of our study is that we investigated measures in three planes and in different diagnostic groups. The interobserver agreement in the Chiari II malformation group was slightly lower than in the unaffected groups. This may be due to anatomical distortions, which may hamper precise identification of landmarks. However, this did not affect reliability to a large extent.

Unreliable measures in the present study were predominantly complex measures, depending on reference lines, which are susceptible to differences in interpretation as well. For example, the disagreement found for foramen magnum diameter will have contributed to the disagreement for the measures that depend on it, such as vermis level.

The unreliability of vermis level and tonsil level was remarkable. Blurred boundaries in a crowed posterior fossa and upper cervical spinal canal may have hampered precise delineation of the tonsils and vermis. Consequently, these structures could not be distinguished precisely. On the other hand, the disagreement for vermis level may also be observer dependent, as two of the three observers moderately agreed on vermis level, whereas these two observers systematically disagreed with the third observer (Table 3). To elucidate this, we performed a post hoc analysis using the most caudal extent of cerebellar tissue (vermis or tonsil) as a variable. As this derivative measure also failed to be reliable (AI = 0.29), however, observer dependency seems to play a minor role. In contrast, Salman et al. [21] presented a comparable measure “herniation distance” as reliable, but they used other statistical methods in a smaller sample size. Although cerebellar herniation remains a key feature of Chiari II malformation and its morphological appearance can reliably be judged on MR images (see part 1), the present study shows that measuring the degree of cerebellar herniation can be unreliable.

The majority of the reliable measures differed statistically significantly between children with Chiari II malformation and unaffected children (Table 4). These differences are in accordance with the morphogenesis of Chiari II malformation. Increased cerebellar height and vermis length and decreased cerebellar width support the hypothesis of a small posterior fossa [3] with squeezing of the vermis and enlargement of the midsaggital vermis area [21]. An increased mamillopontine distance results from caudal displacement of the brainstem and pons. For a few measures, reference values have been reported in the literature (Table 4). Our values for foramen magnum diameter corresponded well with the values reported by Aboulezz et al. [16] and our values for cerebellar height and vermis length with the values reported by Salman et al. [21]. The pons length in affected children in our study was longer than the pons length reported by Tsai et al. [20]. A different identification of the inferior pontine notch and a different age range of the investigated populations might explain this difference.

The substantial differences in the measurement values between affected and unaffected children warrant the search for cutoff points. The ROC analyses showed reasonably accurate cutoff points for more than half of the reliable measures (Table 5), but only two measures, mamillopontine distance and cerebellar width, showed consistent diagnostic performance. Some caution is justified, however. From the ROC analyses, very precise cutoff points were calculated, but this amount of precision will not be feasible in clinical practice.

Clinicians should be aware of the imprecise judgment of the degree of cerebellar herniation in the midsagittal plane. The reliable measures presented are more suitable to assess the morphological distortions. They appraise the cerebellum and brainstem not only in the midsagittal plane but also in the axial and coronal plane. Since measures differ substantially between affected and unaffected children, they are considered to be of diagnostic value. Cerebellar width provides an indication of the size of the posterior fossa, and cerebellar height and vermis length reflect the enlarged vermis area. Mamillopontine distance, pons length, and medulla length provide quantifications of downward displacement and stretching of the brainstem. Although hemispheral length and hemispheral height were reliable measures, they did not differ substantially between affected and unaffected children and thus failed to provide objective cutoff values for wrapping of the cerebellar hemispheres around the brainstem and upward tentorial herniation, respectively. The reliable measures might be suitable to assess severity of clinical signs and symptoms. However, the association between measurements and severity of Chiari II malformation is a matter of further study.

The results of this study may have implications for prenatal surgery for spina bifida as well. Intrauterine spina bifida repair appears to reverse the degree of hindbrain herniation [14, 26, 27]. The currently used scoring system might be imprecise, as it is based on the degree of vermis herniation and the position of the fourth ventricle. The present study provides reliable measures, which may be more suitable to objectively evaluate the effect of prenatal surgery on Chiari II malformation in three dimensions. However, the results may not simply be transformed to prenatal imaging, since unshunted hydrocephalus might have an effect on the measures in the prenatal setting. In particular, this may be relevant for mamillopontine distance, as this distance may decrease as a result of raised intracranial pressure [28]. The effect of hydrocephalus may have less influence on most other measures. However, additional evaluation of the measures in a prenatal setting is recommended.

The study also had some limitations. Due to its partly retrospective design, the study population comprised a heterogeneous set of MR images. Furthermore, the reference standard used in the ROC analyses might be questionable. However, a better reference standard is currently not available. Finally, we could not take into account a possible age effect even though brain dimensions change in a growing child. However, Salman et al. [21] showed that MR measurements of the posterior fossa did not correlate with age in children with Chiari II malformation. In the present study, the strong differences between affected and unaffected children seem to outweigh the influence of age.

In conclusion, using morphometric measures represent a reliable and feasible method to quantify the morphological distortions of Chiari II malformation on MR images. These measures are easily used on standard MR images without the need of specific software. They appraise different parts of the cerebellum, brainstem, and posterior fossa providing quantitative information about the extent of Chiari II malformation in three dimensions. The measures may have added value in assessment of severity of Chiari II malformation in clinical decision making as well as in research settings, such as studies on the effect of prenatal surgery for spina bifida. The excellent diagnostic performance of mamillopontine distance and cerebellar width makes these measures particularly helpful in cases in which the diagnosis of Chiari II malformation is ambiguous.