Introduction

Replication is the cornerstone of science. Its absence reduces any scientific endeavor to a set of unverified beliefs. Brain imaging studies are no exception, although they have several specific characteristics that conspire to make quantification of reliability especially difficult. First, measurements are complex and idiosyncratic for each modality. Second, the definition of the actual target to be measured is often imperfect. Third, the data sets are large and not amenable to standard investigations of replication. Fourth, there is relatively little cross-pollination of research between different imaging modalities. Finally, setting up replication experiments can be difficult under many scenarios.

A variety of methods have been proposed for measuring the reliability of images, particularly in the context of functional neuroimaging studies (see Bennett & Miller, 2010, for an overview). One approach, the intraclass correlation (ICC) (Shrout & Fleiss, 1979), can be used to measure the similarity between region of interest (ROI) summaries of activation, intensity, or shape metrics in multiple subjects under two or more experimental replications. Another approach, the Dice coefficient (Rombouts, Barkhof, Hoogenraad, Sprenger, & Scheltens, 1998) measures what proportion of voxels exceed a threshold, such as one indicating activation, in both of two separate imaging sessions. A third approach, predictive modeling, measures the ability of a training data set to predict the structure of test data. One of the best established predictive modeling techniques within functional neuroimaging is the nonparametric prediction, activation, influence, and reproducibility sampling approach (NPAIRS; Strother et al., 2002), which has been used to illustrate how small changes in a functional magnetic resonance imaging (fMRI) processing pipeline can have dramatic effects on final results.

In this work, we propose a general model for brain imaging replication studies and introduce the image intraclass correlation (I2C2) as a measure of data reliability. This measure generalizes the classic (scalar) ICC to the case when the measurement target is an image. Resampling approaches are then developed to quantify I2C2 variability under the replication design and to test whether it is different from the I2C2 obtained under a random permutation of subject matching. Notably, the proposed framework is applied to three replication studies utilizing data from different brain imaging modalities. These include regional analysis of volumes in normalized space (RAVENS) imaging (a technique used to investigate localized changes in brain morphology; Davatzikos, Genc, Xu, & Resnick, 2001), seed-voxel brain connectivity maps based on resting-state fMRI (rs-fMRI), and fractional anisotropy (FA) measured using diffusion tensor imaging (DTI) in an area surrounding the corpus callosum.

The image intraclass correlation coefficient

To better understand the underlying issue, consider the most basic replication study where J = 2 and scalar replicate measurements are collected for each of I subjects. An example would be measuring total white matter brain volume from two imaging sessions. Yet even in such a seemingly straightforward setting, the study of and expectations for the extent of replication can vary dramatically. For example, in one study, replicate images may be collected on the same day, using the same scanner, and processed by the same technicians, while in another, replicate images may be collected weeks apart, in different X i σ 2 U X i laboratories, with different technicians and scanners. Using our example for context, let X i denote the true (unknown) white matter volume and W ij the white matter volume measurements from two replications. Succinctly, the observed W ij ’s are the measured proxies of the measurement of interest, X i . The classical measurement error model (Carroll, Ruppert, Stefanski, & Crainiceanu, 2006; Fuller, 1987) in replication studies is

$$ {W}_{ij}={X}_i+{U}_{ij}, $$
(1)

with assumptions that the measurements, X i , are independent across subjects and the measurement errors, U ij , are independent across both subjects and replicates and are mutually independent of X i , for i = 1, …, I, and j = 1,…, J. Conceptually, U ij is the error that occurs during each individual measurement of the true target, X i . The classical measurement error model further assumes that the measurement error variates U ij have the same variance, . Likewise, we denote the variance of X i by σ 2 X . This model is then equivalent to a one-way ANOVA model with random effects. Note that the observed measurements, W ij , for the same subject, i, are correlated, since they share the same X i . Specifically, the correlation is equal to

$$ \mathrm{corr}\left({W}_{i1},{W}_{i2}\right)=\frac{\sigma_X^2}{\sigma_X^2+{\sigma}_U^2}=\frac{\sigma_W^2-{\sigma}_U^2}{\sigma_W^2}=1-\frac{\sigma_U^2}{\sigma_W^2}. $$

This is the well known ICC coefficient. Here, the “class” is the replication experiment, and the correlation is between replicated measurements for the same subject. In the measurement error literature, ICC is referred to as the reliability ratio. The ICC is a scale-free quantity between 0 and 1, where 0 corresponds to exact independence of measurements W i1 and W i2; that is, they are unrelated, despite attempting to measure the same underlying quantity. Correspondingly, 1 indicates perfect reliability for every subject, W i1 = W i2 = X i . Estimation is simple; σ W 2 can be estimated as the variance of W ij , and σ 2 U can be estimated by the variance of (W i2W i1)/2.

Generalizations of the ICC to high-dimensional multivariate settings, such as images, are not obvious. However, a need for reliability metrics from these settings arises frequently. For example, the target of measurement might be a measure of brain morphology in a template, an rs-fMRI connectivity map, an FA map in an ROI such as the area surrounding the corpus callosum (see the Methods section), and so forth. In specific terms, let X i (v) be the (unknown) true image and W ij (v) be the proxy measurements of X i (v) at voxel v. The classical image measurement error can then be written as

$$ {W}_{ij}(v)={X}_i(v)+{U}_{ij}(v), $$
(2)

where all images are represented as V × 1 dimensional vectors; W ij = {W ij (v):v = 1, …, V} are the observed proxy images; X i = {X i (v):v = 1, …, V} are the true images, assumed to be independent across subjects; and U ij = {U ij (v):v = 1, …, V} are the measurement error images, assumed to be independent across subjects and replicates and (mutually) of X i . Here, i = 1, …, I, and j = 1, …, J i . Thus, we consider a general case involving different numbers of replicates per subject, J i of any value greater than or equal to 2.

The model further assumes that the measurement error vector, U ij , has covariance K U and X i has covariance, K X ; that is, cov (U ij ,U ij ) = K U and cov (X i ,X i ) = K X . These cannot be directly estimated, since the U ij and X i are unobserved. Note that the covariance operator of the observed data K W = cov (W ij , W ij ), a quantity directly estimable from the data, can be written as K W = K X + K U via the straightforward application of the multivariate variance operator to Equation 2. Exactly, paralleling the univariate setting, K X is interpreted as the within-subjects covariance, and K U as the covariance of the measurement error.

On the basis of the aforementioned connection with the classical measurement error model (Equation 1), we propose the following I2C2 coefficient:

$$ \rho =\frac{\mathrm{trace}\kern0.5em \left({K}_X\right)}{\mathrm{trace}\kern0.5em \left({K}_W\right)}=\frac{\mathrm{trace}\kern0.5em \left({K}_W\right)-\mathrm{trace}\kern0.5em \left({K}_U\right)}{\mathrm{trace}\kern0.5em \left({K}_W\right)}=1-\frac{\mathrm{trace}\kern0.5em \left({K}_U\right)}{\mathrm{trace}\kern0.5em \left({K}_W\right)}. $$
(3)

One possible way of calculating I2C2 is to estimate the smoothed covariance matrices using multilevel functional principal component analysis (MFPCA; Di, Crainiceanu, Caffo, & Punjabi, 2009) or its extension to high-dimensional data (Zipunnikov et al., 2011). Alternatively, we obtain the following method of moments estimators based on formulas from Carroll et al. (2006) to reduce the computational cost,

$$ \widehat{\mathrm{trace}}\left({K}_W\right)=\frac{1}{{\displaystyle {\sum}_{i=1}^I{J}_i-1}}{\displaystyle \sum_{i=1}^I{\displaystyle \sum_{j=1}^{J_i}\kern0.50em {\displaystyle \sum_{v=1}^V{\left\{{W}_{ij}(v)-{{\displaystyle \overline{W}}}_{..}(v)\right\}}^2}}}, $$

and

$$ \widehat{\mathrm{trace}}\left({K}_U\right)=\frac{1}{{\displaystyle {\sum}_{i=1}^I}\;\left({J}_i-1\right)}{\displaystyle \sum_{i=1}^I}\;{\displaystyle \sum_{j=1}^{J_i}}\kern0.50em {\displaystyle \sum_{v=1}^V}\;{\left\{{W}_{ij}(v)-{{\displaystyle \overline{W}}}_{i.}(v)\right\}}^2. $$

Here, \( {{\displaystyle \overline{W}}}_{..}(v)={\displaystyle {\sum}_{i,j,v}}\;{W}_{ij}(v)/{\sum}_{i=1}{\operatorname{J}}_i \) is the average of all images over all subjects and visits, and \( {{\displaystyle \overline{W}}}_{i.}(v)={\displaystyle {\sum}_{j=1}^{J_i}}\;{W}_{ij}(v)/{J}_i \) is the average image for subject i over all visits j. Thus, an estimate of I2C2 can be reached by entering these estimates into Equation 3.

Calculating the I2C2 is both quick and scalable, because it does not require dealing with the V × V dimensional matrices. Indeed, the computational burden for calculating trace(K W ) and trace(K U ) is linear in V. Moreover, the formulas separate by subject, making the calculations simple and easy to implement even on very modest computational resources. Both MATLAB (MATLAB, 2010) and R (R Core Team, 2012) codes are provided for calculating I2C2 at http://www.biostat.jhsph.edu/∼ccrainic/software.html. In practice, one may also be interested in the reliability of imaging in a particular ROI. The formulas for an ROI are almost identical to the ones for the whole image, except that the summation over v is done only within the ROI mask. This is especially useful when one suspects that the reliability of image measurements varies across functional or anatomical area brain regions.

To assess the variability of the I2C2 parameter, a method is proposed to calculate a confidence interval by nonparametrically bootstrapping subjects and applying the same estimation procedure for every bootstrap sample. There are multiple sources of variability for the I2C2 estimator, but the major source will be the limited number of subjects, I, and the imbalance in the number of replicates, where applicable.

Lastly, the distribution of the I2C2 under complete random sampling—that is, no reliability—is investigated. In this case, the model is W ij (v) = U ij (v); and recall that the U ij (v)s are independent. Draws from such a null distribution can be realized using permutation sampling. More precisely, all indexes, (i, j), are collected and relabeled as k i,j for k i,j = 1, …, (∑ I i = 1 J i ). Let σ(k i,j ) be a random permutation obtained by sampling the k-vector without replacement. Denote the image corresponding to σ(k i,j ) by \( {{\displaystyle \tilde{W}}}_{ij}(v) \), and estimate the I2C2 coefficient for the model \( {{\displaystyle \tilde{W}}}_{ij}(v)={{\displaystyle \tilde{X}}}_i(v)+{{\displaystyle \tilde{U}}}_{ij}(v) \). Under permutation, the (i,  j) pairing does not have the same meaning as before, because the images \( {{\displaystyle \tilde{W}}}_{ij}(v) \) are not necessarily from the same subject. By breaking the subject associations via random permutation, a null distribution that is otherwise close to the variation in the data is obtained. Because the number of resamples must be large to minimize Monte Carlo error, for both bootstrapping and permutation testing, the speed of the proposed methods is crucial. Below, we first investigate the “reliability" of this proposed metric in the next section and then show how these quantities can be calculated and used in three different imaging applications in the Methods section.

Simulations

The I2C2 metric is developed on the basis of the assumptions that the signal and noise are independent and normally distributed across repeated measurements. Through extensive simulations, we investigate the effects of various model violations on the preformance of I2C2. In particular, we examine the performance of our algorithm when the model is correctly and incorrectly specified. When the model is misspecified, we study scenarios where (1) replication errors are non-Gaussian, (2) replication errors are correlated over repetitions, and (3) the signal is correlated with the replication errors.

Correctly specified model

Consider the data-generating mechanism W ij (v) = X i (v) + U ij (v), i = 1, ⋯, I; j = 1, ⋯, J i ; vV, where each subject i has J i images repeatedly measured on a group of voxels V. Let U ij (v) = V ij (v) + ε ij (v), where X i (v) and V ij (v) are mutually uncorrelated with smooth covariance operators and ε ij (v) are the i.i.d. for each voxel, repetition, and subject. Generate \( {X}_i(v)=\mu (v)+{\displaystyle {\sum}_{k=1}^{K_1}{\xi}_{ik}{\phi}_k(v)} \) and \( {V}_{ij}(v)={\displaystyle {\sum}_{k=1}^{K_2}}\;{\zeta}_{ij k}{\psi}_k(v) \), where ξ ik  ∼ N(0,λ X k ) and ζ ijk  ∼ N(0,λ V k ). To approximate the DTI-MRI example in the Methods section, we set μ(v) to be the vector obtained by concatenating the population average of corpus callosum images. Let V = {v 1, v 2, ⋯, v V }; then, V = 38 × 72 × 11. We set K 1 = K 2 = 4, λ X k = 1400 × 0. 5k− 1, and λ V k = 840 × 0. 5k − 1, k = 1, 2, 3, 4. The eigenfunctions ϕ k (v) and ψ k (v) are chosen to be orthonormal blocks, as in Zipunnikov et al. (2011). Data were simulated for I = 200 subjects, each with J i = 2 replications. By definition, the theoretical I2C2 is \( {\displaystyle {\sum}_K^{K_1}{\lambda}_k^X}/\left({\displaystyle {\sum}_k^{K_1}}\;{\lambda}_k^X+{\displaystyle {\sum}_K^{K_2}{\lambda}_k^V+V{\sigma}^2}\right) \). We show the results for the following distributions of ε ij (v): Gaussian, heavy-tail t, and mixture normal with two components. For each scenario, we conduct 100 iterations.

  • ε ij (v) ∼ N(0, σ 2). The model is correctly specified, and results are highly reliable (see the left panel in Fig. 1). The box plots show the distribution of estimated I2C2 over 100 iterations with respect to a range of signal-to-noise ratios. The red line indicates the theoretical I2C2 values as a function of σ 2.

    Fig. 1
    figure 1

    Left panel: True I2C2 (red line) and estimated I2C2 (box plots over 100 simulations) for ε ij (v) ∼ N(0, σ 2) and a range of σ 2. Right panel: True I2C2 (red line) and estimated I2C2 (box plots over 100 simulations) for ε ij (v) ∼ t 3/s and a range of t distribution variances

  • ε ij (v) ∼ t 3/s, s = 0.5 × (1:20). Here, the t distribution generates measurement errors with a heavy tail distribution and a variance controlled by s. Results are displayed in the right panel of Fig. 1. Performance is very good, although a slight overestimation can be noted in the very low signal-to-noise setting.

  • ε ij (v) ∼ pN(μ 1,s 21 ) + (1 − p)N(μ 2,s 22 ). This scenario corresponds to the case when measurement error has two possible sources. We simulate the case when the noise distribution is a mixture of two normal components. We consider the following three settings corresponding to three different reliability ratios: (1) p = 0.8, μ 1 = − 0.2, μ 2 = 0.8, s 1 = 0.005, and s 2 = 0.1; (2) p = 0.5, μ 1 = − 0.02, μ 2 = 0.02, s 1 = 0.02, and s 2 = 0.1; (3) p = 0.3, μ 1 = − 1, μ 2 = 0.43, s 1 = 0.05, and s 2 = 0.1. The parameters are chosen so that the distribution of the noise has a mean of 0. The density of selected distributions and the estimated I2C2 under each setting are shown in Fig. 2, indicating excellent performance of the I2C2 estimators.

    Fig. 2
    figure 2

    Left panel: Density plots of the mixture normal distributions used for measurement noise. Right panel: True I2C2 (red lines) and estimated I2C2 (box plots) for the different mixtures of normal distributions

    Fig. 3
    figure 3

    Estimated I2C2 (red horizontal lines) and 95% equal tail probability confidence intervals for ventricles, white matter, and gray matter RAVENS images. Gray distributions correspond to the I2C2 estimator under the zero reliability assumption (random permutations of labels)

    Fig. 4
    figure 4

    Estimated image intraclass correlation coefficients (I2C2s) (red horizontal lines) and 95% equal tail probability confidence intervals for fMRI seed-voxel correlation maps for the posterior cingulate cortex (PCC), the dorsal region of the motor cortex corresponding to control of the lower limbs (M1), the premotor cortex (M3), and the ventral-most region of the motor cortex corresponding to oro-motor function (M5). Gray distributions correspond to the I2C2 estimator under the zero reliability assumption (random permutations of labels)

We conclude that the I2C2 is properly recovered when the model is correctly specified. This is due to the fact that we use a method of moments estimator that is insensitive to the distribution of measurement error.

Misspecified model

When the model assumptions are violated, we show that the estimated I2C2 still reflects the magnitude of reliability. Note that the theoretical I2C2 can be equivalently defined as \( \mathrm{I}2\mathrm{C}2={\displaystyle {\sum}_{v\in V}\mathrm{Cov}\left\{{W}_{ij}(v),{W}_{i{j}^{\prime }}(v)\right\}}/{\displaystyle {\sum}_{v\in V}\mathrm{Var}\left\{{W}_{ij}(v)\right\}} \). xThus, I2C2 is a measure of the fraction of variability that is shared among repeated measurements, without distinguishing whether the correlation is from the signal or the noise. We consider the following scenarios where correlation among images is due not only to the signal, but also to the correlation of replication errors. This violates a basic assumption of measurement, although in the absence of gold standard measurements, it is difficult to determine whether the true errors are correlated.

  • Correlated noise across replications. Consider the case where ε ij (v) ∼ N(0, σ 2), and corr{∈ ij (v), ∈  ij ′(v) = ρ} for every j ≠ j′. The theoretical I2C2 is \( \left({\displaystyle {\sum}_k^{k_1}{\lambda}_k^X+ V\rho {\sigma}^2}\right)/\left({\displaystyle {\sum}_k^{k_1}{\lambda}_k^X+}{\displaystyle {\sum}_k^{k_2}{\lambda}_k^V+V{\sigma}^2}\right) \), which is larger than in the uncorrelated case. Similarly to the previous analysis, we examine the estimated I2C2 with respect to σ 2 and ρ. The mean square errors of the estimated I2C2 under a range of correlations ρ are shown in the left panel of Table 1.

    Table 1 Mean square errors (MSEs) of the estimated image intraclass correlation coefficients (I2C2s) under a range of correlations, both for correlated noise case and for correlated signal and noise

The case where noise variables are not exchangeable is more difficult because defining the true I2C2 becomes tricky. For example, consider the case of AR(1) dependence: that is, ε ij+1(v) = αε ij (v) + z ij+1(v), ε i1(v) ∼ N(0, σ 2), and z ij (v) ∼ N(0, (1 − α 2)σ 2) to ensure that ε ij (v) s have the same marginal distributions. A possible way to define I2C2 is to start with the pairwise correlations

$$ {{\displaystyle \mathrm{I}2\mathrm{C}2}}_{jj\prime }={\displaystyle \sum_{v\in V}\mathrm{Cov}\left\{{W}_{ij}(v),{W}_{i{j}^{\prime }}(v)\right\}}/{\displaystyle \sum_{v\in V}\mathrm{Var}{\left\{{W}_{ij}(v)\right\}}^{1/2}\mathrm{Var}{\left\{{W}_{i{j}^{\prime }}(v)\right\}}^{1/2}}. $$

The true I2C2 could then be defined as the average of all possible pairs \( \mathrm{I}2\mathrm{C}2={\scriptscriptstyle \frac{1}{\left(\begin{array}{c}\hfill J\hfill \\ {}\hfill 2\hfill \end{array}\right)}}{\displaystyle {\sum}_{j<j\prime }{{\displaystyle \mathrm{I}2\mathrm{C}2}}_{jj\prime }} \). Although this is a rather contrived example, our simulations indicate good estimation of the I2C2 (results not shown).

  1. 1.

    Consider the case where the true underlying image intensity is correlated with the magnitude of noise at each voxel. Consider \( {W}_{ij}(v)={{\displaystyle \tilde{X}}}_i(v)+{{\displaystyle \tilde{U}}}_{ij}(v) \), where \( {{\displaystyle \tilde{X}}}_i(v)={X}_i(v)+{z}_i \) and \( {{\displaystyle \tilde{U}}}_{ij}(v)={V}_{ij}(v)+{v}_{ij} \) and X i (v), V ij (v) are generated as in the previous sections. Correlation between signal and noise is modeled using the trivariate normal distribution N(0,Σ) for {z i , v i1, v i2}, where

$$ {\displaystyle \sum =\left(\begin{array}{ccc}\hfill {\sigma}_x^2\hfill & \hfill \rho {\sigma}_{xu}^2\hfill & \hfill \rho {\sigma}_{xu}^2\hfill \\ {}\hfill \rho {\sigma}_{xu}^2\hfill & \hfill {\sigma}_u^2\hfill & \hfill 0\hfill \\ {}\hfill \rho {\sigma}_{xu}^2\hfill & \hfill 0\hfill & \hfill {\sigma}_u^2\hfill \end{array}\right).} $$
  1. 2.

    We assume that σ 2 xu  = σ 2 x and σ u 2 = 5σ x 2. In this case, the theoretical I2C2 is \( \left\{{\displaystyle {\sum}_k^{K_1}{\lambda}_k^X+V\left(1+2\rho \right){\sigma}^2}\right\}/\left\{{\displaystyle {\sum}_k^{K_1}{\lambda}_k^X+{\displaystyle {\sum}_k^{K_2}{\lambda}_k^V+V\left(6+2\rho \right){\sigma}^2}}\right\} \). By varying the correlation ρ, we examine the estimated I2C2 in the right panel of Table 1.

In sum, simulation results demonstrate the robustness of the I2C2 estimation approach when there is a correlation among noise variables or between the signal and the noise. However, it is important to note that I2C2 is not designed to distinguish between these cases and is unbiased with respect to the true correlation; this true correlation may be different from the proportion of variability explained when model assumptions are violated. We now proceed to show how I2C2 can be calculated and used in three different imaging applications.

Methods

RAVENS acquisition

This work employs the “multimodal MRI reproducibility resource” (Landman et al., 2011), colloquially known as the Kirby21 data set, which is publicly available through the Neuroimaging Informatics Tools and Resources Clearinghouse (www.nitrc.org). The Kirby21 data set consists of test–retest structural MRI and rs-fMRI scans from 21 healthy adult volunteers with no history of neurological conditions (11 male and 10 female; 31.76 ± 9.47 years of age) who were each scanned twice on the same day. Further details of the study can be found in Landman et al. (2011).

The structural MRI data were acquired on a 3.0T scanner (Achieva, Philips Medical Systems) using a high-resolution 3-D magnetization-prepared rapid acquisition of gradient echoes sequence with a resolution of 1.0 × 1.0 × 1.2 mm; TR,∼6.7 ms; TE, 3.1 ms; TI, 842 ms; flip angle, 8°; SENSE factor, 2. All images were spatially normalized via registration of T1 maps into the mean template generated using ANTS (Avants et al., 2011; Avants et al., 2010). Details of how the average template is generated can be found in Chen et al. (2012). All T1 images were segmented into ventricles (VNs), gray matter (GM), and white matter (WM) using Lesion-TOADS (Shiee et al., 2010). After segmentation, the final tissue maps of VNs, WM, and GM were spatially normalized using the HAMMER-SUITE (Shen & Davatzikos, 2002) to generate RAVENS images. Finally, the RAVENS maps were smoothed individually with a 4-mm FWHM Gaussian kernel using SPM8.

fMRI acquisition

The Kirby21 data set was also used to investigate the reproducibility of seed-based functional connectivity analysis as follows. In short, two 7-min resting state scans were acquired from each subject using a single-shot, partially parallel (SENSE) gradient-recalled echo planar sequence with an ascending slice order (TR/TE, 2000/30 ms; FA, 75; 3-mm axial slices with a 1-mm slice gap) and an 8-channel head coil. Subjects were instructed to relax and fixate on a cross-hair while remaining as still as possible. The two resting-state scans were separated by a short break, during which the subject exited the scanner; the T1-weighted anatomical image described in the RAVENS acquisition section was also acquired to be used as a template for spatial registration of the functional images.

Image processing was performed using SPM8 and custom MATLAB scripts. Anatomical images were registered to the first functional volume and normalized to MNI space using unified segmentation/normalization (SPM8). Functional data were adjusted for slice time acquisition, as well as subject motion, and were transformed to MNI space. Nuisance covariates from white matter and CSF were estimated using CompCor (Behzadi, Restom, Liau, & Liu, 2007) and regressed from the data along with the motion realignment estimates, their derivatives, global mean signal, and linear trends. Data were then spatially smoothed (6-mm kernel) and temporally filtered using a 0.01–0.10 band-pass filter. Data from one subject was excluded from analysis due to a misalignment of the first and second resting-state scans.

Seed voxel analysis is commonly used in fMRI studies to analyze the functional connectivity of the brain via a seed voxel from an ROI (Lindquist, 2008). Here, we investigated the reproducibility of this approach for our data set considering four different seeds, each with a 6-mm radius: the posterior cingulate cortex (labeled PCC; Fox et al., 2005), the premotor area (labeled M3) (Chouinard & Paus, 2006), and two seeds from the dorsal–ventral extremes of the motor strip, the dorsal seed representing lower limb control (labeled M1; Meier, Afalo, Kastner, & Graziano, 2008) and the ventral one corresponding to oro-motor function (labeled M5). Within each seed, fMRI time series were averaged across voxels, and a correlation map for each of the resulting four time courses was then obtained with each voxel in the brain.

DTI-MRI acquisition

The data were collected as part of an ongoing observational study being conducted at the National Institutes of Health and at Johns Hopkins University. Study subjects with multiple sclerosis (MS) were recruited from the outpatient neurology clinic and healthy volunteers from the community. Prior to MRI scanning, all subjects gave signed, informed consent, and all procedures were approved by the institutional review board. Cohort characteristics are summarized in Reich, Ozturk, Calabresi, and Mori (2010); Goldsmith, Crainiceanu, Caffo, & Reich (2011). Longitudinal analyses of the DTI-MRI substudy can be found in Greven, Crainiceanu, Caffo, and Reich (2010); Zipunnikov et al. (2012).

Scans were performed on a 3T scanner (Intera; Philips, Best, The Netherlands) over a 4.6-year period, using the body coil for transmission and either a 6-channel head coil or the eight head elements of a 16-channel neurovascular coil for reception (both coils are made by Philips). Each session included two sequential DTI scans using a conventional spin-echo sequence and a single-shot EPI readout. Whole-brain data were acquired in nominal 2.2-mm isotropic voxels with the following parameters: TE, 69ms; TR, automatically calculated (shortest); slices, 60 or 70; parallel imaging factor, 2.5; noncollinear diffusion directions, 32 (Philips overplus high scheme); high b-value, 700 s/mm2; low b-value (b0), approximately 33 s/mm2; repetitions, 2; reconstructed in-plane resolution, 0.82 × 0.82 mm. A 3-D gradient-echo magnetization-transfer sequence was also performed with segmented EPI readout (nominal acquired resolution, 1.5 × 1.5 × 2.2 mm; TE, 15ms; TR, 64 ms; parallel imaging factor, 2; EPI factor, 7; magnetization-transfer pulse, sinc-shaped, 1.5kHz off-resonance; repetitions, 3), the data from which were rigidly registered to the DTI scan before calculation of magnetization transfer ratio (MTR) maps (defined as 1 minus the voxel-wise ratio of data from this sequence to those obtained using the same sequence without the magnetization-transfer pulse). Prior to analysis, data were adjusted to account for changes in average tract-specific MRI indices that resulted from the scanner upgrades that inevitably occur over the course of a study such as this. The procedure by which this adjustment was made has been previously described (Harrison et al., 2011).

The diffusion-weighted scans were processed using CATNAP (Landman et al., 2007) to create maps of FA, mean diffusivity (MD), axial diffusivity (AD), and radial diffusivity (RD). These four quantities, together with MTR, are hereafter termed MRI indices. Whole-brain MRI indices were calculated by slice-wise averaging of all diffusion-weighted images, removal of the low-intensity voxels that are characteristic of extracerebral tissues on these images, and final removal of voxels with MD >1.7μm2/ms to exclude cerebrospinal fluid (Ozturk et al., 2010). The resulting brain mask was applied to all DTI maps and also to the coregistered MTR maps. The images were obtained from a natural history study where 176 MS patients were followed for up to 5.5 years, which generated a total of 446 MRI scans. The number of scans per subject varied from one to six. The scanning time is shown in Fig. 5, where time zero indicates the first scan. For illustration purposes, we focus on the measurements in a region of 30, 096 voxels that contains the corpus callosum. At each voxel, data are FA weighted by the probability of being in the corpus callosum. Images are registered using affine transformations.

Results

RAVENS replication results

RAVENS maps produce an image of the deformation of the brain necessary to fit in a given template and are proxies of brain morphology. Here, the focus is on ventricular, WM, and GM regions considered separately, segmented via Lesion-TOADS (Shiee et al., 2010). The measurement error is an uncontrollable combination of sources, including image acquisition, biological error (natural within-day brain variation), movement, magnetic field inhomogeneities, preprocessing, spatial normalization, and segmentation. Apportioning error variability is beyond the scope of this article. Instead, interest lies, first, in establishing that estimating the effect of total measurement error variability (regardless of its source) is possible and, then, in investigating its impact on image reliability.

Figure 3 displays the I2C2 estimators (\( \widehat{\rho} \)) as a red line with 95% equal tail probability confidence intervals obtained using the nonparametric bootstrap of subjects. The reliability in the VNs is by far the largest roughly (.9), followed by reliability in white matter (.55) and gray matter (.45). Determining the source and type of error could be done, for example, by investigating various ROIs or by inspecting the principal components of measurement error variability based on HD-MFPCA (Zipunnikov et al., 2011). The distributions of I2C2 estimators under zero reliability \( {{\displaystyle \widehat{\rho}}}_0 \) is shown in gray, with the median displayed as a black horizontal line. These results indicate strong evidence that the observed reliability values are inconsistent with zero reliability. Interestingly, the null distribution (gray histogram plot) for VNs has a long right tail, with a nontrivial probability above .3. This is somewhat unexpected and may indicate stronger between-subjects correlations of measurement error processes in the VNs. Further investigation of this postulate is left for future study.

fMRI replication results

The I2C2 metric was used to quantify the reproducibility of the resulting connectivity map (correlation matrix) for each of the four seed regions. Results are shown in Fig. 4, using the same notation and symbols as in Fig. 3. The overall message is that the seed-voxel-based correlation maps are not reliable, with the reliability estimates varying between approximately .20 (for M1, M3, and M5) and .37 (for the PCC). These low values suggest that state-of-the-art seed-voxel-based correlation maps based on rs-fMRI data are unreliable, although the PCC seems to indicate higher (nearly double) reliability than do other regions. Thus, caution is warranted in the interpretation of these maps and in the analysis of connectivity maps obtained from thresholding unreliable fMRI resting-state correlation operators. These results are inconsistent with the large and increasing literature (Braun et al., 2012; Chen et al., 2008; Damoiseaux et al., 2006; Honey et al., 2009; Meindl et al., 2010; Schwarz & McGonigle, 2011; Shehzad et al., 2009; Wang et al., 2011; Zhang et al., 2011; Zuo, Di Martino, et al., 2010; Zuo, Kelly, et al., 2010) on rs-fMRI that reports high reliability of measurements. Much deeper investigation is needed to address these divergent findings, establish identical estimands, estimators, and evaluation procedures. Our procedure provides a clear, simple, and easy-to-use step in this direction.

Fig. 5
figure 5

Image scanning time for 176 patients. Every person has a baseline scan at time 0. The x-axis is time in years. The y-axis is patient IDs. We match visit number from different patients by rounding their scan time to quarter month, as indicated by gray dashed lines

DTI-MRI replication results

To highlight methods, a subset of the complete data collection (Fig. 5) consisting of subjects who have more than six visits was selected. This reduced the data set to 117 scans from 18 subjects: 14 subjects with 6 scans, 1 with 7, 2 with 8, and 1 with 10. Henceforth, the subset is viewed as the complete data set, with no further reference of the omitted subjects. We also consider four further subsets labeled as “T≤4,” “T≤3,” “T≤2,” and “T≤1.” The notation refers to the number of years since the baseline scan, as, for example, the T≤4 data set considers only images obtained within the first 4 years from the baseline scan, resulting in 110 scans from the 18 subjects (4 ∼ 5, 11 ∼ 6, 3 ∼ 8 where 4 ∼ 5 refers to 4 subjects with 5 scans). The T≤ 3 data set contains 88 scans broken down as 6 ∼ 4, 9∼5, 2 ∼ 6, and 1∼7. The T ≤ 2 data set contains 70 scans broken down as 7 ∼ 3, 7∼4, 3 ∼ 5, 1 ∼ 6, and 1 ∼ 7. Finally, the T ≤ 1 data set contains 45 scans, 1 ∼ 1, 1 ∼ 4, 8 ∼ 2, and 8∼3.

In Zipunnikov et al. (2012), the existence of a longitudinal change over time in these data was studied, with the finding that less than 1% of the variability was explained by longitudinal within-subjects changes. Thus, modeling these data as exchangeable image measurement error processes is likely a valid approximation of the underlying processes. All five data sets are unbalanced, having a different number of replicates per subject. The left panel in Fig. 6 displays the reliability estimators (red horizontal line) and the associated equal tail probability 95% confidence intervals. These results indicate that the reliability of these measurements hovers slightly below .8, which is consistent with the findings in Zipunnikov et al. (2012).

Fig. 6
figure 6

Estimated image intraclass correlation coefficients (I2C2s; red horizontal lines) and 95% equal tail probability confidence intervals for fractional anisotropy in an area containing the corpus callosum. Gray distributions correspond to the I2C2 estimator under the zero reliability assumption (random permutations of labels). Left panel results are based on 18 subjects who have at least six visits (“All") and subsets of the “All" data set containing all scans within the first 4, 3, 2, and 1 year from baseline, respectively. Right panel results are based on pairs of imaging obtained, at most, 1, 2, 3, 4, and 5 years apart. The number of subjects in each data set (from left to right) was 119, 64, 49, 31, and 18, respectively

Our work investigated the reliability of the imaging studies as a function of time by selecting subjects who have at least two replications and constructing five additional replication substudies labeled “1 apart,” “2 apart,” “3 apart,” “4 apart,” and “5 apart,” respectively. To be specific, each such substudy contains exactly two replicates per subject: the baseline observation and the replicate that is closest to being 1, 2, 3, 4, or 5 years apart, respectively. The number of subjects in each data set was 119, 64, 49, 31, and 18, respectively, with more subjects in data sets with shorter between-observation intervals.

The right panel in Fig. 6 displays the reliability estimators for these replication studies as a function of how many years apart images were taken. The estimated reliability of observations taken within 1 year of each other is quite high, roughly .9, which indicates that there are very few changes in the FA measurements along the corpus callosum of MS subjects within 1 year. This may be good news for individuals with MS if the lack of measured neuronal fiber integrity via FA represents actual fiber integrity. However, this finding may be disheartening to investigators searching for biomarkers of neuronal fiber degradation, if degradation is actually there. As was expected, the reliability of image replication decreases with the increased time between visits, with median reliability roughly around .8 for images collected 5 years apart. However, this decline in reliability is relatively small and likely to be indicative of small observable longitudinal changes. The variability around the estimated I2C2 also increases from the replication study “1 apart" to “5 apart," although this is most likely due to the decrease in sample size from 119 to 18 subjects with repeat samples.

Discussion

This article proposes an extension of the classical ICC coefficient to image replication studies. The resulting parameter, denoted I2C2, provides a global measurement of reliability that is intuitive and easy to calculate. Moreover, I2C2 can readily be calculated for given ROIs by simply restricting the summations in the Introduction to those voxels within the ROI mask. In practice, one may actually report the I2C2 on a partition of the image in mutually disjoint ROIs—say, R 1, …, R P . Then I2C2 can be calculated for each R p , p = 1, …, P, and compared with the overall I2C2. Areas of unexpectedly small estimated I2C2 may further indicate the source and type of measurement error. Another practical approach would be to calculate the I2C2 hierarchically—that is, at the voxel level, then at overlapping neighborhoods of increasing size and, ultimately, at the image level. This could provide an interesting multiresolution approach to visualizing the structure of the measurement error.

An equally simple measure of reproducibility could be the average of ICC at the voxel levels. An unbiased estimator of the average ICC would then be

$$ 1-\frac{1}{V}\frac{{\displaystyle {\sum}_i{J}_i\kern0.1em -\kern0.1em 1}}{{\displaystyle {\sum}_i\left({J}_i\kern0.1em -\kern0.1em 1\right)}}{\displaystyle \sum_{v=1}^V\frac{{\left\{{W}_{ij}(v)\kern0.1em -\kern0.1em {{\displaystyle \overline{W}}}_{i.}(v)\right\}}^2}{{{\displaystyle {\sum}_{i=1}^I{\displaystyle {\sum}_{j=1}^{J_1}\left\{{W}_{ij}(v)\kern0.1em -\kern0.1em {W}_{..}(v)\right\}}}}^2}}. $$

Irrespective of the replication estimand and estimation procedure, the subject-level bootstrap and permutation tests introduced in this article can be applied. However, there are reasonable arguments for preferring the I2C2 to the average ICC value. Indeed, the variability attributable to variation among subjects is equal to trace(K X ), whereas the variability attributable to visits is trace(K U ). Thus, I2C2 is the proportion of variability explained by subject-level variability out of the total variability of the data in the multivariate image measurement error model. In contrast, the average ICC is the average of the proportion of variability explained by subject-level variability out of the total variability of the data in the sequence of univariate (marginal) measurement error models. This distinction has practical implications. Consider, for example, the case where there are 1,000 voxels in every image. At 500 voxels, the absolute variability of the data and reliability are very low. However, at the other 500 voxels, the variability and reliability are large. In this context, the average ICC would place too much emphasis on the low-variability voxels, because it ignores the relative variability of the data at different voxels. A second problem occurs at locations with small visit-to-visit variability, since this variance is used in the denominator of the ICC estimator and may lead to serious computational instabilities.

While data rarely satisfy the measurement error model (Equation 2) exactly, the model is a reasonable starting point for defining the data structure under explicit assumptions. Model assumptions notwithstanding, we prefer this explicit statistical approach to an algorithmic one that obscures assumptions. Moreover, the model can easily be extended to include some obvious data-supported complications. For example, if each visit has a different mean, one can easily expand the model to include (so-called) batch or visit effects,

$$ {{\displaystyle W}}_{ij}={{\displaystyle B}}_j+{{\displaystyle X}}_i+{{\displaystyle U}}_{ij}, $$

as proposed in Di et al. (2009). Here, the images B j are visit-specific fixed effect images. Such deterministic changes across all subjects from one visit to another could be due to the use of different scanners, imaging parameters, scanner drift, and so forth. In quality control, agriculture, and lab sciences, such effects arise from a batch being run for measurement or assay (hence, the term “batch effect”). For subjects returning to a scanner, batches are visits. Note that the visit-specific effects can be easily estimated as \( {{\displaystyle \widehat{B}}}_j={\displaystyle {\sum}_{i=1}^I{W}_{ij}/I} \) and one can define the I2C2 for the residuals \( {{\displaystyle W}}_{ij}\kern0.1em -\kern0.1em {{\displaystyle \widehat{B}}}_j \).

In more complex models, one may also be interested in, or worried about, the longitudinal effects of collecting the data. For example, in the DTI study, some images are taken within a few months of each other, whereas other images are collected years apart. In such situations, it is reasonable to add a term that accounts for longitudinal changes. A reasonable model for such an approach could be

$$ {{\displaystyle W}}_{ij}=B\left({T}_{ij}\right)+{{\displaystyle X}}_{i,0}+{{\displaystyle X}}_{i,0}{T}_{ij}+{{\displaystyle U}}_{ij}, $$

where B(T ij ) is an effect that depends on time of the visit, T ij , as in most longitudinal studies, visits are not equally spaced. In this model, B(T ij ) + X i,0 is the true unobserved image at baseline (T ij = 0), B(T ij ) + X i,0 + X i,0 T ij is the true unobserved image at time T ij  > 0, and U ij is the image measurement error process. Estimation of these types of models is thoroughly discussed in Greven et al. (2010; Zipunnikov et al., 2012), but it is worth noting that reasonable assumptions about the data can easily be incorporated into statistical models.

Regardless of the model under investigation, the image error process, U ij , deserves particular attention. Indeed, from all the models discussed in this article, one can estimate the covariance operator, K U , and the first eigenvectors can be visually inspected. This provides clues into the structure of measurement error. For further reading on measurement error modeling, we recommend Carroll et al. (2006; Fuller, 1987). For the effect of image measurement error on estimating associations with outcomes, we recommend Crainiceanu, Staicu, and Di (2009), while for inference in the means of two imaging processes, we recommend Crainiceanu, Staicu, Ray, and Punjabi (2012).