1 Introduction

The negative impact of head motion on measurements derived from brain MRI has recently been a subject of study in the neuroimaging literature. In the context of functional connectivity studies, it has been shown that head motion has substantial, systematic effects on the timecourses of fMRI data, leading to variations in correlation estimates and functional coupling [1, 2]. In diffusion MRI, motion typically produces increased radial diffusivity estimates, while decreasing axial diffusivity and fractional anisotropy measures [3]. In morphometric studies with structural MRI, it has recently been shown that head motion decreases the estimates of cortical thickness and gray matter volumes [4]. Therefore, head motion is an important confounding factor that can undermine the conclusions of MRI-based neuroimaging studies. While motion certainly affects studies with a single healthy group, it is a particularly important factor in group studies involving clinical or special populations, such that one group is more prone to moving in the scanner than the other (e.g., Parkinson’s).

To mitigate these problems, one would ideally use motion correction methods at acquisition. These techniques can be prospective or retrospective. The former attempt to dynamically keep the measurement coordinate system fixed with respect to the subject during acquisition. Head motion can be tracked with an external system (e.g., camera and markers [5]) or with image-based navigators [6, 7]. Retrospective methods attempt to correct for motion after the acquisition. Some retrospective algorithms exploit information from external trackers as well [8], while others use the raw k-space data [9]. Unfortunately, neither prospective motion correction nor external trackers are widely available yet. Moreover, there are immense amounts of legacy MRI data for which the raw k-space data are not available (since only reconstructed images are normally stored in the PACS), which limits the applicability of retrospective k-space techniques.

A simpler, more extended alternative to reconstructing motion-free images is to estimate a measure of motion, manually or automatically. The former is typically in the form of a quality control (QC) step, in which a human rater disregards scans that display motion artifacts. Despite its simplicity, manual QC is neither continuous nor reproducible, and can introduce bias in subsequent analyses. This problem can be ameliorated with automated techniques, which generate continuous, reproducible motion scores that can be used in two different ways: as automated QC and as nuisance factors. In automated QC, subjects with scores over a threshold are left out in a systematic and reproducible manner. When used as nuisance factors, scores are regressed out from the target variable to reduce the impact of motion on the analysis [3], so no subjects are discarded.

In functional and diffusion MRI, head motion can be estimated from the parameters of the transforms that co-register the different frames. In structural MRI, however, the absence of temporal information makes extracting measures of motion more difficult. Here we present a machine learning approach to retrospectively quantify motion from structural brain MRI. To the best of our knowledge, this is the first motion estimation method that relies solely on image intensities. Motion detection is cast as a supervised classification problem, which is solved with a convolutional neural network (CNN). We use a 3D network architecture (similar to 3D U-net [10]) with a nonlinear data augmentation scheme that enables learning with sparsely annotated MRI scans. This is a key feature in our application, since image regions corrupted by motion artifacts (e.g., ghosting, blurring) have ill-defined boundaries, and are difficult to manually delineate with precision – especially in 3D. We also model uncertainty in the CNN with dropout at testing [11], and a scalar motion score is produced by averaging the probability map estimated by the CNN across an application-dependent ROI.

Our technique requires no specialized equipment, and can be used to analyze both prospective and legacy MRI data. We evaluated the method with two datasets involving motion-prone populations (children and autism). Using a ROI including the cortical ribbon and an underlying layer of white matter, we show that our motion score is closely connected with cortical thickness (which is known to be sensitive to motion [4]), accurately predicts the results of human QC, and recovers group differences confounded by motion in a group study.

2 Methods

2.1 Voxel Classifier

The core of our method is a classifier that produces, for each voxel, an estimate of the probability that its intensity is corrupted by motion artifacts. As classifier, we use a 3D CNN based on the 3D U-net architecture [10], which is robust against sparsely labeled training data. Our architecture is shown in Fig. 1. The network is leaner than in  [10], since we do not need a large receptive field to detect motion artifacts, and also for faster training and inference.

The network has an analysis and synthesis stage with three levels of resolution. The input is an image patch of size \(64^3\) voxels (1 mm isotropic). At the analysis stage, the convolution layers have kernels of size 3\(\times \)3\(\times \)3 (stride 1), and are followed by rectified linear units (ReLU), batch normalization [12] and max pooling (2\(\times \)2\(\times \)2, stride 2). At the synthesis stage, deconvolutions (2\(\times \)2\(\times \)2, stride 2) are followed by a 3\(\times \)3\(\times \)3 convolutional layer and a ReLU. In testing, we also implement random dropout at the ReLUs, in order to obtain different samples of the posterior distribution of the output [11]. Shortcut connections link layers of matching resolution at the analysis and synthesis stages, providing the latter with information at increasingly higher resolution at each level. In the last layer, a 1\(\times \)1\(\times \)1 convolution reduces the number of outputs to two: motion and no motion. We used weighted cross-entropy as loss function, and trained on sparsely labeled data by setting the weight of unlabeled voxels to zero. The output is a \(42^3\) voxel tile, with a receptive field of size \(22^3\) voxels. To classify a whole volume, we simply partition it into overlapping tiles of size \(64^3\) voxels.

Fig. 1.
figure 1

CNN architecture. Conv. stands for convolution, BN for batch normalization, and ReLU for rectified linear unit. The number of feature maps is displayed above each layer.

2.2 Computation of the Measure of Head Motion

Following [13], we use an average probability within a ROI as global score:

$$\begin{aligned} M = \frac{1}{|\varOmega _{ROI}|} \sum _{\varvec{x} \in \varOmega _{ROI}} p_m(\varvec{x}) = \frac{1}{|\varOmega _{ROI}|} \sum _{\varvec{x} \in \varOmega _{ROI}} \frac{\exp [m(\varvec{x})]}{\exp [n(\varvec{x})]+\exp [m(\varvec{x})]}, \end{aligned}$$
(1)

where M is our global motion score, \(\varOmega _{ROI}\) is the ROI domain, \(\varvec{x}\) is a voxel location, and \(p_m(\varvec{x})\) is the probability that the voxel at location \(\varvec{x}\) is motion corrupted. Such probability is computed as the softmax of \(n(\varvec{x})\) and \(m(\varvec{x})\), which are the strengths of the activations of the no-motion and motion units at the final layer of the CNN, respectively. As much as a single \(p_m(\varvec{x})\) is a weak measure of head motion, its average across the ROI provides a robust estimate [13].

3 Experiments and Results

3.1 MRI Data and Manual Annotations

We used two different datasets in this study. The first dataset (henceforth the “in-house” dataset) consists of brain MRI scans from \(n=48\) healthy children aged 7.1–11.5 years, acquired with a 3T Siemens scanner using an MP-RAGE sequence at 1 mm isotropic resolution. Two separate sets of ground truth annotations were created for this dataset: at the scan level (for testing automatic QC) and at the voxel level (for training the CNN). At the scan level, we made two sets of QC annotations: one by a trained RA (SM), which we used as ground truth (\(n_{\text {pass}}=34\), \(n_{\text {fail}}=14\)), and a second by JEI, with inter-rater variability purposes.

At the voxel level, creating dense segmentations is time consuming and hard to reproduce due to the difficulty of placing accurate boundaries around regions with motion artifacts, particularly in 3D. Instead, we made sparse annotations as follows. First, the RA went over the QC-passed scans, and identified slices in different orientations (axial / sagittal / coronal, approximately 30 per scan) that displayed no motion artifacts. The voxels inside the brain in these slices were all labeled as “no motion”, whereas all other voxels in the scan were not used in training. Then, the RA went over the QC-failed scans, and drew brushstrokes on regions inside the brain that clearly showed motion artifacts, making sure that the annotations were highly specific. These voxels were labeled as “motion”, whereas the remaining voxels were not used to train the classifier. The process took approximately 10–15 min per scan.

In order to test our classifier in a practical scenario and assess its generalization ability, we used a second dataset: the Autism Brain Imaging Data Exchange (ABIDE [14]). Even though effect sizes are notoriously small in autism spectrum disorder (ASD), ABIDE is a representative example of the type of application for which our method can be useful, since children with ASD might be more prone to moving in the scanner. We used a subset of ABIDE consisting of the \(n=111\) subjects (68 controls, 47 ASD) younger than 12 years (range: \(10-12\)). This choice was motivated by: 1. staying in the age range in which children with ASD still have increased cortical thickness [15, 16]; and 2. matching the population with that of the in-house dataset. This subset of ABIDE was acquired on nine different scanners across different sites, mostly with MP-RAGE sequences at 1 mm resolution (see [14]).

In both datasets, image intensities were coarsely normalized by dividing them by their robust maximum, computed as the \(98^{th}\) percentile of their intensity distribution. Cortical thickness measures were obtained with FreeSurfer [17].

3.2 Experimental Setup

The motion metric from Eq. 1 was computed for the scans from both datasets as follows. For the in-house dataset, we used cross-validation with just two pseudorandom folds (since training the CNN is computationally expensive), ensuring that the number of QC-fails was the same in both. For ABIDE, rather than retraining the CCN on the whole in-house dataset, we processed the scans with the two CNNs that were already trained and averaged their outputs.

The 3D CNNs were trained end-to-end from scratch using a modified version of their publicly available implementation, which is based on the Caffe framework [18]. Data augmentation included: translations; linear mapping of image intensities (slope between 0.8 and 1.2); rotations (up to 15\(^\circ \) around each axis); and elastic deformations based on random shifts of controls point and B-spline interpolation (control points 16 voxels apart, random shifts with standard deviation of 2 voxels). Stochastic gradient descent was used to minimize the weighted cross-entropy. We used different (constant) weights for the positive and negative samples to balance their total contributions to the loss function. We trained until the cross-entropy flattened for the training data (i.e., no validation set), which happened at 60,000 iterations (approximately 10 h on a Nvidia Titan X GPU). In testing, we used a 50% overlap of the output tiles to mitigate boundary artifacts. Further smoothness was achieved by the dropout at testing scheme [11] (probability: 0.5), which also increased the richness in the distribution of output probabilities. The final probability of motion for each voxel was computed as the average of the available estimates at each spatial location.

We evaluated our proposed approach both directly and indirectly. For direct validation, we assessed the ability of the motion score to predict the output of human QC of the in-house dataset. For the indirect validation, we examined the relationship between our motion score and average cortical thickness, as well as the ability of the score to enhance group differences when regressed out. To compute the motion score, we used a ROI (\(\varOmega _{ROI}\)) comprising the cortical ribbon (as estimated by FreeSurfer) and an underlying 3 mm layer of cerebral white matter, computed by inwards dilation with a spherical kernel.

3.3 Results

Qualitative Results: Figure 2 shows sagittal slices of four sample MRI scans with increasingly severe artifacts, along with the corresponding outputs from the CNN: (a) is crisp and motion-free, and few voxels produce high probability of motion; (b) shows minimal ringing, mostly on the superior region; (c) shows moderate motion; and (d) displays severe blurring and ringing due to motion, such that the CNN produces high probabilities around most of the ROI.

Fig. 2.
figure 2

Sagittal slices of four cases and corresponding probability maps (masked by the ROI, outlined in blue). (a) \(M=0.12\) (lowest in dataset). (b) \(M=0.19\). (c) \(M=0.25\). (d) \(M=0.32\) (failed QC). The arrows point at motion artifacts.

Fig. 3.
figure 3

(a) Distribution of motion scores for the two QC groups. (b) ROC for automatic QC based on score thresholding; the dot marks the operating point: 91.6% accuracy. (c) Distribution of cortical thickness with and without correction.

Quantitative Results on In-house Dataset: Figure 3(a) shows the distributions of the motion scores for the two QC groups, which are far apart: a non-parametric test (Wilcoxon signed-rank) yields \(p = 5 \times 10^{-8}\). Therefore, a classifier based on thresholding the score can closely mimic human QC, reaching 0.916 accuracy and 0.941 area under the receiver operating characteristic (ROC) curve; see Fig. 3(b). This performance is close to the inter-rater variability, which was 0.958. We also found a strong negative correlation between our score and mean cortical thickness: \(\rho =0.66\) (95% C.I. [-0.79,-0.46], \(p=3\times 10^{-7}\)). When correcting for motion, the variance of the cortical thickness decreased from 0.0191 mm\(^2\) to 0.0108 mm\(^2\), i.e., by 37% (\(R^2_{adj}=0.42\)); see Fig. 3(c).

Results on ABIDE Dataset: Using a Wilcoxon signed-rank test, we found differences in motion scores between the two groups (\(p=0.03\)), a circumstance that can undermine the conclusion of cortical thickness comparisons. We built a general linear model for the left-right averaged mean thickness of each FreeSurfer cortical region, with the following covariates: age, gender, group, site of acquisition and, optionally, our motion score. Introducing motion as a covariate in the model changed the results considerably, as shown by the significance maps in Fig. 4, which are overlaid on an inflated, reference surface space (“fsaverage”).

Figure 4(a,d) shows an inferior-posterior view exposing the occipital lobe and lingual gyrus, areas in which increased cortical thickness has been reported in children with ASD [16]. The motion-corrected model increases the effect size in the occipital lobe (particularly the inferior region) and detects differences in the lingual gyrus that were missed by the model without motion – possibly because the effect of motion was very strong in this region (\(p=5\times 10^{-7}\) for its slope).

Figure 4(b,e) shows a lateral view, in which correction by motion reveals effects in the temporal lobe and the insula, which would have been otherwise missed. The thicknesses of both of these regions showed a strong association with our motion score: \(p=5\times 10^{-9}\) and \(p=2\times 10^{-8}\), respectively. Finally, the model with motion also detected missed differences in the mid-anterior cingulate cortex, as shown in the medial view in Fig. 4(c,f) (effect of motion: \(p=3\times 10^{-8}\)).

Fig. 4.
figure 4

Region-wise significance map for differences in cortical thickness between ASD and control group (left-right averaged). The color map represents \(-\log _{10} p\). (a) Inferior-posterior view, model without motion. (b) Lateral view, model without motion. (c) Medial view, model without motion. (d-f) Model with motion.

4 Discussion

This work constitutes a relevant first step to retrospectively estimate in-scanner motion from structural MRI scans, without requiring external trackers or raw k-space data. The technique not only enables sites without means for specialized MRI acquisition to consider motion, but also makes it possible to reanalyze legacy datasets correcting for motion, which can considerably change the results – as we have shown on ABIDE, without even fine-tuning our CNN to this dataset.

Our method is specific to population and MRI contrast. However, once a CNN has been trained, accurate motion estimates can be automatically obtained with the method for all subsequent scans within a center, with some generalization ability to other datasets. Training datasets for other MRI contrasts can be created with limited effort (ca. 10 h), since training relies on sparely labeled data. Moreover, manual labeling effort could in principle be saved by fine-tuning our CNN to a new dataset, using only a handful of (sparsely) annotated scans.

Future work will follow three directions: 1. Fine-tuning the CNN to other datasets; 2. Testing the method on other morphometric measures and ROIs (e.g., hippocampal volume); and 3. Extension to motion correction, by training on a (possibly synthetic) set of matched motion-free and motion-corrupted scans.