1 Introduction

Accurate prediction of Alzheimer’s disease (AD) in individuals with Mild Cognitive Impairment (MCI) is highly valuable for clinical management and treatment. Volumetric MRI can be used to quantify brain atrophy, a primary pathologic process in AD due to widespread neuronal death (Jack et al. 2010). Many studies suggest that structural changes in the brain can be detected in the early stages of AD (Fennema-Notestine et al. 2009; Karow et al. 2010; Aksu et al. 2011). Several studies analyzed brain atrophy patterns in different regions of interests (ROIs) and different disease stages (Fennema-Notestine et al. 2009; Karow et al. 2010; Aksu et al. 2011; Leung et al. 2010; McDonald et al. 2009; Leung et al. 2013). They found that the rates of atrophy are not uniform, but vary in accordance with disease stage and by region. For example, Leung et al. (2010) found a higher rate of hippocampal atrophy in MCI converters (MCI-c) than in MCI nonconverters (MCI-nc). Also, Fennema-Notestine et al. (2009) found that some regions such as the mesial temporal, exhibited a linear rate of atrophy throughout both MCI stages and AD. Other regions such as lateral temporal, middle gyrus showed accelerated atrophy later in the disease. These results imply that brain atrophy in certain regions can be used to differentiate MCI-c and MCI-nc. Longitudinal data captures atrophy patterns that change with time at different disease stages. Compared to single point brain imaging data, longitudinal data does a better job of describing brain atrophy and can likely perform better in predicting conversion from MCI to AD.

MRI offers a non-invasive, widely available and cost-effective way for obtaining imaging biomarkers. Image analysis software such as FreeSurfer is able to provide accurate estimation of regional brain volume, cortical thickness, and even curvature. Regional brain volume has been investigated as a potential biomarker to monitor atrophy progression and thereby the progression of AD. In this paper we developed prediction models for conversion from MCI to AD based on the longitudinal volumetric data of five ROIs measured with MRI, including the hippocampus (H), entorhinal cortex (EC), middle temporal cortex (MTC), fusiform gyrus (FG) and whole brain (WB).

Among the papers which investigate prediction of conversion from MCI to AD using longitudinal MRI data, several stand out as particularly pertinent to our present work (Zhang et al. 2018; Misra et al. 2009; Zhang et al. 2012; Li et al. 2012; Lee et al. 2016; Arco et al. 2016; Adaszewski et al. 2013; Cuingnet et al. 2011). Zhang et al. (2018) introduced a supervised learning method, with a specifically designed logistic regression using the results from a multi-state Markov model to perform prediction on the progression from Normal and MCI to AD with an accuracy rate of 78.5%. Misra et al. (2009) applied a high-dimensional pattern classification method to longitudinal MRI scans to predict the conversion from MCI to AD within an average period of 15 months. An abnormality score was calculated for classification and achieved an accuracy rate of \(81.5\%\). Zhang et al. (2012) performed predictions on multimodal longitudinal data (i.e., MRI, PET, etc.) using a longitudinal feature selection method. Multi-kernel SVM was used for classification. They could predict the conversion from MCI to AD at least 6 months in advance with an accuracy of \(78.4\%\), sensitivity of \(79.0\%\), and specificity of \(78.0\%\). In Li et al. (2012), 262 features were calculated from longitudinal cortical thickness measured by MRI. They first applied mRMR (minimum redundancy maximum relevance) for feature selection, then an SVM (support vector machine) was trained for classification. Their method could detect \(81.7\%\) (\(\hbox {AUC} = 0.875\)) of the MCI converters 6 months ahead of conversion to AD. Lee et al. (2016) used baseline plus 1-year follow-up callosal MRI scans to predict the conversion from MCI to AD in the following 2–6 years. Logistic regression model with fused lasso regularization was applied. The accuracy of prediction was \(84\%\) in females and \(61\%\) in males. Arco et al. (2016) performed prediction based on baseline and 1-year follow-up MRI data and clinical scores (MMSE and ADAS-Cog). The feature selection was based on a two sample t-test. The classification was based on maximum-uncertainty linear discriminant analysis. They achieved a classification accuracy of \(73.95\%\), with a sensitivity of \(72.14 \%\) and a specificity of \(73.77\%\) for 6 months before conversion. In Adaszewski et al. (2013), voxel-based longitudinal structure MRI data was used. Weight based feature selection and SVM were applied for prediction. Both sensitivity and specificity could reach \(65\%\) at 1–4 years before conversion. Notably they found that conversion could be detected 4 years in advance with a sensitivity of \(62\%\).

As mentioned in Eskildsen et al. (2013), one factor preventing high predictive power is the heterogeneity of the MRI data due to various disease stages across subjects. Without homogenizing the MRI data based on disease stages, it is impossible to establish a reasonable pattern of atrophy at specific disease stages. Adaszewski et al. (2013) attempted to solve this issue by aligning the data according to the “time of conversion”. We adopted the same alignment strategy in this paper. This alignment strategy allows us to access how early we can predict the conversion from MCI to AD with our algorithms.

The number of data points across subjects are various. In the ADNI cohort, some subjects have 9 years of data with up to 10 data points, but many only have 1–3 data points. If data points are sparse and irregular, traditional longitudinal data analysis does not have enough power to provide accurate estimations and predictions. Therefore, we adopted a technique known as the principal component analysis through conditional expectation (PACE, Yao et al. 2005) to analyze the realigned longitudinal MRI data. In PACE, the mean curve and covariance structure of the biomarker along with time are obtained based on the pooled data from all individuals. In this way, longitudinal observations from each subject could be recovered as a smooth trajectory even if only one or few observations are available. Finally, once the smooth trajectories were obtained, they would be treated as functional data. Then a functional prediction method proposed by Hall and Maiti (2012) was employed to determine when an early decision can reasonably be made, and identify which ROI is best for prediction, using only part of the trajectory. Moreover, we also examined whether any combination of the ROIs can provide a better prediction result.

The remainder of the paper is organized as follows. Section 2 introduces the PACE approach for analyzing longitudinal data and the functional prediction method. Section 3 includes a detailed description of the data that we use in analysis. Section 4 presents the numerical results. We conclude with a discussion in Sect. 5.

2 Methodology

2.1 Functional principal component analysis for longitudinal data

PACE, one type of functional principal components (FPC) methods, constructs the FPC scores as conditional expectations. It extends the applicability of the regular FPC method to longitudinal data analysis, especially where only a few repeated and irregularly spaced measurements are available for each subject. In PACE, the longitudinal data is modeled as noisy sample points from a collection of trajectories that are assumed to be independent realizations of a smooth random function X(t), with unknown mean function \(EX(t)=\mu (t)\) and covariance function cov\((X(s),X(t))=G(s,t),\) where \(t,s \in \mathcal {T}\) and \(\mathcal {T}\) is a bounded and closed time interval. The covariance function G of the process has an orthogonal expansion (in the \(L^2\) sense) in terms of eigenfunctions \(\phi _k\) and non-increasing eigenvalues \(\lambda _k: G(s,t)=\sum _k\lambda _k\phi _k(s)\phi _k(t),\ t,\ s \in \mathcal {T}.\) From classical functional principal component analysis, the ith underlying random curve can be expressed through the \(Karhunen-Lo\grave{e}ve\) expansion as

$$\begin{aligned} X_i(t)=\mu (t)+\sum _k\xi _{ik}\phi _k(t),\ t\in \mathcal {T} \end{aligned}$$
(1)

where \(\xi _{ik}\) are uncorrelated random variables with mean 0 and variance \(E\xi _{ik}^{2}=\lambda _k\), with \(\sum _k\lambda _k<\infty , \lambda _1\ge \lambda _2 \ge \cdot \cdot \cdot .\)

Let \(Y_{ij}\) be the jth observation of the random function \(X_i(\cdot ),\) observed at a random time \(T_{ij}\in \mathcal {T}.\) Then \(Y_{ij}\) can be modeled as:

$$\begin{aligned} Y_{ij}&=X_i(T_{ij})+\epsilon _{ij}\nonumber \\&=\mu (T_{ij})+\sum _{k=1}^{\infty }\xi _{ik}\phi _k(T_{ij})+\epsilon _{ij} \end{aligned}$$
(2)

where \(\epsilon _{ij}\) are the additional measurement errors that are assumed to be independent, identically distributed and independent of \(\xi _{ij}\), \(i=1,\ldots ,n\), \(j=1,\ldots ,N_i\), \(N_i\) is the number of observations for the ith subject, and \(E(\epsilon _{ij})=0\), var\((\epsilon _{ij})=\sigma ^2\). To reflect sparse and irregular designs, \(N_i\) are assumed to be independent and identically distributed random variables as well as independent of all other random variables.

Now we need to estimate the mean function \(\mu (t)\), covariance function G(st), eigenfunctions \(\phi _k(t)\) and eigenvalues \(\lambda _k\) as well as functional principal component scores (FPC scores) \(\xi _{ik}\) for \(k=1,2,\ldots\) for each subject \(i=1,2,\ldots ,n.\)

Assuming that the mean function, covariance function and eigenfunctions are smooth, they can be estimated by local linear smoothing. First the mean function \(\mu (t)\) is estimated based on the pooled data from all individuals by minimizing

$$\begin{aligned} \sum _{i=1}^{n}\sum _{j=1}^{N_i}\kappa _1\left( \frac{T_{ij}-t}{h_{\mu }}\right) (Y_{ij}-\beta _0-\beta _1(t-T_{ij}))^2 \end{aligned}$$

with respect to \(\beta _0\) and \(\beta _1\), where \(\kappa _1(\cdot )\) is kernel function and \(h_\mu\) is bandwidth. Then the estimate of \(\mu (t)\) is \(\hat{\mu }(t)=\hat{\beta }_0(t)\).

Let \(G_i(T_{ij},T_{il})=(Y_{ij}-\hat{\mu }(T_{ij}))(Y_{il}-\hat{\mu } (T_{il}))\) be the “raw” covariances. The local linear surface smoother for G(st) is defined by minimizing

$$\begin{aligned} \sum _{i=1}^{n}\sum _{1\le j\ne l\le N_i}\kappa _2 \left( \frac{T_{ij}-s}{h_G},\frac{T_{il}-t}{h_G}\right) \times (G_i(T_{ij},T_{il})-f(\varvec{\beta },(s,t),(T_{ij},T_{il})))^2, \end{aligned}$$

where \(f(\varvec{\beta },(s,t),(T_{ij},T_{il}))=\beta _0 +\beta _{11}(s-T_{ij})+\beta _{12}(t-T_{il})\), \(\kappa _2(\cdot ,\cdot )\) is kernel function and \(h_G\) is bandwidth. Minimization is with respect to \(\varvec{\beta }=(\beta _0,\beta _{11},\beta _{12})\). Then the estimate of G(st) is \(\hat{G}(s,t)=\hat{\beta _0}(s,t).\)

The estimates of eigenfunctions \(\hat{\phi }_k(t)\) and eigenvalues \(\hat{\lambda }_k\) are the solutions of the eigenequations,

$$\begin{aligned} \int _{\mathcal {T}}\hat{G}(s,t)\hat{\phi }_k(s)ds=\hat{\lambda }_k\hat{\phi }_k(t). \end{aligned}$$

The eigenfunctions can be estimated by discretizing the smoothed covariance.

Traditionally, when the measurements for each subject are densely sampled, the FPC scores \(\xi _{ik}=\int (X_i(t)-\mu (t))\phi _k(t)dt\) are estimated by numerical integration. However, for longitudinal data, the time points vary widely across subjects and are sparse. The FPC scores cannot be well approximated by the usual integration method. However, as PACE further assumes that in (2), \(\xi _{ij}\) and \(\epsilon _{ij}\) are jointly Gaussian, FPC scores for the ith subject can be estimated by the best prediction:

$$\begin{aligned} \tilde{\xi }_{ik}=E[\xi _{ik}|\mathbf {Y}_i]=\lambda _k \varvec{\phi }_{ik}^{T}\Sigma _{Y_i}^{-1}(\mathbf {Y_i}-\mu _i), \end{aligned}$$
(3)

where \(\mathbf {Y}_i=(Y_{i1},Y_{i2},\ldots ,Y_{iN_i})^T\), \(\varvec{\phi }_{ik}=(\phi _{k}(T_{i1}),\ldots ,\phi _{k}(T_{iN_i}))^T,\)\(\Sigma _{Y_i}=cov(\mathbf {Y}_i,\mathbf {Y}_i)\), i.e. the (jl)’s entry of \(\Sigma _{Y_i}\) is \((\Sigma _{Y_i})_{j,l}=G(T_{ij},T_{il}) +\sigma ^2\delta _{jl}\) with \(\delta _{jl}=1\) if \(j=l\) and 0 if \(j\ne l\). Substituting the parameters in (3) by the estimates of \(\mu _i, \lambda _i, \phi _{ik}, \Sigma _{Y_i}\), we have the estimation for FPC scores:

$$\begin{aligned} \hat{\xi }_{ik}=\hat{E}[\xi _{ik}|\mathbf {Y}_i]=\hat{\lambda }_k \hat{\varvec{\phi }}_{ik}^{T}\hat{\Sigma }_{Y_i}^{-1}(\mathbf {Y}_i-\hat{\mu }_i), \end{aligned}$$
(4)

Where the (jl)th element of \(\hat{\Sigma }_{Y_i}\) is \(\hat{\Sigma }_{j,k}=\hat{G}(T_{ij},T_{il})+\hat{\sigma }^2\delta _{jl}\).

Assuming the infinite-dimensional processes of (1) can be approximated by the projection on the function space spanned by the first K eigenfunctions, the prediction for the trajectory \(X_i(t)\) for the ith subject, using the first K eigenfunctions is:

$$\begin{aligned} \hat{X}_i^K(t)=\hat{\mu }(t)+\sum _{k=1}^{K}\hat{\xi }_{ik} \hat{\varvec{\phi }}_{k}(t) \end{aligned}$$
(5)

According to Yao et al. (2005), the number of eigenfunctions K can be selected by cross validation based on prediction error or by AIC-criteria. In this study we selected K by setting a threshold. That is, we selected the first K eigen functions such that they explained more than \(95\%\) of the total variation.

In summary, to recover a trajectory for each subject from longitudinal observations, we first estimate the mean curve \(\hat{\mu }(t)\) and covariance \(\hat{G}(s,t)\) from pooled data of all individuals, from which, eigenvalues \(\hat{\lambda }_k\) and eigenfunctions \(\hat{\phi }_k\) can be estimated. Then FPC scores are estimated using available observations from each subject by conditional expectation, even if only one observation is available. From (5), the FPC scores \(\hat{\xi }_{ik}\) (\(k=1,\ldots ,K\)) characterize each subject \(i=1,\ldots ,n\) and can be used to describe differences between subjects. As a result, the FPC scores can be used for classification or prediction (Müller 2005).

For details of PACE methodology, please refer to Yao et al. (2005). The MATLAB package for “PACE” is available from http://www.stat.ucdavis.edu/PACE/.

2.2 Early prediction and choosing trajectory

Hall and Maiti (2012) introduced a methodology for classifying and predicting the future state using functional data. It provided an approach to determine when an early decision can be made reasonably, using only part of the trajectory and showed how to use the method to choose a better biomarker as a predictor.

The method assumes that there are q types of biomarkers from two classes of the population. Let \(X_{ji}^{[k]}(t)\) be the observed data function of the kth biomarker from the ith subject in population \(\Pi _{j}\), where \(j=1,2\), \(i=1,2,\ldots ,n_j\), \(k=1,2,\ldots ,q\), \(t\in \mathcal {I}\). Without loss of generality, assume \(\mathcal {I} =\left[ 0,1\right]\). First, the dimension of the functional data is reduced by discretizing on a grid, i.e. by confining attention to \(X_{ji}^{[k]}(t_l)\), where \(l=1,2,\ldots ,p\) and p denotes the number of grid points. Second, a classifier based on p-variate is constructed (linear discriminant analysis and logistic classification were considered in Hall and Maiti (2012)). Then the classifier is applied to each type of biomarkers using only a portion of the trajectory, for example, on the interval \(\mathcal {I}=\left[ 0,t\right]\) with \(t\in [0,1]\), to predict which class the subject belongs to in the end, i.e. at \(t=1\). The estimated error rates are denoted by \(\hat{err}^{[k]}(t)\). Comparing \(\hat{err}^{[k]}(t)\) for a range of values t of interest (\(t\in [0,1]\)) and \(k=1,\ldots ,q\) gives an idea of when a relative early prediction could be made and which is a more reliable biomarker. The results are examined both numerically and theoretically by checking the consistency of \(\hat{err}^{[k]}(t)\).

In this paper, we fit the model to sparse longitudinal data , which is different from Hall and Maiti (2012) using dense functional data. As a result, instead of reducing dimension by discretizing on a grid, we reduce dimension by PACE, which is introduced in Sect. 2.1. There are two main steps in this process. First, we apply PACE to the data to calculate FPC scores for each subject. To check the appropriate time period for early prediction, using the idea from Hall and Maiti (2012), we use only part of the observations from longitudinal data in this step. Second, we apply the logistic classification method to the resulting FPC scores for prediction.

In the numerical study, we compared the 1 year, 2 years, and 3 years early prediction results. We also compared the error rates for different ROIs to choose which one is the most reliable biomarker.

2.3 Predict with logistic classification

Logistic Regression helps us to link continuous statistics with a differentiable decision between two groups \(G_1\) (MCI-c subjects) and \(G_0\) (MCI-nc subjects). Then, technically we can model whether subjects belong to \(G_1\) or \(G_0\) as binary responses.

Suppose \(x_{i1}, x_{i2},\ldots , x_{iq}\) are the predictors for the subject i. The logistic regression equation is

$$\begin{aligned} logit(p_i)=log{{p_i}\over {1-p_i}}=\beta_0 + \sum _{k=1}^{q} x_{ik}\beta _k,\ for \ \ i=1,2,\ldots ,n \end{aligned}$$

where n is the number of subjects and \(p_i\) is the probability that the ith subject belongs to class \(G_1\). The regression coefficients \(\beta _k\) are usually estimated using maximum likelihood estimation(MLE). Then \(p_i\) is derived by

$$\begin{aligned} p_i=\frac{1}{1+e^{-X_i\beta }} \end{aligned}$$
(6)

where \(X_i=(1,x_{i1},x_{i2},\ldots ,x_{iq})\), and \(\beta =(\beta _0,\beta _{1},\beta _{2},\ldots ,\beta _{q})'.\) If \(p_i\) is greater than a certain threshold (usually the threshold is set to be 0.5), subject i will be classified into class \(G_1\), otherwise, classified into class \(G_0\).

In this paper, we use the trajectory of ROIs’ volume as the predictor (we use function \(X(t), t \in [0,T]\) to denote the trajectory). As introduced in the previous section, we firstly use PACE to reduce dimension and denote the predicted trajectory for subject i as

$$\begin{aligned} \hat{X}_{i}(t)=\hat{\xi }_{i1}\hat{\phi }_1(t)+\hat{\xi }_{i2} \hat{\phi }_1(t)+\cdots +\hat{\xi }_{iK}\hat{\phi }_1(t). \end{aligned}$$

Thus we can do logistic regression with FPC scores \(\hat{\xi }_{i1},\ldots ,\hat{\xi }_{iK}\) as predictors. For example, if two FPC scores are chosen, i.e. \(\hbox {K}=2\), the logistic regression is

$$\begin{aligned} logit(p_i)=\beta _0+\beta _1\hat{\xi }_{i1} +\beta _2\hat{\xi }_{i2}, \quad i=1,2,\ldots ,n \end{aligned}$$
(7)

Technically, we can easily include more variables in the logistic regression model. For example, to adjust for relevant covariates, we can include the following clinical features in the logistic regression model: baseline age, baseline MMSE (Mini-Mental State Examination) and APOE (apolipoprotein E) genotype. Then the corresponding model becomes

$$\begin{aligned} logit(p_i)=\beta _0+\beta _1\hat{\xi }_{i1}+\beta _2\hat{\xi }_{i2} +\beta _3 Age_i+\beta _4 APOE_i+ \beta _5 MMSE_i, \ i=1,2,\ldots ,n \end{aligned}$$
(8)

As we do not emphasize this part so much, the numerical study Sect. 4 only includes the results of base model (7). Model (8) is named as MRI plus model and the corresponding numerical results are listed in the Appendix. Gender is not included in Eq. 8, because gender is not significant in differentiating MCI converters and MCI non converters (Table 1) in this paper. The statistical inference on the coefficients of logistic regression with respect to gender also demonstrates the insignificance. Please see the the comparisons between models with and without gender (Tables 15 and 16).

3 Data description

The longitudinal volumetric variables of H, EC, MTC, FG and WB are considered as potential predictors of conversion from MCI to AD in this paper. Baseline and follow-up volumetric MRI data along with the corresponding clinical information in “ADNIMERGE” file are downloaded from the Alzheimer’s Disease Neuroimaging Initiative (ADNI, http://www.adni-info.org/). The “ADNIMERGE” data merges several of the key variables from various case report forms and biomarker lab summaries across the ADNI protocols (ADNI1, ADNIGO and ADNI2). The volumetric MRI scans in this dataset were acquired from 1.5T or 3T GE, Siemens, or Philips scanners. Regional brain segmentation and volume estimation were carried with FreeSurfer by UCSF/SF VA Medical Center (Weiner et al. 2013; Hartig et al. 2014).

From a total of 872 individuals with a baseline diagnosis of MCI who were recruited for ADNI, 66 subjects were excluded due to missing data and 5 subjects were excluded because they reverted from AD to MCI. In the end, 801 MCI subjects were included in this study. They were split into two categories. The MCI subjects who converted to AD after some years are labeled as MCI-c (mild cognitive impairment converters; \(n=272\)); the rest of them who did not convert to AD during the follow-up period are labeled as MCI-nc (mild cognitive impairment nonconverters; \(n=529\)). At baseline, all subjects underwent a comprehensive clinical evaluation, cognitive/functional assessments, and a structural brain MRI scan. Subjects also provided a blood sample for apolipoprotein E (APOE) genotyping and proteomic analysis. The subjects were then followed longitudinally at specific time points (6, 12, 18, 24, 36… months). However, the number of visits and the intervals between visits varied. Table 1 shows the characteristics of the MCI subjects included in this study. Except gender, baseline age, baseline MMSE and APOE are significantly different between MCI-c and MCI-nc subjects with p-values less than 0.05. For all the subjects, the years of MRI scans vary from 0 to 9.18, and the number of time points varies from 1 to 10 (see Table 2).

Table 1 Subjects characteristics

Before the group data analysis, all regional volumes were normalized by dividing the intracranial volume (ICV) to correct for individual differences in head size. In the spaghetti plot (Fig. 1), (f) is the longitudinal ICV for each subject. It shows that the ICV for MIC-c and MIC-nc subjects have a similar pattern on average. Also, as expected the ICV is relatively constant during the observed period for each subject. For the other five normalized regional volumes (Fig. 1a–e), on average the MCI-c group has a higher rate of atrophy in volume compared to MCI-nc group.

Fig. 1
figure 1

Longitudinal volume of ROIs for MCI subjects. In ae, the value on Y axis is the normalized volume, i.e. the value is ROI volume  /  ICV. In (f), the value on Y axis is the ICV (\(\hbox {mm}^3\)). The value on X axis is “disease year”. For MCI converters (MCI-c), year 0 is the time of conversion. For MCI non converters (MCI-nc), year 0 is the last observed time point. Thin lines are observations for each subject. Blue lines are for MCI-c subjects and red lines are for MCI-nc subjects. Blue and red thick lines are pooled mean curves for MCI-c group and MCI-nc group respectively (Color figure online)

As shown in the spaghetti plot (Fig. 1), the MCI-c subjects were aligned by “time of conversion”. Specifically, the time point of conversion is defined as year 0. Year \(-\,n\) is defined for the time points observed n year prior to conversion. The MCI-nc subjects were aligned by the last time point of observation, which is defined as year 0 and the time points observed n year prior to the last observation is defined as year “\(-\,n\)” and so forth. In this way, the data were homogenized by the timeline of disease progression. The goal is to predict whether a subject will convert to AD in the end, i.e. at year 0.

The other goal of this paper is to determine a reasonable prediction window. So we need to compare the prediction accuracy among different prediction windows. Here we compare 1 year, 2 years and 3 years of early prediction. To make the 3 years’ prediction results comparable, we pick the subjects who have observations at all the three most recent years: year “\(-\,3\)”, year “\(-\,2\)”, year “\(-\,1\)” as the testing set (labeled as “all 3y” subjects in Tables 2 and 3). There are 127 “all 3y” subjects in total, of which 30 are MCI-cs and 97 are MCI-ncs. In Table 3, characteristics of the subjects who have observations at year “\(-\,3\)”, “\(-\,2\)”, and “\(-\,1\)” are listed in the first three columns respectively. The last column in Table 3 lists the characteristics of the testing set. All the subjects are considered in the training set. Leave-one-out testing is applied to the subjects in the testing set.

Table 2 Data characteristics
Table 3 Subjects characteristics by year

As described in Sect. 2.1, one big strength of PACE analysis is that it pools the information from all the subjects to estimate the curve of a single subject, even if the number of observations for a single subject is sparse and limited in number. From the data description (Tables 1, 2, 3 and Fig. 1), we have a good amount of subjects and data points either for PACE analysis or for prediction. Indeed more subjects/data points are available for closer prediction window (1-year prediction). We will see later in Sect. 4 that the variance of prediction results is smaller when more subjects/data points are available in 1 year prediction. Please note that both the training data and the testing data are imbalanced data. We have more MCI-nc subjects than MCI-c subjects for 3-year, 2-year and 1-year prediction. The ratios of subjects of MCI-c versus MCI-nc are 1:2 for 1-year prediction, and 1:3 for 3-year prediction. Thus, the metric of accuracy to evaluate the prediction performance is misleading because the training data set is imbalanced due to the so called “accuracy paradox”(Chawla et al. 2002). The classification with imbalanced data usually leads to low sensitivity or specificity even if the accuracy is high. To overcome the issue, we applied SMOTE (Synthetic Minority Over-sampling Technique) from Chawla et al. (2002) in the prediction. SMOTE overly samples the minority groups by making synthetic data, which is a powerful tool to deal with imbalanced data classification.

4 Numerical results

This section presents the numerical results and conclusions. We first take the hippocampus as an example to illustrate the details of the prediction procedure using a single ROI’s volume in Sect. 4.1. We then list all prediction results using a single ROI in Sect. 4.2. In the end, we investigate whether or not any combinations of the ROIs would improve the prediction performance in Sect. 4.3. There are 26 different combinations of the 5 ROIs. The prediction results from the 26 combination prediction models are listed.

4.1 Prediction using hippocampus

In this section, we take the volume of the hippocampus as an example to describe how to use a single ROI for prediction. We start with the 1-year early prediction.

The first step is PACE analysis with the partially observed data and calculating the FPC scores for each subject. As outlined in Sect. 3, the longitudinal data were realigned according to disease status (Fig. 1). We want to predict whether or not a subject would convert to AD at year 0. 1-year prediction means to use all the data observed prior to year “\(-\,1\)” to predict the state at year 0. Since the data points were irregularly collected, we use the data with year \(t\in [-9.18,-0.5)\) for 1-year prediction (9.18 is the longest trajectory for all the subjects and all ROIs, see Fig. 1). Figure 2 shows the PACE analysis results for 1-year early prediction using hippocampus. In Fig. 2a, b are the estimation of mean curve \(\mu (t)\) and covariance surface G(st) for \(s,t \in [-\,9.18,-\,0.5)\), which are introduced in Sect. 2.1. We select the number of the eigenfunctions used in (5) by setting a threshold to be \(95\%\). (d) shows that the first two principal components explained \(98.9\%\) of the variance. Thus only the first two eigenfunctions and corresponding FPC scores \(\hat{\xi }_{i1}\) and \(\hat{\xi }_{i2}\) of each subject i are employed for prediction. (c) is the estimation of the first two eigenfunctions \(\phi _1(t)\) and \(\phi _2(t)\) for \(t \in [-\,9.18,-\,0.5)\). The FPC scores are calculated by (4) and then applied to logistic regression model (8). Figure 3 shows the scatter plot of the first two FPC scores for all the training subjects.

Fig. 2
figure 2

PACE analysis using hippocampus for 1-year prediction. ac show the estimations of mean function \({\mu }(t)\), covariance surface G(t) and the first two eigen functions \(\phi _1(t)\) and \(\phi _2(t)\). d shows the first two eigen functions explained \(98.907\%\) of the total variance

Fig. 3
figure 3

Second versus first FPC scores for hippocampus (1 year prediction). The triagulars indicate MCI-nc and the crosses indicate MCI-c

Table 4 provides sample statistics for the coefficient estimates which are obtained from the logistic regression of group indicator (“MCI-c”\(=1\), “MCI-nc”\(=0\)) with linear model (7). It shows both the first two FPC scores \(\hat{\xi }_1\) and \(\hat{\xi }_2\) are significant in increasing the odds of being MCI-c. The resulting estimates of coefficients from the logistic models (7) are plugged into (6) to calculate the conversion probability. We let \(p_i\) be the probability of conversion to AD for subject i. Let \(p_0\) be a threshold. If \(p_i\ge p_0\), subject i will be classified as MCI-c, otherwise, it will be classified as MCI-nc. If set \(p_0=0.5\), the sensitivity, specificity, and accuracy for 1-year early prediction are 0.83, 0.71 and 0.74. By changing the threshold \(p_0\) from 0 to 1, we can derive a ROC (receiver operating characteristic) curve with AUC (area under the curve) to be 0.82.

Table 4 Parameter estimation (MRI model of hippocampus for 1y prediction)

We then apply the same procedure mentioned above to a 2-year early prediction and a 3-year early prediction with hippocampus volume. Setting threshold \(p_0\) to be 0.5, the corresponding sensitivity, specificity, and accuracy are summarized in Table 5. Setting the threshold \(p_0\) to be from 0 to 1, the ROC curve are shown in Fig. 4. From the ROC curve, as expected, 1-year early prediction performs best, but the improvement from 2 and 3-year early predictions is marginal. In other words, we can use 3-year prediction.

Table 5 Prediction performance (hippocampus)
Fig. 4
figure 4

ROC curve for prediction using hippucampus. Solid line, dotted line and dot-dashed line are the ROC curves for 1-year, 2-year and 3-year prediction respectively

4.2 Prediction results using single ROI

In Sect. 4.1, we take the hippocampus as an example to illustrate the procedure of 1-year, 2-year and 3-year early prediction using a single ROI. The results shown in Sect. 4.1 are based on one training data set. In this section, we apply the same procedure to all the five ROIs: H, WB, EC, FG and MTC. To check the robustness of prediction performance, we repeat the procedure on 100 samples of training data. Specifically, we randomly sample 2/3 of the training data (2/3 of 272 MCI-c subjects and 2/3 of 529 MCI-nc subjects) for 100 times. We then repeat the prediction procedures stated in Sect. 4.1 for 100 times. The mean and standard deviation of sensitivity, specificity, and accuracy are calculated when classification probability threshold \(p_0\) is set to be 0.5. Also, when set \(p_0\) to vary from 0 to 1, the mean and standard deviation of the AUCs from the ROC curves are also calculated. The results are shown in Table 6.

From Table 6, all the volumetric MRI brain ROIs have prediction power as early as 3 years in advance. For the 1-year, 2-year and 3-year prediction in all the models using single ROI, the overall sensitivity is around \(80\%\), specificity is above \(70\%\), accuracy is around \(75\%\) and AUC above \(80\%\).

Table 6 Prediction performance using single ROI

4.3 Prediction results using combinations of ROIs

We then check the prediction performance from the combinations of ROIs. Combination prediction is realized by plugging combinations of ROIs’ FPC scores into the logistic regression model. For example, if we predict using the combination of H and EC, we first calculate the FPC scores of H (\(\hat{\xi }_{1i}, \hat{\xi }_{2i}\)) as well as FPC scores of EC (\(\hat{\eta }_{1i}, \hat{\eta }_{2i}\)) for each subject i (suppose the number of the eigenfunctions selected for both H and EC is 2). Then the logistic regression model for predicting using combination of H and EC is:

$$\begin{aligned} logit(p_i)=\alpha +\beta _1\hat{\xi }_{i1}+\beta _2\hat{\xi }_{i2} +\beta _3\hat{\eta }_{i1}+\beta _4\hat{\eta }_{i2}, \quad i=1,2,\ldots ,n \end{aligned}$$

There are 26 different combinations in total. Table 7 shows the results of combinations of two ROIs. Table 8 shows the results of combinations of three ROIs. Table 9 shows the results of combinations of more than three ROIs.

Table 7 Prediction performance using combination of two ROIs
Table 8 Prediction performance using combinations of three ROIs
Table 9 Prediction performance using combination of four or more ROIs

4.4 Summary

The graphs of the prediction performances from the 31 different combinations of the ROIs (5 single variable models and 26 combination models) are listed in Fig. 5. Tables 6, 7, 8, 9 and Fig. 5 can be summarized as below.

Fig. 5
figure 5

Prediction performance summary using single and combinations of volumetric MRI biomarker a 3-year prediction, b 2-year prediction and c 1-year prediction. In each panel, every point represent a predict result using different combinations of biomarkers (from left to right labeled as 1–31): Hippocampus (H), whole brain (WB), entorhinal cortex (EC), fusiform gyrus (FG), middle temporal cortex (MTC), H  +  WB, H + EC, H + FG, H + MTC, WB + EC, WB + FG, WB + MTC, EC + FG, EC + MTC, FG + MTC, H + WB + EC, H + WB + FG, H + WB + MTC, H + EC + FG, H + EC + MTC, H + FG + MTC, WB + EC + FG, WB + EC + MTC, WB + FG + MTC, EG + FG + MTC, H + WB + EC + FG, H + WB + EC + MTC, H + WB + FG + MTC, H + EC + FG + MTC, WB + EC + FG + MTC, H + WB + EC + FG + MTC. The ROIs combinations’ labels are listed in Table 10

First, the longitudinal volumes from the listed brain ROIs measured by MRI have prediction power as early as 3 years in advance. In 1-year, 2-year, and 3-year prediction for all the models, the overall accuracy is above \(70\%\) with AUC above 0.8. The sensitivity varies for different prediction windows and models from \(70\) to \(90\%\) with specificity from \(67\) to \(77\%\).

Second, short-term prediction performs better, i.e. 1-year early prediction performs the best compared to 2-year and 3-year prediction. It is consistent with intuition because more information is included and less noise is introduced in short-term prediction procedure. But the error curves over time are clearly helpful for early interventions.

To compare the models using different combinations of ROIs, we use AUC as a metric to evaluate the performance of the MRI biomarkers in the following paragraphs. Among the five models using single ROI, EC, H and MTC performs the best in regarding to AUC (0.8 to 0.84). For the prediction results using combinations of two or more MRI biomarkers, EC + MTC, H + EC + MTC are the best combinations in prediction performance with the highest AUC (0.85 to 0.86). Adding WB and F to the above best combinations doesn’t improve the performance. On the other hand, the combinations without EC and hippocampus (two of the best single biomarkers) (WB + FG, WB + FG + MTC, see Tables 7 and 8) perform relatively worse than other combinations. This observation implies that the prediction power of the combination models basically comes from the prediction power of EC and hippocampus.

Table 10 Number labels of MRI regions combinations

5 Discussion and conclusion

In this study we use sparsely observed volumes of ROIs quantified by longitudinal volumetric MRI to predict whether or not a MCI subject will convert to AD in a specified time window. A longitudinal data prediction method based on functional data analysis is developed for early prediction in varying time windows, as well as identifying the most important ROI(s) in the process.

Because the number of time points and the intervals between them vary among the ADNI subjects, traditional prediction methods are not applicable. To resolve this data complexity issue, we developed a prediction method based on functional data. we first use PACE to extract statistically validated information from the longitudinal data. FPC scores are calculated and used for prediction. This method is innovative for analyzing ADNI data.

Our best prediction model (H + EC + MTC, with 1-year, 2-year and 3-year prediction AUCs as 0.86, 0.85, and 0.82 respectively and the 1 year prediction accuracy \(77\%\), sensitivity \(91\%\), specificity \(73\%\)) performs favorably compared to the existing literature for prediction using the same longitudinal data, even though the comparison is not entirely comparable since those studies used different biomarkers and different prediction windows. Actually, our model used fewer predictors (longitudinal ROI volumes and three clinical features) and has a longer prediction window. This suggests that our model used less information but derived comparable prediction results. Since the primary goal of this paper is to introduce a statistically validated advanced methodology for prediction using longitudinal MRI data, we do not intend to elaborate the results by including more predictors. However, one could continue experimentation with other available biomarkers by the techniques adapted here.

Besides, due to the irregularity and sparsity of longitudinal data, most other literature make predictions just for one fixed time length. Our proposed method allows flexible prediction windows. It enables us to compare the prediction performance in varying prediction windows and helps to understand the disease risk over time.

Moreover, most of the methods employed in the literature are machine learning techniques, which are not accessible for statistical validation and interpretation. Our model is based on two well-established statistical methods, which are easy to validate and interpret.

One limitation of the proposed model is the necessary assumption for PACE method. We need to assume the measurement error \(\epsilon _{ij}\) and FPC scores \(\xi _{ij}\) are jointly Gaussian. This assumption is not always valid. However, the existing simulation studies indicate the method is robust to some extent in violations of the Gaussian assumption (Yao et al. 2005).

In conclusion, the proposed procedure in this paper provides a method to predict conversion from MCI to AD using longitudinal data. The longitudinal prediction curve can be utilized for understanding disease risk over time. The key finding is that the AUC of 1-year prediction is not much different from that of 3-year prediction. In other words, we can use 3-year advanced prediction. In addition to AD prognosis, the method can be easily adapted and applied in other biomedical research to decide a better prediction window and identify biomarker(s), as longitudinal study is a common design of assessing human health and disease.