1 Introduction

Alzheimer’s disease is an incurable neurodegenerative disease. Typical clinical symptoms include memory loss, disorientation, language and behavioral issues. The progression of AD is not reversible, however, there are treatment available to modify disease effect in the early stage of AD [1]. Thus, early diagnosis or prognosis of AD is of high value in clinical practice since it can save more time for treatment and then improve the life quality for not only patients but also their caregivers.

AD introduces both structural and functional loss that is known to have dynamically evolving morphological patterns [13, 1215]. In the past decade, longitudinal studies have been actively investigated for AD diagnosis with special attention to MCI [1, 4], which is an intermediate stage between NC and AD. For example, tensor-based morphometry is used in [4] to reveal brain atrophy patterns from 91 probable AD patients and 189 MCI subjects scanned at baseline, and after 6, 12, 18, and 24 months. Moreover, the trend of longitudinal cortical thickness is used as the morphological patterns in [5] to identify subjects which eventually convert to AD. However, current longitudinal AD diagnosis methods have very strong restriction on the longitudinal image sequence. For example, each subject recruited in [5] should at least 5 time-points in every six months, and should develop AD after at least 12 months after the baseline scan. For convenience, many longitudinal approaches assume the number of scans is equal, albeit implicitly. In real clinical setting, however, not all patients have a large or an equal number of imaging scans.

In order to accurately measure the tiny structural changes along time, current state-of-the-art computer assisted diagnosis methods have to wait until the patient has enough number of longitudinal scans. More critically, the prediction is short term, e.g., only 6 months before real onset of AD in [5]. Although promising results have been achieved in predicting whether the subject has progressed to AD or stays in MCI stage, the limitation of short-term prediction substantially hamper the deployment in clinical practice.

In light of this, we propose a flexible solution for early detection of AD by sequentially and consistently recognizing abnormal patterns of structure change from longitudinal MR image sequence. First, we present a novel temporally structured SVM (TS-SVM) which is trained based on a set of partial image sequences cut from the complete longitudinal data. Compared to conventional SVM, our TS-SVM has two major improvements to achieve early alarm and high accuracy in detecting AD progression: (1) Temporal consistency. We enforce monotonic constraint to avoid inconsistent detection results along time. Since convergent evidence suggests that AD progression is non-reversible [6, 7], we require the risk of AD progression should monotonically increase within each subject as more and more time-points are inspected. (2) Early detection. We employ sequential recognition to achieve best balance of early alarm and detection accuracy. In the training stage, we specifically train the classifiers by making the classification margin adaptive to the length of partial image sequence. Given the longitudinal image sequence of new subject with arbitrary number of scans, we sequentially examine the longitudinal imaging patterns from baseline and alarm the AD conversion as long as the detection of abnormal change is of high confidence. Thus, our proposed AD early detection method does not have requirement on the number of scans. Second, we further present a joint feature selection and classification framework, in order to make the selected best features are eventually optimal to work with the learned support vector machine. We have evaluated the performance of AD early detection on more than 150 longitudinal subjects from ADNI dataset. Our method achieved promising results by alarming AD onset 12 months prior to the clinical diagnosis with at least 82.5 % accuracy.

2 Methods

2.1 Temporally Structured SVM for Early Detection of AD

The goal of our method is to accurately predict AD converting as early as possible by longitudinally tracking the structure changes. Since magnetic resonance (MR) image is non-invasive and widely used in clinic practice, we present a novel temporally structured SVM on longitudinal MR image sequences.

Morphological Features.

Suppose we have \( N \) training subjects, each subject \( S_{n} \) has a MR image sequences \( \varvec{I}^{n} = \{ I_{t}^{n} |t = 1, \ldots ,T_{n} \} \) (\( n = 1, \ldots ,N \)) with \( T_{n} \) longitudinal scans. For each volumetric image \( I_{t}^{n} \), we first register the template image (http://qnl.bu.edu/obart/explore/AAL/) with 90 manually labeled ROIs (regions of interest) using hammer registration tool to the underlying image \( I_{t}^{n} \) and extract seven morphological features in each ROI which include tissue percentiles (volumetric percentiles of the ROI volume) of white matter (WM), gray matter (GM), cerebral-spinal fluid (CSF), and background, and the averaged voxel-wise Jacobian determinant in WM, and GM and CSF regions. Therefore, the image feature \( {\mathbf{f}}_{t}^{n} \) for each volumetric image \( I_{t}^{n} \) is a \( 90 \times 7 = 630 \) dimension feature vector.

Decomposition to Partial Image Sequences.

We can decompose the complete longitudinal image sequence \( \varvec{I}^{n} \) into \( \left( {T_{n} - 1} \right) \) partial image sequences \( \varvec{P}^{n} = \{ \varvec{P}^{n} (b)|b = 2, \ldots ,T_{n} \} \), where each \( \varvec{P}^{n} \left( b \right) = \{ I_{t}^{n} |t = 1, \ldots ,b\} \) is the partial image sequence with \( b \) time points from baseline to \( \left( {b - 1} \right) \)-th follow-up. For each \( \varvec{P}^{n} (b) \), we further extract longitudinal feature representations and form a column vector \( {\mathbf{h}}\left( {b,n} \right) = \left[ {\sum\nolimits_{t = 1}^{b} {{\mathbf{f}}_{t}^{n} /b,\left( {{\mathbf{f}}_{1}^{n} - {\mathbf{f}}_{b}^{n} } \right)} } \right]^{'} \), where the first half elements are the average of morphological features from baseline to last time point and the second half elements measure the longitudinal difference of morphological features from baseline to the last follow-up. It is apparent that each feature representation \( {\mathbf{h}}(b,n) \) describes both the spatial and temporal morphological patterns. As we will explain in Sect. 2.2, feature selection is of necessity to remove data redundancy from such high dimension (\( d = 1,260 \)).

Naive Way to Achieve Early Detection by Classic SVM.

In our application, the goal of classification is to determine (1) whether we can detect the conversion of AD on the new testing subject based on its MR image sequence \( \varvec{Z} = \{ Z_{t} |t = 1, \ldots ,T_{z} \} \) up to the current time point \( T_{z} \); and (2) whether we could detect the AD onset as early as possible, i.e., push \( T_{z} \) as close to baseline as possible. Thus, we regard the early detection of AD as a binary classification problem between MCI non-converter (MCI-NC for short) and MCI converter (MCI-C for short). Without loss of generality, we assume the first \( M \) subjects belong to MCI-NC group and the remaining subjects belong to MCI-C group. Therefore, we divide all partial image sequences for training purpose into two groups: MCI-NC group \( {\mathbf{X}} = \left\{ {{\mathbf{x}}_{b,p} |{\mathbf{x}}_{b,p} = {\mathbf{h}}\left( {b,p} \right),p = 1, \ldots ,M,b = 1, \ldots ,T_{p} } \right\} \) and MCI-C group \( {\mathbf{Y}} = \left\{ {{\mathbf{y}}_{b,p} |{\mathbf{y}}_{b,p} = {\mathbf{h}}\left( {b,q} \right)|q = M + 1, \ldots ,N,b = 1, \ldots ,T_{q} } \right\} \). To achieve above goal, the naïve way is to train a SVM by:

$$ {\text{arg min}}_{{\mathbf{w}}} \left\| {\mathbf{W}} \right\|_{F}^{2} + \lambda\varvec{\varepsilon}^{2} ,s.t.\left\{ {\begin{array}{*{20}c} {\delta_{{\mathbf{x}}} - \left( {{\mathbf{w}}_{{\mathbf{x}}}^{T} - {\mathbf{w}}_{{\mathbf{y}}}^{T} } \right){\mathbf{x}}_{b,p} < \varepsilon , \varepsilon > 0, \forall {\mathbf{x}}_{b,p} \in {\mathbf{X}}} \\ {\delta_{{\mathbf{y}}} - \left( {{\mathbf{w}}_{{\mathbf{y}}}^{T} - {\mathbf{w}}_{{\mathbf{x}}}^{T} } \right){\mathbf{y}}_{b,q} < \varepsilon , \varepsilon > 0,\forall {\mathbf{y}}_{b,q} \in {\mathbf{Y}}} \\ \end{array} } \right., $$
(1)

where \( {\mathbf{W}} = [{\mathbf{w}}_{{\mathbf{x}}} {\mathbf{w}}_{{\mathbf{y}}} ] \) is a matrix consisting of classifier \( {\mathbf{w}}_{{\mathbf{x}}} \in {\mathfrak{N}}^{d \times 1} \) for MCI-NC group and \( {\mathbf{w}}_{{\mathbf{y}}} \in {\mathfrak{N}}^{d \times 1} \) for MCI-C group. The intuition behind the constraint is that the probability score \( \left( {{\mathbf{w}}_{{\mathbf{x}}}^{T} {\mathbf{x}}_{b,p} } \right) \) for each MCI-NC sample \( {\mathbf{x}}_{b,p} \) staying the MCI-NC group should be greater than the score \( \left( {{\mathbf{w}}_{{\mathbf{y}}}^{T} {\mathbf{x}}_{b,p} } \right) \) for jumping to MCI-C group by an inter-class margin \( \delta_{{\mathbf{x}}} \). Similar principle also applies to the sample \( {\mathbf{y}}_{b,q} \) from MCI-C group. \( \varepsilon \) is the slack variable which compensates for the mis-classification errors.

It is clear that there is strong structural correlations along partial image sequences in each subject. However, the naïve SVM solution shown in Eq. (1) treats each partial sequence separately. As shown in the left of Fig. 1, the probability scores of AD conversion and staying in MCI stage are not stable along time, which is not realistic since the structural change and AD progression are normally regarded as non-reversible.

Fig. 1.
figure 1

Advantages of our TS-SVM (right) over the naïve SVM solution (left). In our TS-SVM method, we enforce the temporal monotony and consistency constraints on the extracted partial image sequences (shown in the middle))

Temporally Structured SVM on Longitudinal MR Image Sequences.

To improve the accuracy of early AD detection, we propose the temporally structured SVM as:

$$ \begin{gathered} {\text{arg}}\,{\text{min}}_{{\mathbf{w}}} \left\| {\mathbf{W}} \right\|_{F}^{2} + \lambda \user2{\varepsilon }^{{\mathbf{2}}} ,s.t.\,C_{1} :\left\{ {\begin{array}{*{20}c} {\delta _{{\mathbf{x}}} ({\text{b}}) - \left( {{\mathbf{w}}_{{\mathbf{x}}}^{{\text{T}}} - {\mathbf{w}}_{{\mathbf{y}}}^{{\text{T}}} } \right){\mathbf{x}}_{{{\text{b}},{\text{p}}}} < \varepsilon ,\varepsilon > 0,\forall {\mathbf{x}}_{{{\text{b}},{\text{p}}}} \in {\mathbf{X}}} \\ {\delta _{{\mathbf{y}}} ({\text{b}}) - \left( {{\mathbf{w}}_{{\mathbf{y}}}^{{\text{T}}} - {\mathbf{w}}_{{\mathbf{x}}}^{{\text{T}}} } \right){\mathbf{x}}_{{{\text{b}},{\text{q}}}} < \varepsilon ,\varepsilon > 0,\forall {\mathbf{x}}_{{{\text{b}},{\text{q}}}} \in {\mathbf{Y}}} \\ \end{array} ,\,{\text{and}}} \right. \hfill \\ C_{2} :\tau _{{\mathbf{y}}} (l) - {\mathbf{w}}_{{\mathbf{y}}}^{T} ({\mathbf{y}}_{{{\text{b}},{\text{q}}}} - {\mathbf{y}}_{{{\text{a}},{\text{q}}}} ) < \varepsilon ,l = b - a,\varepsilon > 0,2 \le a < b,\forall {\mathbf{y}}_{{{\text{a}},{\text{q}}}} ,{\mathbf{y}}_{{{\text{b}},{\text{q}}}} \in {\mathbf{Y}}. \hfill \\ \end{gathered} $$
(2)

Compared to the objective function of naïve SVM in Eq. (1), two new constraints (C 1 and C 2 ) are used. (1) we first turn the inter-class margins \( \delta_{x} \) and \( \delta_{y} \) in Eq. (1) from scalar values into the monotonically increasing functions of \( b \) (the length of partial image sequence). The constraint C 1 is mainly used to achieve early detection, i.e., we require the probability of making accurate classification should increase as more time points are available. (2) The second constraint C 2 takes advantage of the non-reversible nature of AD progression. Suppose \( {\mathbf{y}}_{a,q} \) and \( {\mathbf{y}}_{b,q} \) are the morphological features from the same MCI-C subject but \( {\mathbf{y}}_{b,q} \) is extracted at the later time points after \( {\mathbf{y}}_{a,q} \) (i.e., a < b). Then we require the probability of the underlying MCI-C subject being converted to AD should higher at later time point b than at earlier time point a, i.e., \( {\mathbf{w}}_{{\mathbf{y}}}^{T} {\mathbf{y}}_{b,q} > {\mathbf{w}}_{{\mathbf{y}}}^{T} {\mathbf{y}}_{a,q} \) since AD conversion is irreversible. Furthermore, the intra-class margin \( \tau_{{\mathbf{y}}} \) is a monotonically increasing function of l (\( l = b - a \) is the length difference between two partial image sequences). Intuitively, the bigger the gap between two time points is, the larger the increase of AD conversion risk becomes. It is worth noting that the constraint C 2 is not applicable to MCI-NC subjects since the MCI-NC subject might convert to AD as more and more follow-ups will be scanned in future. Thus it is unreasonable to assume the MCI-NC subject can keep staying at MCI stage. As shown in the right of Fig. 1, for particular MCI-C subject, not only the probability score of AD conversion but also the difference between the probability scores of converting to AD and staying MCI monotonically increase as the partial image sequence becomes longer and longer. Thus, our TS-SVM can detect AD onset at early stage with high confidence. It is worth noting that we set \( \delta_{{\mathbf{x}}} \left( b \right) = b \), \( \delta_{{\mathbf{y}}} \left( b \right) = b \), \( \tau_{{\mathbf{y}}} \left( l \right) = l \) in all experiments.

2.2 Joint Feature Selection and Classification on TS-SVM

Since the morphological features are in high dimension, feature selection is a standard procedure to remove the data redundancy. Usually feature selection is independently applied prior to train the classifiers. In order to make the selected best features are also optimal for using TS-SVM, we proposed to jointly select best features and train the classifiers by introducing \( L_{2,1} \) norm on the classification matrix \( {\mathbf{W}} \):

$$ {\text{arg min}}_{{\mathbf{w}}} \left\| {\mathbf{W}} \right\|_{2,1} + \lambda\varvec{\varepsilon}^{2} ,s.t.\,C_{1} \,{\text{and}}\,C_{2}. $$
(3)

The intuitions behind using \( \left\| {\mathbf{W}} \right\|_{2,1} \) are that (1) sparsity constraint on each column of \( {\mathbf{W}} \): only a small number of features are selected which is useful to suppress the noisy and redundant patterns, and (2) group-wise constraint on each row of \( {\mathbf{W}} \): both MCI-NC and MCI-C classifiers select/discard the same morphological features. In this way, \( {\mathbf{W}} \) can be simultaneously regarded as a coefficient matrix for feature selection and a classifier for classification.

2.3 Optimization

Although Eq. (3) is a convex problem, it is hard to optimize it directly due to a large number of linear inequality constraints. To solve this problem efficiently, we reformulate it as an unconstrained problem following the framework of Alternating Direction Method of Multipliers (ADMM) [8, 9, 16]. Specifically, we rewrite Eq. (3) as an unconstrained convex optimization problem by introducing a dummy variable Z to break the group sparse constraint with other inequality constraints:

$$ \begin{aligned} {\text{arg min}}_{{{\mathbf{W}},{\mathbf{Z}}}} \left\| {\mathbf{Z}} \right\|_{2,1} + \lambda \left[ {\mathop \sum \limits_{{{\mathbf{x}}_{b,p} \in {\mathbf{X}}}} \left\| {\delta_{{\mathbf{x}}} \left( b \right) - \left( {{\mathbf{w}}_{\text{x}}^{\text{T}} - {\mathbf{w}}_{\text{y}}^{\text{T}} } \right){\mathbf{x}}_{b,p} } \right\|_{h} + \mathop \sum \limits_{{{\mathbf{y}}_{b,q} \in {\mathbf{Y}}}} \left\| {\delta_{{\mathbf{y}}} \left( b \right) - \left( {{\mathbf{w}}_{\text{y}}^{\text{T}} - {\mathbf{w}}_{\text{x}}^{\text{T}} } \right){\mathbf{y}}_{b,p} } \right\|_{h} + } \right. \hfill \\ \left. {\mathop \sum \limits_{{{\mathbf{y}}_{a,q} \in {\mathbf{Y}}, {\mathbf{y}}_{b,q} \in {\mathbf{Y}}}} \left\| {\tau_{{\mathbf{y}}} \left( {b - a} \right) - {\mathbf{w}}_{\text{y}}^{\text{T}} \left( {{\mathbf{y}}_{b,q} - {\mathbf{y}}_{a,q} } \right)} \right\|_{h} } \right] + \mu \left\| {{\mathbf{W}} - {\mathbf{Z}}} \right\|_{F}^{2} + Tr\left( {{\varvec{\Lambda}}^{T} \left( {{\mathbf{W}} - {\mathbf{Z}}} \right)} \right) \qquad \qquad \quad \hfill \\ \end{aligned} $$
(4)

where \(\left\| \, \right\|_{h} \) is a hinge loss function which measures the mis-classification error with the quadratic loss: \( \left\| {\mathbf{x}} \right\|_{h} = \left\|{ \hbox{max} }(0,{\mathbf{x}}) \right\|_{2}^{2} \), \( \mu \) is the penalty parameters for the constraint \( {\mathbf{W}} = {\mathbf{Z}} \),\( {\varvec{\Lambda}} \in {\mathfrak{N}}^{d \times 2} \) is the Lagrange multiplier matrix for the equality constraint \( {\mathbf{W}} = {\mathbf{Z}} \), \( Tr(.) \) represents the trace operator, and λ is the penalty parameter for the constraints C 1 and C 2 , respectively. Equation (4) can be optimized by alternatively solving \( {\mathbf{W}},{\mathbf{Z}} \) until the overall energy function converges.

3 Experiments

In the following experiments, we select 70 MCI-C subjects from ADNI dataset which have AD onset in the middle of longitudinal image sequence and 81 MCI-NC subject which stay in MCI stage until the last scan in the latest ADNI dataset. For all subjects, 95.3 % have 4 follow-ups every 6 months, and the remaining 4.7 % having more than 4 follow-ups. Specifically, for 70 MCI-C subjects, 11.1 % are diagnosed AD at 6 months, 31.8 % at 12 months, 25.3 % are diagnosed AD at 18 months after baseline scan, while the remaining 31.8 % are diagnosed AD more than 24 months after baseline scan. We compare our proposed TS-SVM based early detection method with standard SVM based method. Furthermore, we evaluate the importance of feature selection in both TS-SVM and standard SVM method. Thus, we compare the classification performance for four method in total, denoted by SVM, SVM+FS, TS-SVM, and TS-SVM+FS, respectively. In all experiments, we split the data into 10 non-overlap folders and report the averaged classification accuracy after 10-fold cross validation. The parameters are tuned using grid search strategy only in the training dataset.

Performance of AD Early Detection.

In each cross validation case, we train our TS-SVM on the training data and sequentially apply the trained classifier to the testing subject image sequence from the first follow-up. Since the month of converting to AD after baseline scans varies across MCI-C subjects, we show the detection accuracy for MCI-C subjects converting to AD 12 months, 18 months, and 24 months after the baseline scan in Tables 1, 2 and 3, separately. It is clear our TS-SVM beat the standard SVM with more than 10 % improvement in terms of classification accuracy, which shows the advantage of temporal consistency and monotony constraints in our proposed method. Also, feature selection is very important to improve the detection accuracy, where SVM+FS and TS-SVM+FS can obtain average 3.8 % and 2.9 % increase over SVM and TS-SVM, respectively. In brief, our full method (TS-SVM+FS) can detect AD 6 months prior to AD onset with 86.8 % accuracy, 12 months prior to AD onset with 82.5 % accuracy, and 18 months prior to AD onset with 76.5 % accuracy. Note, the early detection performance in Table 3 is worse than Tables 1 and 2 at corresponding pre-diagnosis windows. The reason is that the subjects in Table 3 mostly have 5 time points and have AD onset exactly at the last time point. Thus, the unbalanced partial image sequences before and after AD onset challenge the learning of robust classifiers.

Table 1. Accuracy of AD early detection at 6 months, and 0 month before AD onset for the MCI-C subjects converting to AD 12 months after baseline scan.
Table 2. Accuracy of AD early detection at 12 months, 6 months, and 0 month before AD onset for the MCI-C subjects converting to AD 18 months after baseline scan.
Table 3. Accuracy of AD early detection at 18 months, 12 months, 6 months, and 0 month before AD onset for the MCI-C subjects converting to AD 24 months after baseline scan.

Critical Brain Regions Related with AD Progression.

Since our method jointly select morphological features in training TS-SVM, it is interesting to examine the critical brain regions where the morphological features extracted from these region contribute significantly to detect AD progression via longitudinal tracking. Figure 2 show the top 20 regions selected by our TS-SVM+FS method. It is apparent that the selected brain regions are located at AD related sub-cortical regions (such as putamen, thalamus, and hippocampus) and cortical areas (such as orbitofrontal cortex, medial/lateral temporal lobe, and medial/lateral parietal lobe), which is in consensus with the neuroimaging observations in the literatures [10, 11]. We also compared the top selected ROI for short term and long term detection and found that the cortical regions contribute more for short term detection and the sub-cortical regions, such as such as putamen, thalamus, and hippocampus, contribute more for long term converters detection. This may indicate that the sub-cortical regions changes are more significant compared with the cortical regions at the earlier AD progression stage. We did not visualize this result due to the page limitation.

Fig. 2.
figure 2

The top 20 critical brain regions which contributed in AD early detection.

4 Conclusion

In this paper, we present a novel early AD diagnosis method using temporally structural SVM. In order to avoid inconsistent and unrealistic classification results, we propose the monotony on the output of SVM since the AD progression is generally non-reversible. In order to achieve early alarm of AD onset, we propose to adjust the classification margin such that the confidence of detecting AD progression becomes high as more and more follow-up scans are examined. Furthermore, we jointly perform feature selection and training of TS-SVM, in order to make the selected features can work well with the trained classifiers.