Patients with mild cognitive impairment due to Alzheimer’s disease (MCI) [1] - also known as mild neurocognitive disorder [2] - are considered to be at an early stage of dementia. There are now multiple published criteria sets for identifying these individuals at high risk of progression [13], all of which include at least: 1) subjective concern; 2) an objective cognitive impairment on formal neuropsychological testing in one or more cognitive domains, typically including memory; 3) preservation of functional independence; and 4) no dementia.

Although these criteria have been a major step forward in the conceptualization of MCI, they leave room for considerable ambiguity, particularly regarding the operational definition of objective cognitive impairment. A number of cognitive tests have been proposed that may be useful for identifying objective episodic memory impairment in MCI, specifically measures that assess both immediate and delayed recall, such as word-list learning or paragraph recall [1, 4]. These suggestions are very useful in providing common ground for clinicians and researchers working with MCI cohorts. However, three critical issues remain.

First, it is unclear which cutoff scores should be used to define impairment. Studies examining MCI patients typically report test performance in the range of one to two standard deviations (SD) below age-adjusted and/or education-adjusted norms. However, using a −1 SD cutoff may be overly inclusive, as cognitive performance in healthy older adults often falls below this limit [5] for a variety of non-pathological reasons (e.g., fatigue, anxiety). Conversely, using a −2 SD cutoff may underestimate the number of individuals who are in the earliest phases of the disease process.

Second, it is unclear how many measures should be used in assessing cognition. In memory clinics, diagnosis is typically based on results of a battery of neuropsychological tests including more than one test probing the same cognitive domain. Longitudinal evidence confirms that using at least two tests to establish impairment greatly increases diagnostic accuracy [6]. In research settings, however, MCI diagnosis is often based on a single test. This is potentially problematic, as research has shown that more than one quarter of healthy elderly adults who are tested using a single memory measure obtain scores in impaired ranges (< −1.5 SD), while this number is reduced to 14.1 % when a second test is added [5]. As mentioned above, impaired performance on a single test in otherwise healthy normal adults may be explained by numerous factors such as anxiety, depression, fatigue, or inattention. Thus, this single-test procedure may not be adequate for identifying individuals who are at highest risk of dementia.

Third, it is unclear which cognitive domain(s) should be assessed, if any, in addition to episodic memory. Originally, Petersen’s [3] diagnostic criteria recommended that a distinction be made between single-domain and multiple-domain MCI, with the assumption that this classification would be of heuristic value in determining the probable etiology of the disorder. This recommendation is echoed in Albert and colleagues’ [1] revised criteria as well. Indeed, some longitudinal evidence suggests that these subtypes evolve differently over time [7], suggesting distinct etiological processes. However, the most recent DSM-5 criteria for mild neurocognitive disorder [2] do not discriminate between single-domain and multiple-domain cognitive impairment. Many research studies also do not make this distinction.

In addition, recent guidelines for diagnosing MCI have emphasized the importance of using genetic and imaging biomarkers in addition to neuropsychological testing. The presence of one or two copies of the epsilon 4 allele (ε4) in the apolipoprotein E (APOE) gene is one commonly accepted genetic characteristic believed to increase the risk of development of dementia due to Alzheimer’s disease (AD) [8]. Additionally, metrics obtained from structural magnetic resonance imaging (MRI) that assess neuronal injury, such as total brain atrophy [9, 10], ventricular enlargement [1113], hippocampal (HP) volume loss [14, 15], medial temporal lobe atrophy [16], and possibly the presence of small vessel disease [17], may be informative predictors for the development of AD dementia.

Using data obtained from the Alzheimer’s Disease Neuroimaging Initiative (ADNI), the purpose of this study is to determine whether prediction of development of clinical dementia among non-demented participants is improved by: 1) using cutoff scores of −1.0, −1.5 or −2.0 SD to define cognitive impairment; 2) assessing episodic memory using one or two tests; 3) assessing additional non-memory domains; and 4) accounting for commonly used neuroimaging and genetic biomarker data. It was hypothesized that the identification of individuals at risk for the development of dementia would best be predicted by defining objective impairment as performance < −1 SD on two episodic memory tests. Furthermore, it was anticipated that the ability to predict the development of AD would be further optimized by considering performance in at least one other, non-memory domain. Finally, it was expected that the inclusion of imaging and genetic biomarkers known to be associated with AD would further improve prediction.

Materials and methods

Data used in the preparation of this article were obtained from the ADNI database ( on 3 February 2015. The ADNI was launched in 2003 as a public–private partnership, led by Principal Investigator Michael W. Weiner, MD. The primary goal of ADNI has been to test whether serial MRI, positron emission tomography (PET), other biological markers, and clinical and neuropsychological assessment can be combined to measure the progression of MCI and early AD.


Of the 819 participants enrolled in ADNI-1, those who had neuropsychological and genetic data available at baseline and 24-month follow up were selected for this study (n = 630). A 24-month follow-up period was selected to maximize statistical power and to ensure that harmonized imaging outcome measures were available for the majority of the sample. Of these 819 participants, those with a diagnosis of probable AD at baseline were excluded (n = 136). Individuals with a history of neurological or psychiatric illness or substance abuse, or without a study partner able to corroborate reports of functioning, were not eligible for ADNI; complete eligibility criteria for the ADNI study as a whole are described at The final sample consisted of the remaining 494 non-demented participants. According to the assigned diagnoses in the ADNI database, 294 of these participants were classified as having MCI, and the remaining 200 were classified as cognitively normal. All participants (201 women, 293 men) were 55–89 years old at baseline (mean = 75.3 ± 6.4) and had 6–20 years of education (mean = 15.9 ± 2.8).


Cognitive measures

A neuropsychological battery was administered to all participants upon admission to ADNI, and raw scores were downloaded from the ADNI Neuropsychological Battery table. Of interest in the present study are tests that measure general cognition (Mini-mental state exam (MMSE)), episodic memory (Logical memory story A delayed recall (LM-II), Rey auditory verbal learning test (AVLT)), language (Category fluency, Boston naming test (BNT)) and executive functioning (Trails A and B). A derived Trails B/Trails A ratio was calculated to obtain a relatively independent measure of executive control, as has been suggested by other authors [18]. Raw scores were transformed to standardized scores (z scores or scaled scores (SS)) based on published age-adjusted norms for the AVLT [19], Category fluency [20], BNT [21] and Trails A & B [18]. Education-adjusted z scores for LM-II story A were obtained using a web-based calculator [22] based on data from a large published report [23]. Higher z scores or SS represent better performance, with the exception of Trails A and B in which higher scores represent poorer performance (i.e., longer time to complete the test).

Outcome measure

The presence or absence of clinically probable AD was assessed at 24 months and defined as: 1) MMSE <26; 2) Clinical Dementia Rating (CDR) ≥0.5; and 3) positive NINCDS/ADRDA criteria for probable AD [24].

Imaging and genetic biomarkers

Neuroimaging-based biomarkers were obtained from downloaded ADNI database tables (hierarchical parcellation of MRI using multi-atlas labeling methods (UPENN); white matter hyperintensity volumes (UCD)). Whole brain atrophy was assessed using the brain parenchymal fraction (BPF), which was calculated as a ratio of total parenchymal volume (gray matter (GM) and white matter (WM)) to total cranial vault (TCV) volume as follows:

$$ \mathrm{B}\mathrm{P}\mathrm{F} = \left(\mathrm{G}\mathrm{M} + \mathrm{W}\mathrm{M}\right)/\mathrm{T}\mathrm{C}\mathrm{V}. $$

To assess medial and focal atrophy, head-size-corrected ventricular cerebrospinal fluid (vCSF) and HP volume were automatically segmented using previously published and validated methods [11, 14]. Small vessel disease burden was assessed using whole brain white matter hyperintensity (WMH) volumes [25]. Full segmentation methodological details can be obtained from ADNI (see ADNI1_Methods_UCD_WMH_Volumes_Methods.pdf and ADNI_Total_Cranial_Vault_Segmentation_Method_20121108.pdf). In addition, the presence of one or two copies of the APOE ε4 allele was determined for all participants as per standard ADNI protocol.

Statistical analyses

Six binary variables were created based on scores < −1.0, −1.5 or −2.0 SD on one (LM-II or AVLT delayed recall) or two (LM-II and AVLT delayed recall) memory tests, and participants were classified as above or below each cutoff. The predictive accuracy of these six cutoffs was tested using the area under the curve (AUC) for receiver operating characteristic (ROC) analysis. The minimum value for an AUC to be considered clinically significant was >0.75 [26]. Hanley and McNeil’s [27] method was used to test for statistical differences between AUC values. Cutoff scores with AUC values >0.75 were then entered into separate binary logistic regression analyses with hierarchical designs, with probable AD at 24 months as the binary (yes/no) dependent variable. In all models, age, sex, education, MMSE and the selected cutoff score were entered in a first block. A second block included performance on non-memory cognitive measures, specifically standardized Category fluency, BNT, and Trails B/A- derived scores. A third block assessed the potential added predictive value of biomarkers that are known to be associated with probable AD: BPF, vCSF volume, total HP volume, WMH volume, and APOE ε4 status. We verified that all variables met multicollinearity and linearity assumptions.

Last, in order see whether participants whose performance fell above and below the best selected cutoff scores were phenotypically different, multivariate analysis of covariance (MANCOVA) was used to compare cognitive and neuroimaging characteristics between these two groups, with age, sex and education entered as covariates. Highly skewed variables exhibiting non-normal distributions were log-transformed (WMH, vCSF) or inverse-transformed (Trails B/A ratio) prior to analysis. Category fluency scores did not meet the equal variance assumption and were therefore log-transformed. Dichotomous variables were compared using the chi-square test.


At 24 months post-baseline, 112 participants (22.7 %) had received a diagnosis of AD. Sensitivity, specificity and accuracy of the different cutoff scores are illustrated in Fig. 1. On ROC analysis there were three cutoffs with AUC values >0.75. A cutoff of < −1 SD on two memory tests (AUC = 0.80, standard error (SE) = 0.02, 95 % CI 0.75, 0.84) had 75.91 % accuracy in correctly identifying patients who would later develop probable AD (97 true positives) and those who would not (278 true negatives). A cutoff of < −1.5 SD on one memory test (AUC = 0.77, SE = 0.02, 95 % CI 0.73, 0.81) had 66.60 % accuracy (108 true positives, 221 true negatives). A cutoff of < −2 SD on one memory test (AUC = 0.77, SE = 0.03, 95 % CI 0.72, 0.82) had 76.52 % accuracy (87 true positives, 291 true negatives). The AUC values for the three cutoff scores were not statistically different (all comparisons p >0.05, one-tailed).

Fig. 1
figure 1

Sensitivity, specificity and accuracy of different cutoff scores in 494 non-demented participants at baseline. AD Alzheimer’s disease, LM-II Logical memory story A delayed recall, AVLT Rey auditory verbal learning test

Seven participants were excluded from subsequent analyses because they had missing data (two had missing WMH data, two had missing Trails B data, one had missing BNT data, and two had missing Trails B and BNT data). First, on logistic regression model to test the added value of non-memory measures and biomarkers, in addition to a cutoff of < −1 SD on two memory tests (B = 2.55, SE = 0.33, p <0.001), MMSE was a significant predictor of future AD (B = −0.34, SE = 0.08, p <0.001). Only the presence of two APOE ε4-positive alleles (B = 1.10, SE = 0.45, p = 0.016) further improved prediction. Altogether, this model accounted for 83.4 % of the variance in risk of probable AD (Table 1).

Table 1 Variables predicting AD in addition to < −1 SD on two episodic memory tests

In the second model, in addition to a cutoff of < −1.5 SD on one memory test (B = 3.09, SE = 0.54, p <0.001), significant predictors of probable AD were MMSE (B = −0.32, SE = 0.07, p <0.001) and the Trails B/A ratio in the non-memory cognitive measures block (B = 0.27, SE = 0.13, p = 0.033). Biomarkers that significantly improved prediction included BPF (B = −16.58, SE = 7.64, p = 0.030) and presence of two APOE ε4-positive alleles (B = 1.05, SE = 0.45, p = 0.021). This model accounted for 82.3 % of the variance in risk of probable AD (Table 2).

Table 2 Variables predicting AD in addition to < −1.5 SD on one episodic memory test

In the third model, in addition to a cutoff of < −2 SD on one memory test (B = 2.04, SE = 0.28, p <0.001), significant predictors of probable AD were MMSE (B = −0.40, SE = 0.08, p <0.001) and the Trails B/A ratio in the non-memory cognitive measures block (B = 0.31, SE = 0.13, p = 0.017). Presence of two APOE ε4-positive alleles (B = 1.07, SE = 0.46, p = 0.019) further improved prediction. This model accounted for 81.9 % of the variance in risk of probable AD (Table 3).

Table 3 Variables predicting AD in addition to < −2 SD on one episodic memory test

Participants who scored above (n = 291) and below (n = 196) a cutoff score of < −1 SD on two memory tests were compared using MANCOVA. Levene’s test indicated that both groups had equal variances (all variables p >0.05). As summarized in Table 4, it was found that those with episodic memory scores below the cutoff had poorer performance on Category fluency (F (4,482) = 14.23, p <0.001), BNT (F (4,482) = 25.60, p <0.001), and Trails B/A ratio (F (4,482) = 7.18, p <0.001). For brain morphology, patients below the cutoff had smaller BPF (F (4,482) = 49.02, p <0.001), smaller left (F (4,482) = 44.83, p <0.001) and right HP volumes (F (4,482) = 41.03, p <0.001), more vCSF (F (4,482) = 28.99, p <0.001) and smaller WMH volume (F (4,482) = 8.69, p <0.001).

Table 4 Characteristics (mean (SD)) of participants above and below selected cutoffs

Participants who scored above (n = 223) and below (n = 264) a cutoff score of < −1.5 SD on one memory test were compared in a second MANCOVA. Two variables violated Levene’s test (Trails B/A ratio and left HP volume), likely due to the large sample sizes. Inspection of the data showed that the variance between both groups was highly similar (in the above-cutoff and below-cutoff groups, the respective variances were 0.010 and 0.016 for Trails B/A ratio, and 0.001 and 0.001 for left HP volume), and therefore parametric analyses were retained. Results revealed that individuals with episodic memory scores below the cutoff had poorer performance on Category fluency (F (4,482) = 14.24, p <0.001), BNT (F (4,482) = 24.00, p <0.001), and Trails B/A ratio (F (4,482) = 3.81, p = 0.005). They also had smaller BPF (F (4,482) = 45.00, p <0.001), smaller left (F (4,482) = 27.38, p <0.001) and right HP volume (F (4,482) = 33.42, p <0.001), more vCSF (F (4,482) = 28.94, p <0.001) and larger WMH volume (F (4,482) = 8.90, p <0.001).

Participants who scored above (n = 313) and below (n = 174) a cutoff score of <2 SD on one memory test were compared in a third MANCOVA. Trails B/A ratio violated Levene’s test of equality of error variances, but again inspection of the data showed highly similar variances between the above-cutoff (0.010) and below-cutoff (0.016) groups. Parametric analyses were thus retained. Individuals with episodic memory scores below the cutoff had poorer performance on Category fluency (F (4,482) = 11.61, p <0.001), BNT (F (4,482) = 19.23, p <0.001), and Trails B/A ratio (F (4,482) = 3.40, p = 0.009). They also had smaller BPF (F (4,482) = 45.07, p <0.001), smaller left (F (4,482) = 31.79, p <0.001) and right HP volume (F (4,482) = 35.16, p <0.001), more vCSF (F (4,482) = 28.72, p <0.001) and larger WMH volume (F (4,482) = 9.33, p <0.001).


This study aimed to assess how various cognitive, neuroimaging and genetic measures collected at baseline can be used to predict the development of probable AD dementia at 24 months in a sample of elderly participants obtained from ADNI. By assessing a series of normative cutoff scores from cognitive test results, the number of episodic memory and non-memory tests used to assess cognitive performance, and other commonly used neuroimaging and genetic biomarkers, a set of recommended criteria was established which may be used in future investigations to improve prediction for the development of probable AD in the elderly.

Consistent with our initial hypotheses, performance < −1 SD on two memory tests (LM-II and AVLT delay) had the best trade-off between sensitivity and specificity for predicting probable AD, followed by performance < −1.5 SD and < −2 SD on one memory test (LM-II). These results suggest that to maximize diagnostic certainty, a minimum of two measures should ideally be used to assess episodic memory performance and impairment should be defined as scores at least 1 SD below appropriate normative references on both measures. Jak and colleagues [28] were among the first to recommended establishing impairment on at least two measures within a cognitive domain as the best way to increase sensitivity while maintaining reliability, and other authors have since corroborated the value of this approach [6, 2931]. Our results further indicate that clinicians or researchers with limited resources who administer only a single memory test should opt for a much more stringent cutoff (i.e., −2 SD below normative reference data) to determine episodic memory impairment with comparable accuracy to two measures. Applying a −1.5 SD cutoff to a single test should be avoided when possible, as it remains highly prone to false positive diagnostic errors (c.f. [30, 31]) which reached nearly one-third of the sample (32.6 %) in the present study.

The only variable that improved prediction above and beyond episodic memory testing using two measures was APOE status, consistent with previous research recognizing APOE ε4-positive status as a major risk factor for subsequent AD (see [32] for a review). When only one test was used to assess episodic memory, prediction of dementia was improved using a non-memory test, specifically the ratio of Trails B/A, considered to be a measure of executive control [18]. Predictive accuracy was further increased using APOE ε4 status and whole-brain atrophy (as indexed by brain parenchymal fraction). These interesting results suggest that thorough episodic memory testing using several measures is successful in predicting subsequent dementia with at least as much accuracy as using one memory test plus additional memory tests and biomarkers. It has previously been reported that the use of sensitive neuropsychological instruments are at least as effective in predicting AD as imaging biomarkers [3336]. Other authors have also reported that the use of a single memory test is not optimal in predicting AD, and that adding information on brain atrophy and/or cerebrospinal fluid biomarkers is necessary to improve predictive accuracy in regression models [35, 37, 38]. We corroborate these findings, and extend them to specify that “impairment” should be defined as performance more than 1 SD below normative data.

Certain limitations must be considered in interpreting these data. First, the ADNI study specifically set out to recruit patients who represented relatively pure cases of MCI and dementia of the Alzheimer’s type, who are appropriate for clinical trials; this is evident in patients’ relatively low burden of WMH [39] (thought to reflect underlying vascular disease [40]). As such, the sample primarily includes individuals whose suspected etiology is AD, and whose primary (and often only) cognitive deficit involves memory. While ADNI provides a large and rich database to study individuals who are at high risk of developing AD, findings generated from these data have limited generalizability to real-world patient populations [39]. Other, more inclusive cohorts of individuals with MCI are needed. In addition, the standardized scores used in this study were derived from published age-adjusted norms for each test. It is possible that the use of local norms may produce different results (e.g., see [41]).

We have shown that diagnostic accuracy can be improved by approximately 10 % by administering an extra memory test to evaluate memory capacities in persons suspected of MCI. This improved accuracy is mostly the result of reducing false positive results, which other authors have shown are inflated when using a single test [31]. Although adding a test to the diagnostic battery resulted in some patients being missed at baseline, who went on to develop AD at 24 months, our findings suggest that this trade-off is altogether fair. An incorrect diagnosis of AD has serious implications for research and clinical practice. First, studies that employ only LM-II to test for memory impairment in participants are effectively pooling true MCI cases with those who are likely cognitively normal, thus potentially weakening the robustness of the research findings and limiting their generalizability. Clinically, the consequences of an incorrect diagnosis include needless testing, pharmacotherapy, and anxiety incurred by the patient and family. Also, inaccurate diagnosis implies that alternative (potentially reversible) causes of cognitive changes are not being investigated.

In closing, we must acknowledge that expanding cognitive batteries to include an extra memory test has some disadvantages. Namely, more clinician time and additional test materials are required, and research protocols will be slightly lengthened. However, we believe that these caveats are greatly outweighed by the benefit of improved accuracy, and that an additional memory measure should be added to clinical and research cognitive batteries to the extent that it is feasible.


The findings of our study in the ADNI cohort suggest that neuropsychological testing can predict decline with high accuracy regardless of biomarkers, when memory is assessed using delayed recall of a short story and a word list, using a cutoff of < −1 SD below normative references. This criterion provides the optimal trade-off between specificity and sensitivity for predicting conversion to AD at two years. The increased accuracy that this criterion provides decreases the probability of misdiagnosing a patient and avoids needless testing, pharmacotherapy and anxiety, and provides a high-accuracy, low-cost strategy for identifying individuals at highest risk of dementia. In situations where it is only feasible to administer a single memory test, collecting information on non-memory performance and imaging or genetic biomarkers is necessary to optimize diagnostic accuracy.