Introduction

Intense research efforts have been conducted to develop and validate biomarkers to predict, detect, and follow up the progression of the disease or impact of potential new disease-modifying treatments of Alzheimer’s disease (AD). Brain imaging and biologic biomarkers are now included in the recommended AD diagnostic criteria[13]. Various scales or tools are available to physicians involved in AD research for better interpreting the positive or negative results of biomarkers for establishing a clinical diagnosis, such as the detection of hippocampal atrophy with the Scheltens’ scale[4] and the positivity of amyloid load on functional brain imaging via the Jack’s visual scale[5].

Although biologic cerebrospinal fluid (CSF) biomarkers are among the most studied and validated in clinical practice[68], no sort of “visual” scale concerning biologic biomarkers is available. Numerous studies did indeed show that AD patients display characteristic CSF changes with decreased levels of β-amyloid1-42 (Aβ42) and elevated levels of total tau protein (tau) and its phosphorylated form at threonine 181 (p-tau)[7, 9, 10]. Interestingly, these biomarkers are directly related to neuropathologic changes[11] present in the disease. Therefore, one would expect that the presence of these three biomarkers would be highly indicative of AD, whereas their absence would strongly disqualify this diagnosis. Other biomarkers represented by isoforms of Aβ (Aβ40, Aβ38) or linked to oxidative stress and inflammation[12, 13] may also contribute to the diagnosis of AD, in particular when combined with tau or Aβ42. Apart from familial forms of AD, genetic profiles[14], in particular apolipoprotein E status, which represents the most prominent risk factor, is not yet used as a diagnostic tool.

So, today we can only rely routinely on the commonly measured CSF biomarkers Aβ42, tau, and p-tau to help us with AD diagnosis. In clinical practice, after control of preanalytic biases[1517] and standardization of procedures[1820], the performance of CSF biomarkers is satisfactory with a coupled sensitivity/specificity as high as 80% when used alone or combined. The combination of these biomarkers increases their performance, as demonstrated for Aβ42 and tau[21] or Aβ42 and p-tau[22, 23]. In most AD biomarker studies, results are represented by the best sensitivity/specificity and area under the ROC curves (AUC). It is, however, difficult to use this information directly when biomarkers are being used in routine practice to help physicians with AD diagnosis[68]. We need a way to convert the variation of the three biomarkers into a probability scale for AD. A leading study from Spies et al.[24] proposed a logistic regression of logarithmic-transformed values of Aβ42 and p-tau values and sex defined classes of AD probability. Logistic regressions already demonstrated their relevance in differentiating AD patients from non-AD patients[7, 23].

In the present study, we conducted an evaluation, based on multicenter data from our PLM cohort[25], of a simpler scale based solely on the numbers (from 0 to 3) of pathologic biomarkers (Aβ42, tau, and p-tau) identified. This intuitive scale was very efficient for generating categories of patients seen in memory clinics with a refined probability of AD, and it will represent a valuable tool for facilitating the interpretation and routine use of multivariate CSF data for stratifying patients into clinical research trials.

Material and methods

Study design and subjects

Patients (1,273) who had a lumbar puncture were recruited between January 2008 and December 2011 from the three initial PLM centers (Paris-North, Lille, and Montpellier) specialized in the care management of patients with cognitive disorders (Table 1). These centers used the same diagnostic procedures and criteria[25]. All patients had a thorough clinical examination, including biologic laboratory tests, neuropsychological evaluations, and brain imaging. Patients were classified into two groups: AD (as defined by the NINCDS-ADRDA criteria[26]), and non-AD (NAD) patients. NAD diagnosis (that is, frontotemporal lobar degeneration, semantic dementia, Lewy body and Parkinson diseases, progressive supranuclear palsy, amyotrophic lateral sclerosis, normal-pressure hydrocephalus, and psychiatric disorders), were defined by the commonly validated international criteria. Important for this study, the used diagnostic criteria for AD do not include CSF biomarkers (otherwise, that would bias the evaluation of the interest of the scale, which was based on biomarker values). Mild cognitive impairment, as well as phenotypes that were mixed with AD or might correspond to specific/early forms of AD, were excluded from the cohorts (mixed dementia, primary progressive aphasia, amyloid angiopathy). CSF was collected by using standardized collection, centrifugation, and storage conditions in different centers[25].

Table 1 Population demography and biomarker values

A second set of data from 408 patients with the same clinical characteristics was generated from the three new PLM centers (Rouen, Strasbourg, and Besançon) also specialized in the care management of patients with cognitive disorders.

CSF samples and assays

CSF Aβ42, tau, and p-tau concentrations were measured by using standardized commercially available INNOTEST sandwich ELISA, according to the manufacturer’s procedures (Fujirebio Europe NV, formerly Innogenetics NV). As the data were generated in the different centers through the routine activity of their laboratories, the lots of assay kits were variable within and in between laboratories. The quality of the results was ensured by the use of validated standard operating procedures and internal quality controls (QCs). The range of the QC coefficient of variation for Aβ42 across the different lots for the six laboratories was 5% to 11%. For tau and p-tau, the range was 8% to 14% and 6% to 14%, respectively. The use of external QC ensured also the quality of the results and the validity of the intersite data comparison[19]. The Paris, Lille, and Montpellier centers contributed with two sets of data labeled “-1” or “-2” generated from different collection tubes (for the −1 cohort in Montpellier, Greiner, catalog number 18 82 81; in Lille, Becton Dickinson, catalog number Falcon 35 2097; in Paris, CML, catalog number TC10PCS, and for the −2 cohorts: Sarstedt, catalog number 62.610.201)[20]. These two sets of data differed in their optimal cutoffs (≤506 and ≤834 for Aβ42; >343 and >340 for tau; >64 and >62 for p-tau; see Additional file1) in relation with the preanalytic properties of the tube that affected mostly Aβ42 levels[27].

We pooled the data from the three additional PLM centers (Rouen, Strasbourg, and Besançon), which used standard cutoff biomarker values in relation to the type of collection tube used[20]. The lumbar puncture, the CSF biomarkers measurement, and the clinical examination (including the diagnosis) were part of standard care. This observational study on routine biologic analyses is not considered in France to be “biomedical research,” and it does not necessitate informed consent or ethical approval. Authorization for handling personal data has been granted by the French Data Protection Authority (CNIL) under the number 1709743 v0.

Statistical analysis

Statistical analyses were computed with the MedCalc software (11.3). The logistic regression was performed by following a method similar to that of Spies et al.[24]. In brief, log-transformed CSF biomarkers and sex were entered with backward stepwise selection by using a significance level of 0.10. The logistic regression generates the coefficients of a formula to predict the logistic transformation of the probable presence of AD (p(AD)) as follows:

p(AD) = 1/(1 + 1 e – [intercept + score])

where score = coefficient × ln(Aβ42) + coefficient × ln(p-tau) + coefficient × Sex. Probability classes to discriminate between AD and non-AD patients were obtained by selecting the following ranges of p(AD): class 0, 0 to 0.1; class 1, 0.1 to 0.5; class 2, 0.5 to 0.9; and class 3, 0.9 to 1.0.

We did not perform the logistic regression analysis in one training population with the goal to apply the resulting classification to the other populations, but we rather recalculated the coefficients in each cohort to generate best-fitted models. Receiver operating characteristic (ROC) curves were used to represent sensitivity and specificity for AD detection. ROCs are generated from continuous diagnostic variables. The CSF biomarkers and their ratio are continuous, as well as the p(AD) values obtained after the logistic regression. Regarding the scales, we plotted the curve by using the four values 0, 1, 2, and 3 applied to the different samples.

To compare the classifications of patients, we used the net reclassification index (NRI)[28]. The NRI is based on reclassification constructed separately for participants with and without the event of interest (that is, AD or non-AD diagnosis), and quantifies the correct movement into classes, upward for events and downward for nonevents. At first, the following probabilities were calculated: p(up_AD) = (number of cases in which the class was moving up between two classifications of AD patients)/(number of AD patients); p(down_AD) = (number of cases in which the class was moving down between two classifications of AD patients)/(number of AD patients); p(up_NAD) = (number of cases in which the class was moving up between two classifications of NAD patients)/(number of NAD patients); p(down_NAD) = (number of cases in which the class was moving down between two classifications of NAD patients)/(number of NAD patients). We assumed that correctly classifying an AD patient was as important as correctly classifying an NAD patient, and therefore we computed the NRI by using the formula: NRI = (p(up_AD)-p(down_AD))-(p(up_NAD)-p(down_NAD)).

Results

Demographics and biomarker values for the different cohorts are presented in Table 1. As expected, differences in individual biomarker concentrations between AD and NAD were apparent across all cohorts. ROC curves for Aβ42, tau, and p-tau were used to compute AUCs (Table 2) and optimal cutoff values (Additional file1). AUCs were also calculated for Aβ42/tau and Aβ42/p-tau ratios, as well as the Aβ/tau index (IATI)[21] (Table 2). The Paris, Lille, and Montpellier PLM centers provided data from two independent cohorts (−1 and −2) that differed by the type of collection tube used (see Material and methods section). This explains the variations in cutoff values; the second tube having been selected for its low Aβ42 absorption[27], therefore resulting in higher cutoff values[20]. All patients of these three centers for whom samples were collected in tube 1 or 2 were also differentiated into two additional populations called PLM-1 and PLM-2. This was one way to evaluate the AD scale on multicenter cohorts after standardization of biologic and clinical practices. Data from these populations were used to compare predictions based on a logistic regression performed as shown by Spies et al.[24]. Interestingly, sex was never retained in the model used for our cohorts (Additional file1). The models were very efficient in differentiating AD from non-AD (NAD) patients for all cohorts with AUCs close to 0.9 in most cases (Table 2).

Table 2 AUCs

The distribution of AD and NAD patients was then evaluated according to four classes: between p(AD) 0 to 0.1 for class 0; 0.1 to 0.5 for class 1; 0.5 to 0.9 for class 2; and 0.9 to 1 for class 3 (Figure 1A,B and Additional file2). The percentage of AD in each class was also computed (Figure 1C). As expected, in class 0, we found only around 10% of AD patients, whereas in class 3, this number was close to 90%, with slight variations related to differences in AD prevalence in these populations (Table 1).

Figure 1
figure 1

Distribution and percentage of the AD and NAD patients between the different classes. (A, B) The mean (±SD) was plotted of the distribution of the AD (A) and NAD (B) patients in the Paris, Lille, and Montpellier cohorts and across the four classes based on the logistic regression or the PLM scale. Significant differences (Student t test) were found between the percentage of AD patients in classes 2 and 3 (A), as well as between the percentage of NAD patients in classes 0 and 1 (B). (C) The mean (±SD) of the percentage of AD patients in each class was plotted. No significant difference was apparent. (D, E) The distribution of AD (dark gray bars) and NAD (light gray bars) patients in the populations of the Rouen, Strasbourg, and Besançon (RSB) centers. The percentage of AD in each class is also plotted (black dots linked a dotted line). Data obtained by using the PLM scale (C) show more AD patients in class 3 and NAD in class 0 than for the logistic regression (E).

The PLM scale composed of four classes was then designed, based on a very simple and intuitive rule: class 0, corresponding to no pathologic biomarkers (below cutoff for Aβ42, above cutoff for tau and p-tau); class 1, corresponding to one pathologic biomarker of three; class 2, corresponding to two pathologic biomarkers of three; and class 3, with all three biomarkers being pathologic. We calculated the sensitivity and specificity of this scale, with classes 0 and 1 grouped on one side, and classes 2 and 3, on the other side for NAD and AD diagnoses, respectively. Hence, a sample was defined as positive for AD if two or three biomarkers were pathologic (class 2 or 3). The performance of this simple rule was very high, as demonstrated by the sensitivity and specificity reached across the different cohorts (see Additional file2). Moreover and interestingly, when all AUCs were compared (Table 2), the PLM scale was second best after the logistic regression.

Distributions of AD and NAD patients, as well as percentage of AD patients in each class, were then computed and plotted for the PLM scale (Figure 1A-C; Additional file2). The percentage of AD patients in each class, which actually corresponded to the positive predictive value (PPV), that is, the percentage of true positives among all samples in the class, was compared between the two prediction models (Figure 1C). Predictive values were not significantly different between the logistic regression and PLM scale (Figure 1C). For the PLM scale, the average predictive values for AD in these cohorts were for class 0, 9.6% (95% CI, 6.0% to 13.2%); for class 1: 24.7% (95% CI, 18.0% to 31.3%), for class 2: 77.2% (95% CI, 67.8% to 86.5%), and for class 3, 94.2% (95% CI, 90.7% to 97.7%) (Figure 2).

Figure 2
figure 2

Graphic illustration of the AD scale.

The difference in patient distribution between the two prediction models was, however, significantly different, as the PLM scale had more AD patients in class 3 as well as more NAD patients in class 0 compared with the prediction obtained with the logistic regression (Figure 1A, B). This distribution of AD and NAD patients in different classes was also evaluated in an additional dataset from Rouen Strasbourg and Besançon (RSB) centers (Table 1). We used the standard cutoffs and logistic regression coefficients (Additional file2) to compute and represent in Figure 1 (panels D and E) the distribution of AD and NAD patients across the different classes, as well as the percentage of AD patients in each class. With the PLM scale, profiles were comparable to those obtained before, with a better distribution of patients (more AD patients in class 3, and more NAD patients in class 0) and similar percentages of AD patients in each class.

To compare the classification resulting from the logistic regression and the PLM scales, we calculated the NRI (see Material and methods section) on the PLM-2 population. We observed that the PLM scale resulted in 23.25% more patients better classified (that is, having a higher scale value in the case of AD patients, and a lower value for non-AD patients). Similar results were obtained when the NRI was applied to the different cohorts (data not shown).

Discussion

Diagnostic measures of AD include clinical observation, assessment of cognitive functions, including memory testing as well as advanced neuroimaging examinations such as magnetic resonance imaging (MRI) or positron emission tomography (PET), and CSF biomarkers[29]. In the present study, we validated the relevance of CSF biomarkers when used individually or within ratios[7, 2123]. Identifying one or several biomarkers or ratios as pathological is clinically relevant for the diagnosis of AD. However, the question remains as to how to use this element, which is often heterogeneous (not all biomarkers being pathological) in clinical practice. This complexity renders problematic the interpretation of biomarker results, and it differs between practitioners or centers.

Different rating systems or scales have being designed to help clinicians evaluate dementia. The clinical dementia rating (CDR), which is a 5-point scale based on cognitive and functional performances, does provide valuable clinical information[30]. The medial temporal lobe atrophy visual rating scale introduced by Scheltens et al.[31] is also a good example of a biomarker scale facilitating the diagnosis of AD. The challenge is to transform a multivariate panel into a score that predicts clinical outcomes. This was recently done to predict the conversion of mild cognitive impairment (MCI) patients into AD by using CSF Aß42, MRI, and PET data[32]. However, this model cannot be applied to our paradigm, which focuses on memory clinic cohorts suspected of having AD or other differential dementia diagnoses. Very relevant in this context is the leading work of Spies et al.,[24] who used a logistic regression on CSF biomarkers and sex to define a predictive model for AD. This model reached the highest performance for the diagnosis of AD, and it provides a continuous predictive scale made of 10 classes of AD probability (that is, none to 10%, 10% to 20%).

When we reproduced this approach for our datasets, the logistic regression did not retain sex in the models. The higher percentage of women in the AD population in the Spies cohort was probably responsible for this difference. In any case, the performance of the logistic regression was outstanding, as it resulted in the highest AUCs among individual or biomarker ratios (Table 2). To match the reports already used in clinical practice, we wanted to generate a simple rating with classes bearing very low (<10%), low (<25%), high (>75%), and very high predictive values (>90%) for AD. Four logistic regression classes with p(AD) ranging from 0 to 0.1, 0.1 to 0.5, 0.5 to 0.9, and 0.9 to 1 were therefore generated. This resulted in classes having an expected percentage of AD prediction (Figure 1C). We then compared this logistic regression approach with an intuitive classification based solely on the numbers (from 0 to 3) of pathological CSF biomarkers (Aß42, tau, and p-tau) (Figure 2).

The performance of this simple scale was evaluated in the different cohorts by using as a criterion of AD the presence of two or three pathologic biomarkers. Surprisingly, this simple rule resulted in high sensibility/specificity/AUCs. The predictive values of this “PLM scale” for the different classes were also similar to the ones obtained with the logistic regression (Figure 1C). Differences in the distribution of AD and NAD patients between classes were, however, noticeable, with significantly more patients correctly classified with the PLM scale. This apparent superiority of the PLM scale was validated by computing the NRI, which is a statistical tool designed to assess improvement in model performance offered by a new classification. This result could be explained by the “discontinuous” nature of the PLM scale, which is better suited to the four probability classes that we had targeted. Of note, a discrepancy between predicted and observed percentage of AD in regression classes was already observed by Spies et al.[24]. This triggered these authors to generate a second model with an extensive correction based on a second regression. This fine-tuning might overfit the models based on the individuals within a cohort, and it is therefore not adapted for the use of the scale in a predictive way for the classification of new patients in memory clinics.

In any case, the PLM scale outperformed the logistic regression and has the advantage of being used without complex mathematical adjustments. In addition, this classification gave a direct access to the percentage of discarding profiles, that is, AD patients with none of the biomarkers being positive (class 0), and conversely, NAD patients with three biomarkers being positive (class 3).

One of our study’s limits resides in using as reference the final clinical diagnosis, which, in the absence of neuropathologic information ,can sometimes remain uncertain. However, diagnoses were established by experienced multidisciplinary teams based on clinical, neuropsychological, and imaging data. The cross-validation of the results in the different cohorts also guarantees the reliability and relevance of our conclusions. Of note, the scale relies on the use of biomarker cutoffs that might differ in the different centers. The continuous improvement of the quality of the assays, thanks to the use of quality control, and the homogenization of preanalytics and operating procedure[1520] may result in a scale that will use “unified” cutoffs. It is important to underline that the predictive value of the PLM scale is valid only to evaluate whether memory impairments or dementia is due to AD for patients seen in memory clinics. In other situations, and in particular for assessing the conversion from MCI to AD, the design and/or predictive value of the PLM scale should be revaluated.

We also must keep in mind that if we expect the distribution of AD and NAD patients across classes to be comparable between cohorts, the predicted percentage of AD patients in a given class depends itself on the prevalence of AD patients in the studied cohort. In the future, this scale might be improved by adding new variables and, in particular, the apolipoprotein E status, which today is not used routinely.

Conclusion

We developed a simple and intuitive prediction model that demonstrated its relevance in identifying groups of patients with different predictive values for AD. This new scale can be used to facilitate the interpretation and routine use of multivariate CSF data and to stratify patients in clinical research trials.