Background

Screening cognitively normal older individuals for the presence of elevated cerebral amyloid-beta protein (“Aβ+”) and inclusion in secondary prevention trials for Alzheimer’s disease (AD) is invasive, expensive and slow. The current gold standards to measure Amyloid-β in the brain require either positron emission tomography (PET) or cerebrospinal fluid (CSF) assay. For example, the Anti-Amyloid Treatment in Asymptomatic Alzheimer’s disease (A4) trial conducted amyloid PET on 4,486 individuals in order to identify 1,323 Aβ+ individuals for an amyloid PET screen fail rate of 71% (1). The Number Needed to Screen (NNS) to identify each Aβ+ individual was 3.39 individuals.

Trial-Ready Cohort in Preclinical/Prodromal Alzheimer’s Disease (TRC-PAD) is a research program that was initiated to find solutions to these challenges in trial recruitment and site management, as described in Aisen, et al. Submitted (2). There are three elements that make up the TRC-PAD platform; Alzheimer’s Prevention Trial (APT) webstudy (aptwebstudy.org), Site Referral System (SRS) and the Trial Ready Cohort (TRC). The APT webstudy invites participants to enroll into the study. At the time of enrollment, participants are asked for demographic, medical and lifestyle information. They are asked to complete longitudinal web-based cognitive testing and symptom questionnaires. With these data, we aim to estimate the likelihood that an individual is Aβ+ before they are invited to participate in a secondary prevention trial. The SRS helps facilitate the participants deemed to be most likely Aβ+ from APT to go for in-clinic assessments where they proceed with the TRC screening. During the TRC screening phase participants are administered additional testing, including Preclinical Alzheimer’s Cognitive Composite (PACC) (3) and genotyping, before assessing their eligibility for an amyloid test.

In this paper, we describe how the prediction models and algorithms used in TRC-PAD were derived from A4 screening data. We anticipate blood-based biomarkers will greatly improve predictions of amyloid positivity, and this is a focus of future work and an aim of TRC-PAD. Predictors in the current analysis are limited to demographics, cognitive and functional assessments, and APOE genotype.

Methods

Population and Study Design

The study design and screening data for A4 have been previously described (7, 8) and Institutional Review Boards have approved both A4 and TRC-PAD studies. The A4 screening dataset contains N=4,486 participants, of which 1323 (29%) were classified as Aβ+. Amyloid PET imaging was conducted with florbetapir F18 and summarized by mean cortical standardized uptake value ratio (SUVR) relative to the whole cerebellum. Participants were considered eligible to continue screening for A4 based on an algorithm combining both quantitative SUVR (≥1.15) and qualitative visual read performed at a central laboratory. A SUVR between 1.10 and 1.15 was considered to be elevated amyloid only if the visual read was considered positive by a two-reader consensus determination. Participants who were considered Aβ+ were slightly older; with mean/standard deviation (SD) age of 72.10/4.89 in the Aβ+group and 70.95/4.53 in the Aβ- group. However, there were no observed differences in sex and education. Aβ+ participants were more likely to have a family history of dementia and at least one APOEe4 allele. In addition, Aβ+ participants performed worse on the screening Preclinical Alzheimer Cognitive Composite (PACC) results and had higher scores on the Cognitive Function Index.

Variables

Table 1 describes the collections of predictors that we considered to train different predictive algorithms. All screening data for the A4 Study were collected during supervised clinic visits. However some components of the A4 screening battery are being captured remotely in the APT webstudy, including demographic, Cogstate brief battery (9), family history (sibling or parent with Alzheimer’s), and Cognitive Function Instrument (10) (CFI) variables indicated in Table 1. We consider predictive algorithms using these “remote” variables only, as well as a more thorough battery that would require a supervised clinic visit with an administration of the PACC3. In all, we considered 6 models: (1) remote battery without APOE, (2) remote battery with APOE, (3) in clinic battery without APOE, (4) in clinic battery with APOE, (5) in clinic battery with individual PACC component scores without APOE, and (6) in clinic battery with individual PACC component scores with APOE. The PACC component scores include the Mini-Mental State Exam (MMSE) (11), Wechsler Memory Scale-Revised Logical Memory, Digit Symbol Substitution (DSST), and Free and Cued Selective Reminding Test (FCSRT) (12).

Table 1 Predictors Considered

Statistical Analysis

Extreme Gradient Boosting (XGBoost) (4) is a decision tree-based machine learning technique (6). A single decisions tree, or regression tree, is easy to interpret but provides relatively poor prediction. Aggregating a large number of trees can improve prediction accuracy. Boosting is a technique in which models are trained in sequence, with each new model making cumulative improvements. At each iteration the data are re-weighted such that misclassified data points receive larger weights. XGBoost is a scalable tree boosting algorithm, that is optimized and designed to be highly efficient, flexible, and portable.

XGBoost supports monotone constraints and customized objective functions. We applied monotone constraints to predictors such as age, number of APOEe4 alleles (0, 1 or 2), and assessment scores that we expect to have a generally monotonie relationship with amyloid PET SUVR (Supplemental Figure 1). The default XGBoost objective function is mean squared loss, meaning decision trees are selected to minimize the residual sum of squares. Because XGBoost does not provide confidence intervals with mean squared loss, we applied the Quantile Regression loss function to estimate the 50%, 2.5%, and 97.5% quantiles of the predictions. XGBoost model has a number of hyper-parameters that are used to assist in the issue known as the bias-variance trade-off (13). Hyper-parameters are fixed before the model is fitted and are not learned from data. We used 10-fold Cross-Validation (CV) to assess the out-of-sample bias and variance for given hyper-parameter values, and Bayesian Optimization (14) to optimize the hyper-parameter selection. We use SHapley Additive explanation (SHAP) (15) values to summarize the importance of each predictor to the overall predictive accuracy of each model. More details about the model fitting procedures are provided in the supplemental material (Supplemental Table 1). Our main interest lies in the predictive accuracy of the models. In order to assess this, we split the data randomly into 80% training and 20% test. After fitting the models with the training data, we assess their predictive accuracy with the independent test data. Analyses were conducted with R version 3.6.2 (r-project.org) with packages xgboost (4) version 0.90.0.2 and mlrMBO (16) version 1.1.2.

Figure 1
figure 1

Contribution of 5 best predictors in each model

Using the model training data we see the contribution to prediction accuracy expressed in terms of the mean absolute SHAP value (mean ∣ SHAP ∣). Abbreviations: SHAP, SHapley Additive explanation; OCL, One Card Learning; OBPv One Back Reaction; DER, Detection Reaction; IDR, Identification Reaction; FH, Family History; FH P, FH Parent; FH S, FH Sibling; CFI, Cognitive Function Index; CFI Pt, CFI Participant; CFI SP, CFI Study Partner; ADL, Activities of Daily Living; ADL Pt, ADL Participant; ADL SP, ADL Study Partner; PACC, Preclinical Alzheimer Cognitive Composite

Results

Figure 1 shows the relative contributions, in terms of SHAP values, for each predictor to the predictive accuracy of each model. As expected, when available, APOE genotype is the most important predictor for these cross-sectional models. We see that age, CFI, education, and family history also enter the top 5 most valuable predictors in some models. Figure 2, the Receiver Operating Characteristic (ROC) curves and Area under the Curve (AUC) for the 6 models, also demonstrates the relative value of APOE. The dashed lines are models fitted without the APOEe4 variable and the solid lines are for models that include APOEe4. The ROC curves were generated using a cut point SUVR value of 1.15 for a binary separation between amyloid positive and negative. In general, we see AUCs in the range 0.60 (without APOE) to 0.73 (with APOE).

Figure 2
figure 2

ROC curves and AUCs

ROCs and AUCs for each model are determined using the independent test set and Ap+ set to SUVR ≥ 1.15. The colors represent the setting type; Remote (red), In-Clinic (blue) and PACC components (blue). Abbreviations: ROC, Receiver Operating Characteristic; AUC, Area Under the Curve; PACC, Preclinical Alzheimer Cognitive Composite

Figure 3 expresses prediction accuracy in terms of screening for a clinical trial. The top panel shows 1/Positive Predictive Value (PPV), which is equivalent to the number needed to screen (with amyloid PET) to identify one eligible participant. In this figure, movement along the horizontal axis represents varying the threshold applied to SUVRs predicted from each model. The bottom panel provides the required number of potential participants (e.g. webstudy participants) in order to identify 1,000 Aβ+ participants.

Figure 3
figure 3

Number needed to screen and required registry size

The top panel shows the number needed to screen (which is equivalent to 1 / PPV) with amyloid PET to identify one Aβ+ individual by applying the given SUVR threshold to the values predicted from each model. The middle panel shows sensitivity. The models not containing APOEe4 all have lower sensitivity. The bottom panel shows the size of the screening pool (e.g. web-based registry) that would be required to recruit 1,000 Aβ+ individuals by applying the given SUVR threshold to values predicted from each model.

Abbreviations: PPV, Positive Predictive Value; SUVR, Standardized Uptake Value Ratio; PACC, Preclinical Alzheimer Cognitive Composite; PET, positron emission tomography

Table 2 reports operating characteristics from several screening algorithm scenarios. The top half provides operating characteristics when a threshold is selected to provide 50% prediction prevalence (i.e. select half the participant pool to receive amyloid PET scans). With 50% prediction prevalence, the NNS is about 2.5 participants with APOE and 3.0 participants without APOE. When the threshold for predicted amyloid PET is increased to 1.15, the NNS is reduced to about 1.7 participants with APOE and 2.5 participants without APOE. However, this results in much lower sensitivity, and as we can see from Figure 3, a threshold of 1.15 would be practical only with participant registries of 10,000–13,000 to identify 1,000 Aβ+ participants.

Table 2 Operating characteristics of screening algorithms using the test data with Aβ+ set to SUVR ≥ 1.15
Table 3 Demographic characteristics of amyloid positive selections from the test data with Aβ+ set to SUVR ≥ 1.15

Discussion

This work, in the context of the TRC-PAD platform, can facilitate the development of participant selection algorithms. TRC-PAD has two main selection points; the first is from the APT webstudy to in-clinic assessment (stage 1) and the second is from in-clinic to amyloid testing (stage 2). In stage 1, consented webstudy participants are referred to their nearest TRC-PAD site, identified via the use of self-reported zip codes. They are then ranked based on their SUVR prediction. In addition to this predicted SUVR, the selection process considers demographics to achieve diversity and if the participant has known prior amyloid testing and results. During the first in-clinic visit of the referred participants in stage 1, additional cognitive testing, in the form of the PACC, and APOE genotyping is performed. With this additional information, the SUVR predictions are updated and presented for central authorization of amyloid testing.

This work has shown that by collecting relatively simple demographics, cognitive and functional assessments remotely, via the webstudy, we will be able to reduce screen fail rates and improve enrollment. Even small improvements in NNS can have a large impact on the expense of screening for Preclinical AD clinical trials. For example, assuming a conservative estimate of 3,500 US Dollars (USD) per scan, the A4 study spent a total of about 4,486×3,500(USD) = 15,701,000(USD) for screening amyloid PET scans alone to identify 1,323 Aβ+ individuals (NNS=3.39). Reducing the NNS from 3.39 to 2.62, which seems plausible with the simplest remote battery, would have reduced this cost by 3,569,090(USD) to 1,323×2.62×3,500(USD) = 12,131,910 (USD). In addition to the remote data setting, this work included the value of APOE genotyping and collection of PACC during an in-clinic screening. Adding APOE genotype might reduce NNS to below 2.00, for a total PET screening cost of 1,323×2.00×3,500(USD) = 9,261,000(USD). The financial impact would be less with a cerebrospinal fluid (CSF)-based, or blood-based, amyloid screen, but the impact on subject and site burden would remain significant. From a statistical aspect, we have demonstrated the use of Machine Learning Techniques to both optimize, via Bayesian Optimization, and produce predictive models using XGBoost. We have illustrated how to make inferences from a modelling approach that is primarily used for prediction via the SHAP metric.

One limitation of these pre-screening algorithms is that the cohort characteristics will be impacted. For example, we would expect the algorithms to produce an older cohort with an even greater proportion of APOEe4 carriers than a cohort selected without a pre-screen. This could be mitigated by stratifying the screening process to ensure an adequate sample of younger, APOEe4 non-carriers; but with adverse effects on the NNS. Another consideration is the inability for these models to extrapolate beyond the data in the continuous variables such as age. A second potential limitation is in the bias of the training data. As we start using these models in TRC-PAD and collect additional data, we will assess whether the models are biased against any additional covariates collected.

Future work will focus on utilizing longitudinal cognitive and functional change and/or the use of blood-based biomarkers to improve the performance of these predictive models and algorithms. We anticipate, based on analyses of the Alzheimer Disease Neuroimaging Initiative (ADNI) (5), that longitudinal change may be a valuable predictor of amyloid status. In addition, we will incorporate plasma amyloid peptide ratios (currently in validation testing) into the final stage of prediction and expect a large improvement in prediction.