Resolving the genetic background of Alzheimer’s disease (AD) has proven to contribute greatly to our understanding of underlying disease processes, for instance with the discovery of APP [25], PSEN1 [53], and PSEN2 [52] in family-based studies, leading to the amyloid cascade theory [38]. In addition, genome-wide association studies (GWAS) in AD have convincingly highlighted the importance of microglia [33, 56], a finding previously supported by research from other scientific fields [18, 72, 75], and now also widely accepted as a genetic cause rather than a result of AD pathogenesis. Further exploration of genetic risk factors contributing to AD development and pathogenesis might reveal more biological insights, an important step in the quest for AD treatment that will slow down or even halt disease progression.

GWAS of clinically diagnosed AD patients have been successful, and current efforts largely focus on increasing sample size to improve the statistical power to detect genetic variants [6, 70]. An alternative approach is to study effects of genetic variants on pathophysiological features of AD. The strength of such studies is based on the assumption that more objective measurable biological properties are more strongly associated with the underlying AD pathology than the clinical diagnostic classifications (e.g., misclassifications or symptoms not manifested yet), thereby allowing to detect larger effects by reducing heterogeneity [26]. The use of biomarkers further enables to identify genetic effects specific for certain AD-related biological mechanisms. This is an advantage over the conventional GWAS approach for clinical AD diagnosis, where it generally remains unclear through what causal gene or cellular process a locus is associated to AD.

It is possible to measure levels of amyloid-beta-42 (Aβ42) and (phosphorylated) tau (pTau and Tau) in cerebrospinal fluid (CSF), the two major proteins implicated in the AD pathological process. Aβ42 pathology in the brain is negatively correlated with CSF Aβ42 levels, where a decrease in CSF Aβ42 is indicative of AD [49, 58]. CSF (p)Tau is positively correlated with (p)Tau pathology in the brain, and therefore higher CSF (p)Tau levels are observed in patients with AD. CSF pTau is presumed to reflect AD-type tau-tangles more specifically than total tau [49, 58]. Previous studies on CSF amyloid-beta and (p)Tau have identified genetic risk loci, the most recent one including 3,146 individuals [17]. Some of the 8 discovered loci had not been previously associated with AD, emphasizing the potential of endophenotypes to reveal novel genetic risk factors. Our current study aimed to further define the genetic background of AD by studying the genetic effects on CSF Aβ42 and pTau levels in a total of 13,116 individuals.

Materials and methods


We combined data from 16 European cohorts, encompassing a total of 8074 individuals (Table 1; Online Resource 1—Table 1; Online Resource 2—Fig. 1) with both genotype data and CSF measurements. The majority of these cohorts (82%) are part of the EADB consortium [6], and included the full spectrum of clinical severity potentially leading to AD, from subjective cognitive decline, mild cognitive impairment, to dementia. Written informed consent was obtained from study participants or, for those with substantial cognitive impairment, from a caregiver, legal guardian, or other proxy. Study protocols for all cohorts were reviewed and approved by the appropriate institutional review boards.

Table 1 Demographic information on cohorts of stage 1 discovery analysis

For replication, 15 cohorts totaling 5042 individuals (Online Resource 1—Table 2) were available to attempt replication of the association signals to Aβ42 and pTau, for the variants with P value < 1e–5 in the discovery analysis. Data from all cohorts, except one (NorCog from University of Oslo, Norway), were obtained through collaboration with the previous largest GWAS on CSF Aβ42 and pTau, mostly including cohorts originating from the United States [17]. Basic demographics are described in Online Resource 1—Table 2, more detailed cohort information is described elsewhere [17].

CSF measurements

Due to the multi-center approach, CSF protein levels were measured with various CSF protein assays (Online Resource 1—Table 2). Aβ42 was measured with ELISA, Lumipulse or V-PLEX, and pTau with ELISA or Lumipulse. For details on specific lab procedures, see the original studies [2,3,4, 17, 21, 23, 27, 34, 36, 42, 47, 50, 57, 62, 78]. Protein levels were log10 transformed and normalized within cohorts and CSF assay type (if multiple assays were used within a single cohort) to approximate a normal distribution to correct for the application of various CSF assays across different studies. Then, the normalized protein levels were used as continuous phenotypes in the association analyses. The distribution of raw and normalized CSF protein levels are displayed in Online Resource 2—Figs. 2–13, for the cohorts for which we have individual-level data available (indicated in Online Resource 1—Table 1; column D). For the stratified analyses, two subgroups of individuals, amyloid normal and amyloid abnormal, were defined based on their Aβ42 status. Individuals with an untransformed Aβ42 level below a threshold were assigned to the abnormal amyloid level group. The thresholds were defined by the individual research groups as it depends on technical circumstances, and are displayed in Online Resource 1—Table 2.

Genotyping, quality control and imputation

The genetic data for the EADB cohorts have been processed in a homogeneous approach (Online Resource 1—Table 1), in which the Illumina Infinium Global Screening Array (GSA, GSAsharedCUSTOM_24 + v1.0) was predominantly used for data generation. Additional arrays included the Axiom 815 K Spanish biobank array (Thermo Fisher) for ACE (Barcelona, Spain) and Valdecilla (Santander, Spain) cohorts, and the Illumina Neurochip array (Gothenburg, Sweden). Standard quality control (QC) procedures were performed to exclude individuals and variants with low quality, in general followed by imputation with the Trans-Omics for Precision Medicine (TOPMed) reference panel [13, 59]. For the EADB cohorts for which GSA genotype level data were available, the details on QC steps and imputation with the TOPMed reference panel were previously described [6].

For the Spanish ACE and Valdecilla cohorts, QC procedures are described in another study [16], followed by imputation with the TOPMed reference panel. For the Gothenburg H70 Birth Cohort studies and clinical AD samples from Sweden, the QC and imputation procedures were described elsewhere [48]. Post‐imputation QC only included variants with a high imputation quality (RSQ [imputation quality] > 0.8). The UCSC LiftOver program ( and Plink v2.0 ( [10] were used to lift the GRCh37 genomic positions to GRCh38, the genomic build for all other datasets. All genotypes were hard called using the default Plink v2.0 ( settings.

Heritability and genetic correlation

For the estimation of the SNP-heritability, two distinct tools were used. With LD score regression (LDSC), it was possible to perform the calculations with the full number of samples as the input for this analysis is the summary statistics. Besides heritability estimates, genetic correlations were also calculated for Aβ42, pTau, tau (to test the similarity in genetic background to pTau), and two previously published AD summary statistics [33, 43]. Precalculated LD scores from the 1,000 Genomes European reference population were obtained online. All estimates were based on HapMap3 SNPs only to ensure high-quality LD score calculations ( As a rule of thumb, LD Score regression tends to yield very noisy results when applied to datasets with fewer than 5000 individuals ( [9]. The summary statistics for the stratified analyses were, therefore, not considered.

For comparison to SNP-heritability estimates of previous studies for Aβ42 and pTau, GCTA v1.9 [74] was applied to the individual-level genotype data of the largest dataset (Netherlands). Other datasets were not considered as the sample size was too low for small standard errors, thereby impossible to draw any meaningful conclusions from the estimates. The restricted maximum likelihood (REML) analysis was performed for the log10-transformed normalized CSF Aβ42 and pTau adjusted for gender, age, and the first 10 principal components. Variance explained could not be calculated for significant loci only as p-values from GWAS results of a large independent sample are unavailable, and calculation in the Dutch sample would be hampered by winners-curse, causing inflation.

Single-marker association

Genome-wide association analysis for each cohort was performed in PLINK v2.0 [10], using linear regression for the continuous phenotypes Aβ42, tau and pTau. Association tests were adjusted for gender, age, assay type (if applicable), and ten ancestry principal components. Only variants with a minor allele frequency threshold above 0.01 were tested. For smaller cohorts (n < 250 individuals) this threshold was set to 0.05 to avoid false positive findings.

Association analyses were repeated for subgroups, stratified according to APOE4 status (based on the high-quality (R2 > 0.8) imputed variants rs429358 and rs7412) or dichotomous Aβ42 status, resulting in the following groups: (1) APOE4 (hetero- and homozygous) carriers; (2) APOE4 non-carriers; (3) individuals with abnormal Aβ42 levels; and (4) individuals with normal Aβ42 levels. After stratification, cohorts with a minimal sample size of 100 individuals were included. Covariates were those described for the main analyses above.

Independent replication

A total of 5042 samples from 15 cohorts were included for the replication analysis. The genetic data for NorCog were generated with 2 different genotyping assays. Extensive QC procedures which are detailed elsewhere [6], allowed for joined genetic analyses of these sub-datasets. Variant association testing was performed according to the association analysis section above. For all other replication cohorts, QC, imputation and association testing procedures are described elsewhere [17]. In short, individual and variant QC standards were met, and imputation was performed using the 1000 Genomes Project Phase 3 reference panel. Each dataset was QCed and imputed independently. The additive linear regression model in PLINK v1.9 [10] was used for single-variant analyses.


METAL [71] was used for meta-analyses in stages 1–3 of the per-cohort association results, applying the default approach that utilizes P value and direction of effect, while weighted according to sample size. For stage 1, we used the genome-wide threshold for significance of P < 5 × 10−8, and a suggestive threshold of P < 1 × 10−5 to select variants to study in Stage 2. Stage 2 variants were considered a replication with P < 0.05 and same direction of effect in comparison to stage 1. The genome-wide threshold for significance of P < 5 × 10−8 was used to defined GWAS hits in stage 3.


All variants within 1.5 megabases (Mb) of the lead variant of each genomic risk loci were used in the colocalization analysis. The stage 1 GWAS summary statistics were used for the CSF loci aiming consist sample sizes across the variants. Colocalization comparisons were performed to eQTL data for brain and immune-related tissues and cell-types, which were obtained from the eQTL catalog [39] release 5. The microglia data were obtained from Young et al. [77]. GWAS summary statistics for loci comparison to other GWAS studies were obtained from Kunkle et al. [43] for AD, and from Vojinovic et al. [65] for brain ventricular volume. The GWAS data and eQTL data were trimmed so that all variants overlap. Colocalization was performed with the Coloc R package [24], using the coloc.abf function for the approach assuming a single causal variant, and the runsusie and coloc.susie functions to test for colocalization relaxing this assumption to multiple variants. The latter approach was performed for the comparisons to AD and brain volume loci, and LD matrices were calculated with our own individual-level data for the significant CSF loci using LDstore R package [7]. Default priors were used for prior probability of association with the GWAS data and eQTL data. The prior probability of colocalization was set as 5 × 10−6 as recommended [67]. Nominal P, sample size and MAF were used when beta and variance of beta were not available for the GWAS data or eQTL data. Colocalizations with a posterior probability > 0.8 were considered successful colocalizations. Comparisons were visualized with the R package LocusCompareR (

Gene-based analysis

Gene-based and gene-set association tests were performed using MAGMA v1.08 [14], which was implemented by FUMA [69]. The per variant association summary statistics for the main results served as the input, where variants were selected if mapped within 18,870 protein-coding genes (with unique ensembl ID). The mean SNP-wise model was implemented. The Bonferroni-corrected significance threshold was set to P < 2.65 × 10–6, based on the number of tested genes.

Gene mapping

The genome-wide significant loci of the main results were further explored for promising causal AD genes using FUMA [69], after lifting over the results with genomic build GRCh38 to GRCh37 with the UCSC LiftOver Program ( Two gene mapping strategies were used:

  • Positional mapping maps SNPs to genes based on physical distance (within a 10-kb window) from known protein-coding genes in the human reference assembly (GRCh37/hg19).

  • eQTL mapping maps SNPs to genes with which they show a significant eQTL association (that is, allelic variation at the SNP is associated with the expression level of that gene). eQTL mapping uses information from 85 brain- and immune-related tissue types in 11 data repositories (BIOSQTL, BloodeQTL, BRAINEAC, CMC, DICE, eQTLcatalogue, eQTLGen, GTEx, PsychENCODE, scRNA_eQTLs, xQTLServer), and is based on cis-eQTLs which can map SNPs to genes up to 1 Mb apart. We used a false discovery rate of 0.05 to define significant eQTL associations.

Phenome-wide association studies (PheWAS)

We conducted phenome-wide association studies (PheWAS) on the top SNPs, rs4844610, rs429358, rs744373, rs9877502, rs4843559. A PheWAS starts out with a single to a few variants of interest that are systematically being tested for association to many phenotypes. We used the ‘phewas’ function of the R-package ‘ieugwasr’ [19, 31]. Using this function, we searched traits that associate with the list of SNPs with P < 1 × 10–7 in all GWAS harmonized summary statistics in the MRC IEU OpenGWAS data infrastructure [31]. In short, this enables us to screen for other traits to which these SNPs are associated. The database (May 2021) includes the GWAS summary statistics of 19,649 traits.

Association with CSF proteomics

We associated the lead variants near GMNC (rs9877502) and in C16orf95 (rs4843559) with CSF proteomics data of two different sources (EMIF-AD MBD and Knight-ADRC). For the EMIF-AD MBD data, a total of 2,136 proteins were quantified centrally using 11-plex tandem mass tag spectrometry in 366 individuals from the EMIF-AD MBD study [61] (subset of Amsterdam Dementia Cohort within EADB). We selected proteins with a maximum of 50% missing values. For related proteins that had identical values due to fragment aspecificity, we randomly selected one protein for analysis (52 proteins were excluded). Out of the 2136 proteins quantified, 1282 (55.4%) proteins respected these criteria and were included in the study.

For the Knight-ADRC data, levels of 1305 proteins were quantified using the SOMAscan assay, a multiplexed, aptamer-based platform CSF (n = 717) [73]. Quality control was performed at the sample and aptamer levels using control aptamers (positive and negative controls) and calibrator samples. As described in detail [73], additional quality control was performed that included limit of detection cut-off, scale factor, coefficient of variation, and outlier variation. Only proteins with a call rate higher than 85% call rate were included. A total of 713 proteins passed quality control. pQTL analyses was performed and reported in previous studies [73].

g:Profiler and Enrichment map [44], a Cytoscape App, were used to perform pathway enrichment analyses on proteins with a certain level of association (EMIF-AD MBD: P < 0.05; Knight-ADRC: P < 0.004, corresponding to proteins with a similar effect size as in EMIF-AD MBD). The results are shown as functionally grouped networks. We used GO biological processes and Reactome as ontology sources. For this explorative analysis, only pathways with P < 0.05 (corrected for multiple testing) are shown.

Effects of AD-associated variants on Aβ42 and pTau

We assessed the most recent GWAS [6] for AD and extracted the top loci of 83 variants (excluding APOE ɛ4 and APOE ɛ2) that showed genome-wide significant association with AD [6]. We extracted Z-scores and P values and plotted them in a heatmap. Rows and columns were clustered using Euclidean distances and average hierarchical clustering. We performed a gene-set enrichment analysis to find molecular pathways enriched within each cluster. The SNP-gene assignment corresponds to the one described in the recent main EADB GWAS [6], including several annotation strategies. When multiple genes were reported to associate with the same SNP (rs12590654 near SLC24A4/RIN3, rs7225151 near SCIMP/RABEP1, rs6846529 near CLNK/HS3ST1, rs7384878 near ZCWPW1/NYAP1 and rs10437655 near CELF1/SPI1), we considered both genes for the gene-set enrichment analysis. In addition, for SNP rs6605556, located in the complex HLA region, we considered HLA-DRB1 gene (eQTL in blood with rs6605556), and for SNPs rs7157106 and rs10131280, both located in the gene-dense IGH region, we considered IGHG2 and IGHV2-70 (eQTLs in blood with rs7157106 and rs10131280, respectively). The gene-set enrichment analysis was performed specifying Biological Processes from Gene Ontology [1, 5] as gene-set and correcting P values with Bonferroni. Biological pathways were considered significant at corrected P < 0.01. To help with the interpretation of each cluster’s function, we plot the most recurring words of the significant terms underlying each cluster using wordclouds. The following R packages were used for these analysis: gprofiler2 [41] and wordcloud2 (


The overview of the study design is illustrated in Online Resource 2—Fig. 1. The GWA results from 16 studies were combined in stage 1. Variants that reached a suggestive level of significance (P < 1 × 10–5) were subsequently evaluated in an independent sample from 15 studies in stage 2. Finally, the results of stage 1 and stage 2 analyses were combined in stage 3. Detailed information on study participants, CSF acquisition and genotyping is provided in Table 1 and Online Resource 1—Tables 1 and 2. The results for tau and pTau are strongly correlated (rg = 0.94; P = 1.86 × 10–118), and therefore only pTau findings are reported.

Genetic architecture

The fraction of variance in Aβ42 and pTau protein levels that could be explained by the additive effect of the genetic variants tested, was estimated on 0.13 (SE = 0.06) and 0.21 (SE = 0.07) by LDSC, respectively. These SNP-heritabilities are substantially higher than the 0.07 previously estimated with LDSC for the diagnosis AD [80], or similarly reported for AD by this study using the same LDSC method on more recent public GWAS summary statistics of AD [33, 43] (Online Resource 1—Table 3). GCTA estimated the SNP-heritability to be higher for both Aβ42 and pTau, namely 0.27 (SE = 0.13) and 0.34 (SE = 0.12), respectively. Both methods are reporting a higher SNP-heritability for pTau than for Aβ42. Genetic correlation estimates with AD GWAS summary statistics are described in Online Resource 1—Results and Online Resource 1—Table 4.

GWAS variants associated with CSF Aβ42 and pTau

The stage 1 meta-analyses (QQ plots and lambda shown in Online Resource 2—Fig. 14) identified 4 independent significant variant associations, 1 for Aβ42 and 3 for pTau (Table 2). The strongest associations were observed for both Aβ42 and pTau in the APOE locus (Table 2). The variant that determines the APOE ɛ4 allele (rs429358-C) decreased Aβ42 (Z = − 36.29; P = 2.0 × 10–288) and increased pTau (Z = 18.31; P = 6.87 × 10–75) in CSF. In contrast, the variant determining the APOE ɛ2 haplotype (rs7412-T) increased Aβ42 (Z = 11.97; P = 5.09 × 10–33) and decreased pTau (Z = − 6.59; P = 4.49 × 10–11). The APOE ɛ2 association was replicated in stage 2 (Aβ42: Z = 7.27; P = 3.73 × 10–13, and pTau: Z = − 6.43; P = 1.26 × 10–10), and APOE ɛ4 with rs4420638-G for Aβ42 (Z = − 25.51; P = 1.57 × 10–143), and with rs769449-A for pTau (Z = 13.83; P = 1.66 × 10–43), both variants in high linkage disequilibrium with rs429358, as the original APOE ɛ4 variant was not genotyped or imputed in the replication datasets.

Table 2 Meta-analysis association results for the 3 stages

In stage 1, no other significant loci were observed for Aβ42. For pTau, we further identified significant associations mapping to two chromosomal regions at 3q28 and 16q24.2 (Table 2). The 3q28-locus (Z = 9.06; P = 1.28 × 10–19) also known as the GMNC locus, was reported previously for its association with pTau [17]. The cohorts from this previous study are the replication cohorts of the current study, thereby logically GMNC was replicated (Z = 8.97; P = 3.00 × 10–19). The 16q24.2 locus (Z = 6.41; P = 1.49 × 10–10), which is novel for pTau, was replicated in the stage 2 meta-analysis (Z = 3.88; P = 1.03 × 10–04).

Subsequently, the results from all individual studies were combined in the stage 3 meta-analysis (N = 13,116). In stage 3, two well-known AD loci showed additional genome-wide significant associations (Fig. 1, Table 2) with Aβ42 in chromosomal region 1q32.2 (Z = − 6.01; P = 1.84 × 10–9, CR1; Fig. 1a), and with pTau for the region 2q14.3 (Z = 6.15; P = 7.88 × 10–10, BIN1; Fig. 1b). The per-cohort and zoomed in genomic location details of all significantly associated loci of stage 3 are visualized in Online Resource 2—Figs. 15–20. Colocalization analyses (results detailed in Online Resource 1—Table 5; Online Resource 2—Fig. 21 and 22) showed colocalization of the CR1 (posterior probability = 0.97) and BIN1 (posterior probability = 0.82) loci to these loci in a recent AD GWAS [43]. For the BIN1 locus of the AD GWAS two independent causal signals were observed, of which our BIN1 locus colocalized (posterior probability = 0.84) with the first signal that was tagged by rs6733839, which is the most significant variant of the BIN1 locus of the AD GWAS.

Fig. 1
figure 1

Manhattan plots of the stage 3 GWAS results. a Results visualized for CSFAβ42; b Results visualized for CSF pTau. The y-axes are limited to visualize the non-APOE loci. The lowest P values for APOE are 4.07 × 10–355 and 3.74 × 10–94 for Aβ42 and pTau, respectively

Explorative meta-analyses were repeated stratified for APOE (APOE ɛ4 carriers (n = 3240) vs. APOE ɛ4 non-carriers (n = 3201)) and amyloid status (Amyloid normal levels (n = 3182) vs. amyloid abnormal levels (n = 3775)) for stage 1 (QQ plots and lambda shown in Online Resource 2—Figs. 23 and 24), of which the results are visualized in Online Resource 2—Figs. 25 and 26, and detailed in Online Resource 2—Results and Online Resource 1—Table 6. Besides the APOE and GMNC loci, two novel loci are observed that have previously not been linked to any AD phenotype.

Functional interpretation

To interpret the functional effects of the identified variants beyond AD, we performed gene prioritization (based on positional mapping, gene-based association results, and brain and immune eQTL annotations) using FUMA [69], colocalization analyses and PheWAS. The results of the FUMA annotation are detailed in the Online Resource 2—Results and Online Resource 1—Table 7. Ten of our CSF Aβ42 and pTau loci colocalized with one of the brain or immune eQTLs from the 63 tested datasets, which are reported in Online Resource 1—Table 8. The APOE locus for both Aβ42 and pTau colocalized with an eQTL for NKPD1 in a specific immune helper T cell. The CR1 locus for AB42 colocalized with an eQTL for the CR1 gene in 6 brain tissues (hippocampus, caudate, putamen, dorsolateral prefrontal cortex, frontal cortex, cortex). Additionally for the dorsolateral prefrontal cortex, the CR1 locus also colocalized with an eQTL for the AL137789.1 gene. The BIN1 locus colocalized with an eQTL for BIN1 in lymphoblastoid cell lines.

For the PheWAS, using data from publicly available genome-wide association studies (N = 19,649) of the five top variants yielded 529 associations at P < 1 × 10–7 (Online Resource 1—Table 9). The majority is the known wide range of 490 traits associations with the APOE ɛ4 allele. For the other variants 39 associations were reported for 27 unique traits. These traits can be categorized in three groups: traits related to brain ventricular volumes in particular the lateral-ventricle (GMNC and C16orf95), Alzheimer’s disease diagnosis (BIN1 and CR1), and measures of blood cell/lymphocyte counts (CR1). The regional pTau associations of GMNC and C16orf95 overlapped with ventricular volume (Fig. 2). Colocalization results (Online Resource 1—Table 5; Online Resource 2—Figs. 27 and 28) imply the same causal variant for C16orf95 (posterior probability = 0.93). The GMNC locus cannot be convincingly explained by the same causal variant (posterior probability = 0.63). The coloc method combined with Susie reports only 1 causal signal for both GWAS. Of note, the colocalization probability improves to 0.74 when first reducing the GMNC-locus variants to a credible set of variants. Although these results are subthreshold to the predefined posterior probability limit of 0.8, they do imply that the same causal variant is likely underlying the GMNC association signals for pTau and AD.

Fig. 2
figure 2

LocusZoom plots showing variant association results for a GMNC and b C16orf95 loci. In black, the pTau association signals of this study; and in orange, the lateral ventricular volume (LVV) association signals observed in other studies

Because of the overlap in effect of GMNC and C16orf95, we hypothesized that these two loci affect the same biological pathways. We explored this hypothesis using CSF proteomics datasets of the EMIF consortium with 1284 quantified proteins and of Knight-ADRC with 696 quantified proteins (of which 42% overlap with the EMIF-AD MBD proteins). For GMNC there were 279 (22%) proteins associated in the EMIF-AD MBD data (Online Resource 1—Table 10) and 255 (36%) proteins in the Knight-ADRC data (Online Resource 1—Table 11). C16orf95 could only be tested in the EMIF-AD MBD data in which 73 (6%) proteins were associated (Online Resource 1—Table 10). Only 2 proteins (CDH9 and DPP6) overlapped between the 2 loci. We studied the overlap in affected pathways between the associated protein lists. For GMNC, consistent functional group networks between the 2 tested datasets were axon guidance and ephrin signaling (Online Resource 2—Fig. 29, Online Resource 1—Tables 12 and 13), while for C16orf95 (only based on the EMIF-AD MBD data) glycosaminoglycan metabolism and ECM organization were overrepresented functional groups (Online Resource 2—Fig. 30, Online Resource 1—Table 14). There was little overlap between the loci in the pathways that emerged from the protein lists.

Relation to AD-associated genetic variants

Because of the evident overlap in etiology with clinical AD dementia, we examined the association of all known AD loci (excluding the APOE locus) with CSF Aβ42 and pTau. This analysis has an explorative character as the small contribution of each individual AD loci to CSF amyloid and pTau is in general reflected in moderate effect sizes and association signals. However, patterns of reasonable signal could facilitate the generation of biological hypotheses to test in future experiments. The results are shown in the heatmap of Fig. 3 and Online Resource 1—Tables 15 and 16. The variants could be clustered in 4 groups of AD-associated genes based on their associations with Aβ42 and pTau. The first cluster of 14 variants showed strong association with both decreased levels of Aβ42 and increased levels of pTau in CSF. A pathway enrichment analysis of the genes associated with the variants showed 29 GO terms enriched and ‘amyloid’ is the common denominator in the names of these terms, of which the signal is mostly driven by BIN1, PICALM ABCA7 and CLU. The second cluster contained 21 variants and included genes that have also been related to other dementia types (e.g., GRN, TMEM107B, SNX1, MAPT, CTSB and CTSH). This cluster was associated with decreased pTau levels, an no general effect on Aβ42 levels. Pathway analysis of the genes suggests an enrichment for 8 GO terms of which the names have ‘immune’ as a common denominator that is mostly driven by 12 genes of which TREM2 and GRN are the most frequent contributors. The third cluster consisted of 22 variants, which were related to decreased levels of Aβ42 but not increased levels of pTau. Nine GO terms were enriched and ‘migration’ and ‘tyrosine’ are the words that occur most often in these terms, though each word only based on 2 GO terms each. The last cluster of 20 variants group because they increased pTau, but did not decrease Aβ42 levels. No GO terms are significantly enriched in this gene cluster.

Fig. 3
figure 3

The effects of all AD-associated loci. The names of the loci are named according to their linked gene names in Bellinguez et al. (2021). Hierarchical clustering was performed on the rows and columns using Eucledian distances and the method ‘average’ for clustering Pathway enrichment analyses was performed on the four first clusters. The enrichment analyses are in Online Resource 1—Table 9. The upper bar shows the odds ratio for AD, where alleles for variants with protective effects have been flipped to show AD-risk increasing effects for all variants. The increases in Aβ42 and pTau (positive Zscores) are shown in red and decreases in AB42 and PTau (negative Z scores) are shown in blue. * = P value < 0.05, X = 0.05 < P value < 0.001, # = 0.001 < P value < 5 × 10–8, $ = P value < 5 × 10–8


We identified 2 loci (CR1 and APOE) for Aβ42, and 4 loci (BIN1, GMNC, C16orf95 and APOE) for pTau in a total of 13,116 individuals (discovery n = 8074; replication n = 5402 individuals). In concordance with previous GWAS studies [12, 17], both proteins showed the strongest association for APOE, where APOE ɛ4 decreased amyloid-beta levels and increased pTau levels, while APOE ɛ2 had the opposite effect. We confirmed GMNC as a risk factor for CSF pTau levels. We identified CR1 as a novel locus for CSF Aβ42 levels, and we observed 2 novel loci (BIN1 and C16orf95) for CSF pTau levels. So other than APOE, no risk loci overlap is observed for Aβ42 and pTau, implying at least partly separate genetic backgrounds for both pathological hallmarks. Amyloid-beta appears to be dominated by the effect of APOE, while pTau is influenced by multiple genetic components. Such a divergence in genetic influences is not in concordance with a genetic etiology where accumulation of Tau tangles is a direct downstream effect of amyloid plaque formation, as proposed by the amyloid cascade theory [29, 30]. Rather, it seems that these pathologies have an independent component in its origin, as already extensively reviewed [64], thereby implying a dual-cascade hypothesis in which sporadic AD pathogenesis is caused by defects in correlated but independent cellular processes. This is highly relevant biological knowledge for the development of potential AD treatments. The limited clinical efficacy of agents that aim to reduce beta-amyloid plaques might potentially be due to this [76].

In line with this implication is the observed difference in genetic subgroups based on CSF protein patterns as proposed by the explorative cluster analysis, suggesting multiple Aβ42 and pTau related biological pathways to be involved in the etiology AD. The first genetic subgroup is defined by the word ‘amyloid’, where BIN1, PICALM, ABCA7 and CLU contribute most to this signal. The involvement of these genes in amyloid-beta pathology is supported by previous studies [15, 46, 63, 79]. ‘Immune’ is labeling the second subgroup, thereby implying that the genes included in this cluster are influencing AD by altering immune responses. The largest genetic drivers of this signal are TREM2 and GRN which have been previously described for their functional role in AD via the immune system [45, 54]. A substantial involvement of the immune system in AD is a generally accepted concept, supported by clinical, functional and genetic research [8]. As mentioned in the Results section, this second cluster includes genes that have also been related to other dementia types (e.g., GRN, TMEM107B, SNX1, MAPT, CTSB and CTSH). This is in line with more recent insights where dysfunction of the immune response is seen as a common cause for multiple neurodegenerative diseases [28]. The third and fourth clusters are less straightforward to interpret as many terms are similarly enriched, or no significant enrichment is observed. The definition of the first 3 subclasses in its current state is the best approximate within reach, though presumably not a perfect reflection of the genuine genetic etiology of AD. Subsequent genetic studies with larger sample sizes are anticipated to facilitate an improved understanding of the biological implications of the underlying genetic subclasses.

The variety in subclasses of genetic contributors for AD etiology could mean that different patient groups might benefit from distinct AD treatment depending on the biological pathway that is affected. Although, our genetic results in its current state are not applicable for clinical trials, we believe by improving the definition of the amyloid-tau clusters (e.g., by increasing sample sizes), it will be possible to assign individuals to one of the groups based on the distribution of their AD risk variants. For example, individuals with relatively more AD risk variants in the first cluster should preferably be included in a clinical trial addressing amyloid formation, while individuals with higher sub PRS scores for the second immune cluster, would have a higher probability benefitting from a clinical trial targeting the immune system. Alternatively, when sample sizes are increasing for future CSF amyloid and tau GWAS, an amyloid or tau PRS could be calculated to identify individuals at high genetic risk for amyloid and tau, thereby defining the individuals that should be included in trials aiming to reduce amyloid plagues or tau neurofibrillary tangles, respectively.

As mentioned above, APOE is the strongest locus for both CSF Aβ42 and pTau. The functional implications of the relation between APOE and the 2 pathological hallmarks has been excessively studied and reviewed in the scientific field of Alzheimer’s disease [20, 32, 60, 64]. In short, studies in human and mice models have shown higher Aβ42 levels and plaque load for APOE4 carriers. Whether this is a result of gain of toxic effect (e.g., enhanced ability of the APOE4 isoform to bind to Aβ42 [40]), or loss of defensive mechanism (e.g., less effective microglial response[20]), or a combination of both, remains to be elucidated. For Tau, studies have shown hyperphosphorylation and faster accumulation for APOE4 carriers, both in mice and human models [32, 64]. Again, the functional route through which APOE is affecting the Tau aggregates is unknown, and multiple cellular processes are being proposed as mediators, including neuronal endocytosis, lipid metabolism and glial function. An important functional observation though is the influence of APOE4 on tau accumulation in absence of Aβ42 pathology [55, 68], which is in agreement with the genetic implications of this current study. An unexpected observation for the APOE locus in this study is the successful colocalization with an eQTL for the NKPD1 in a specific type of T cells. However, due to the convincing functional involvement of APOE in AD pathogenesis via the presence of cysteine or arginine at APOE amino acid residuals 112 and 158, that we anticipate this successful NKPD1 eQTL colocalization result to be a false positive.

The CR1-locus findings are in agreement with the well-known observed association for AD risk, where rs6656401 (R2 = 0.88 with lead SNP of current study) carriers are more susceptible for AD. The similarity in these association signals is strengthened by the convincing colocalization of the CR1 locus between our CSF pTau observation and the CR1 locus of Kunkle et al. [43]. We furthermore observe a successful colocalization with an eQTL for the CR1 gene in multiple brain regions. In concordance with this observation is functional work on the effect of CR1 further which suggests that CR1 is involved in AD pathogenesis by regulating Aβ42 clearance in the brain itself, but also peripherally in blood cells [81]. More recent research in red blood cells of AD patients showed deficient CR1 immunoreactivity, including CR1-mediated capture of circulating amyloid-beta [35]. They observed decreased CR1 protein levels in red blood cells for CR1 SNPs that associate with higher AD risk. The second novel locus in this study is BIN1 for CSF pTau, for which we observe a successful colocalization for an eTQL for the BIN1 gene in lymphoblastoid cell lines, implying BIN1 to be the causal gene for this locus. The BIN1-locus furthermore localizes with the same locus in the AD GWAS, implying that a causal variant of the BIN1-locus for AD contributes to AD pathogenesis via tau pathology. BIN1 has already been linked to Tau pathology in several functional studies, first shown in fruit flies where a decrease in the BIN1 ortholog gene expression suppressed Tau-mediated neurotoxicity [11]. More recently, research in mice showed physical protein interaction between BIN1 and Tau [51], and BIN1 involvement in Tau-dependent hyperexcitability in AD [66]. In human subjects, BIN1-carriers were associated with lower memory performance, mediated by higher tau-PET levels [22]. Our observed BIN1-Tau association contributes valuable knowledge on this topic, by observing the same trend in a substantial larger study (89 vs 13,118 individuals) using different techniques to measure Tau pathology (PET vs. CSF). We provide in vivo confirmation that CR1 is associated with AD via Aβ42, while BIN1 relates to Tau pathology.

The third novel locus in this study is the region on genomic location 16q24.2 for pTau, of which the strongest associated variants are located within intronic regions of C16orf95. This locus has not been linked to (p)Tau pathology or AD in previous research, though it associated to lateral ventricular volume in the CHARGE study, including 23.5 k healthy individuals [65]. Similarly, the 3q28-locus for pTau from our findings colocalizes inconclusively (posterior probability of 0.73) to lateral ventricular volume by the CHARGE study, implying that the same genetic risk factors contribute to both phenotypes, strengthening the notion that neurodegeneration and (p)Tau pathology are highly correlated. 3q28 has been linked to (p)Tau by previous CSF studies in dementia cohorts and was identified as the GMNC locus [12, 17]. In comparison to the latest GWAS of Deming et al. [17], increasing the sample size with a small 10 k individuals in this study strengthens the association of this locus from 3.07 × 10–11 to 1.19 × 10–36, thereby turning it into a well-established locus for pTau pathology (similar for Tau: Deming P = 3.07 × 10−11 (n = 3146); current P = 9.65 × 10–37 (n = 12,540). The C16orf95-locus convincingly colocalized with the same locus in ventricular volume. The formerly reported directions of effect of GMNC and C16orf95 for ventricular volume are counterintuitive. For both loci the allele that associated with an increase in pTau pathology in our dementia cohorts associated with a smaller ventricular volume, implying less neurodegeneration. We explored if these loci work through the same biological pathways using CSF proteomics data. The consistently highlighted functional groups for the GMNC locus (axon guidance and ephrin signaling) were different than for the C16orf95-locus (extracellular matrix components), thereby implying at least partly distinct functional routes via which they influence pTau protein levels in CSF. The power of the proteomics dataset is rather limited. We anticipate that these analyses would benefit and potentially find overlapping biological pathways, when AD proteomics datasets at hand will increase in sample size, or apply the same proteomics assay (rather than using different ones with little overlap in proteins).

We furthermore identified two novel loci in the stratified analyses for Aβ42 levels: 7q11.22 for APOE ɛ4 non-carriers and 12q13.3 in individuals with abnormal amyloid levels. We were unable to test for replication of these loci in independent replication datasets, as such analyses were not performed. Neither loci have been previously linked to AD, or any other trait. The lead SNP for the locus on chromosome 7 is a common intronic variant for the lncRNA LOC105375341, which according to GTEx is only expressed in testis and prostate, and thereby not a promising causal gene for AD or AD-related phenotypes. The locus on chromosome 12 consists of just one rare intronic variant that was only detected in the Dutch cohort with the largest sample size (n = 498). Future studies including more cohorts of large sample sizes are required to study this rare variant in more detail.

Besides APOE and GMNC, no other loci from the latest GWAS of CSF amyloid and tau [17] were replicated by this study. For Aβ42, it was not possible to test the GLIS1-locus as this variant was not observed in our data, which is in concordance with the gnomAD browser [37] reporting extremely low coverage and a MAF of 0.007 for rs185031519, the strongest GLIS1 SNP in Deming et al. For the other loci that we were unable to replicate (SERPINB1 for Aβ42; and GLIS3, PCDH8, CTDP1 for pTau), there were no significant associations (P > 0.05) despite substantial sample sizes (n > 7000). These differences in findings might be due to differences in study design, for example inclusion of cohorts with other diagnoses and/or differences in analysis strategies.

The GCTA-based SNP-heritability estimates of 27% and 34% for Aβ42 and pTau, respectively, are in a similar range to the estimates calculated by Deming et al. [17] (36% for Aβ42 and 25% for pTau). Notable is that the previous study estimated Aβ42 to be most heritable, while we observed the highest estimate for pTau, though the standard errors were about 10% for both studies. Our LDSC-based SNP-heritabilities show a similar trend with higher estimates for pTau. Furthermore, these LDSC-based 13% for Aβ42 and 21% for pTau protein levels are considerably higher than the 7% previously observed for the diagnosis AD. The higher heritability for the tested CSF protein levels strengthens the assumption that more objective measurable biological properties are more strongly associated with AD pathogenesis than the diagnostic classifications.

Due to this assumption, the number of risk loci identified in this study might be lower than anticipated. However, our study design is suboptimal for obtaining the most powerful model for genetic association identification. Meta-analyzing 31 heterogeneous cohorts (differences in genotyping array, imputation references, CSF protein assays, and patient/control ratios) is more challenging rather than analyzing a homogeneous dataset of similar sample size. Alternatively, increasing the sample size by adding more cohorts to increase power is presumably a more feasible approach for the future.

In conclusion, the current findings clearly show that studying the genetic effects of AD-related endophenotypes has the potential to reveal novel associations, and highlight important biological insights. The clear distinction in genetic findings for amyloid-beta and tau emphasizes the (partly) genetic independence of these two biological mechanisms in AD pathogenesis. Moreover, the identification of CR1 and BIN1, which are the second and third strongest associated AD locus after APOE, furthermore implies that by increasing sample size of genetic analysis in CSF biomarkers it will become more apparent through which biological mechanisms certain AD loci have their effect on AD pathogenesis. Even larger collaborative efforts with more homogeneous sample definitions are, therefore, encouraged to be undertaken to enhance our genetic understanding of AD, ultimately leading to improved biological knowledge for the development of drug treatment.