Genome-wide Association Studies in Alzheimer’s Disease: A Review

Over the past decade, research aiming to disentangle the genetic underpinnings of late-onset Alzheimer’s disease has mostly focused on the identification of common variants through genome-wide association studies. The identification of several new susceptibility genes through these efforts has reinforced the importance of amyloid precursor protein and tau metabolism in the cause of the disease and has implicated immune response, inflammation, lipid metabolism, endocytosis/intracellular trafficking, and cell migration in the cause of the disease. Ongoing and future large-scale genome-wide association studies, translational studies, and next-generation whole genome or whole exome sequencing efforts, hold the promise to map the specific causative variants in these genes, to identify several additional risk variants, including rare and structural variants, and to identify novel targets for genetic testing, prevention, and treatment.


Introduction
Late-onset Alzheimer's disease (LOAD) typically begins with the onset of symptoms after the age of 60 years and evolves slowly from mildly impaired memory to severe cognitive loss. At death, the most frequent pathological manifestations in the brain include extracellular β-amyloid protein (Aβ) in diffuse and neuritic plaques and intracellular deposits of hyperphosphorylated tau protein, a microtubule assembly protein, in the form of neurofibrillary tangles. Widespread loss of both neurons and synapses also occurs [1].
An estimated 4.5 million Americans have LOAD. The annual incidence of LOAD increases from 1 % at the age of 60-70 years to 10-30 % at 85 years and older [2]. As the US population ages, it is expected that the number of LOAD cases will increase to 16 million 20 million by 2050, with one in 45 Americans affected [3,4]. A critical barrier to lessening the impact of this disease is the limited development of drugs to prevent or treat LOAD, which is mostly attributable to incomplete characterization of the basic underlying pathologic mechanisms. Determining which genes and gene networks contribute to LOAD risk would reveal basic pathogenic mechanisms, highlighting key proteins and pathways for drug development ("druggable targets"), and inform the development of genetic testing methods for identifying those at greatest risk of LOAD when preventive measures become available.
In recent years, the genetic analysis of LOAD has focused on identification of common variants through genome-wide association studies (GWAS) and has identified several novel susceptibility genes implementing specific pathways in the disease. This article reviews these studies, discusses their potentials and limitations, and provides suggestions for future research.

Data Source and Study Selection
The primary sources of the studies addressed in this review were full-text articles and abstracts published in English in the PubMed database between 2010 and February 2013. The keywords used for searching PubMed were "dementia," "Alzheimer's disease," "gene," "genetics," "epigenetics," "endophenotype," and "genome-wide association study." The abstracts retrieved were read to identify studies addressing the topics included in this review. We also performed a manual search of references cited in published articles. The studies were read in their entirety to assess their appropriateness for inclusion in this article.

Genetic Epidemiology of LOAD
A family history of dementia is one of the most important risk factors for LOAD [5,6]. Families multiply affected by LOAD are at increased risk of dementia, but the distribution of secondary cases is not consistent with Mendelian inheritance. LOAD is more frequent among monozygotic twins than dizygotic twins [7][8][9], and first-degree relatives of patients with LOAD have approximately twice the expected lifetime risk of developing the disease. Heritabilities of 58-79 % for LOAD indicate that in spite of progress made in identifying the underpinnings of the disease, a substantial fraction of LOAD is attributable to unknown genetic factors.

Apolipoprotein E Region
For more than a decade, only one genetic risk factor, the APOE ε4 allele, located on chromosome band 19q13, was an unequivocally established "susceptibility" gene in non-Hispanic Whites of European ancestry. Apolipoprotein E (ApoE) is a lipid-binding protein and is expressed in humans as three common isoforms coded for by three alleles, ε2, ε3, and ε4. A single APOE ε4 allele is associated with a twofold to threefold increased risk; having two copies is associated with a fivefold or more increased risk [10]. In addition, each inherited APOE ε4 allele lowers the age at onset by 6-7 years [11][12][13][14][15][16][17][18]. APOE ε4 is also associated with lower cognitive performance, in particular the memory domain, is associated with mild cognitive impairment [19][20][21][22], and is associated with progression from mild cognitive impairment to dementia [19][20][21][22][23][24][25][26][27][28][29]. Although the population attributable risk of APOE ε4 is estimated at 20-50 % [30], the presence of ε4 is neither necessary nor sufficient for development of the disease [31]. In ethnic groups other than non-Hispanic Whites, the association between APOE and LOAD was largely inconsistent across studies.

Findings from GWAS
At the beginning of the century, thousands of candidate-genebased association studies aiming to identify additional susceptibility loci were performed, but only one gene, the sortilinrelated receptor (SORL1) [32], which is implicated in intracellular trafficking of amyloid precursor protein (APP), could be consistently replicated in independent datasets and implicated in the disease. The main reasons for these inconsistencies between studies are sample heterogeneity with differences in linkage disequilibrium (LD) patterns and allele frequencies, and small sample sizes, leading to limited power to detect small or moderate effect sizes. In the past 5 years, technological advances in high-throughput genome-wide arrays have allowed the hypothesis-free simultaneous examination of thousands to millions of polymorphisms across the genome, and large collaborative efforts capitalizing on this technology have significantly advanced knowledge of the genetic underpinnings of LOAD and the pathways involved by identifying several novel risk loci.
The major GWAS contributing to this gained knowledge are summarized in Table 1. Most were performed in non-Hispanic Whites of European ancestry. The first set of studies identified CLU, PICALM, CR1, and BIN1 as susceptibility loci [33][34][35]. Clusterin (Clu), also known as apolipoprotein J, is a lipoprotein highly expressed in both the periphery and the brain [36]. Like ApoE, it is involved in lipid transport [37]. Clu is also hypothesized to act as an extracellular chaperone that influences Aβ aggregation and receptor-mediated Aβ clearance by endocytosis [38]. Unlike for APOE, there are no known coding variants that account for the observed genetic association to CLU, suggesting that genetic variation in expression levels may be responsible for the altered risk of LOAD [38]. BIN1 (amphiphysin II) is a member of the Bin1/ amphiphysin/RVS167 (BAR) family of genes that are involved in diverse cellular processes, including actin dynamics, membrane trafficking, and clathrin-mediated endocytosis [39] which affect APP processing and Aβ production or Aβ clearance from the brain. Phosphatidylinositol-binding clathrin assembly protein (PICALM) is also involved in clathrinmediated endocytosis and recruits clathrin and adaptor protein complex 2 to sites of vesicle assembly [40]. CR1 is a cellsurface receptor that is part of the complement system. It has binding sites for complement factors C3b and C4b and is involved in clearing immune complexes containing these two proteins. Since Aβ oligomers can bind C3b, CR1 may participate in the clearance of Aβ. CR1 may also play a role in neuroinflammation, which is a prominent feature in Alzheimer's disease [41]. Interestingly, Clu may play a role in this process as an inhibitor [42]. In summary, this first set of GWAS identified loci mainly clustering in four pathways, namely, immune response, APP processing, lipid metabolism, and endocytosis/ intracellular trafficking.
The second set of large GWAS identified additional susceptibility genes (CD33, MS4A4A/MS4A4E/MS4A6E cluster, ABCA7, CD2AP, and EPHA1) [43••, 44••]. In line with the pathways identified by the first set of GWAS, all of these five loci are likely involved in the immune system, whereas ABCA7 is in addition involved in lipid metabolism and APP processing ( Table 2). The CD33 gene encodes a protein that is a member of a family of cell-surface immune receptors that bind extracellular sialylated glycans and signal via a cytoplasmic domain called the immunoreceptor tyrosine inhibitory motif [45,46]. CD33 has primarily been studied in the peripheral immune system, where it is expressed on myeloid progenitors and monocytes and also in the brain. In the periphery, CD33 appears to inhibit proliferation of myeloid cells [47]. The MS4A4A/MS4A4E/MS4A6E locus is part of a cluster of 15 MS4A genes on chromosome 11 and encodes proteins with multiple membrane-spanning domains that were initially identified by their homology to CD20, a B-lymphocyte cellsurface molecule. Little is known about the function of MS4A4A gene products; however, like CD33, MS4A4A is expressed on myeloid cells and monocytes and likely has an immune-related function. EPHA1 encodes a member of the ephrin family of cell-surface receptors which interact with ephrin ligands on adjacent cells to modulate cell adhesion, migration, and axon guidance and synapse formation and plasticity. Although there is a substantial body of research on the function of ephrin receptors in general, little is known about the EPHA1 gene product. Like other ephrin receptors, it regulates cell morphology and motility [48] and early work implicated this receptor in regulating vascular morphogenesis and angiogenesis [49]. EPHA1 knockout in mice results in abnormal tail and reproductive tract development [50], but no effects on the brain. Consistent with this notion, in mice, expression is restricted to epithelial tissue. In humans, EPHA1 is expressed by CD4 + T lymphocytes [51], monocytes [52], intestinal epithelium, and colon. Combined with the lack of evidence for brain expression, this may suggest that, like CD33, CR1, and MS4A4/MS4A6E, the role of the EPHA1 gene product in Alzheimer's disease may be mediated though the immune system. The CD2-associated protein gene (CD2AP) encodes a scaffolding protein that binds directly to actin [53], nephrin, and other proteins involved in cytoskeletal organization. In the immune system, CD2AP is required for synapse formation [54] in a process that involves clathrindependent actin polymerization. ABCA7 is an integral transmembrane ATP-binding cassette (ABC) transporter belonging to the ABC family of proteins that mediate the biogenesis of high-density lipoprotein with cellular lipid and helical apolipoproteins [55]. It binds apolipoprotein A-I and functions in apolipoprotein-mediated phospholipid and cholesterol efflux from cells [56]. In addition, ABCA7 affects the transport of other important proteins, including APP [56], through the cell membrane and is involved in host defense through effects on phagocytosis by macrophages of apoptotic cells [55].
In these large-scale GWAS performed in non-Hispanic Whites of European ancestry, the most strongly associated single-nucleotide polymorphisms (SNPs) at each locus other than APOE demonstrated population attributable fractions between 1.0 and 8.0 %, with effect sizes ranging from an odds ratio of 1.16 to an odds ratio of 1.20, i.e., much smaller than for APOE [57]. In the largest GWAS performed to date in Caribbean Hispanics [58], associations in CLU, PICALM, and BIN1 were replicated and several additional loci on 2p25.1, 3q25.2, 7p21.1, and 10q23.1-which could be replicated in an independent cohort of non-Hispanic Whites of European ancestry from the National Institute on Aging Late-Onset Alzheimer's Disease Family Study (NIA-LOAD)-were observed. Finally, in the largest GWAS of African Americans performed, Reitz et al. [59••] identified ABCA7 as a major susceptibility locus in this ethnic group. Interestingly, in contrast to all GWAS loci identified in Caucasians, in African Americans the ABCA7 locus had an effect size as strong as that of APOE ε4 (i.e., a 70-80 % increase in risk compared with a 10-20 % increase in risk through the GWAS loci observed in Whites). Although this finding may represent a winner's curse (i.e., inflation of the estimated effect in a discovery set in relation to follow-up studies) and needs to be confirmed by independent studies in African Americans and functional methods, it may have major implications for developing targets for genetic testing, prevention, and treatment in this ethnic group if proven true. In addition, this study confirmed APOE as a susceptibility gene in this ethnic group, evidence for which prior to this study had been inconsistent across studies, and also replicated CR1, BIN1, EPHA1, and CD33.

Discussion
The recent GWAS for LOAD using large numbers of cases and controls identified several novel susceptibility loci that are biologically plausible, cluster in specific pathways, and have significantly advanced the understanding of the pathogenic mechanism underlying the disease. Common to all novel loci in non-Hispanic Whites of European ancestry is the modest effect size with odds ratio ranging from 1.1 to 1.5 leaving the APOE ε4 allele by far the strongest risk factor. In contrast, in the largest GWAS performed to date in African Americans, the ABCA7 locus was observed to have an effect size similar to that of APOE (70-80 % increase in risk). The population attributable risk of each of the non-APOE loci is estimated to be 1-8 %. However, this estimate will change with elucidation of the number, allele frequencies, and risk effects of the true functional variants at each locus, and the detection of additional common and rare risk variants and patterns of epistasis.
Replication in independent datasets-if possible across different ethnic groups-and functional validation of the loci identified by GWAS is crucial for several reasons. First, GWAS are not designed to identify the specific causative variants, but rather are designed screen the genome, capitalizing on the LD between genotyped SNPs and the potentially causative variants [60]. As LD can extend over large intervals, the true genetic effectors may be located considerably far away from the SNP showing the disease association, limiting the ability to detect true associations from GWAS. The development of high-throughput genotyping arrays, which have increased the number of genotyped markers to several millions, has decreased this problem to some extent, but not entirely, depending on the LD pattern in the region. Second, signals selected on the basis of statistical significance thresholds in underpowered settings are often subject to the winner's curse (bias away from the null in the estimated effect of a newly identified allele on disease) [61][62][63], and replication can help produce a more accurate, unbiased estimate of the genetic effect of a locus. Third, the probability that an observed association truly exists depends on the power to detect the association, which in turn is a function of minor allele frequency, effect size, sample size, and the observed p value. The distribution of effect sizes of true associations in complex diseases is unknown, but it is likely that most of the large effects in LOAD GWAS have been identified, whereas most of the smaller effects remain to be discovered. The significance threshold needed to preserve the genome-wide type I error rate in studies of individuals with European ancestry is estimated at 5×10 −7 -1×10 −8 [64,65]. This threshold is even lower in ethnic groups with greater genetic diversity such as Hispanics, Africans, and African Americans and, consequently, most individual GWAS do not have enough power to distinguish false positives from false negatives. Finally, replication in a population with different environmental or genetic backgrounds may-if assessed in a population with a lower extent of LD such as Africans-help narrow down the location of the causative variant. In addition, it allows one to determine the generalizability of the observed association. However, when the aim is to replicate an observed association, it has to be kept in mind that there are several reasons for the observation of no association, including differences in allele frequencies or LD patterns across populations, or allelic or locus heterogeneity.
There are several additional approaches that can address some of these issues inherent to GWAS. Reclassifying sample subjects into more homogeneous subgroups, for example, based on endophenotypes, can reduce phenotypic heterogeneity and increase the power to detect true associations.
Gene-based association studies, which consider association between a trait and all markers within a gene rather than each marker individually, can be more powerful than traditional individual-SNP-based GWAS. For example, if a gene contains more than one causative variant, then several SNPs within that gene might show marginal levels of significance that are often indistinguishable from random noise in the initial GWAS results. If the effects of all SNPs in a gene are combined into a test statistic and correction is made for LD, the gene-based test might be able to detect these effects. Similarly, genome-wide haplotype-based association studies can characterize loci not detected by univariate analyses. Such gene-based or haplotype-based analyses led to the discovery of NARS2, FRMD6, and FRMD4A as susceptibility loci [66,67], the latter of which is immediately adjacent to GAB2. Identification and examination of regions with runs of homozygosity (i.e., excess burden of homozygous markers) can help identify recessive causative genes. Evidence is accumulating that a substantial part of the missing genetic variability could be due to epistatic effects or gene-environment interactions. Thus, exploration of gene-gene and gene-environment interactions can identify novel variants not detected by individual testing of SNPs. However, such studies require large samples sizes and/or large effect sizes to achieve adequate power.
Although the latest GWAS arrays include dense SNP maps of several million SNPs with minor allele frequency down to 1 % and novel functional exonic variants that were identified through sequencing of thousands of exomes, they are limited in their ability to detect associations with variants not tagged by the genotyped SNPs. In addition, they are limited in their ability to identify structural variants or rare variants with minor allele frequency of less than 1 %. However, both rare and structural variants are increasingly recognized as being implicated in complex disease [68]. In fact, two recent studies that performed genome sequencing followed by imputation of identified variants in independent datasets implicated the triggering receptor expressed on myeloid cells 2 gene (TREM2) in Alzheimer's disease by identifying a causative rare missense mutation (rs75932628) resulting in an R47H substitution affecting the gene's anti-inflammatory function [69•, 70•]. Additional sequencing studies identified rare causative variants in the nicastrin gene (NCSTN) encoding an obligatory component of the γ-secretase complex involved in splicing of APP [71] as well as CLU [72]. Although individual rare variants may have an effect size large enough to cause disease, the accumulation of several rare variants each with small or modest effect sizes may cross the susceptibility threshold. Ongoing and future large-scale next-generation whole exome or whole genome sequencing techniques will fill this gap and further provide the means to identify the specific causative variants in the genes/regions identified by GWAS. Although appropriate algorithms for the statistical and bioinformatic analysis of sequencing data, in particular for whole genome sequencing data and whole exome or whole genome sequencing data derived from families, still need to be developed and implemented, the recent identification of rare variants in CLU, NCSTN, and TREM2 in LOAD that also cluster in amyloid processing and immune-response/inflammation pathways and were missed by the GWAS but identified by sequencing studies clearly belie the common disease-common variant hypothesis and prove the necessity of approaches with the ability to detect rare variants [69•, 70•, 71, 72]. Once causative variants are identified, functional studies can assess the pathogenic effects of the variants and characterize the molecular pathways in which they are involved or with which they interact, further implicating the gene in the disease and potentially providing targets for effective intervention.

Conclusions and Future Directions
Over the past 10 years, studies capitalizing on high-throughput genome technologies have significantly advanced knowledge of the genetic underpinnings of LOAD. GWAS have identified several susceptibility genes, and sequencing studies have identified specific causative variants in these genes, but have also provided invaluable evidence for an involvement of rare variants in this complex disease, overturning the common diseasecommon variant hypothesis that had long defined the genetic research of complex diseases. Ongoing and future large-scale next-generation sequencing approaches (both hypothesisdriven and hypothesis-free) are likely to disentangle a significant part of the missing heritability of LOAD, and have the potential to identify targets for genetic testing, prevention, and treatment.