Main

The ability to read is crucial for success at school and access to employment, information and health and social services, and is related to attained socioeconomic status1. Dyslexia is a neurodevelopmental disorder characterized by severe reading difficulties, present in 5–17.5% of the population, depending on diagnostic criteria2,3. It often involves impaired phonological processing (the decoding of sound units, or phonemes, within words) and frequently co-occurs with psychiatric and other developmental disorders4, especially attention-deficit hyperactivity disorder (ADHD)5,6 and speech and language disorders7,8. Dyslexia may represent the low extreme of a continuum of reading ability, a complex multifactorial trait with heritability estimates ranging from 40% to 80%9,10. Identifying genetic risk factors not only aids increased understanding of the biological mechanisms, but may also expand diagnostic capabilities, facilitating earlier identification of individuals prone to dyslexia and co-occurring disorders for specific support.

Previous genome-wide investigations of dyslexia have been limited to linkage analyses of affected families11 or modest (n < 2,300 cases) association studies of diagnosed children and adolescents12. Candidate genes from linkage studies show inconsistent replication, and genome-wide association studies (GWAS) have not found significant associations, although LOC388780 and VEPH1 were supported in gene-based tests12. Larger cohorts are vital for increasing sensitivity to detect new genetic associations of small effect. Here, we present the largest dyslexia GWAS to date, with 51,800 adults self-reporting a dyslexia diagnosis and 1,087,070 controls, all of whom are research participants with the personal genetics company 23andMe, Inc. We validate our association discoveries in independent cohorts, provide functional annotations of significant variants (mainly single-nucleotide polymorphisms (SNPs)) and potential causal genes, and estimates of SNP-based heritability. Lastly, we investigate genetic correlations with reading and related skills, health, socioeconomic, and psychiatric measures, and evaluate the evidence for previously implicated dyslexia candidate genes in our well-powered results.

Results

Genome-wide associations

The full dataset included 51,800 (21,513 males, 30,287 females) participants responding ‘yes’ to the question ‘Have you been diagnosed with dyslexia?’ (cases) and 1,087,070 (446,054 males, 641,016 females) participants responding ‘no’ (controls). Participants were aged 18 years or over (mean ages of cases and controls were 49.6 years (s.d. 16.2) and 51.7 years (s.d. 16.6), respectively). We identified 42 independent genome-wide significant associated loci (P < 5 × 10−8) and 64 loci with suggestive significance (P < 1 × 10−6) (Fig. 1 and Supplementary Table 1). Genomic inflation was moderate (λGC = 1.18) and consistent with polygenicity (see Q–Q plot, Extended Data Fig. 1). We also performed sex-specific GWAS and age-specific GWAS (younger or older than 55 years) because dyslexia prevalence was higher in our younger (5.34% in 20- to 30-year-olds) than older (3.23% in 80- to 90-year-olds) participants. These subsample analyses showed high consistency with the main GWAS (of the full sample). Genetic correlation estimated by linkage disequilibrium (LD) score regression (LDSC) was 0.91 (95% confidence intervals (CI): 0.86–0.96; P = 8.26 × 10−253) in males and females, and 0.97 (95% CI: 0.91–1.02; P = 2.32 × 10−268) between younger and older adults.

Fig. 1: Manhattan plot of the genome-wide association analysis of dyslexia.
figure 1

The y axis represents the −log10 P value for association of SNPs with self-reported dyslexia diagnosis from 51,800 individuals and 1,087,070 controls. The threshold for genome-wide significance (P < 5 × 10−8) is represented by a horizontal grey line. Genome-wide significant variants in the 42 genome-wide significant loci are red. Variants located within a distance of <250 kb of each other are considered as one locus.

Of the 17 genome-wide significant variants in the female GWAS (Extended Data Fig. 2), all but four (rs61190714, rs4387605, rs12031924 and rs57892111) were significant in the main GWAS and, of these four, three were in LD with an SNP that approached significance (P < 3.3 × 10−7 or smaller) in the main analysis. Intergenic SNP rs57892111 (located between TFAP2B and PKHD1 on chromosome 6p) was not among the significant or suggestive SNPs of the main analysis, and so may represent a female-specific variant. There is no evidence from existing GWAS that this SNP is associated with any other human trait. Of the six genome-wide significant variants in the male GWAS (Extended Data Fig. 3), all were significant in the main GWAS.

In the main GWAS, all significant variants were autosomal, except rs5904158 at Xq27.3 (for regional association plots, see Supplementary Fig. 1). A total of 17 index variants were in high LD with published (genome-wide significant) associated SNPs in the NHGRI GWAS Catalog13 (15 were associated with cognitive/educational traits; Supplementary Tables 1 and 2). Thus, a total of 27 associated loci showed no evidence of published genome-wide associations with traits expected to overlap with dyslexia (for example, educational attainment, cognitive ability) and were considered new (Table 1).

Table 1 New SNP associations with dyslexia, including gene-based results, eQTL status, expression in brain and validation in three independent cohorts (GenLang Consortium, CRS and NeuroDys)

Of 38 associated loci (the 4 remaining were tagged by indels unavailable in validation cohorts), 3 (rs13082684, rs34349354 and rs11393101) were significant at a Bonferroni-corrected level (P < 0.05/38) in the GenLang consortium GWAS meta-analysis of reading (n = 33,959) and spelling (n = 18,514) ability14. At P < 0.05, 18 were associated in GenLang, 3 in the NeuroDys case-control GWAS12 (n = 2,274 cases), and 5 in the Chinese Reading Study (CRS) of reading accuracy and fluency (n = 2,270; Supplementary Note) (Table 1 and Supplementary Tables 36).

Gene-based tests identified 173 significantly associated genes (Supplementary Table 7) but no significantly enriched biological pathways (Supplementary Table 8). We estimated the LDSC liability-scale SNP-based heritability of dyslexia to be h2SNP = 0.152 (standard error = 0.006) using the 23andMe sample prevalence of 5%, and h2SNP = 0.189 (standard error = 0.008) using a 10% prevalence of dyslexia, which is more typical of the general population2,3.

Fine-mapping and functional annotations

Within the credible variant set (Supplementary Table 1), missense variants were the most common (55%) of the coding variants; Extended Data Figure 4 summarizes all predicted variant effects. Predicted deleterious variants by SIFT (Sorting Intolerant From Tolerant) score were identified in R3HCC1L, SH2B3, CCDC171, C1orf87, LOXL4, DLAT, ALG9 and SORT1. Within the credible variant set, no genes were especially intolerant to functional variation (smallest LoFtool (Loss-of-Function) percentile was 0.39). For the 42 associated loci, the most probable gene targets of each were estimated by the Overall V2G (Variant-to-Gene) score from OpenTargets (Supplementary Table 9). Two index variants (missense variant rs12737449 (C1orf87) and rs3735260 (AUTS2)) could be causal because they had combined annotation dependent depletion (CADD) scores suggestive of deleteriousness to gene function according to Kircher et al.15 (Supplementary Table 10). The AUTS2 variant RegulomeDB rank of 2b indicated a regulatory role; its chromatin state supported location at an active transcription start site16,17.

Of the 173 significant genes from genome-wide gene-based tests in MAGMA (see Supplementary Table 11 for their functions), 129 could be functionally annotated (Supplementary Table 12). Protein-coding and noncoding sequences are actively conserved in approximately three-quarters of these genes, 63% are more intolerant to variation than average and 33% are intolerant to loss-of-function mutations. Gene property analysis for general tissues and 13 brain tissues confirmed the importance of the brain and specific brain regions (Supplementary Tables 13 and 14). Levels of brain expression for 125 of the 173 significant genes from gene-based tests could be mapped in FUMA and are shown in Supplementary Table 15. A total of 20 genes showed high general brain expression levels and, of these, 3 (PPP1R1B, NPM1 and WASF3) were located near significant SNP associations. Of the 12 brain regions assessed, gene expression was generally highest in the cerebellar hemisphere, cerebellum, and cerebral cortex, consistent with the results of gene property analysis.

Partitioned heritability

SNP-based heritability of dyslexia partitioned by functional annotation showed significant enrichment for conserved regions and H3K4me1 clusters (Supplementary Table 16 and Extended Data Fig. 5). There was enrichment in genes expressed in the frontal cortex, cortex and anterior cingulate cortex (P < 4.17 × 10−3) (Supplementary Table 17 and Extended Data Fig. 6), but not for brain cell type (Supplementary Table 18 and Extended Data Fig. 7). Enrichment was seen in enhancer and promoter regions, identified by the presence of H3K4me1 and H3K4me3 chromatin marks, respectively, in multiple central nervous system (CNS) tissues (Supplementary Tables 19 and 20 and Extended Data Figs. 8 and 9). Reading, an offshoot of spoken language, is a uniquely human trait, but there was no enrichment for a range of annotations related to human evolution spanning the last 30 million to 50,000 years18 (Supplementary Table 21).

Genetic correlations and LDSC

Genetic correlations were estimated for 98 traits (Fig. 2 and Supplementary Table 22), including reading and spelling measures, from GenLang (Fig. 3), and brain subcortical structure volumes, total cortical surface area and thickness from the Enhancing Neuro Imaging Genetics through Meta-Analysis (ENIGMA) consortium. A total of 63 traits showed genetic correlations with dyslexia at the Bonferroni-corrected significance threshold (P < 0.05/98; Fig. 2). Genetic correlations (rg) with quantitative reading and spelling measures ranged from −0.70 to −0.75 (lowest 95% CI of −0.60, highest 95% CI of −0.86), and were −0.62 (95% CI: −0.50, −0.74) and −0.45 (95% CI: −0.26, −0.64) with phoneme awareness and nonword repetition measures, respectively. The childhood/adolescent performance (nonverbal) intelligence quotient (IQ) rg was lower (−0.19; 95% CI: −0.08, −0.30) than that for adult verbal-numerical reasoning19 (−0.50; 95% CI: −0.45, −0.55) but similar to that for childhood IQ20 (−0.32; 95% CIs: −0.21, −0.43) and educational attainment21 (−0.22; 95% CI: −0.15, −0.29). Traits showing positive rg included jobs involving heavy manual work21 (0.40; (95% CI: 0.34, 0.45)), work-related/vocational qualifications21 (0.50; 95% CI: 0.41, 0.59), ADHD22 (0.53; 95% CI: 0.29, 0.77), equal use of right and left hands21 (0.38; 95% CI: 0.19, 0.57) and pain measures21 (average = 0.31; 95% CI: 0.21, 0.41). Of the 11 ENIGMA measures tested, only intracranial volume was significantly correlated with dyslexia (rg = −0.14; 95% CI: −0.06, −0.22). Targeted investigation of 80 structural neuroimaging measures from UK Biobank, including surface-based morphometry and diffusion-weighted imaging for brain circuitry linked to language, were nonsignificant at a Bonferroni-corrected significance level for number of independent traits. Phenotype independence was estimated by spectral decomposition of the phenotypic correlation matrix implied by the bivariate LDSC intercept from GWAS summary statistics of these traits, using the PhenoSpD toolkit23 (Supplementary Table 23).

Fig. 2: Genetic correlations of dyslexia with other phenotypes.
figure 2

Significant (P < 5 × 10−4) genetic correlations (rg) between self-reported dyslexia diagnosis from 23andMe and other phenotypes from the LD Hub database and Enhancing Neuro Imaging Genetics Through Meta-Analysis (ENIGMA). We tested 98 traits but present only those that were significant after Bonferroni correction. Center points represent genetic correlations, and error bars represent standard errors around the estimate; exact values can be found in Supplementary Table 22. The vertical line indicates a genetic correlation of zero, and the horizontal lines divide groups of related traits. GCSE, General Certificate of Secondary Education; HNC, Higher National Certificate; HND, Higher National Diploma; NVQ, National Vocational Qualification.

Fig. 3: Genetic correlations between dyslexia and measures of reading, language and nonverbal IQ.
figure 3

Genetic correlations (rg) between self-reported dyslexia diagnosis from 23andMe and measures of reading, language and performance (nonverbal) IQ in the GenLang consortium. Center points represent genetic correlations estimated in LDSC, and error bars represent standard errors around the estimate; exact values can be found in Supplementary Table 22.

Polygenic score analyses

Dyslexia polygenic scores (PGS) based on the 23andMe dyslexia GWAS were computed in four independent cohorts and, overall, higher PGS were associated with lower reading and spelling accuracy (Supplementary Table 24). In two Australian population-based samples (1,647 adolescents, 1,163 adults), the dyslexia PGS explained up to 3.6% of variance in the reading and spelling measures, being most predictive of lower performance on tests of nonword reading, an index of phonological decoding. Dyslexia PGS did not correlate with scores on tests of nonword repetition (considered a marker of phonological short-term memory). In developmental cohorts enriched for reading difficulties, the dyslexia PGS explained 3.7% (UKdys; n = 930) and 5.6% (CLDRC; n = 717) of variance in word recognition tests.

Analyses of dyslexia associations from the literature

Of 75 previously reported dyslexia associations, none showed genome-wide significance in our analyses (Supplementary Table 25). Of these targeted variants, 19 (in ATP2C2, CMIP, CNTNAP2, DCDC2, DIP2A, DYX1C1, FOXP2, KIAA0319L and PCNT) showed association surviving Bonferroni correction that accounted for LD (P < 0.05/68.7). In gene-based tests of 14 candidate genes from the literature24,25, association at a Bonferroni level (P < 0.05/14) was seen for KIAA0319L (P = 1.84 × 10−4) and ROBO1 (P = 1.53 × 10−3) (Supplementary Table 26). The CNTNAP2 association approached corrected replication-level significance (P = 0.004). Targeted gene set analysis of three pathways previously implicated in dyslexia (Supplementary Table 27) showed replication-level support (P = 2.00 × 10−3) for the axon guidance pathway (comprising 216 genes).

Discussion

In the largest GWAS of dyslexia to date (>50,000 self-reported diagnoses), we identified 42 significant independent loci. Of these, 27 represent new associations that have not been uncovered in GWAS of related cognitive traits; 12 of the new associations were validated in the GenLang consortium GWAS meta-analysis of reading/spelling in English and other European languages14, and 1 in a Chinese language cohort. Of the significant SNPs, 36% overlapped with variants from general cognitive ability GWAS, consistent with twin studies that find that genetic variation in reading disability is explained by general and reading-specific cognitive ability10. Similar to other complex traits, and consistent with high polygenicity, each significant locus showed small effects (odds ratios (ORs) ranging from 1.04 to 1.12). Our estimated SNP-based heritability of 19% (assuming a 10% dyslexia population prevalence) was equal to that reported in a smaller GWAS12, but lower than heritability estimates from twin studies (40–80%)26,27. This difference may be due partly to effects of rare and structural variants28, which have been implicated in reading and related traits29,30.

Whereas AUTS2 has been implicated in autism31, intellectual disability32 and dyslexia33, the variant we uncovered (rs3735260) represents the strongest AUTS2 SNP association with a neurodevelopmental trait to date. Amongst our findings were other known neurodevelopmental genes, such as TANC2 (implicated in language delay and intellectual disability34,35) and, especially, GGNBP2 (linked to neurodevelopmental delay36 and autism37) with variant rs34349354 supported in all our validation cohorts. However, rs34349354 is also associated with cognitive performance38, and based on expression quantitative trait loci (eQTL) evidence is more likely linked to ZNHIT3, colocalizing with molecular QTLs (opentargets.org). Notably, none of the more established candidate genes for dyslexia approached genome-wide significance in our results.

Like other human complex traits, partitioning of SNP-based heritability revealed enrichment in conserved regions39. We further observed enrichment in the histone mark H3K4me1 (which has also been reported for ASD40), and at H3K4me1 and H3K4me3 clusters in the CNS (marking enhancers and promoters, respectively). Since reading/writing systems are built on our capacities for spoken language, it is plausible that evolutionary changes on the human lineage helped shape the underlying genetic architecture41. However, we did not find enrichment of significant associations for curated annotations spanning different periods of hominin prehistory.

Our self-reported dyslexia diagnosis binary trait showed strong negative genetic correlations with quantitative reading and spelling measures, supporting the validity of this measure in the 23andMe cohort, and suggesting that reading skills and disorder are not qualitatively distinct. The positive genetic correlation between hearing difficulties and dyslexia is consistent with genetic correlations reported for childhood reading skill42, suggesting that hearing problems at an early age could affect acquisition of phonological processing skills.

Dyslexia showed moderately negative genetic correlations with adult verbal-numerical reasoning, but there was a lack of a strong genetic correlation of dyslexia with (nonverbal) performance IQ. This would be consistent with phenotypic observations that individuals with dyslexia are disadvantaged on verbal IQ tests43. Educational attainment correlations were also not strong, which might reflect school adjustments and other support that counteract disadvantage in academic learning.

There was little evidence of common genetic variation in dyslexia being related to interindividual differences in subcortical volumes, or structural connectivity and morphometry for brain regions implicated in language processing in adults. Thus, the phenotypic correlations previously reported between dyslexia and aspects of neuroanatomy may in large part reflect environmental shaping of the brain, perhaps through the process of reading itself44. Left-handedness and ambidexterity show small genetic overlap with each other45 yet are both phenotypically linked to neurodevelopmental disorders/cognitive abilities46,47. We report a significant genetic correlation between dyslexia and self-reported equal hand use, but not left-handedness, supporting theories linking ambidexterity and dyslexia48.

Dyslexia and ADHD5,6 often co-occur (24% reporting ADHD in our cases versus 9% in controls), and we show a moderate genetic correlation between the two, potentially reflecting shared endophenotypes like deficits in working memory and attention49. Although we did not find significant genetic correlations between dyslexia and ASD, the GWAS for the latter encompassed diverse neurodevelopmental phenotypes, including subgroups with varying educational attainment and IQ40. Genetic correlations with pain-related traits suggest that individuals with dyslexia may have a lower threshold for pain perception. Links between pain and other neurodevelopmental disorders have been reported50,51.

Dyslexia polygenic scores were correlated with lower achievement on reading and spelling tests in population-based and reading-disorder enriched samples, especially for nonword reading, a measure of phonological decoding that is typically impaired in dyslexia. Polygenic scores could become a valuable tool to help identify children with a propensity for dyslexia, enabling learning support before development of reading skills. However, a limitation of our study is the potential for collider bias arising from sample selection (that is, people without dyslexia and from higher socioeconomic positions), which we were unable to quantify; thus, care should be taken in future research when using polygenic scores based on many variants52.

In summary, we report 42 new independent genome-wide significant loci associated with dyslexia, 27 of which have not been associated with cognitive-educational traits and should be prioritized for follow up as dyslexia candidates. Functional annotation of the variants highlights the importance of conserved and enhancer regions of the genome for this trait. Dyslexia shows positive genetic correlations with ADHD, vocational qualifications, physical occupations, ambidexterity and pain perception, and negative correlations with academic qualifications and cognitive ability; family-based methods are needed to dissociate pleiotropic and causal effects.

Methods

GWAS participants

Participants were drawn from the customer base of 23andMe, Inc., a consumer genetics company. Participants provided informed consent and participated in the research online, under a protocol approved by the external AAHRPP-accredited IRB, Ethical and Independent Review Services (www.eandireview.com). They included 51,800 (21,513 male, 30,287 female) participants who responded ‘yes’ to the question ‘Have you been diagnosed with dyslexia?’ (cases) and 1,087,070 (446,054 male, 641,016 female) participants who responded ‘no’ (controls). Age ranged from 18 to 110 years, with the prevalence of dyslexia higher for younger participants (5.34% in those aged 20–30 years) than older participants (3.23% in those aged 80–90 years). The negative linear relationship between dyslexia prevalence and participant age was expected given that screening for specific learning difficulties has only become commonplace in more recent decades. Moreover, this aligns with findings from the subsample (4.3%) of participants who reported age of diagnosis: younger participants were diagnosed at an earlier age (for example, 9.7 years (±4.7) for 20- to 30-year-olds) than older participants (for example, 22.4 years (±17.8) for 80- to 90-year-olds). The prevalence of dyslexia in our sample was similar for women (4.51%) and men (4.6%), although the slightly higher prevalence in males in this very large sample was statistically significant (P < 8.7 × 10−6). Such a prevalence lies at the lower end of the range typically reported in the US population3 and might represent the more severe cases of dyslexia given that a formal diagnosis was required; additionally, people with dyslexia might opt out of survey research that requires reading, further restricting the sample range.

Genotyping and imputation

DNA was extracted from saliva samples and genotyped on one of five genotyping platforms by the National Genetics Institute (NGI). In the present analysis, only participants with European ancestry were included. Details about the genotyping arrays, quality control of samples and ancestry derivation can be found in Fontanillas et al.53 and the Supplementary Note. Phased genotypes were imputed to a combined reference panel of the 1000 Genomes Phase 3 haplotypes (May 2015) and the UK10K imputation reference panel using Minimac3 (see Das et al.54).

Association analysis

Association analysis was performed on genotyped and imputed SNP dosage data using logistic regression and assuming an additive model of allelic effects. For X-chromosome analysis, male genotypes were treated as homozygous diploid. Covariates included age, age squared, gender, the first five ancestry principal components and genotype platform. SNP significance was evaluated by a likelihood ratio test, and genome-wide significance was determined as P < 5 × 10−8 (suggestive significance level as P < 1 × 10−6). Only reliably imputed SNPs (r2 > 0.80) and those with minor allele frequency (MAF) > 0.01 are presented (n = 7,995,923). We define associated regions by first identifying all variants with P < 5 × 10−8, then grouping these variants into regions separated by gaps of at least 250 kb. Index variants are the variants with smallest P value within each associated region. We use the same approach for regions with suggestive associations, but by first identifying all variants with P < 10−5. Subsidiary genome-wide association analysis of separate male (n = 21,513 cases, 446,054 controls) and female (n = 30,287 cases, 641,016 controls) groups, and younger (below 55 years; n = 30,763 cases, 582,276 controls) and older (55 and above; n = 21,037 cases, 504,794 controls) groups was performed. The latter was to check whether reliability of diagnosis (assumed to be higher in the younger sample whose recall of diagnosis should be better and who would have been exposed to greater levels of dyslexia screening) affected the GWAS signal.

We also looked to independently validate our genome-wide significant variants within (1) a published GWAS meta-analysis of 2,274 dyslexia cases from nine European countries representing six different languages (NeuroDys) by Gialluisi et al.55; (2) a population sample (Chinese Reading Study; CRS) of children measured on quantitative traits of reading accuracy and reading fluency (n = 2,270; described in the Supplementary Note), and; (3) within the GenLang quantitative trait GWAS meta-analysis of word reading (up to n = 33,959) and spelling (up to n = 18,514) skills measured in cohorts of children and adolescents from Europe, the United States and Australia, and representing seven European languages, of which English was the most common14.

Genomic control

Top SNPs are reported from the more conservative GWAS results adjusted for genomic control (Fig. 1, Extended Data Figs. 14, and Supplementary Tables 1, 2, 9 and 10), whereas downstream analyses (including gene-set analysis, enrichment and heritability partitioning, genetic correlations, polygenic prediction, candidate gene replication) are based on GWAS results without genomic control.

Gene-based analyses

The GWAS results were used to calculate gene-based P values for association with dyslexia by performing the gene analysis in MAGMA v.1.08 (ref. 56) through the FUMA interface57 using standard settings. In total, 19,039 genes were tested, and P values were judged based on a Bonferroni-corrected significance threshold of P < 2.63 × 10−6. We also performed gene set analyses for association of biological pathways (all available gene ontology (GO) terms and curated gene sets from the Molecular Signatures Database (MsigDB)58,59) with dyslexia in MAGMA through the FUMA interface. The total number of pathways tested was 15,486, and P values were judged based on a Bonferroni-corrected significance threshold of P < 3.23 × 10−6.

Biological annotations

Genome-wide significant variants and nearby gene(s) were annotated using external reference data and evaluated for functional or regulatory impact. A 99% credible set of potentially causal variants for SNPs in significant regions was based on approximate Bayes factor (ABFs)60 assuming a prior variance of 0.1, and using the method of Maller et al.61 to define these sets. Variant effect prediction of these was done in ENSEMBL (release 104)62. For genome-wide significant variants, we considered: gene context (whether a variant is intergenic or located within a specific functional region within a gene locus); deleteriousness (Combined Annotation Dependent Depletion (CADD) score); functionality (RegulomeDB (RDB) category); chromatin state (minimum and common 15-core chromatin state); and SNP-trait associations reported in the NHGRI GWAS Catalog13.

For each variant, the most probable gene target was identified using the Open Target Genetics portal63, which draws on evidence from QTL and chromatin interaction experiments, functional predictions and distance from a gene’s transcription start site. For genome-wide significant genes, we considered: loss-of-function intolerance (probability of loss-of-function Intolerance (pLI) score); variation intolerance (residual variation intolerance score, RVIS); variation intolerance in noncoding regions (noncoding RVIS, ncRVIS); evolutionary constraint of noncoding regions (noncoding genomic evolutionary rate profiling (ncGERP) score); evolutionary constraint of protein-coding regions (protein-coding genomic evolutionary rate profiling (pcGERP) score); deleteriousness across noncoding regions (noncoding CADD (ncCADD) score); combined functionality of variants in noncoding regions (noncoding genome-wide annotation of variants (ncGWAVA) score); and expression in 12 brain tissues (amygdala, anterior cingulate cortex, caudate basal ganglia, cerebellar hemisphere, cerebellum, cortex, frontal cortex, hippocampus, hypothalamus, nucleus accumbens basal ganglia, putamen basal ganglia and substantia nigra). All annotations were obtained through FUMA57 except RVIS, ncGERP, pcGERP, ncCADD and ncGWAVA, which were taken from Petrovski et al.64. Details of each annotation including original sources are in the Supplementary Note.

Partitioned heritability

We partitioned SNP heritability of dyslexia using stratified LDSC, as described by Finucane et al.39, to determine whether SNPs that share the greatest proportion of the heritability are also clustered in specific functional categories in the genome. Overall, we performed 266 different tests, which would give a very conservative Bonferroni-corrected significance level of 1.88 × 10−4, but because there will be overlap among annotation groups, we also report corrections to significance within different classes of annotation, each of which we now describe. Partitioning was performed for the 24 main functional annotations defined by Finucane et al.39. LD scores, regression weights and allele frequencies are from European ancestry samples and were retrieved from https://alkesgroup.broadinstitute.org/LDSCORE. Heritability estimates were considered statistically significant if the P value surpassed an α level of 2.08 × 10−3, derived by Bonferroni correction based on 24 tests.

We also estimated the enrichment for heritability of dyslexia for tissue-specific annotations, while controlling for the annotations in the baseline model, including gene expression in three brain cell types, gene expression in 12 brain regions, and chromatin marks H3K4me1 and H3K4me3 in multiple tissues (108 and 114, respectively) since these marks are enriched at enhancers65 and promoters66, respectively. Enrichment is the proportion of SNP heritability divided by the proportion of SNPs. For the brain cell types, we estimated enrichment for heritability of dyslexia for genes expressed in neurons, astrocytes, and oligodendrocytes using data from Cahoy et al.67. Enrichments were considered statistically significant if the P value surpassed an α level of 0.017, derived by Bonferroni correction based on three tests. The gene expression data used to estimate the enrichment of heritability in genes expressed in certain brain regions was from the GTEx database68, and the Bonferroni-derived α level for enrichment was 4.17 × 10−3 (based on 12 tests). Chromatin annotations include data from the Roadmap Epigenomics consortium17 and EN-TEx69,70. For H3K4me1, the Bonferroni-derived α level for enrichment was 4.63 × 10−4 (based on 108 tests) and, for H3K4me3, the Bonferroni-derived α level for enrichment was 4.39 × 10−4 (based on 114 tests).

Evolutionary annotations

Although reading and writing is a human cultural invention, it builds on fundamental pathways involved in language processing. Therefore, we investigated whether annotations related to human evolution were significantly enriched for heritability of dyslexia by applying an evolutionary analysis pipeline adapted from Tilot et al.18. These analyses capture a range of periods in an evolutionary timeframe on the lineage that led to humans, from approximately 30 million years ago to 50,000 years ago.

Enrichment of heritability was estimated in adult brain human gained enhancers (HGEs)71, fetal brain HGEs72, ancient selective sweep regions73, Neanderthal-introgressed SNPs74 and Neanderthal-depleted regions75 (see Supplementary Note for a description of each annotation); and controlled for using the baselineLD v.2 model from Gazal et al.76. Heritability enrichment in human adult and fetal HGEs were additionally controlled for adult and fetal brain active regulatory elements from the Roadmap Epigenomics resource17. Active regulatory elements were defined using chromHMM16. Enrichment P values were judged by an α level of 10−2, derived by Bonferroni correction based on five tests.

Genetic correlations

Genetic correlations within the 23andMe GWAS of dyslexia

Genetic correlation between self-reported dyslexia diagnosis in males and females, and between younger (<55 years old) and older (≥55 years old) adults was calculated using LDSC77,78.

Genetic correlations of dyslexia with other traits

We present the pairwise genetic correlation of dyslexia with 98 traits. Summary statistics for most of these traits are publicly available through LD Hub77,78,79—a centralized database and web interface that automates the LDSC regression analysis pipeline. A selection of brain magnetic resonance imaging measures obtained from the ENIGMA-3 consortium80,81,82,83, and measures of reading and spelling accuracy, and performance IQ from the GenLang Consortium14 were analyzed locally using LDSC. Word reading accuracy in GenLang was measured by the number of correct words read aloud from a list in a time restricted or unrestricted fashion. Examples of tools that include this measure are Test of Word Reading Efficiency (TOWRE), the British Ability Scales (BAS) and the Wide Range Achievement Test (WRAT). Spelling accuracy in GenLang was measured by the number of words correctly spelled orally or in writing. The words were dictated as single words or in a sentence. Examples of tools that include this measure are the BAS, WRAT and Wechsler Objective Reading Dimensions (WORD). Performance IQ in GenLang was based on subtests of IQ tests that did not depend on verbal cues, as included for example in the BAS and Wechsler Intelligence Scale for Children (WISC). Trait descriptions and summary statistic sources are in Supplementary Table 22. Bonferroni correction for multiple testing derived an adjusted critical P value of 5.1 × 10−4 from 98 independent tests.

Genetic correlations were further estimated in a targeted analysis of structural brain magnetic resonance imaging measures from UK Biobank, which were more comprehensive than those currently available from ENIGMA, along with further advantages such as hemisphere-specific data and greater homogeneity in cohort and scanning procedures. GWAS summary statistics from brain imaging-derived phenotypes for 33,000 participants were downloaded from the Oxford Brain Imaging Genetics Server84. Structural brain imaging traits encompassed both diffusion tensor imaging and surface-based morphometric phenotypes85 where selected tracts or regions of interest had a known link to language. For diffusion tensor imaging, fractional anisotropy values derived from both tract-based-spatial statistics and probabilistic tractography were used for available tracts spanning the extended language network86. For surface-based morphometric (cortical volume, surface area and thickness) GWAS, summary statistics for regions of interest derived from the Desikan-Killiany atlas (white surface) were used, again selected for their relevance in language processing, based on previous literature87,88,89,90. To correct for multiple testing, phenotypic correlations between the UK Biobank imaging indices were derived and analyzed by PhenoSpD23 to obtain the number of independent variables (36.08) to use for Bonferroni correction (adjusted critical P value of 1.39 × 10−3).

Polygenic score analyses

Dyslexia polygenic scores were based on increasingly larger numbers of SNPs corresponding to their association P values from the 23andMe GWAS (P < 5 × 10−8, P < 1 × 10−5, P < 0.001, P < 0.01, P < 0.05, P < 0.1, P < 0.5, 1). They were calculated in four independent cohorts. Two were general population cohorts from Australia: n = 1,640 (772 families) adolescents/young adults (Brisbane adolescents)91; n = 1,165 (966 families) older adults (Brisbane adults)25. The other two were family-based samples selected for dyslexia: one from the United Kingdom (UKdys), n = 930 (595 families); the other from the United States (Colorado Learning Disabilities Research Center, CLDRC), n = 717 (336 families)92. In the Australian samples, polygenic scores were calculated on 1000 Genomes Phase 3 (v.20101123) imputed genetic data using PLINK93. Only reliably imputed SNPs (R2 > 0.80) and those with a minor allele frequency >0.01 were included, and the default clumping procedure was used where index SNPs formed a clump with other SNPs in LD (R2 > 0.1) and within a 250 kb distance. In the UKdys and CLDRC samples, polygenic scores were calculated on Haplotype Reference Consortium imputed genetic data using PRSice94, with the same imputation quality and MAF exclusions for the base (23andMe GWAS) sample, and clumping parameters.

Polygenic scores were then used as predictors in linear models of quantitative trait outcomes (Australia: word, nonword (phonetic), irregular word (lexical) reading and spelling tests from an extended version of the Components of Reading Examination95, and two nonword repetition tests which are sensitive to developmental language disorders—Dollaghan and Campbell96, Gathercole and Baddeley97; UKdys and CLDRC: word recognition). All quantitative traits were preadjusted for sex, age and ancestry principal components (10 principal components in UKdys and CLDR; 20 principal components in Australian samples). Further adjustments were made for imputation run (separate runs for different genotyping arrays) in the Australian samples, and for nonverbal IQ in all samples (except for the Australian adults), and for hearing difficulties in the Australian older adults. Because the cohorts included related family members (twins or siblings), linear mixed models (lme) were specified in RStudio98, with family membership modeled as a random effect and the dyslexia polygenic score as a fixed effect. Where monozygotic twins were present, their trait scores were averaged and they were used as a single case.

Evaluation of candidates from previous literature

We used the results of the 23andMe dyslexia GWAS to assess variants, genes and biological pathways previously associated with or implicated in dyslexia and/or variation in reading and spelling ability in past association studies, linkage analyses and other studies.

Previously reported variants

We assessed 75 previously reported variants within our summary statistics, adopting a replication/validation significance threshold of P < 7.28 × 10−4, derived by Bonferroni correction based on 68.7 independent tests derived through matrix spectral decomposition, taking into account LD (see Doust et al.25 for details on how these variants were selected). The sources for each variant are provided in Supplementary Table 26.

Dyslexia candidate genes

We evaluated gene-based results from MAGMA v.1.08 (ref. 56) for overrepresentation of genome-wide significant variants from the 23andMe dyslexia GWAS within the loci of 14 candidate genes from earlier literature: CMIP, CNTNAP2, CYP19A1, DCDC2, DIP2A, DYX1C1, GCFC2, KIAA0319, KIAA0319L, MRPL19, PCNT, PRMT2, S100B and ROBO1. The rationale for this selection is detailed by Luciano et al.24 and Doust et al.5. The critical P value, based on Bonferroni correction for 14 tests, was 3.57 × 10−3.

Candidate dyslexia gene sets

We performed a gene set analysis in MAGMA to test for overrepresentation of genome-wide significant variants within (1) a set of transcriptional targets of FOXP2, a highly conserved transcription factor linked to speech and language impairment99; and (2) two biological pathways previously suggested to play a role in dyslexia susceptibility100,101—axon guidance (GO:0007411: ‘chemotaxis process that directs the migration of an axon growth cone to a specific target site’; 216 genes) and neuron migration (GO:0001764: ‘movement of an immature neuron from germinal zones to specific positions where they will reside as they mature’; 145 genes). An adjusted critical P value of 0.017 was derived using Bonferroni correction based on three independent tests.

Ethical standards

Participants provided informed consent and participated in the research online, under a protocol approved by the external AAHRPP-accredited IRB, Ethical and Independent Review Services. Participants were included in the analysis on the basis of consent status as checked at the time data analyses were initiated.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.