Skip to main content
Log in

An expanded sequence context model broadly explains variability in polymorphism levels across the human genome

  • Analysis
  • Published:

From Nature Genetics

View current issue Submit your manuscript

Abstract

The rate of single-nucleotide polymorphism varies substantially across the human genome and fundamentally influences evolution and incidence of genetic disease. Previous studies have only considered the immediately flanking nucleotides around a polymorphic site—the site's trinucleotide sequence context—to study polymorphism levels across the genome. Moreover, the impact of larger sequence contexts has not been fully clarified, even though context substantially influences rates of polymorphism. Using a new statistical framework and data from the 1000 Genomes Project, we demonstrate that a heptanucleotide context explains >81% of variability in substitution probabilities, highlighting new mutation-promoting motifs at ApT dinucleotide, CAAT and TACG sequences. Our approach also identifies previously undocumented variability in C-to-T substitutions at CpG sites, which is not immediately explained by differential methylation intensity. Using our model, we present informative substitution intolerance scores for genes and a new intolerance score for amino acids, and we demonstrate clinical use of the model in neuropsychiatric diseases.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Figure 1: C-to-T substitution probabilities and methylation patterns in 7-mer CpG sequence contexts.
Figure 2: Posterior probabilities of all classes of nucleotide substitution in the intergenic noncoding genome, estimated using the 7-mer context model.
Figure 3: Prioritizing pathogenic variants and causal genes using constraint scores.
Figure 4: Application of gene and amino acid intolerance scores to de novo autism spectrum disorder mutational data.

Similar content being viewed by others

References

  1. Hodgkinson, A. & Eyre-Walker, A. Variation in the mutation rate across mammalian genomes. Nat. Rev. Genet. 12, 756–766 (2011).

    Article  CAS  PubMed  Google Scholar 

  2. Ehrlich, M. & Wang, R.Y. 5-methylcytosine in eukaryotic DNA. Science 212, 1350–1357 (1981).

    Article  CAS  PubMed  Google Scholar 

  3. Rideout, W.M. III, Coetzee, G.A., Olumi, A.F. & Jones, P.A. 5-methylcytosine as an endogenous mutagen in the human LDL receptor and p53 genes. Science 249, 1288–1290 (1990).

    Article  CAS  PubMed  Google Scholar 

  4. Arbiza, L. et al. Genome-wide inference of natural selection on human transcription factor binding sites. Nat. Genet. 45, 723–729 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  5. Yang, Y. et al. Clinical whole-exome sequencing for the diagnosis of mendelian disorders. N. Engl. J. Med. 369, 1502–1511 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  6. Hwang, D.G. & Green, P. Bayesian Markov chain Monte Carlo sequence analysis reveals varying neutral substitution patterns in mammalian evolution. Proc. Natl. Acad. Sci. USA 101, 13994–14001 (2004).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  7. Blake, R.D., Hess, S.T. & Nicholson-Tuell, J. The influence of nearest neighbors on the rate and pattern of spontaneous point mutations. J. Mol. Evol. 34, 189–200 (1992).

    Article  CAS  PubMed  Google Scholar 

  8. Neale, B.M. et al. Patterns and rates of exonic de novo mutations in autism spectrum disorders. Nature 485, 242–245 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  9. Michaelson, J.J. et al. Whole-genome sequencing in autism identifies hot spots for de novo germline mutation. Cell 151, 1431–1442 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  10. Fromer, M. et al. De novo mutations in schizophrenia implicate synaptic networks. Nature 506, 179–184 (2014).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  11. Lawrence, M.S. et al. Mutational heterogeneity in cancer and the search for new cancer-associated genes. Nature 499, 214–218 (2013).

    CAS  PubMed  PubMed Central  Google Scholar 

  12. Samocha, K.E. et al. A framework for the interpretation of de novo mutation in human disease. Nat. Genet. 46, 944–950 (2014).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  13. 1000 Genomes Project Consortium. An integrated map of genetic variation from 1,092 human genomes. Nature 491, 56–65 (2012).

  14. International HapMap Consortium. A haplotype map of the human genome. Nature 437, 1299–1320 (2005).

  15. Campbell, M.C. & Tishkoff, S.A. African genetic diversity: implications for human demographic history, modern human origins, and complex disease mapping. Annu. Rev. Genomics Hum. Genet. 9, 403–433 (2008).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  16. Schaffner, S.F. The X chromosome in population genetics. Nat. Rev. Genet. 5, 43–51 (2004).

    Article  CAS  PubMed  Google Scholar 

  17. Nachman, M.W. & Crowell, S.L. Estimate of the mutation rate per nucleotide in humans. Genetics 156, 297–304 (2000).

    CAS  PubMed  PubMed Central  Google Scholar 

  18. Mugal, C.F. & Ellegren, H. Substitution rate variation at human CpG sites correlates with non-CpG divergence, methylation level and GC content. Genome Biol. 12, R58 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  19. Okae, H. et al. Genome-wide analysis of DNA methylation dynamics during early human development. PLoS Genet. 10, e1004868 (2014).

    Article  PubMed  PubMed Central  Google Scholar 

  20. Hovestadt, V. et al. Decoding the regulatory landscape of medulloblastoma using DNA methylation sequencing. Nature 510, 537–541 (2014).

    Article  CAS  PubMed  Google Scholar 

  21. Walser, J.-C. & Furano, A.V. The mutational spectrum of non-CpG DNA varies with CpG content. Genome Res. 20, 875–882 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  22. Kamiya, H. et al. Mutagenicity of 5-formylcytosine, an oxidation product of 5-methylcytosine, in DNA in mammalian cells. J. Biochem. 132, 551–555 (2002).

    Article  CAS  PubMed  Google Scholar 

  23. Deaton, A.M. & Bird, A. CpG islands and the regulation of transcription. Genes Dev. 25, 1010–1022 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  24. Levinson, G. & Gutman, G.A. Slipped-strand mispairing: a major mechanism for DNA sequence evolution. Mol. Biol. Evol. 4, 203–221 (1987).

    CAS  PubMed  Google Scholar 

  25. Panchin, A.Y., Mitrofanov, S.I., Alexeevski, A.V., Spirin, S.A. & Panchin, Y.V. New words in human mutagenesis. BMC Bioinformatics 12, 268 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  26. Lanfear, R., Welch, J.J. & Bromham, L. Watching the clock: studying variation in rates of molecular evolution between species. Trends Ecol. Evol. 25, 495–503 (2010).

    Article  PubMed  Google Scholar 

  27. Kong, A. et al. Rate of de novo mutations and the importance of father's age to disease risk. Nature 488, 471–475 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  28. Bustamante, C.D. et al. Natural selection on protein-coding genes in the human genome. Nature 437, 1153–1157 (2005).

    Article  CAS  PubMed  Google Scholar 

  29. Fu, W. et al. Analysis of 6,515 exomes reveals the recent origin of most human protein-coding variants. Nature 493, 216–220 (2013).

    Article  CAS  PubMed  Google Scholar 

  30. Cooper, G.M. & Shendure, J. Needles in stacks of needles: finding disease-causal variants in a wealth of genomic data. Nat. Rev. Genet. 12, 628–640 (2011).

    Article  CAS  PubMed  Google Scholar 

  31. Stenson, P.D. et al. The Human Gene Mutation Database: building a comprehensive mutation repository for clinical and molecular genetics, diagnostic testing and personalized genomic medicine. Hum. Genet. 133, 1–9 (2014).

    Article  CAS  PubMed  Google Scholar 

  32. Petrovski, S., Wang, Q., Heinzen, E.L., Allen, A.S. & Goldstein, D.B. Genic intolerance to functional variation and the interpretation of personal genomes. PLoS Genet. 9, e1003709 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  33. Georgi, B., Voight, B.F. & Bucć an, M. From mouse to human: evolutionary genomics analysis of human orthologs of essential genes. PLoS Genet. 9, e1003484 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  34. Uddin, M. et al. Brain-expressed exons under purifying selection are enriched for de novo mutations in autism spectrum disorder. Nat. Genet. 46, 742–747 (2014).

    Article  CAS  PubMed  Google Scholar 

  35. De Rubeis, S. et al. Synaptic, transcriptional and chromatin genes disrupted in autism. Nature 515, 209–215 (2014).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  36. Epi4K Consortium & Epilepsy Phenome/Genome Project. De novo mutations in epileptic encephalopathies. Nature 501, 217–221 (2013).

  37. Deciphering Developmental Disorders Study. Large-scale discovery of novel genetic causes of developmental disorders. Nature 519, 223–228 (2015).

  38. Hamdan, F.F. et al. De novo mutations in moderate or severe intellectual disability. PLoS Genet. 10, e1004772 (2014).

    Article  PubMed  PubMed Central  Google Scholar 

  39. Rauch, A. et al. Range of genetic mutations associated with severe non-syndromic sporadic intellectual disability: an exome sequencing study. Lancet 380, 1674–1682 (2012).

    Article  CAS  PubMed  Google Scholar 

  40. de Ligt, J. et al. Diagnostic exome sequencing in persons with severe intellectual disability. N. Engl. J. Med. 367, 1921–1929 (2012).

    Article  CAS  PubMed  Google Scholar 

  41. Ginsburg, D. & Bowie, E.J. Molecular genetics of von Willebrand disease. Blood 79, 2507–2519 (1992).

    CAS  PubMed  Google Scholar 

  42. Iossifov, I. et al. The contribution of de novo coding mutations to autism spectrum disorder. Nature 515, 216–221 (2014).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  43. Orosco, L.A. et al. Loss of Wdfy3 in mice alters cerebral cortical neurogenesis reflecting aspects of the autism pathology. Nat. Commun. 5, 4692 (2014).

    Article  CAS  PubMed  Google Scholar 

  44. Eyre-Walker, A. & Eyre-Walker, Y.C. How much of the variation in the mutation rate along the human genome can be explained? G3 (Bethesda) 4, 1667–1670 (2014).

    Article  Google Scholar 

  45. Kimura, M. & Ohta, T. On some principles governing molecular evolution. Proc. Natl. Acad. Sci. USA 71, 2848–2852 (1974).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  46. Ségurel, L., Wyman, M.J. & Przeworski, M. Determinants of mutation rate variation in the human germline. Annu. Rev. Genomics Hum. Genet. 15, 47–70 (2014).

    Article  PubMed  Google Scholar 

  47. Hussin, J.G. et al. Recombination affects accumulation of damaging and disease-associated mutations in human populations. Nat. Genet. 47, 400–404 (2015).

    Article  CAS  PubMed  Google Scholar 

  48. Koren, A. et al. Genetic variation in human DNA replication timing. Cell 159, 1015–1026 (2014).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  49. Siepel, A. & Haussler, D. Phylogenetic estimation of context-dependent substitution rates by maximum likelihood. Mol. Biol. Evol. 21, 468–488 (2004).

    Article  CAS  PubMed  Google Scholar 

Download references

Acknowledgements

We thank C. Brown, M. Bucan, P. Babb, K. Siewert, K. Johnson, S. Bumgarner and two anonymous reviewers for helpful comments on the manuscript. B.F.V. is grateful for support of the work from the Alfred P. Sloan Foundation (BR2012-087), the American Heart Association (13SDG14330006), the W.W. Smith Charitable Trust (H1201) and the US National Institutes of Health/National Institute of Diabetes and Digestive and Kidney Disorders (R01DK101478).

Author information

Authors and Affiliations

Authors

Contributions

V.A. and B.F.V. conceived and designed the experiments, developed the model, performed the statistical analysis, developed and contributed analysis tools, and wrote the manuscript. B.F.V. supervised the research.

Corresponding author

Correspondence to Benjamin F Voight.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Integrated supplementary information

Supplementary Figure 1 Illustration of the intuition supporting our substitution probability model.

(a) Defining the non-Bayesian probability and Bayesian posterior probability of nucleotide substitution for a 7-mer context. Here we use the example CTACGAT, where position 4 is the polymorphic site and the three nucleotides located 5′ and 3′ constitute the remainder of that site’s local 7-mer sequence context. We count (i) the number of occurrences of that 7-mer context found in the reference genome and (ii) the number of times we observe a polymorphic substitution at position 4. The example shown here is a C-to-T substitution. To generate the posterior probabilities, we sum the observed counts of occurrences and substitutions with a count obtained from the modeled prior. We apply this mathematics to all 7-mer sequence contexts for all substitution classes and then merge the reverse-complementary pairs (the A-to-C class was merged with the T-to-G class, etc.). This results in a total of 24,576 parameters, each representing a unique 7-mer sequence context. (b) Illustration showing how the same 7-mer sequence context on different codon frames leads to different types of amino acid change. Depicted are three cases where a C-to-T substitution that occurs in the sequence context CTA[C/T]GAT at either position 1, 2 or 3 of a codon results in a synonymous, nonsynonymous or nonsense change in amino acid identity.

Supplementary Figure 2 Scatter plot of nucleotide substitution probabilities for each 7-mer sequence context, inferred from 1000 Genomes and HapMap variants.

The substitution probabilities in both cases are strongly correlated with each other (R2 = 0.91, P << 10−100).

Supplementary Figure 3 Genome-wide nucleotide substitution probabilities are correlated across different human populations.

(a) The nucleotide substitution probabilities estimated from the 1-mer model for three human population groups (African, European and Asian) obtained from the 1000 Genomes Project. (b) The nucleotide substitution probabilities estimated from the 7-mer context in the same three populations. Because the x axis for this plot represents 24,576 sequence contexts, it was not practical to list them individually as was done in a. The contexts are represented graphically, sorted from lowest to highest nucleotide substitution probability, as observed in the African group. Data for the European and Asian groups were then represented according to the order obtained for the African group, to make comparison possible across the populations for any given sequence context.

Supplementary Figure 4 Comparison of observed and expected C-to-T substitution probabilities within a 7-mer CpG sequence context.

Supplementary Figure 5 C-to-T substitution probabilities and methylation patterns.

Probabilities of C-to-T substitutions are shown for the following sequence contexts: CpG Me, CpG 7-mer contexts that were unmethylated in all sperm samples; CpG Me+, CpG 7-mer contexts that were methylated in all sperm samples. ***P << 10−100.

Supplementary Figure 6 Correlation between average methylation intensity and probability of C-to-T substitution in the CpG 7-mer context.

(a) Scatterplot of average methylation intensity in brain samples against substitution probability at each 7-mer CpG context. (b) Scatterplot of average methylation intensity in oocyte samples against substitution probability at each 7-mer CpG context. (c) Scatterplot of average methylation intensity in blood samples against substitution probability at each 7-mer CpG context. (d) Scatterplot of average methylation intensity in blastocyst samples against substitution probability at each 7-mer CpG context. In all cases, the substitution probability is moderately correlated (R2 ~0.3) with methylation intensity at each 7-mer CpG sequence context.

Supplementary Figure 7 Substitution probabilities at 7-mer CpG sequence contexts and the distance of the contexts from genes.

Box-and-whisker plot of the distances between sequence contexts that contains a CpG site (C at polymorphic position 4, fixed G at position 5) and the gene nearest to that context found in the human reference genome. LOW plots the distances from sequence contexts identified in the bottom 1% smallest substitution probabilities in the C-to-T substitution class (n = 10). ALL represents the distances from all sequence contexts containing a CpG (n = 1,024). HIGH represents the distances from sequence contexts in the top 1% smallest substitution probabilities from the C-to-T substitution class (n = 10). Each distribution is significantly different from the others (pairwise P << 10−100 by Wilcoxon rank-sum test).

Supplementary Figure 8 Methylation intensity values in various sequence contexts containing a CpG site.

Box-and-whisker plot of methylation intensity values in various sequence contexts containing a CpG site. Methylation intensity represents the average intensity values across all sperm samples. Poly-CpG represents sequence contexts that segregate additional CpG dinucleotides beyond the CpG site at positions 4 and 5 (note that a 7-mer sequence context with a CpG site can segregate up to two additional CpG dinucleotides). Each distribution is significantly different from the others (pairwise P < 10−5 by Wilcoxon rank-sum test).

Supplementary Figure 9 Nucleotide substitution probabilities and recombination rate.

Scatterplot of nucleotide substitution probabilities inferred from only 1000 Genomes regions with a high recombination rate (>3 cM/Mb in the YRI population) and separately from regions with a low recombination rate (<0.05 cM/Mb in the YRI population) for each change in a 7-mer sequence context. The substitution probabilities in both cases are strongly correlated with each other (R2 = 0.97, P << 10−100).

Supplementary Figure 10 Human substitution probabilities are strongly correlated with human-chimpanzee and human-macaque divergence rates.

(a) Scatterplot of nucleotide substitution probabilities against nucleotide divergence rates between human and chimpanzee at each 7-mer sequence context. (b) Scatterplot of nucleotide substitution probabilities against nucleotide divergence rates between human and macaque at each 7-mer sequence context. In both cases, the substitution probabilities and divergence rates are strongly correlated with each other (R2 = 0.96, P << 10−100).

Supplementary Figure 11 Substitution probabilities across the variant frequency spectrum.

Scatterplot of nucleotide substitution probabilities inferred from only 1000 Genomes low to high frequency variants (MAF ≥1%) and separately from rare variants (singletons and doubletons only) for each change in a 7-mer sequence context. The substitution probabilities in both cases are strongly correlated with each other (R2 = 0.98, P << 10−100).

Supplementary Figure 12 Nucleotide substitution probabilities in the coding genome.

Posterior probabilities of nucleotide substitution for each type of amino acid substitution in the coding genome, estimated using the 7-mer coding context model. Sequences contexts are further stratified by color to indicate presence of a CpG (C at the polymorphic position 4 and G at position 5, for C-to-A, C-to-G and C-to-T substitution classes = CpG+; otherwise, CpG) and where evidence of substitution was only observed in the intergenic region. The inset shows a magnified view specifically of the distribution for nonsense substitutions.

Supplementary Figure 13 Violin plot for trends in amino acid replacement types across different amino acids.

(a) Note that the mean probability is different for glycine and tyrosine substitutions, although the expected trend holds (synonymous > missense > nonsense). (b) Some amino acid substitutions deviate from this expected trend owing to the CpG context in the coding genome.

Supplementary Figure 14 The 7-mer context model improves power to detect pathogenic variants.

Log10 ratios of substitution probabilities for the 3-mer model with codon context for coding sequences matched to noncoding sequences for each type of amino acid replacement. We consider all variants from the 1000 Genomes Project (African, yellow) or the Human Gene Mutation Database (HGMD; orange). Larger values indicate fewer substitutions in the coding genome than expected from matched noncoding sequences (intolerance), consistent with selective constraint acting on these replacements. **P < 10−53; NS, not significant by Wilcoxon rank-sum test.

Supplementary Figure 15 The gene scores calculated from 1000 Genomes or EVS (European populations) data sets are correlated with each other.

Supplementary Figure 16 Comparison and correlation of various gene score measures.

(a,b) Comparison of our presented gene score (Aggarwala) built from the 1000 Genomes African group using the coding 7-mer model with the scores presented by Petrovski et al. (a) and Samocha et al. (b). Note that in a, all HGNC gene IDs could not be mapped to Ensembl 75 genes, and in b only a subset of gene scores were publicly available.

Supplementary information

Supplementary Text and Figures

Supplementary Figures 1–16 and Supplementary Note. (PDF 2799 kb)

Supplementary Tables 1–17

Supplementary Tables 1–17. (XLSX 16896 kb)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Aggarwala, V., Voight, B. An expanded sequence context model broadly explains variability in polymorphism levels across the human genome. Nat Genet 48, 349–355 (2016). https://doi.org/10.1038/ng.3511

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/ng.3511

  • Springer Nature America, Inc.

This article is cited by

Navigation