Abstract
Deep catalogs of genetic variation from thousands of humans enable the detection of intraspecies constraint by identifying coding regions with a scarcity of variation. While existing techniques summarize constraint for entire genes, single gene-wide metrics conceal regional constraint variability within each gene. Therefore, we have created a detailed map of constrained coding regions (CCRs) by leveraging variation observed among 123,136 humans from the Genome Aggregation Database. The most constrained CCRs are enriched for pathogenic variants in ClinVar and mutations underlying developmental disorders. CCRs highlight protein domain families under high constraint and suggest unannotated or incomplete protein domains. The highest-percentile CCRs complement existing variant prioritization methods when evaluating de novo mutations in studies of autosomal dominant disease. Finally, we identify highly constrained CCRs within genes lacking known disease associations. This observation suggests that CCRs may identify regions under strong purifying selection that, when mutated, cause severe developmental phenotypes or embryonic lethality.
Similar content being viewed by others
Data availability
The segmental duplications can be found at ftp://hgdownload.soe.ucsc.edu/goldenPath/hg19/database/genomicSuperDups.txt.gz. The self-chains can be found at http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/chainSelf.txt.gz. The Pfam domains can be found at http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/ucscGenePfam.txt.gz. The Ensembl exons file can be found at ftp://ftp.ensembl.org/pub/release-75/gtf/homo_sapiens/Homo_sapiens.GRCh37.75.gtf.gz. The gnomAD file can be found at https://storage.googleapis.com/gnomad-public/release/2.0.1/vcf/exomes/gnomad.exomes.r2.0.1.sites.vcf.gz. The gnomAD coverage files can be found at the location indicated by the pattern below: https://storage.googleapis.com/gnomad-public/release/2.0.1/coverage/exomes/gnomad.exomes.r2.0.1.chr$chrom.coverage.txt.gz. The CADD files for both indels and SNPs can be found at http://krishna.gs.washington.edu/download/CADD/v1.3/InDels.tsv.gz and http://krishna.gs.washington.edu/download/CADD/v1.3/whole_genome_SNVs.tsv.gz. The GERP++ file can be found at http://mendel.stanford.edu/SidowLab/downloads/gerp/hg19.GERP_scores.tar.gz. The file for MPC can be found at ftp://ftp.broadinstitute.org/pub/ExAC_release/release1/regional_missense_constraint/fordist_constraint_official_mpc_values.txt.gz. The whole-exome MTR file can be found, courtesy of the author, at http://mtr-viewer.mdhs.unimelb.edu.au:8079/mtrflatfile_1.0.txt.gz. The REVEL file can be found at https://rothsj06.u.hpc.mssm.edu/revel/revel_all_chromosomes.csv.zip. The file for pLI can be found at ftp://ftp.broadinstitute.org/pub/ExAC_release/release1/manuscript_data/forweb_cleaned_exac_r03_march16_z_data_pLI.txt.gz. The ClinVar VCF file used in the analyses can be found at ftp://ftp.ncbi.nih.gov/pub/clinvar/vcf_GRCh37/archive_2.0/2017/clinvar_20170802.vcf.gz. Lastly, the de novo variants file from ref. 41 can be found on our s3 server at https://s3.us-east-2.amazonaws.com/pathoscore-data/samocha/samochadenovo.xlsx.
References
Wallis, W. A. The statistical research group, 1942–1945. J. Am. Stat. Assoc. 75, 320–330 (1980).
Petrovski, S., Wang, Q., Heinzen, E. L., Allen, A. S. & Goldstein, D. B. Genic intolerance to functional variation and the interpretation of personal genomes. PLoS Genet. 9, e1003709 (2013).
Fu, W. et al. Analysis of 6,515 exomes reveals the recent origin of most human protein-coding variants. Nature 493, 216–220 (2013).
Lek, M. et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature 536, 285–291 (2016).
Samocha, K. E. et al. A framework for the interpretation of de novo mutation in human disease. Nat. Genet. 46, 944–950 (2014).
Finn, R. D. et al. The Pfam protein families database: towards a more sustainable future. Nucleic Acids Res. 44, D279–D285 (2016).
Letunic, I., Doerks, T. & Bork, P. SMART 7: recent updates to the protein domain annotation resource. Nucleic Acids Res. 40, D302–D305 (2012).
Tatusov, R. L. et al. The COG database: an updated version includes eukaryotes. BMC Bioinformatics 4, 41 (2003).
Klimke, W. et al. The National Center For Biotechnology Information’s Protein Clusters Database. Nucleic Acids Res. 37, D216–D223 (2009).
Haft, D. H., Selengut, J. D. & White, O. The TIGRFAMs database of protein families. Nucleic Acids Res. 31, 371–373 (2003).
Bailey, J. A. et al. Recent segmental duplications in the human genome. Science 297, 1003–1007 (2002).
Cabanski, C. R. et al. BlackOPs: increasing confidence in variant detection through mappability filtering. Nucleic Acids Res. 41, e178 (2013).
Lister, R. et al. Human DNA methylomes at base resolution show widespread epigenomic differences. Nature 462, 315–322 (2009).
Aggarwala, V. & Voight, B. F. An expanded sequence context model broadly explains variability in polymorphism levels across the human genome. Nat. Genet. 48, 349–355 (2016).
Mugal, C. F. & Ellegren, H. Substitution rate variation at human CpG sites correlates with non-CpG divergence, methylation level and GC content. Genome Biol. 12, R58 (2011).
Carlson, J. et al. Extremely rare variants reveal patterns of germline mutation rate heterogeneity in humans. Preprint at bioRxiv https://doi.org/10.1101/108290 (2017).
Yates, A. et al. Ensembl 2016. Nucleic Acids Res. 44, D710–D716 (2016).
Marfella, C. G. A. & Imbalzano, A. N. The Chd family of chromatin remodelers. Mutat. Res. 618, 30–40 (2007).
Van Houdt, J. K. J. et al. Heterozygous missense mutations in SMARCA2 cause Nicolaides-Baraitser syndrome. Nat. Genet. 44, 445–449 (2012).
Spataro, N., Rodríguez, J. A., Navarro, A. & Bosch, E. Properties of human disease genes and the role of genes linked to Mendelian disorders in complex disease aetiology. Hum. Mol. Genet. 26, 489–500 (2017).
Gibson, J., Tapper, W., Ennis, S. & Collins, A. Exome-based linkage disequilibrium maps of individual genes: functional clustering and relationship to disease. Hum. Genet. 132, 233–243 (2013).
Collins, A. The genomic and functional characteristics of disease genes. Brief. Bioinform. 16, 16–23 (2014).
Lelieveld, S. H. et al. Spatial clustering of de novo missense mutations identifies candidate neurodevelopmental disorder-associated genes. Am. J. Hum. Genet. 101, 478–484 (2017).
Davydov, E. V. et al. Identifying a high fraction of the human genome to be under selective constraint using GERP++. PLoS Comput. Biol. 6, e1001025 (2010).
Gussow, A. B., Petrovski, S., Wang, Q., Allen, A. S. & Goldstein, D. B. The intolerance to functional genetic variation of protein domains predicts the localization of pathogenic mutations within genes. Genome Biol. 17, 9 (2016).
Lee, M. P. et al. Low frequency of p57KIP2 mutation in Beckwith-Wiedemann syndrome. Am. J. Hum. Genet. 61, 304–309 (1997).
Romanelli, V. et al. CDKN1C (p57 Kip)) analysis in Beckwith-Wiedemann syndrome (BWS) patients: genotype-phenotype correlations, novel mutations, and polymorphisms. Am. J. Med. Genet. A 152A, 1390–1397 (2010).
Higashimoto, K., Soejima, H., Saito, T., Okumura, K. & Mukai, T. Imprinting disruption of the CDKN1C/KCNQ1OT1 domain: the molecular mechanisms causing Beckwith-Wiedemann syndrome and cancer. Cytogenet. Genome Res. 113, 306–312 (2006).
Baran, Y. et al. The landscape of genomic imprinting across diverse adult human tissues. Genome Res. 25, 927–936 (2015).
Finn, R. D. et al. The Pfam protein families database. Nucleic Acids Res. 38, D211–D222 (2010).
Weckhuysen, S. et al. KCNQ2 encephalopathy: emerging phenotype of a neonatal epileptic encephalopathy. Ann. Neurol. 71, 15–25 (2012).
Tinel, N., Lauritzen, I., Chouabe, C., Lazdunski, M. & Borsotto, M. The KCNQ2 potassium channel: splice variants, functional and developmental expression. Brain localization and comparison with KCNQ3. FEBS Lett. 438, 171–176 (1998).
Ocorr, K. et al. KCNQ potassium channel mutations cause cardiac arrhythmias in Drosophila that mimic the effects of aging. Proc. Natl Acad. Sci. USA 104, 3943–3948 (2007).
Mark, M., Rijli, F. M. & Chambon, P. Homeobox genes in embryogenesis and pathogenesis. Pediatr. Res. 42, 421–429 (1997).
Stevenson, R. E. in GeneReviews (eds Adam, M. P. et al.) (Univ. Washington, 1993–2018).
Higgs, D. R. et al. Understanding α-globin gene regulation: aiming to improve the management of thalassemia. Ann. NY Acad. Sci. 1054, 92–102 (2005).
Baker, L. A., Allis, C. D. & Wang, G. G. PHD fingers in human diseases: disorders arising from misinterpreting epigenetic marks. Mutat. Res. 647, 3–12 (2008).
Musselman, C. A. & Kutateladze, T. G. PHD fingers: epigenetic effectors and potential drug targets. Mol. Interv. 9, 314–323 (2009).
Matthews, A. G. W. et al. RAG2 PHD finger couples histone H3 lysine 4 trimethylation with V(D)J recombination. Nature 450, 1106–1110 (2007).
Nishimura, K., Lee, S. B., Park, J. H. & Park, M. H. Essential role of eIF5A-1 and deoxyhypusine synthase in mouse embryonic development. Amino Acids 42, 703–710 (2012).
Samocha, K. E. et al. Regional missense constraint improves variant deleteriousness prediction. Preprint at bioRxiv https://doi.org/10.1101/148353 (2017).
de Ligt, J. et al. Diagnostic exome sequencing in persons with severe intellectual disability. N. Engl. J. Med. 367, 1921–1929 (2012).
Rauch, A. et al. Range of genetic mutations associated with severe non-syndromic sporadic intellectual disability: an exome sequencing study. Lancet 380, 1674–1682 (2012).
Lelieveld, S. H. et al. Meta-analysis of 2,104 trios provides support for 10 new genes for intellectual disability. Nat. Neurosci. 19, 1194–1196 (2016).
Deciphering Developmental Disorders Study. Large-scale discovery of novel genetic causes of developmental disorders. Nature 519, 223–228 (2015).
Deciphering Developmental Disorders Study. Prevalence and architecture of de novo mutations in developmental disorders. Nature 542, 433–438 (2017).
Epi4K Consortium. et al. De novo mutations in epileptic encephalopathies. Nature 501, 217–221 (2013).
Iossifov, I. et al. The contribution of de novo coding mutations to autism spectrum disorder. Nature 515, 216–221 (2014).
De Rubeis, S. et al. Synaptic, transcriptional and chromatin genes disrupted in autism. Nature 515, 209–215 (2014).
Kircher, M. et al. A general framework for estimating the relative pathogenicity of human genetic variants. Nat. Genet. 46, 310–315 (2014).
Ioannidis, N. M. et al. REVEL: an ensemble method for predicting the pathogenicity of rare missense variants. Am. J. Hum. Genet. 99, 877–885 (2016).
Traynelis, J. et al. Optimizing genomic medicine in epilepsy through a gene-customized approach to missense variant interpretation. Genome Res. 27, 1715–1729 (2017).
Youden, W. J. Index for rating diagnostic tests. Cancer 3, 32–35 (1950).
Kosmicki, J. A. et al. Refining the role of de novo protein-truncating variants in neurodevelopmental disorders by using population reference samples. Nat. Genet. 49, 504–510 (2017).
Turner, T. N. et al. Genomic patterns of de novo mutation in simplex autism. Cell 171, 710–722 (2017).
Werling, D. M. et al. An analytical framework for whole-genome sequence association studies and its implications for autism spectrum disorder. Nat. Genet. 50, 727–736 (2018).
Homsy, J. et al. De novo mutations in congenital heart disease with neurodevelopmental and other congenital anomalies. Science 350, 1262–1266 (2015).
Keinan, A. & Clark, A. G. Recent explosive human population growth has resulted in an excess of rare genetic variants. Science 336, 740–743 (2012).
Zou, J. et al. Quantifying unobserved protein-coding variants in human populations provides a roadmap for large-scale sequencing projects. Nat. Commun. 7, 13293 (2016).
Villard, E. et al. Mutation screening in dilated cardiomyopathy: prominent role of the beta myosin heavy chain gene. Eur. Heart J. 26, 794–803 (2005).
Tan, A., Abecasis, G. R. & Kang, H. M. Unified representation of genetic variants. Bioinformatics 31, 2202–2204 (2015).
McLaren, W. et al. The Ensembl Variant Effect Predictor. Genome Biol. 17, 122 (2016).
Berg, J. S. et al. An informatics approach to analyzing the incidentalome. Genet. Med. 15, 36–44 (2013).
Quinlan, A. R. & Hall, I. M. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26, 841–842 (2010).
Mi, H., Muruganujan, A., Casagrande, J. T. & Thomas, P. D. Large-scale gene function analysis with the PANTHER classification system. Nat. Protoc. 8, 1551–1566 (2013).
Acknowledgements
We acknowledge W. Pearson, C. Feschotte, J. Seger, G. Marth, N. Elde, and S. Kravitz for insightful discussions that motivated some of the analyses presented in this manuscript. We also thank the investigators who contributed to and created the Genome Aggregation Database for openly sharing the genetic variation datasets that facilitated our research. A.R.Q. was supported by the US National Institutes of Health through grants from the National Human Genome Research Institute (R01HG006693 and R01HG009141), the National Institute of General Medical Sciences (R01GM124355), and the National Cancer Institute (U24CA209999). R.M.L. was supported by a K99 award from the National Human Genome Research Institute (K99HG009532).
Author information
Authors and Affiliations
Contributions
A.R.Q. conceived the research question and organized the study. J.M.H. led the research and analysis. J.M.H., B.S.P., R.M.L., and A.R.Q. designed the coding constraint region model and contributed to the analyses. J.M.H. and A.R.Q. wrote the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Integrated supplementary information
Supplementary Figure 1 Evaluation of CCR models by sequencing coverage threshold.
Evaluation of CCR models constructed using different coverage thresholds and different thresholds for the percentage of gnomAD individuals meeting the minimum coverage depth. For example, ‘10x.5 CCR’ reflects a CCR model where every position in a CCR region was required to have 10× coverage in at least 50% of gnomAD individuals. a, ROC curve based on the ClinVar variant set. b, PR curve based on ClinVar. True positives are pathogenic variants and likely pathogenic variants from ClinVar. True negatives are variants labeled as benign from ClinVar. The performance of each model is clearly very similar, and the ‘10x.5 CCR’ model imposed the most relaxed coverage requirement while exhibiting the highest performance. It was therefore chosen as the coverage threshold for the final model. 24,554 pathogenic variants from ClinVar were used, and 4,689 benign variants were used for the evaluation dataset.
Supplementary Figure 2 Correlation between exonic CpG density and genetic variation.
The sample size is the number of CCRs, which is 8,065,333 unique regions. Pearson’s correlation was used. a, Exonic CpG density compared to the density of exonic C>T or G>A transitions. b, Exonic CpG density compared to the density of all exonic variant types.
Supplementary Figure 3 Average exonic distance for adjacent gnomAD variants.
Distribution of the exonic distance between protein-changing (missense or LoF) variants in gnomAD without filtering regions by coverage, segmental duplications, or self-chains. The red dashed line is the average distance between protein-changing variants. The blue and black dashed lines represent the average length of CCRs in the 95th and 99th percentile, respectively.
Supplementary Figure 4 Correlation of constrained coding regions to other models of genic constraint.
The sample size is the number of CCRs ≥95%, which is 21,650 unique regions, and the number of genes with a Missense Z constraint score or pLI score is 18,225 genes for both sets. a, The correlation between a gene’s Missense Z metric (least to most constrained from left to right) and the number of CCRs in the 95th percentile or higher observed in the gene. b, The correlation between a gene’s RVIS metric (least to most constrained from left to right) and the number of CCRs in the 95th percentile or higher observed in the gene.
Supplementary Figure 5 Total number of shared and unique genes across metrics for predetermined constraint metric cutoffs.
a,b, Comparison of genes covered by each metric’s cutoff for constraint (CCR ≥ 95 (a) or 99 (b), pLI ≥ 0.9, and missense depletion ≤ 0.4). The dark blue bar indicates how many genes are unique to a particular metric’s cutoff for constraint, and the light blue-green bar represents how many of the genes for that cutoff are shared with at least one of the other two metrics.
Supplementary Figure 6 Precision–recall (PR) curves for the developmental disorder de novo variant evaluation set.
The true positives are 3,400 missense-only de novo variants from patients with developmental disorders. The true negatives are 1,269 missense de novo variants from the unaffected siblings of autism patients. The dots indicate the score cutoff with the maximal Youden J statistic for each tool. Values in parentheses indicate the F1 score, the weighted average of recall and precision, at the J-score cutoff.
Supplementary Figure 7 X-chromosome variant pathogenicity prediction comparison for CCR versus other metrics.
a, Enrichment of 166 pathogenic de novo mutations on the X chromosome in the most constrained X-CCRs and 43 benign mutations in the least constrained X-CCRs. The error bars represent 95% confidence intervals of 0.043–0.226 for the 0–20 bin, 0.46–2.07 for the 20–80 bin, 0.85–16.5 for the 80–90 bin, 0.69–41.1 for the 90–95 bin, and 1.35–77.2 for the 95–100 bin. b, ROC curve for the developmental disorder de novo variant evaluation set. The true positives are 166 missense-only de novo variants from patients with developmental disorders. The true negatives are 43 missense de novo variants from the unaffected siblings of autism patients. c, PR curve for X-CCR versus other metrics for the de novo set. The dots in b and c indicate the score cutoff with the maximal Youden J statistic for each tool. Values in parentheses indicate AUC and peak J score (respectively) for b and the F1 score, the weighted average of recall and precision, at the J-score cutoff for c.
Supplementary Figure 8 Odds ratio comparison between ExAC-based CCR and gnomAD-based CCR for the ClinVar variant set.
True positives are 24,554 pathogenic variants and likely pathogenic variants from ClinVar. True negatives are 4,689 variants labeled as benign from ClinVar. For ExAC v1, the 95% confidence intervals are 0.021–0.028 for the 0–20 bin, 20.5–29.6 for the 20–80 bin, 9.09–20.0 for the 80–90 bin, 11.8–47.4 for the 90–95 bin, and 14.1–36.8 for the 95–100 bin. For gnomAD, the 95% confidence intervals are 0.015–0.023 for the 0–20 bin, 23.9–36.6 for the 20–80 bin, 14.6–45.4 for the 80–90 bin, 22.8–1151.0 for the 90–95 bin, and 40.4–647.5 for the 95–100 bin.
Supplementary information
Supplementary Text and Figures
Supplementary Figures 1–8
Supplementary Table 1
Genes with CCRs in the 99th percentile or higher
Supplementary Table 2
CCRs under purifying selection specifically in humans
Supplementary Table 3
CCR enrichment in Pfam domains
Supplementary Table 4
Highly constrained CCRs not covered by missense depletion
Supplementary Data
CCR percentile distributions for all Pfam domains
Rights and permissions
About this article
Cite this article
Havrilla, J.M., Pedersen, B.S., Layer, R.M. et al. A map of constrained coding regions in the human genome. Nat Genet 51, 88–95 (2019). https://doi.org/10.1038/s41588-018-0294-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s41588-018-0294-6
- Springer Nature America, Inc.
This article is cited by
-
Clustering of predicted loss-of-function variants in genes linked with monogenic disease can explain incomplete penetrance
Genome Medicine (2024)
-
Next-generation sequencing and bioinformatics in rare movement disorders
Nature Reviews Neurology (2024)
-
A unified analysis of evolutionary and population constraint in protein domains highlights structural features and pathogenic sites
Communications Biology (2024)
-
An overload of missense variants in the OTOG gene may drive a higher prevalence of familial Meniere disease in the European population
Human Genetics (2024)
-
Genetic constraint at single amino acid resolution in protein domains improves missense variant prioritisation and gene discovery
Genome Medicine (2024)