We extracted and analysed a non-redundant dataset of 10,136,597 human missense variants and mapped 3,960,015 of them onto 18,874 experimental 3D structures and 84,818 3D structural models.
8% (1715 proteins) of the human proteome could be mapped at least partially using only an experimental structure. For 51% (10,511 proteins) of the proteome, no experimental structures were available and coverage was obtained using 3D models. For an additional 23% of the human proteome (4742 proteins), structural coverage could be obtained, using a combination of experimental and 3D model structures. No structural data are currently available for 3453 proteins (17% of the human proteome, Fig. 1).
At residue level, structural coverage is currently available for 5,837,443 (51%) amino acid residues: 1,708,531 (15%) by an experimental structure and the remaining (36%) by a 3D model. 18% (2,034,664) of residues without a structural coverage are predicted to be disordered (Fig. 1).
At variant level, 20% (n = 2,437,221) of human amino acid substitutions occur in a disordered region that is predicted to lack a single stable folded structure. Of the remaining 7,699,376 variants, 51.4% (3,960,015 amino acid substitutions) were mapped to an experimental (1,110,833 variants) or modelled (2,849,182) structure and prediction of their effect at atomic level is deposited in Missense3D-DB.
Overall, 14.4% (n = 568,548/3,948,327) of missense variants reported in GnomAD are predicted to cause structural damage, thus leading to protein instability or misfolding. Of these 568,548 variants, 2109 are common (MAF > 1%), 377,622 rare (MAF < 1%) and 188,817 extremely rare (no MAF available). Among the missense variants with a known pathogenic annotation, 34.2% (n = 6334/18,518) from ClinVar and 36.1% (n = 8509/23,588) from UniProt are predicted to cause structural damage. Many missense variants (n = 60,354) in ClinVar are annotated as a VUS (variant with unknown significance) or as a variant with conflicting interpretation. Structural analysis shows that 19.0% (n = 6266/32,717) of these variants for which 3D analysis could be performed are predicted to cause structural damage.
The Missense3D-DB web catalogue
We designed the Missense3D-DB database and its web interface (http://missense3d.bc.ic.ac.uk/) to enable geneticists and the overall scientific community to access the results of the 3D structural analysis. The browser is free to use and can be interrogated using the gene of interest. For each variant, the Results page reports on the predicted structural effect (benign or damaging), with a concise description of the affected structural features identified by the analysis: e.g. steric clash (Fig. 2). The in-depth structural interpretation is presented in a friendly format for the non-experts in structural biology in a dedicated page accessible from a link in the website Results page. This page contains details on the structural analysis and an interactive viewer that visualizes the 3D coordinates of the wild type and variant structure (Fig. 3). From this site, the 3D coordinates of the variant structure (mutant PDB structure) are available for download.
To provide the user with a comprehensive variant overview, the Results page also reports the Ensembl transcript ID, the domain in which the variant occurs, the variant phenotype annotation according to the database from which it was extracted, the MAF from GnomAD or 1000Genomes, the variant dbSNP ID, and in silico variant effect predictions (including raw score) from SIFT and Polyphen2 (Fig. 2). The changes in folding free energy (ΔΔG) calculated using FoldX are also reported, as well as CCR percentile scores. Although regions with CCR scores ≥ 95% are considered constrained (Havrilla et al. 2019), we did not see enrichment in structurally-predicted damaging variants in these regions. This is likely due to the limitations of our structure-based analysis. As opposed to sequence-based methods, a structural analysis can only be performed when a 3D structure (experimental or modelled) is available and, at present, only ~ 50% of missense variants can be modelled. Moreover, our structural analysis is currently limited to the effect of the variant on the single protein chain and does not take into account additional deleterious effect, such as disruption of protein–ligand and protein–protein interaction sites, and disruption of functional sites (Vihinen 2015).
Results for the query variant, or for all variants mapped for a query gene, can be downloaded from the Results page.
A set of 31,283 variants for benchmarking new bioinformatics resources
A subset of our results, consisting of 31,283 variants with their structural prediction, is available in the Supplementary Material 1. Variants in this file are non-redundant and annotated as follows: 19,290 pathogenic and 11,993 benign, according to the annotation in ClinVar, Humsavar or both databases. Moreover, variants with MAF ≥ 1% from GnomAD were also included and annotated as benign (Table S1 and Table S2 in the Supplementary Material 2). Variants with conflicting annotation (e.g. benign in ClinVar but pathogenic in Humsavar) were excluded. This resource can be used by the scientific community to benchmark new bioinformatics algorithms.