Background

"Sequence → Structure → Function" is the traditional view that amino acid sequences determine the structure of a protein molecule and that a definite protein structure is a prerequisite to biological function. This view has been amended by the finding that more and more proteins possess no definite ordered three-dimensional structure but are still involved in key biological processes, including cell cycle and gene regulation, molecular recognition, assembly of complexes, and signalling in general [1, 2]. Indeed, over 33% of eukaryotic proteins contain structure-lacking regions. This kind of protein is often named "intrinsically disordered proteins" (IDPs). Several studies have shown a strong correlation between disease-associated proteins and proteins containing significant amounts of intrinsic disorder [3], leading to the D2 concept of "disorder in disease" [4]. Complex diseases such as cancer, neurodegenerative diseases, cardiovascular diseases, and diabetes are often associated with IDPs [3], likely because errors in signalling and regulation arising from IDPs are important for these disease associations. It was found that mutations that cause disorder tendencies to flip to structure tendencies are the most likely mutations in disordered regions to be disease-causing [5].

Recent advances in genetic studies enabled the discovery of many genetic regions linked or associated to complex diseases, using array-based genotyping technology, or more recently, next generation sequencing technology. Many known or novel single nucleotide variations have been identified, and their potential roles on disease pathogenesis are unknown. Many bioinformatics tools including FastSNP [6], Panther [7], PolyPhen2 [8], SIFT [9], SNPs3D [10] and SPOT [11], have been developed; many of them prioritize the SNV functions based on their roles in affecting protein structures.

Of particular interest here is that one study reported that 114 out of 122 (93%) single amino acid polymorphisms (SAPs) located in disordered regions are associated with disease. Thus SAPs occurring in disordered regions are highly likely to affect the functions of the proteins and be associated with disease [12].

In the present study, we systematically evaluate the potential disease risk on the SNVs whose resultant amino acid changes, SAPs, can change their structure/disorder tendencies, based on the single nucleotide variants derived from the 1,000 Genome Project and distributed by the Genetic Analysis Workshop (GAW17).

Methods

The overall analysis workflow includes three major steps (Figure 1): 1. retrieving protein sequences for the genes of interest; 2. mapping genetic variations in nucleotide level into proteomic variations in amino acid level; and 3. assessing the capability of a genetic variation to change the disorder/structured tendencies. This workflow integrates different levels of bioinformatics approaches, and allows understanding how single nucleotide variations can change the predicted structure/disorder estimate of the stability of the protein structures, and how these changes affect disease risk.

Figure 1
figure 1

Schematics of the workflow.

Dataset

SNV genotypes for 697 individuals were obtained from the sequence alignment files identified in the 1,000 Genomes Project and distributed by the Genetic Analysis Workshop (GAW17). A total of 24,487 exonic SNVs within 3205 autosomal genes were included in the data, regardless whether they are synonymous or non-synonymous. In this study, we focus our analysis on the missense non-synonymous variations that can cause single amino acid polymorphisms.

Three quantitative phenotypes (i.e. Q1, Q2, Q4) are generated for each of the unrelated individuals, and a total of 200 simulations are available to us. The disease model of Q1 includes 39 SNPs in 9 genes from VEGF pathway; Q2 is influenced by 72 SNPs in 13 genes related to cardiovascular risk and inflammation; Q4 is not affected by any of the available SNPs. Liability to disease is defined as latent liability+Q1+Q2-Q4, where latent liability is determined by 51 SNPs in 15 genes involved in the VEGF pathway.

Retrieving protein sequences

Since UniProtKB/Swiss-Prot is a high quality manually annotated and non-redundant protein sequence database [13], we choose UniProtKB/Swiss-Prot to retrieve protein sequences. Furthermore, for every gene in the list, we retrieved both the canonical sequence and isoform data, downloaded from UniProtKB/Swiss-Prot database. We used the gene symbol in the GAW17 data set as the query to search for the amino acid sequences. Among the 3,205 genes provided in the data set, amino acid sequences of 2,893 genes were available in the UniProtKB/Swiss-Prot database, and retrieved for further analysis. Among the 24,487 SNVs provided in the GAW 17 data, 18,075 SNVs were mapped in the UniProtKB/Swiss-Prot database [13]. We exclude synonymous and nonsense non-synonymous mutations from further analysis, since the former causes no change in disorder score and the latter leads to truncation of the protein.

Converting genetic variation to proteomic variation

In order to map the single nucleotide variations into the appropriate amino acid sequences, we used BLASTX algorithm [14] downloaded from the National Center for Biotechnology Information. BLASTX program translates the query nucleotide sequences in all six possible reading frames and provides combined significance statistics for hits to different frames. The nucleotide sequences from human genome build 36 (hg18) were used to retrieve the nucleotide query for each gene, and the reference sequences were the amino acid sequences from UniProtKB/Swiss-Prot. For ~10% of the genes, one segment of amino acid sequence may be mapped to different parts of nucleotide coding regions. To ensure the accuracy of the mapping, we further confirmed that paired segments in the query nucleotide sequence and the resultant amino acid sequences are in the same sequential order. We found this dramatically increase the mapping accuracy.

Assessing the capability of a genetic variation to change disorder/structured status

To evaluate the potential of an SNV to alter the disorder or structure tendency, we first predict the disorder score for a SNV in the specific sequence. In this study, the per-residue disorder predictor PONDR-VSL2 [15] was used. VSL2 is composed of a set of support vector machines trained on datasets containing structured and disordered regions of various lengths. VSL2 provides one disorder prediction score between 0 and 1 for each residual. A score above or below 0.5 indicates that the target amino acid is located in a predicted region of disorder or structure, respectively. Overall, VSL2 achieves accuracy close to 80% correct and is one of the more accurate disorder predictors currently available. Although PONDR-VSL2 used in our study is one of the most accurate protein disorder predictors. There still exist two sources of uncertainty in disorder prediction, one is model uncertainty, and the other one is data uncertainty. Such work has been discussed in recent study [16].

We then use the following strategy to calculate the ability of one SNV to change the tendency of a local sequence region to form disorder or structure:

Δ D S = D S m i n - D S m a j
(1)

where DS min and DS maj represent the SNV's disorder scores for minor allele and major allele, respectively. A positive or negative ΔDS value indicates that the minor allele will be associated with an increased or decreased disorder potential, respectively.

Results and discussion

Disorder- and Structure-potential in SNVs

The hypothesis that we want to test is that SNVs that cause changes in the structure/disorder tendency are deleterious. Destabilization of a structured protein domain by a mutation is often harmful [17]. A positive ΔDS in a structured protein corresponds to this case. As indicated above 114/122 mutations in disordered regions were associated with disease [12], so we also need to consider the possibility that SNVs causing negative ΔDS might also be deleterious. Of course for a mutation, it would likely be the change in the structure ←→ disorder tendency,, ΔDS that would be important, not necessarily the absolute value.

Context-dependence of ΔDS values

Using the VSL2 software [15], we evaluated how a given amino acid change alters the potential for protein disorder by calculating the ΔDS as described above. The output of the VSL2 software depends not only on the amino acid at a given position but also on the amino acids surrounding that position. Thus, a particular ΔDS value will depend both on the amino acid change and also on the sequence context of the given amino acid change. Here we will compare two scenarios: when the same amino acid substitution occurs in two isoforms of related proteins and when the same amino acid substitution occurs in unrelated proteins.

To determine context-dependence for different isoforms, we identified SNVs in 1,170 genes that have isoform records in UniProtKB/Swiss-Prot database. In total 1,082 SNVs were found to potentially exist in 3,229 different isoforms. Pair wise comparison of all of the related isoforms yields a distribution showing how much each ΔDS value changes as the sequence-context changes in a different isoform. This distribution shows a very strong peak very close to a shift of 0.0003 in the ΔDS value followed by an extended tail, giving a average of ~ 0.01 for the data. Such a small context-dependent change for most of the sequences likely results from the very similar amino acid sequences of the different isoforms. Given such a small context-dependent change, it is not surprizing that only 0.2% of these amino acid substitutions lead to a change in the sign of the ΔDS value.

Determining the context dependence for the same amino acid change in different sequences is more involved than the comparison of isoforms. Overall, there are 380 possible non-synonymous amino acid changes. However, if only one base change is allowed per codon, then there are only150 possible changes. These 150 possible changes were all observed in the dataset. The average ΔDS values, determined from the multiple instances of the same amino acid change, range from close to zero to slightly more than 0.16 (Figure 2).

Figure 2
figure 2

Comparisons of average Δ DS scores changes. The ΔDS values the same amino acid change in the different proteins were collected and averaged. Shown are the results for all 150 different amino acid changes.

The amino changes giving the 10 largest values for + or - average ΔDS are given in Tables 1, 2, while the codon changes are given in Tables 3, 4. An amino change might have one or more than one codon changes. The changes with large + average ΔDS correspond to mutating a large, hydrophobic residue (such as W, F, I, Y) into a hydrophilic one (S, N) or even into one with an electrical charge (R, D, K). This would certainly cause a shift towards a greater tendency to be disordered. On the other hand, the changes with large - ΔDS correspond to just the reverse, with a charged (K, E, R, D) or hydrophilic (S, G) residue mutating into a structure-promoting one (I, W, V, F, Y or C). This sort of change identifies a shift towards a greater structure tendency. The rather large standard deviations indicate significant contributions from the surrounding sequences.

Table 1 Rank of top 10 amino acid (AA) changes with + mean ΔDS
Table 2 Rank of top 10 amino acid (AA) changes with - mean
Table 3 Rank of top 10 codon changes with + mean ΔDS
Table 4 Rank of top 10 codon changes with - mean ΔDS

As expected from the large standard deviations in Tables 1, 2, pairwise comparisons of the same amino acid changes in different sequences give a very broad distribution. This distribution has a very weak peak near 0.006 and mean value of ~ 0.036. Large sequence differences for unrelated proteins accounts for these much more significant context-dependent changes here for the ΔDS values for a particular type of amino acid change as compared to the much smaller changes observed for the protein isoform data as discussed above.

Minor allele frequencies and ΔDS

To illustrate the disorder/structured potential of SNVs, we focused our analysis on 10,254 missense non-synonymous SNVs, with4,345 and 5,909 showing positive and negative, ΔDS respectively.

Small values of ΔDS are expected to be less important, so partitioning the data into subsets having larger and smaller ΔDS values should be helpful. Examination of Figure 2 suggests that a threshold of |ΔDS| ≥ 0.04 would eliminate most of the small peaks in the figure. This value is also slightly larger than the mean value of ~ 0.036 for the context-dependent shifts of the ΔDS distribution as mentioned above. So, for both of these reasons, 0.04 was chosen as a threshold for our initial studies.

The positive and negative ΔDS sets were divided into subsets according to the magnitude of the ΔDS values (Table 5). As indicated, there are 1,572 SNVs with ΔDS ≥ +0.04 and 2,629 SNVs with ΔDS ≤ -0.04 (Table 5). The subsets with }ΔDS} ≥ 0.04 contain [(1,572 + 2,629)/10,254] →40% the data.

Table 5 Number of SNVs in synonymous/non-synonymous region

Next, the differences of the minor allele frequency (MAF) were compared for various subsets partitioned by their ΔDS scores. The MAF is the frequency of the SNV's less frequent allele in a given population. When ΔDS values are positive or negative but close to zero (Figure 3A), the two populations show similar MAF values as expected. The two populations are distinguishable, but just barely (with a p-value of only 0.163). On the other hand, as the ΔDS values become larger (Figure 3B), the MAFs of the two populations become more clearly distinct (p-value 1.58 × 10-4). To observe this significant change in the two populations for such a small change in ΔDS threshold value is remarkable and suggests that a change in the structure/disorder potential is indeed an important variable with regard to evaluating single amino acid changes in proteins.

Figure 3
figure 3

Average MAFs for SNVs with different Δ DS values. (A) Comparison of average MAFs for SNVs with |ΔDS| smaller than 0.04. (B) Comparison of average MAFs for SNVs with |ΔDS| greater than 0.04. *** stands for p-value<0.001

The effect of disorder/structured potential SNVs to protein and disease

Three traits namely Q1, Q2 and disease liability distributed by GAW17 are simulated to be affected by a particular set of non-synonymous SNVs within 9, 13 and 15 genes [18]. We further checked whether these genes also have SNVs that give significant changes (e.g. with |ΔDS| ≥ 0.04). Two, seven and six SNVs are found to have significant changes in the structure/disorder potential in trait Q1, Q2 and disease liability, respectively(Table 6).

Table 6 Summary of results on GAW17 simulated data

The details for these 2 + 7 + 6 = 15 examples are given in Table 7. In addition to the ΔDS values, also included are the reported amino change and the prediction score for the major allele.

Table 7 SNVs with |ΔDS |>0.04 in Q1, Q2 and disease liability

In a previous study of deleterious mutations in structured proteins, energy calculations on 3-D structures showed that the deleterious mutations were those that destabilized the structure [17]. A + ΔDS for a region of protein predicted to be structured (e.g. with a prediction score < 0.5) would correspond to the potential destabilization of a region 3-D structure. Interestingly, 6/7 of the examples with + ΔDS values have scores < 0.5, with the remaining example having a score = 0.51. It would be interesting to try this approach on real data to determine if this approach could substitute for that described previously [17] for the identification of deleterious mutations in structured proteins. The advantage here is that the 3-D structures would not be needed.

Of the 15 examples, 8 involved - ΔDS values. Five of these corresponded to regions likely to be structured (e.g. score < 0.5), so from the energetic point of view described above it is not clear how these changes could affect structure. On the other hand, the observed changes (P→L, D→Y and S→L) could certainly lead to protein malfunction.

The last three examples involved amino acid changes that would increase the structural tendency within a region of disorder. Interestingly, recent work suggests that mutations causing disorder tendencies to change to structure tendencies occur much more often than the reverse [5, 19]. Furthermore, a very high fraction of disease-associated mutations mapping to regions of disorder exhibit such tendencies [5] while a significant fraction of these mutations are also associated with the loss of sites of posttranslational modification [20]. Note that posttranslational modifications very commonly occur in disordered regions [1], perhaps much more often than in ordered regions, especially those modifications involving phosphorylation [21, 22].

Also note the earlier work cited above indicating that 114/122 mutations in disordered regions proved to be deleterious [12]. Thus, it would be worthwhile to obtain these data and determine the disorder prediction threshold with the best identification of these harmful mutations. Such a new result could then be included in any future work.

Conclusions

The use of mutation-induced changes in disorder prediction scores, called ΔDS, has been studied here. The overall idea is that a significant change in the tendency of a protein region to be structured or disordered could lead to malfunction of such a protein. The initial findings on the data provided by GAW17 give insight with regard to the directions to explore on real data. For example, with real data one could explore whether some particular threshold for ΔDS would give significant separation of harmless versus harmful mutations. If the data of reference [18] turns out to be a general finding, or even if only applicable to certain diseases, this observation, the discovery of which came out of this work, certainly points towards important new directions to try.

There are some limitations for the study. Single SNV may not change the disorder properties too much, since changes of disorder score depend more on changes of a segment than an individual mutation. So in our further work, we would try to investigate the combination or pattern of SNVs nearby the high ΔDS spot.