Background

Cancer, one of the most costly and heterogenous diseases, is ultimately caused by a build up of somatic mutations within oncogenes or tumor suppressors [1]. Typically, oncogenic mutations result in an increase of gene output or a destabilization of the the resulting protein while mutations within tumor suppressors lead to a reduction of gene activities that promote apoptosis or cell cycle regulation. Due to the relative ease of disrupting protein function as compared to restoring it, significant pharmacological progress has been made towards inhibiting oncogenic mutations as shown by [24]. Combined with the theory of oncogene addiction, that a small subset of so called driver genes result in runaway cellular replication and that the selective targeting of these genes can have a large impact on tumorigenesis [5, 6], the identification of such driver genes becomes critical due to the large translational benefit in the pharmacological space.

Due to the medicinal and biological importance of identifying these driver mutations, a large ensemble of methodologies have been developed. One popular approach is based on the hypothesis that driver mutations have a higher frequency of non-synonymous mutations when compared to the background mutation rate [7, 8]. Relatedly, several studies have shown that somatic mutations cluster within protein kinases [6, 810] and that these clusters may be a sign of positive selection for protein function and thus targets for therapeutic intervention [11, 12]. Such frequency based approaches at identifying driver mutations are often further augmented by accounting for a variety factors such as normalizing for gene length [13], accounting for tumor type and varying background mutation rates [13, 14], as well as considering the ratio of nonsynonymous (K a ) to synonymous (K s ) mutations [15].

In addition to the above methods, several machine learners have been designed to determine the impact of a specific mutation. For example, CHASM [16] endeavors to classify between driver and passenger mutations while Polyphen-2 [17] attempts to determine if a mutation is damaging or benign. Overall, the machine learning approaches utilize a large set of “features” such as sequence, size and polarity of the substituted residues, as well as whether the mutation occurred in a conserved region [18]. These features are used to build a set of rules which are then used to score each mutation. The value of the score then determines how detrimental is the mutation or is used to classify the mutation into a particular category, for example “driver” versus “passenger”. While some classifiers are designed to handle a large feature space, others are optimized to use only a subset of these features. For instance, SIFT only considers the degree of evolutionary conservation when determining whether an amino acid substitution affects protein function [19]. Once the feature set has been determined, a variety of statistical learners such as Random Forests [20], Support Vector Machines [21] and Bayesian Networks [22] are then used to build the model.

Although all of the above methods have shown success in determining whether a mutation is harmful, they nevertheless have limitations as well. Machine learners for example often require several sources of information that must be periodically updated, often at significant expense. Approaches that rely upon differentiating between the frequency of K a to K s over the entire gene may fail if selection only occurs upon a small region of the gene. Similarly, approaches such as those proposed by [14] lose accuracy if the background mutation rate can not be precisely calculated. Other algorithms, such as those proposed by [13, 15] do not distinguish between activating and non-activating mutations.

Using the hypothesis that activating mutations cluster in functionally significant protein regions, [2326] have developed several approaches to identify mutational clustering. Ye et al. [23] created Non-Random Mutational Clustering (NMC) by testing against the null hypothesis that non-synonymous amino acid mutations are distributed uniformly along the polypeptide. However, the algorithm is based upon order statistics and thus considers the protein as a linear sequence of amino acids without taking protein structure into account. To that end, iPAC [24] and GraphPAC [25] extended NMC to account for protein tertiary structure. While both approaches remapped the protein to one dimensional space before identifying clustering, iPAC utilized a global remapping via Multidimensional Scaling (MDS) while GraphPAC employed a local remapping via a graph theoretical approach. While both of these methods considered the protein tertiary structure when identifying clustering, they nevertheless required a remapping to one dimension which resulted in information loss. As such, SpacePAC [26] performed a simulation based analysis to identify clustering directly in 3D space. Despite the success of the above methods, they nevertheless only consider up to the protein tertiary structure and do not account for the large complexes that the protein subunits create in vivo when performing biological functions.

In this article, we extend the work done by iPAC, GraphPAC and SpacePAC to consider protein quaternary structure when identifying mutational clusters. This approach allows us to detect clusters that become apparent only when there are multiple polypeptide chains in the complex. For example, statistically significant clusters in structures 1SUV, 2GRN and 2YDR are identified only when the entire protein complex is considered (see ‘Sections iPAC identifies new proteins with clustering’, ‘ GraphPAC identifies new proteins with clustering’, and ‘ SpacePAC identifies new proteins with clustering’). Furthermore, QuartPAC detects additional mutational hotspots in proteins known to have clustering and thus expands the repertoire of pharmacological targets that can be investigated. We also evaluate the performance of QuartPAC when identifying mutations that are classified as damaging or driver mutations by PolyPhen-2 and CHASM, respectively. In all, by accounting for the highest level of protein complexity, we are able to discern clusters that are otherwise missed by algorithms that only consider the protein tertiary structure.

Methods

The QuartPAC methodology consists of three main parts. The first part obtains the mutational and structural data for each subunit in the quaternary complex (see Section ‘Obtaining mutational & structural data’). The next step is to reconcile the quaternary protein structural information with the mutational data so that the correct mutation is mapped onto the proper amino acid (see Section ‘Reconciling structural and mutational data’). The final step is to run the underlying clustering algorithm on the reconciled quaternary structure (Section ‘Identifying mutational clusters’). For this manuscript, we executed the algorithms presented in iPAC, GraphPAC and SpacePAC in order to identify statistically significant clusters. The software allows the user to specify which clustering algorithms they want to utilize. Lastly, although not part of the QuartPAC process, we correct for the multiple comparison penalty as we test many structures for clustering (see Section ‘Multiple comparison adjustment for structures’). We also note that we use the term “cluster” and “hotspot” interchangeably throughout this manuscript.

Obtaining mutational & structural data

The 70th version of the COSMIC database, the most recent as of when this article was drafted (available via http://cancer.sanger.ac.uk/cosmic), was used to retrieve the mutational data. In order for us to include a mutation in our analysis, it first needed to meet several criteria. First, only nonsynonymous missense mutations that were classified as a “confirmed somatic variant” or “Reported in another sample as somatic” were retained. Next, as all the clustering algorithms test against the null hypothesis that mutations are randomly and uniformly distributed along the polypeptide chain, in order to avoid selection bias, only mutations from whole genome or whole gene screens were kept. Further, as multiple studies often report or use the same mutational data from a single cell line, all the mutations were screened in order to remove duplicate mutations and avoid double counting specific variants. Finally, the gene on which the mutation occurred must of been properly labeled with a Uniprot Accession Number [27]. This allowed us to correctly match the mutation to the protein structure in the PDB (see “COSMIC Query.docx” in Additional file 1 for the entire SQL query).

The structural information was accessed from the PDB by cross-referencing the uniprots from the COSMIC database against those for which quaternary structural information was available. Since multiple structures are often available for the same protein subunits (or a subset of the same subunits), all relevant structures with matching Uniprot Accession Numbers were kept and a multiple comparison adjustment applied afterwards (see Section ‘Multiple comparison adjustment for structures’). In addition, as every amino acid is comprised of several atoms, the (x,y,z) coordinates of the α-carbon atom were used to represent amino acid positions. As shown in [25], using other backbone atoms such as the amide nitrogen or main chain carbonyl carbon is possible but has minimal effect. For a full listing of the 2267 structures considered for analysis, see Additional file 2: Structure files.xlsx in Supplementary materials.

We note that while each PDB entry was used once and only once in each analysis, proteins present in multiple PDB entries are analyzed multiple times. As a given protein can adopt different structures due to a variety of factors, such as variations in the amino acid sequence or the presence of other bound proteins or cofactors, it is important to consider all possible structures. Indeed, one specific structure may be the one that provides insight into the oncogenic process while the other structures do not. However, should only one structure per protein be considered our results would be even more significant as the multiple comparison penalty (see Section ‘Multiple comparison adjustment for structures’) would be reduced.

Reconciling structural and mutational data

As the residue numbering in the PDB database does not match the canonical residue numbering in the COSMIC database, a reconciliation is required in order to map the mutational data to the structural data. Similar to iPAC, GraphPAC and SpacePAC, a pairwise alignment was performed as detailed in [28]. Should the user so desire, a manual alignment is also possible. For full details on the pairwise alignment algorithm, consult the iPAC package available on Bioconductor (http://www.bioconductor.org/packages/release/bioc/html/iPAC.html). Successful alignment was obtained on 2156 quaternary protein structures for which applicable uniprot information was available. Structures for which there were fewer than two mutations were labeled as blank (since no clustering was possible). Refer to “Methodology Results.xlsx” in Additional file 3 for a full listing of the 2156 structures that had a successful alignment and were statistically analyzed.

Identifying mutational clusters

The underlying approach for QuartPAC is that it performs each of the clustering approaches specified in iPAC, GraphPAC and SpacePAC but on the quaternary protein structure. As such, the complexity of the methodology presented here stems from correctly handling the folded structure of the protein subunits when they come together to form a macromolecule. We describe briefly each of the clustering methodologies below and refer the reader to the original manuscripts for further details.

iPAC

The iPAC methodology remaps the protein from \(\mathbb {R}^{3} \rightarrow \mathbb {R}\) by minimizing the stress function defined as:

$$\begin{array}{@{}rcl@{}} \sigma_{1} = \sqrt{\frac{\sum_{i,j}\left[f(\delta_{i,j}) - d_{i,j}(\mathbf{X})\right]^{2}}{\sum_{i,j}d_{i,j}^{2}(\mathbf{X})}} \end{array} $$
((1))

In the equation above, δ i,j represents the distance between the α-carbon atoms of residues i and j in \(\mathbb {R}^{3}\) and d i,j (X) represents the distance between the residues in the lower dimensional space X. In our case, X is the line, \(\mathbb {R}\). Finally, f is used when the original space is not a metric space. Since the protein is in \(\mathbb {R}^{3}\), we simply have f to be the identity function. The denominator of the expression is used to ensure that the remapping is the same regardless of the units used to measure distance.

By performing a global minimization of σ 1, all pairwise \(\mathbb {R}^{3}\) distances are preserved, as best as possible, when the protein is mapped to \(\mathbb {R}\). Once in the lower dimensional space, the position of every mutation is utilized to build order statistics as shown in Fig. 1.

Fig. 1
figure 1

Order statistic construction. Suppose there are a total of seven mutations over three samples of the same protein. The protein is N amino acids long and the number in each box represents the amino acid position. A star above the box signifies a non-synonymous mutation. X (i)=j then signifies that the i-th mutation occurred on residue j [24]

Once the order statistics are calculated, a cluster is found between two mutations if P r(X (k)X (i))≤α for a significance level α where X (i),X (k) represent the i-th and k-th mutations, respectively, along the reordered amino acid sequence. Typically, α is set to be 5 % (as is the case for this manuscript as well as for [2426]), but can be set to whatever level of statistical significance is desired by the study authors. This probability is then calculated for all pairwise mutations and an appropriate multiple comparison adjustment is applied. For the purposes of this paper, a conservative Bonferroni multiple comparisons method was applied to account for all intra-protein comparisons.

GraphPAC

GraphPAC functions similarly to iPAC in that it also hinges on a mapping from \(\mathbb {R}^{3} \rightarrow \mathbb {R}\). However, GraphPAC performs a local minimization by only considering nearby residues when projecting down onto the lower dimensional space. For instance, as shown in Fig. 2, the iPAC methodology will allow for residues in Domain C to have an effect on the final position of residues in Domain A and vice versa. However, utilizing the GraphPAC approach, only nearby residues will effect the remapping process.

Fig. 2
figure 2

In this cartoon protein, the residues in domain A will be visited by the traveling salesman before any of the residues in Domain B or C. Thus the residues in domains B and C do not impact the remapping of domain A to \(\mathbb {R}\). Under iPAC however, every amino acid affects every other amino acid’s final position [25]

To achieve this “local-based” reordering, GraphPAC utilizes a graph theoretic algorithm. Specifically, the algorithm sets every residue to be a vertex and all vertices are then connected to one another forming a complete graph. The weight on the edge between vertices i and j is set to be equal to the Euclidean distance between amino acids i and j in \(\mathbb {R}^{3}\). A heuristic approach is then used to solve the traveling salesman problem in order to find the shortest Hamiltonian path through the protein. In particular, we attempt to heuristically identify the permutation π that solves:

$$ \min_{\pi} \sum\limits_{i=1}^{n} d(i, \pi(i)) $$
((2))

where π(i) represents the amino acid that follows residue i on a path through the protein. While there are many heuristic solutions to the TSP, the problem is NP-hard and there is no known solution that can be solved in polynomial time. However, as shown by [25], the results are remarkably consistent no matter what heuristic approach is used.

SpacePAC

Unlike iPAC and GraphPAC, SpacePAC attempts to identify clustering directly in \(\mathbb {R}^{3}\) by identifying the one, two and three non-overlapping spheres that cover the greatest number of mutations possible at different sphere radii lengths. This statistic is then compared to simulated values in order to come up with a p-value. As described in [26], the specific procedure is:

  • Let s be the number of spheres we consider; s∈{1,2,3}.

  • Let r be the radius considered. Here we consider, r∈{1,2,3,4,5,6,7,8,9,10} Ångstroms.

  • Simulate T(≥1000) distributions of mutation locations over the protein structure. Specifically, for each simulation, every mutation is randomly assigned to a residue i where 1≤iN and N is the total number of residues in the protein quaternary structure.

Next, let X i,s,r represent the number of mutations captured in simulation i (where i=0 represents the observed data), s∈{1,2,3} represents the number of spheres used and r represents the radius of each sphere. Then for a given {s,r} combination,

$$ \mu_{s,r} = \underset{1 \leq i \leq T}{\text{mean}}\{X_{\text{\textit{i,s,r}}}\}, $$
((3))
$$ \sigma_{s,r} = \underset{1 \leq i \leq T}{\text{std.\ dev.}}\{X_{\text{\textit{i,s,r}}}\} $$
((4))
$$ Z_{i} = \max_{s,r} \{ (X_{\text{\textit{i,s,r}}}- \mu_{s,r})/\sigma_{s,r} \} $$
((5))

Once the normalized statistics Z i are calculated, the p-value is estimated as \(1 - \left (\sum {\mathbf {1}_{Z_{0} > Z_{i}}}\right) / T\). Thus per every run of the simulation, there is only one p-value necessary to identify the statistical significance of up to s hot spots. A visual layout of the calculation of this statistic is shown in Fig. 3. It is also worth noting that given n positions and m spheres, there are \(n \choose m\) sphere orientations possible that must be checked under a brute force approach. See [26] for a more efficient approach, which is utilized in the analysis for this manuscript, that nevertheless identifies the globally optimum solution.

Fig. 3
figure 3

Statistic construction. Here we consider up to three spheres (s∈{1,2,3}) and radii of either 3 or 9 Å. The first step is to calculate μ and σ over each column and then a normalized statistic \(Z_{{i,s,r}} = \frac {X_{i,s,r }- \mu _{s,r}}{\sigma _{s,r}}\) for each cell. Then the maximum is taken over each row, specifically m a x s,r Z i,s,r , to obtain Z 0,…,Z 1000. One minus the percentage of cases where Z 0Z i , for i∈{1,…,1000}, is the p-value of our observed statistic Z 0. As 1000 simulations were run, if Z 0>Z i i, a p-value <1.00E−03 is reported [26]

Multiple comparison adjustment for structures

A multiple comparison adjustment was made to account for considering the 2156 successfully aligned protein quaternary structures. As multiple structures may be comprised of the same protein subunits, a Bonferroni adjustment was too conservative and an FDR approach was performed. Namely, a rough FDR (rFDR) [29] approach, which approximates the standard FDR methodology [30], was employed due to the large number of potentially positively correlated tests. For this paper, the cutoff was:

$$ rFDR = \alpha\left(\frac{k+1}{2k}\right) $$
((6))

where k=2156, the total number of structures in the study. Using an α=0.05, the r F D R≈0.025012. To be conservative, we rounded down and deemed all clusters with a p-value less than or equal to 0.025 to be significant. Further, for the rest of this manuscript we may refer to iPAC and GraphPAC as the “pairwise” approaches as they require a multiple comparison adjustment for each pair of mutations while SpacePAC does not.

Results and discussion

Of the 2156 structures considered, if blanks are removed1, approximately 1–5 % of the structures are identified to have clustering only when the protein quaternary structure is considered. Furthermore, approximately 1–3 % of the structures are identified to have clustering only when the protein tertiary structure is considered. For the vast majority of structures, both the tertiary and quaternary algorithms are concordant in whether they identify at least one statistically significant cluster in the structure. The results of each algorithm cross-classified by tertiary versus quaternary classification are shown in Fig. 4 below.

Fig. 4
figure 4

A cross-classification of the structures that were successfully aligned by each method after removing blanks. The colors represent the algorithm used to identify clustering: iPAC, GraphPAC or SpacePAC. An incremental count for “3D only” or “4D only” represents that at least one statistically significant cluster was found only when the tertiary or quaternary structure was considered, respectively. An incremental count for “Both” or “Neither” signifies that the results were concordant regardless of whether the tertiary or quaternary structure was considered

For structures that were identified under only the tertiary methodologies, it is likely that the significant clusters were close to the adjusted p-value threshold and when the entire protein complex was considered the additional multiple comparison penalty was high enough to negate the statistical significance. As such, if a quaternary structure is available, it would be statistically preferable to use in order to reduce potential false positives. For a detailed comparison of which structures were identified by the tertiary and quaternary methods, see “Quaternary vs Tertiary.xlsx” in Additional file 4.

In Fig. 5, we consider the correlation between each of these methods on a per structure basis. Because cluster counts are not directly comparable between SpacePAC and the other two approaches, we applied a nominal classification of three categories: 1) clustering detected, 2) no clustering detected and 3) blank. Cramer’s V [31], was then used to calculate the correlation coefficient between each approach. For reference, Cramer’s V \( = \sqrt {\frac {\chi ^{2}/n}{min(k-1, r-1)}}\) where χ 2 is the statistic from Pearson’s Chi-Squared Test, k is the number of columns, r is the number of rows, and n is the grand total number of observations of pairs (A i ,B i ). Here, A i =1 represents whether structure i had a statistically significant cluster under method A (otherwise A i =0) and B i represents whether structure i had a statistically significant cluster under method B (otherwise B i =0). For the purposes of this manuscript, as we are comparing all six pairwise methods over the 2156 structures, k=2 and r=2156 for every pairwise-algorithmic comparison.

Fig. 5
figure 5

A correlation plot between each of the six methods. A hierarchical clustering approach was applied to group the methods into two categories. As can be seen, the methods that rely upon the tertiary structure separate out from the methods that rely upon the quaternary structure. The number inside the circle is the value for Cramer’s V between the two methods. We note that all the values are statistically significant at the 5 % α-level

Figure 6 below presents a per structure view comparison between the two methods when the structures are considered in decreasing lexicographic order. We believe that a hierarchical reordering of the structures is not appropriate in this case due to the fact that we once again consider only the trinary outcome of “clustering”, “no clustering” and “blank”. However, from Fig. 6, it is clear that for many structures, what is considered a “blank” becomes a result with “no clustering” when the larger quaternary structure is considered. This is due to the case that when all the subunits in the quaternary structure are considered, it is more likely to observe at least two mutations. As such, the structure is no longer considered to be blank and whether there is clustering or not can now be determined. As can be seen from Fig. 6, this pattern of “blanks” being converted to “no clustering” is consistent for all three methods: iPAC, GraphPAC and SpacePAC. Please see Additional file 5 “Trinary Outcomes.xlsx” for the specific details for each structure.

Fig. 6
figure 6

A modified heat map that shows whether the structure contained statistically significant clusters or not. Structures with fewer than two mutations were considered as blanks

Table 1 shows the top five statistically significant structures found by each of the spatial methods when considering quaternary structure. As can be seen from the table, while there is significant overlap, there are differences between the algorithms in regards to which structures are identified. This is analogous to when the tertiary structure is considered and suggests that while one should look at the quaternary structure as opposed to the tertiary structure, looking at the macromolecule does not make one of the spatial approaches perform significantly better. Refer to “Methodology Results.xlsx” in the Additional file 3 for a full listing of all 2156 structures along with the clustering results when tertiary and quaternary structures are considered. While it is outside the scope of this paper to go through every protein structure identified to have clustering individually, we note that many of the complexes that we identify when we consider quaternary structure have biological implications. For example, structure 2YDR contains the TP53 subunit, one of the most common tumor suppressors that has been implicated in a large variety of human cancers [3234]. Alternatively, structure 4MNQ from Table 1 contains the HLA class I histocompatibility antigen which plays a significant functional role in the immune system and has recently been associated with lung cancer [35]. In Sections iPAC identifies new proteins with clustering, GraphPAC identifies new proteins with clustering and SpacePAC identifies new proteins with clustering, we cover three representative structures in further detail.

Table 1 Summary of the top five most statistically significant structures for each method when using the quaternary structure

Next, we considered the performance by iPAC, GraphPAC and SpacePAC when the quaternary structure is utilized as compared to PolyPhen-2 [17] and CHASM [16]. Both PolyPhen-2 and CHASM utilize a large set of features when evaluating each mutation while QuartPAC runs with vastly less a priori information. We note that in order to do a fair comparison, while the quaternary methodologies evaluated each structure, the machine learners evaluated all the protein subunits in each structure. Thus, if at least one subunit had a significant finding under the machine learning methodology, we counted it as a significant finding for the entire quaternary structure. Out of the 343 significant structures found by iPAC to contain mutational clustering when considering quaternary structure, PolyPhen-2 identifies 145 (42 %) structures as having damaging mutations while CHASM identifies 78 (23 %) structures containing driver mutations when using the standard FDR of 20 %. While GraphPAC identified 329 structures with significant clustering, PolyPhen-2 identified 131 (40 %) structures with potentially damaging mutations while CHASM identified 89 (27 %) structures. Of the 327 structures identified by SpacePAC as significant, 129 (40 %) and 74 (23 %) structures were identified by PolyPhen-2 and CHASM respectively. These results are summarized in Table 2 below.

Table 2 This table summarizes how many of the structures identified as having a significant cluster by one of the quaternary methodologies also had a subunit that had at least one damaging (in the case of Polyphen-2) or driver (in the case of CHASM) mutation

We note, that in [2426] the overlap between the machine learning approaches and the tertiary methodologies was larger. As the machine learners do not account for the other subunits in the folded protein structure, they flag fewer proteins as having damaging mutations due to the fact they do not leverage the information from the entire folded protein structure, but rather from one protein subunit. As such, the quaternary methodology may increase the chances of finding a critical mutational area when used in conjunction with other machine learning algorithms. See “Performance Evaluation.xlsx” in Additional file 6 for a breakout per structure.

Finally, we compared our results to the data in the OMIM (Online Mendelian Inheritance in Man) [36]. To do this, we cross-tabulated all the 2156 structures we considered and identified their matching entries on a per-gene level in the OMIM database. Each of these genes in the OMIM database was then classified as a binary “true” or “false” where “true” signifies that the gene was denoted to be either causal or related to a disease. This pairing was completed using the most up-to-date version of the OMIM database available as of January 16th, 2016. The results of this analysis, when considering structures found only by tertiary or quaternary methods, are shown in Table 3 below and further details are available in “OMIM classification.xlsx” in Additional file 7.

Table 3 The p-value represents the results of a one-sided binomial hypothesis test where H 0:p 0=p 1 and H a :p 0>p 1 where p 0 is the proportion of structures found that had a corresponding entry in OMIM when using the quaternary version of the method and p 1 is the proportion of structures with a corresponding entry in OMIM when using the tertiary version of the method

As can be seen from Table 3 there were significantly more structures found by the quaternary versions of iPAC and GraphPAC with related OMIM entries. While the difference was not statistically significant for SpacePAC, that was mainly due to the fact that SpacePAC had much less of a discrepancy between structures that were found only under quaternary and only under tertiary approaches. An expanded version of this table, which considers structures found by both tertiary and quaternary methods combined, is available in Additional file 7 “OMIM classification.xlsx” file. Further, we would like to mention two important observations when analyzing our results in comparison with the OMIM data. First, it is important to note that the OMIM database is not all-inclusive; namely there could very well be genes with hotspots that are oncogenic but which have not been added to the database as of yet. Second, the quaternary methodology described in this manuscript is meant to provide the wet-bench researcher with additional statistically significant clusters. While these clusters may be potential therapeutic targets, final confirmation lies further downstream in the development process and is beyond the scope of this text.

iPAC identifies new proteins with clustering

Under iPAC, there were 56 structures that were identified only when considering the protein quaternary structure. While it is outside the scope of this manuscript to go through each one in detail, we present an example from this set. Specifically, we will now consider 1SUV [37], the structure of human transferrin receptor-transferrin complex. This structure is composed of Transferrin Receptor Protein 1 (TFR1) as well as the C-lobe and N-lobe of serotransferrin. Transferrin proteins, which control the level of free iron, are plasma glycoproteins which are encoded by the TF gene [38, 39]. Recently, it was shown that elevated expression of TFR1 contributes to the oncogenic signaling performed by Sphingosine Kinase 1 (SK1), which in elevated levels enhances cell survival, proliferation and can induce neoplastic transformation. Moreover, by blocking TFR1 with a neutralizing antibody, SK1-induced abnormal cell growth is inhibited which suggests that TFR1 presents a potential therapeutic target for SK1-mediated tumorigenesis [40].

The statistically significant clusters are shown in Table 4 with the clusters referenced by their serial number within the structure file. We note that in addition to the oncogenic implications described above, cluster III also contains mutation G277S in the serotransferrin protein (Uniprot ID: P02787) which is associated with a reduction in total iron binding capacity and is a risk factor for iron deficiency anemia [41].

Table 4 Clusters identified by iPAC for structure 1SUV

The structure of 1SUV is shown below in Fig. 7 below with the boundaries displayed in Table 4 colored in yellow.

Fig. 7
figure 7

The residues at serial numbers 12570, 13099, 14345 and 15138 are labeled and highlighted in yellow. As cluster II is a subset of cluster I, we have colored the atoms between 12570 and 13099 blue, the atoms between 13099 and 14345 red and the atoms between 14345 and 15138 purple. The rest of the structure is colored in green

We note that had the entire structure not been considered, no significant clusters are found, signifying that the biological quaternary unit resulted in more mutations within close proximity than any one tertiary substructure alone.

GraphPAC identifies new proteins with clustering

We now proceed to consider structure 2GRN [42], one of the 43 structures found to be significant by GraphPAC only when the quaternary structure is considered. 2GRN is comprised of two molecules, Ubiquitin-conjugating enzyme E2I which is coded by UBE2I and Ran GTPase-activating Protein 1 which is coded by RANGAP1. Protein ubiquitination is a critical post-translational modification where ubiquitin is added to a substrate protein. This in turn can signal for protein degradation, alter cellular location as well as prevent or promote protein-protein interactions [4345]. RanGAP1 is a GTPase activator, converting the Ras-related nuclear regulatory protein Ran to its putatively inactive GDP-bound state [46]. Recently, it has been shown via comparative proteomic analysis that RanGAP1 is differentially expressed in diffuse large B-cell lymphoma (DBCL) and that a multikinase inhibitor induces cell death, hyperphosphorylation and mitotic cell arrest of RanGAP1 in DLBCL cell lines but not in normal B and T cells. This suggests a potential biomarker as well as therapeutic target for aggressive B-cell lymphoma [47].

For this structure there was one statistically significant cluster identified in Ran GTPase-activating Protein 1 (UniprotID: P46060) shown in Table 5 and Fig. 8.

Fig. 8
figure 8

The atoms with serial numbers 1352 and 1444 are labeled and highlighted in yellow. The amino acids between those two atoms are shown in blue

Table 5 Cluster identified by GraphPAC for structure 2GRN

It is worth noting that the cluster is nearby amino acid 442 which is phosphorylated at the onset of mitosis and is associated with RanBP2 regardless of its phosphorylation state. As such, the phosphorylation is believed to potentially effect RanGAP1’s catalytic activity or allow RanGAP1 to recruit specific SUMO target proteins to RanBP2’s catalytic domain [48].

SpacePAC identifies new proteins with clustering

Finally, we now consider structure 2YDR [49], one of the 21 structures identified by SpacePAC when considering the entire protein macromolecule. 2YDR consists of two protein fragments, one of which is tumor antigen P53 (TP53). TP53 is a well known tumor suppressor involved in cell cycle regulation and apoptosis [50, 51] and is responsible for encoding a transcription factor that is activated in response to cellular stress [52]. The majority of TP53 mutations (over 75 %) correspond to missense mutations [53], and approximately 30 % of all TP53 missense mutations occur in CpG dinucleotides [54]. TP53 somatic mutations have been associated with a wide variety of cancers including acute myeloid leukemia [55], colorectal cancer [56] as well as nonsmall cell lung cancer [57]. Moreover, TP53 germ-line mutations have been shown to be the underlying cause of Li-Fraumeni syndrome [58], a rare autosomal dominant hereditary disorder that predisposes the individual to cancer.

While clusters involving the TP53 protein were found in many of our structures when both the quaternary and tertiary structures were considered, the hotspots shown in Table 6 and Fig. 9 are unique only to the quaternary structure. Not only have mutations in that region occurred in sporadic cancers in the case of Li-Fraumeni syndrome, it is also worth noting that P151S (serial number 4627) is associated with squamous cell carcinomas [59]. It is worth noting that in recent years, significant resources, have been spent to drug the TP53 pathway in order to arrest further tumor growth [6062].

Fig. 9
figure 9

The atoms with serial numbers 4577 and 4627 are labeled and highlighted in yellow. The amino acids between those two atoms are shown in blue

Table 6 Clusters identified by SpacePAC for structure 2YDR

Conclusion

In this manuscript we expand upon several previous methodologies in order to account for protein quaternary structure. By utilizing the entire macromolecule that is comprised of several protein subunits we are able to identify several structures with statistically significant clusters that are otherwise missed. Moreover, we demonstrated several examples where the clusters identified may have a potential therapeutic benefit and in some cases, are already currently being targeted by the pharmaceutical and biotech industries. Furthermore, when considering individual protein subunits, many structures are blank in that they don’t have enough mutations to evaluate whether a cluster exists. As our approach considers the entire protein molecule, it is often able to classify whether or not a cluster occurs (even if all the individual subunits are “blank”) by leveraging mutations over all the subunits within the quaternary structure. This type of negative result can provide valuable insight for the wet-lab scientist when screening many compounds to decide which one requires further evaluation. Finally, although we consider larger structures in this approach, the impact on the running time of iPAC, GraphPAC and SpacePAC is negligible when compared to analyzing the tertiary structure. Most structures are analyzed within 10-15 minutes when the software is run on a consumer desktop with an Intel i7-2600k processor and 16 GB of RAM.

While utilizing the quaternary structure is a significant improvement, this methodology is still subject to some of the same limitations as the tertiary approaches. For example, our approach does not allow for unequal rates of mutagenesis in specific genome regions. To help minimize the impact of this assumption, we considered only missense substitution mutations due to the fact that many insertion and deletion mutations are dependent upon sequence location. Further research is required in order account for other genomic mutational hotspots such as CpG dinucleotides which may have mutational rates that are 10 times higher than other locations [63]. However, as most of the clusters identified are similar when considering the tertiary versus quaternary structures, the impact of such hotspots is limited as described by [24, 26]. Our approach also doesn’t account for differences in mutational position due to the type of mutation. For example, cigarette smokers often result in lung carcinomas with transversion mutations [23] while colorectal carcinoma pathologies often demonstrate transition mutations [64]. However, KRAS mutations, which are often present in both of these carcinomas, nevertheless have the vast majority mutations on residues 12, 13 and 61 for both cancers suggesting that the mutation type may only have a small impact on the uniformity assumption [25]. In all, while this approach may still be influenced by a variety of factors that we are unable to account for, it does suggest that utilizing the quaternary structure is beneficial when identifying statistical clusters.

In summary, QuartPAC provides a new (and as far as we are aware, only) tool for researchers to statistically identify mutational clustering when considering the multi-subunit quaternary structure. We show that many of the novel clusters identified have biological and potentially therapeutic relevance. Moreover, by considering the larger oligomeric structure, the additional information provided by the mutations in all the subunits may allow a scientist to definitively rule out a protein structure that would otherwise not have enough data to be classified, providing valuable time savings when many proteins need to be considered. Several promising areas of additional research are self evident such as loosening the requirement that mutations occur uniformly throughout the genome under the null hypothesis. Also, while we present the results here using human missense mutational clusters within proteins, the approach can also be directly applied to both DNA and RNA, as long as the structural data are available.

Ethics statement

Our work only involved information already published or publicly available via pdb.org and cancer.sanger.ac.uk. No human or animal data was collected. As such, our work did not need to be reviewed by an ethics committee.

Consent

This article is not a prospective human study nor does it present individual clinical data. All clinically relevant data is referenced to other already published articles.

Endnote

1 Blanks are defined as structures where there is at most one mutation. Thus, by definition, no clustering is possible.