Cancer, one of the most widespread and heterogeneous diseases, is at its most fundamental level a disease brought on by the accumulation of somatic mutations [1]. These mutations typically occur in either tumor suppressors or oncogenes. While oncogenic mutations either tend to deregulate or up-regulate the resulting protein behavior, mutations within tumor suppressors typically lower the activity of genes that prevent cancer. Pharmacological intervention has shown to be more effective with inhibiting activating oncogenes than with restoring functionality of tumor suppressing genes. Combined with the theory of "oncogene addiction", that many cancers are dependent upon a small set of key genes to drive their rapid cellular multiplication with the rest of the mutations simply being passenger mutations [2, 3], the identification of driver oncogenic mutations has become of critical importance in cancer research.

Due to the importance of this problem, several approaches have been proposed to detect naturally selected regions in which activating mutations occur. One general approach postulates that driver mutations will have a higher non-synonymous mutation rate as compared to the background level after normalizing for the length of the gene [46]. Similarly, assuming that the neutral rate of nucleotide substitution is surpassed when positive selection is acting on a specific region, one can check if the ratio of nonsynonymous (K a ) to synonymous (K s ) mutations per site is greater than 1 [7]. Relatedly, Ye et al.[8] and Ryslik et al.[9] showed that mutational clusters can be indicative of activating mutations and that finding such clusters is a way to reduce the driver mutation search space needing to be analyzed. An alternative approach relies on creating classifiers to categorize mutations. Machine learning algorithms such as Polyphen-2[10], which predicts whether a missense mutation is damaging, and CHASM[1113], which discriminates between known driver mutations and a set of passenger mutations, rely upon a set of rules developed using a variety of machine learning techniques such as Random Forests [14] and Support Vector Machines [15]. These rules can be used to calculate a score for each mutation based upon both sequence and non-sequence-based features such as evolutionary conservation, size and polarity of the substituted residue as well as accessible surface area [16]. Other classifiers, such as SIFT[17], use only a subset of these features, e.g. evolutionary conservation, for prediction.

While the methods based upon background mutational rates have had some success in identifying regions of positive selections or driver mutations, they nonetheless suffer from several shortcomings. First, many of these methods rely upon calculating the difference between synonymous and non-synonymous mutations but do not take into account that selection can act upon minute regions of the gene. Thus, when the mutations rates are averaged over the entire gene, the signal may be lost. Second, the methods proposed by Kreitman [7] and Wang [4] do not differentiate between activating gain-of-function mutations and inactivating loss-of-function non-synonymous mutations. Third, many of the machine learning methods require an extensive rule set that must first be trained using a well annotated database that is still limited. Until the requisite literature and information is developed, the machine learning algorithm is unable to create a well-performing classifier. Furthermore, the rules must be updated periodically to reflect updated knowledge and information. For a recent review of several popular methods that attempt to discern missense substitution effect on protein function see Gnad et al.[18] and Gonzalez-Perez et al.[19].

Building on the work of Bardelli et al.[5] and Torkamani and Schork [20], which stipulated that only a small number of specific mutations can activate a protein, Ye et al.[8] developed Non-Random Mutational Clustering (NMC) to identify potential activating mutations. NMC works on the hypothesis that absent any previously known mutational hotspot, a mutational cluster is indicative of a possible activating mutation. This is based on the observation that most amino acid substitutions are either neutral or incompatible with protein function, resulting in a concentration of activating mutations within a small subset of protein residues and domains [8]. For the null hypothesis that mutation locations are random in the candidate protein when represented in linear form, NMC identifies clustering by evaluating whether there is statistical evidence of mutations occurring closer together on the line than expected by chance. While NMC is able to implicate some cancer related genes, it is limited by the fact that it considers the protein as a linear sequence and does not take into account the tertiary protein structure. To account for protein structure information, Ryslik et al.[9] developed iPAC (identification of Protein Amino acid Clustering), which reorganizes the protein into a one dimensional space that preserves, as best as possible, the three dimensional amino acid pairwise distances using Multidimensional Scaling (MDS) [21]. As described by Ryslik et al.[9], utilizing the tertiary information is critical when identifying clustering as mutations that occur far apart when the protein is considered linearly can be very close together once the protein is folded in 3D space. The 3D proximity of such mutations might thus yield novel clusters. While it was shown that iPAC provides an improvement over NMC, the reliance upon a global method like MDS can potentially result in a distorted rearrangement of the protein, since distant residues will nevertheless have an impact on each other’s final position in one dimensional space.

In this manuscript, we provide an alternative method to iPAC by remapping the protein into one dimensional space via a graph theoretic approach. This approach allows for a more natural consideration of the protein, one that is sensitive to protein domains and linkers. We show that our methodology is effective in identifying proteins with mutational clustering that are missed by both iPAC and NMC such as NRP1 and MAPK24. We also show that for some proteins, GraphPAC identifies fewer clusters than inferred by both iPAC and NMC while for other proteins GraphPAC identifies more clusters than the other two methods. While both GraphPAC and iPAC are an improvement over NMC since they account for tertiary structure, the differences between GraphPAC and iPAC point to the fact that different rearrangements of the protein must be considered in order to better understand the mutational clustering landscape. We show that many of the clusters identified by GraphPAC are also classified as damaging by Polyphen-2 and as an activating mutation by CHASM. By providing a more complete picture of mutational clustering than iPAC or NMC individually, GraphPAC allows us to obtain a more accurate landscape of where potential activating mutations may occur on the protein.


GraphPAC uses a four step approach to identifying mutational clusters. The first step, as described in Sections ‘Obtaining mutational data’ and ‘Obtaining the 3D structural data’, retrieves mutational and positional data from COSMIC [22] and the PDB [23], respectively. After reconciling the mutational and positional databases (Section ‘Reconciling the structural and mutational data’), the residues are realized as a connected graph where each residue is a vertex whereupon the traveling salesman problem is heuristically solved in order to find the shortest path through the protein (Section ‘Traveling salesman approach’). Once the shortest path has been identified, the protein residues are reordered along this path providing a one dimensional ordering of the protein. The linear NMC algorithm is then used to calculate which mutations are closer together than expected by chance. Lastly, the clusters are unmapped back into the original space and the results reported back to the user. We detail each of the steps in the sections below.

Obtaining mutational data

The mutational positions were obtained from the 58th version of the COSMIC database that was downloaded via the following ftp site: The database was implemented locally using Oracle 11g. Only missense mutations that were classified as "Confirmed somatic variant" or "Reported in another cancer sample as somatic" were selected, with nonsense and synonymous mutations excluded. Moreover, we only considered mutations originating from studies that were classified as whole gene screens. Next, since multiple studies can report mutational data from the same cell line, mutational redundancies were removed to avoid double counting the mutations. Lastly, only the proteins with a UniProt Accession Number [24] were kept in order to correctly match the mutational and positional data, resulting in 777 proteins. See "COSMIC query" in Additional file1 for the SQL code required to generate the mutational data.

Obtaining the 3D structural data

The PDB web interface was used to obtain the protein tertiary information for each of the 777 proteins described in Section ‘Obtaining mutational data’. Since multiple structures are often available for the same protein, all structures with a matching UniProt Accession Number were used and an appropriate multiple comparisons adjustment (see Section ‘Multiple comparison adjustment for structures’) was performed afterwards. For proteins where the resolution provided alternative conformations, the first conformation listed in the file was used. Similarly, for structures where more than one polypeptide chain with a matching Uniprot Accession Number was available, the first matching chain listed in the file was used (typically chain A). Finally, after the chain and conformation were selected, the cartesian coordinates of all the α-carbon atoms were used to represent the tertiary backbone structure of the protein. While we only used the α-carbon location to represent the residue location in this paper, our methods are robust if any of the other backbone atoms are used including the amide nitrogen, main chain carbonyl carbon or the main chain carbonyl oxygen.

Also, while X-ray crystallography was used to determine many of the tertiary structures in the PDB, we note that molecular dynamics (MD) could in principle be used to model the protein structure in solution. However, taking into account the time complexity of such an approach for larger proteins as well as the number of structures that we consider, such a task is beyond the scope of this paper [25]. Further, as crystal structures are almost always representative of the correctly folded protein, using the current structural information is more than sufficient until MD simulations can be applied on much faster time scales. See "Structure Files" in Additional file2 for a full listing of all the 1,904 structure/chain combinations used.

Reconciling the structural and mutational data

In order to reference the same residue in the COSMIC and PDB databases, an alignment was performed to accommodate their different numbering systems. Like iPAC, GraphPAC allows two such reconciliations. The first is based upon a pairwise alignment as described in Pages et al.[26] while the second is based upon a numerical reconstruction from the structural information available in the PDB file. Due to the fact that the PDB file structure potentially changes depending upon the structure release date along with other technical complications, pairwise alignment was used for all the analysis described in this paper unless specifically noted. For further information on the alignment please see the documentation in the GraphPAC package available on Bioconductor. Protein/structure/chain combinations that resulted in only one mutation or no mutations on the residues for which tertiary information was available were dropped. Similar to iPAC, a successful alignment of the tertiary and mutational data was obtained for 140 proteins corresponding to 1100 unique structure/chain combinations. See "Structure Files" in Additional file2 for a full listing and description.

Traveling salesman approach

Since the NMC algorithm requires order statistics to identify clustering (see Section ‘NMC’), we need to map the protein from a three dimensional to a one dimensional space so that order statistics may be constructed. Contrary to iPAC, which employed MDS, a graph theoretic approach is used by GraphPAC. As discussed above, one major limitation of MDS is that the minimization of the stress function:

σ 1 = i , j [ f ( δ i , j ) - d i , j ( X ) ] 2 i , j d i , j 2 ( X )

results in every residue having an effect on the final position of every other residue. In Equation 1, δ ij represents the Euclidean distance between residues i and j in the original higher-dimensional space while di,j(X) represents the distance between them in the lower dimensional space X. Lastly, f, is used to account for situations where the proximity measures δi,j do not come from a true metric space. Since in our case, δ i , j R, f is the identity function. Minimization of σ1 may not capture that a protein is typically comprised of several domains and that only residues within a specific domain should influence each other’s final position in linear space (see Figure 1). Under the GraphPAC algorithm, we first construct a complete graph with each residue represented by a vertexa. We then create a linear ordering of the protein by finding a Hamiltonianb path through the graph. As the number of distinct Hamiltonian paths on a complete graph with N vertices is equal to ( N - 1 ) ! 2 , a direct consideration of all possible paths is computationally unfeasible. Further, selective pruning of the edges based upon edge distance is also often impractical due to the domain structure where many residues are close to each other. Because of these factors, we use a heuristic algorithm that solves the Traveling Salesman Problem (TSP) [27, 28] to find a linear path that is approximate of the shortest path through the protein. We then use this path as a representative reordering of the protein into one dimensional space. Unlike iPAC, which is based on a global remapping, this methodology takes into account only locally neighboring residues to remap the protein to one dimensional space.

Figure 1
figure 1

An example protein with three different domains. Under iPAC, the Domain A residues will influence the final positions of Domain C residues and vice versa, a result that is undesirable if the three domains are independent of each other. The residues in Domain A and Domain C will have no effect on each other’s final position via the graph theoretic approach.

While there are many heuristic solutions for the TSP (see Gutin and Punnen [29]), we consider three of the most common insertion methods [30]: cheapest insertion, farthest insertion and nearest insertion as described below. Specifically, the objective of the TSP is to find a cyclic permutation π of {1,2,3,…,n} that minimizes the total tour distance, namely:

min π i = 1 n d ( i , π ( i ) )

Here, d(i,j) represents the distance between residues i and j (with d(i,i) = 0) and π(i) represents the residue that follows residue i on the tour. The difference between the three insertion methods rests on how the next residue k is selected for insertion. Under cheapest insertion, the next k to be inserted into the tour is chosen such that the increase in tour length is minimal. Under nearest insertion, at each iteration, the k that is closest to a residue already on the tour is selected. Finally, under farthest insertion, the k that is farthest away from any residue already on the tour is selected.

These algorithms have different upper bounds on their tour lengths. For example, the farthest insertion algorithm creates tours that approach 3 2 of the shortest length while the nearest and cheapest insertion algorithms can be linked to the minimal spanning tree algorithm and thus have an upper bound of twice the shortest tour length when distances satisfy the triangular inequality [28]. Due to the varied nature of these methods and that there is no biological justification to favor one over the other, we consider all three methods when identifying clusters and then perform an appropriate multiple comparison adjustment to infer the statistical evidence of mutation clusters (see Section ‘Multiple comparison adjustment for structures’).

As can be seen from Figure 2, all the rearrangement options present a positive skew and are mostly consistent with each other. For the majority of the proteins, all three insertion approaches as well as the MDS approach result in little rearrangement. However, if one method results in radical rearrangement when the protein is mapped to 1D space, the other methods do so as well. This makes selection of a specific insertion method less critical and for the rest of this manuscript, unless otherwise specified, we use the insertion method with the most significant cluster for analysis. Please see "Distribution Summary" in Additional file3 for a full listing of each structure’s Kendall Tau distance, protein index and a high resolution plot.

Figure 2
figure 2

The amount of rearrangement performed under each of the three insertion methods described as well as MDS. Each column on the x-axis represents one of the 1100 structures considered, with structures from the same protein adjacent to one another and the protein order determined lexicographically by protein name. The y-axis shows the Kendall Tau distance, which is equivalent to the number of swaps required to sort the protein back into {1,2,3,…,..} order using bubble sort. The proteins with at least one rearrangement higher than 150,000 represent the DPP4, F5, IDE, MET, PIK3C α, SEC23A and TF proteins, from left to right, respectively.

Path lengths between nearby and distant residues are statistically different

We employed a statistical test to verify that the TSP algorithm yields a shorter path between residues that are close together in 3D space versus residues that are far apart. First, we selected 200 random protein structures from our data set. For each structure we then selected 100 random amino acids and categorized them as "close" versus "far" in 3D space (see Table 1 for more information on the classification). For structure i, we then calculated the average path distance between all pairwise close residues, denoted c ̄ i , and the average path distance between all pairwise far residues, denoted f ̄ i . Next, we calculated the average close and far path distance over all structures: c ̄ =( i = 1 200 c ̄ i )/200 and f ̄ =( i = 1 200 f ̄ i )/200. Finally, as c ̄ and f ̄ are averages, we applied the central limit theorem and performed a t-test with H 0 : c ̄ = f ̄ vs H a : f ̄ > c ̄ . This test was performed for each of the insertion methods described in Section ‘Traveling salesman approach’ and at various classifications of close and far. As shown in Table 1, the p-value is ≈0 for each combination of distance and insertion method, allowing us to conclude that the average path distance between residues that are far apart in 3D space is larger than the average path distance between residues that are close together in 3D space.

Table 1 This table shows the p values when testing the difference in path length between residues that are close versus far apart


The NMC algorithm as described by Ye et al.[8], and briefly reviewed here, was used to find the mutational clusters once the protein was remapped to 1D space. To begin, suppose we had m samples of a protein that was N residues long and that there were a total of n mutations over all m proteins. As shown in Figure 3, by collapsing over the m samples, we can construct order statistics for every mutation. Then, given order statistics X(k) and X(i) where i < k, we define a cluster to exist if P r(C ki  = X(k)-X(i)) ≤ α, for some predetermined significance level α. As shown in Ye et al.[8], while a closed form calculation of the above probability is possible, it often becomes computationally costly. To overcome this, we calculate C ki N and assume that the statistic is uniform on (0,1). Then in the limit, it can be shown that:

Pr C ki N = X ( k ) - X ( i ) N c = 0 c n ! ( k - i - 1 ) ! ( i + n - k ) ! y k - i - 1 ( 1 - y ) i + n - k dy = Pr ( Beta ( k - i , i + n - k + 1 ) c )
Figure 3
figure 3

An example constructing order statistics over 3 samples with 7 total mutations. The number inside the box indicates the residue number. A "*" above a residue signifies a non-synonymous missense substitution mutation for that residue. Figure from Ryslik et al.[9].

The above calculation is then performed on all pairwise mutations and an appropriate multiple comparison adjustment is then applied. For the remainder of this study, we use the more conservative Bonferroni correction [31, 32] to adjust for the intra-protein cluster p-values. See Section ‘Multiple comparison adjustment for structures’ for a description of how we account for the inter-protein multiple comparisons. Lastly, it is important to mention that the structural information obtained for each protein does not always contain the (x,y,z) coordinates for every residue in the protein. In such cases, in order to compare GraphPAC, iPAC and NMC on an equal basis, these missing residues are removed from the protein.

We also note that since we obtained our mutational data from COSMIC, some tissue types are more represented than others in the database. However, this scenario results in our analysis being more conservative and our findings even more significant. Assuming that mutations occur in different parts of the protein for different tissue types, when collapsing over all tissues a larger value of n is obtained while the values of i and k (as seen in Equation 2) for two specific mutations are not changed. This results in a larger p-value signifying that clusters found when collapsing over tissue types would be even more significant if only a unique tissue type was analyzed.

Multiple comparison adjustment for structures

In addition to the Bonferroni adjustment performed to account for multiple testing within a specific structure, we perform a second multiple comparison adjustment to account for testing all 1,100 structures. Since a single protein can have many structures that are similar to each other, a second Bonferroni adjustment is too conservative and an integrated Bonferroni-FDR approach was performed. Specifically, for a given protein, the Bonferrroni adjusted p-value of each cluster was multiplied by n ( n - 1 ) 2 to calculate p. Thus, p could be compared directly to an α-level of 0.05 in order to determine the cluster’s significance. Next, a rFDR[33] approach, which is a good approximation for the standard FDR method when there are a large number of independent or positively correlated tests, was used. Under this method, the expected value of α is estimated over all k tests and then used as the significance threshold. Setting k as the total number of structures under all three insertion methods, the mean alpha can be approximated by:

rFDR = α k + 1 2 k

where k = 3 × 1100 = 3300. Using α = 0.05, rFDR is calculated to be ≈0.025007. Rounding down, all the clusters for which p ≤ 0.025 were deemed to be significant. To avoid confusion in the rest of the paper, we only report the p-value (with the exception of Table 2). However, each cluster discussed in Section ‘Results’ is significant after the Bonferroni-FDR multiple comparison adjustment described here.

Table 2 A comparison of the 15 proteins that were found to contain significant clustering via GraphPAC , iPAC or NMC


In this section we compare the results between GraphPAC, iPAC and NMC in terms of the number of structures found (Section ‘Method comparison’) and describe the new proteins identified by GraphPAC (Section ‘GraphPAC identifies novel proteins with significant clustering’). We also show the results of our method in comparison to two machine learning methods along with a descriptions of whether our results overlap biologically relevant structures (Section ‘Cluster localization in relevant sites and performance evaluation’).

Method comparison

Using the GraphPAC algorithm, out of the 140 proteins analyzed, 9, 10 and 12 proteins with statistically significant clusters were found under the cheapest, nearest and farthest insertion methods, respectively. This corresponded to 223, 225 and 226 significant structures (out of the 1100 total structures considered) under the three methods. It is important to note that failure to utilize the tertiary information results in either an over or an underestimation of the number of clusters in approximately 70% of the structures analyzed (see Figure 4). Hence, failure to account for the protein structure provides either an overly complicated or overly simplified view of the mutational orientation. On the protein level, as shown in Table 2, eight proteins were identified as having significant clusters by GraphPAC, NMC and iPAC while 7 proteins were identified as having significant clusters by only a subset of these methods. We note that of these seven proteins, four of them were only identified via the GraphPAC methodology while two of them were identified only via the iPAC methodology. We further note that GraphPAC identifies the largest number of proteins with significant clustering at the same false discovery rate, thereby showing an increased power to detect mutational clustering. We also observe that there were no proteins found to have significant clustering under the linear NMC algorithm that were subsequently missed by the GraphPAC algorithm. See Section ‘Cluster localization in relevant sites and performance evaluation’ for a summary of cluster overlap with active biological sites along with a performance evaluation via machine learning methods.

Figure 4
figure 4

A comparison of GraphPAC , iPAC and NMC over all the structures that were found to be significant. Each of the 3D methods are considered: all three GraphPAC insertion methods and iPAC. The size of each colored block represents the number of structures with the relationship described. For instance, out of the 223 structures with significant clusters found under the cheapest insertion method of GraphPAC (top left), 94 structures had more clusters identified under the GraphPAC approach as compared to the NMC approach. Green is used to designate structures where the 3D and NMC methods identified 1 cluster while purple is used to designate structures where the 3D and NMC methods identified more than 1 cluster.

GraphPACidentifies novel proteins with significant clustering

GraphPAC identified four proteins with clustering that are missed by the iPAC algorithm: DPP4, MAP2K4, NRP1, and PSCK9. DPP4 is a serine protease that can modify tumor cell behavior and is a potential cancer therapeutic target [34]. Both MAP2K4 and NRP1 are well known to be associated with lung cancer [35, 36]. Finally, while PCSK9 mutations are well known in causing hypercholesterolemia [37], recent research shows that absence of PCSK9 can provide a protective benefit against melanoma due to lower circulating LDLc. This allows for a potential additional cancer therapy via PCSK9 inhibitors [38]. [38]. For a full listing of which structure-protein combinations were found significant, see "Results Summary" in Additional file4. Please see Sections ‘GraphPAC finds novel proteins compared to iPAC and NMC’, ‘GraphPAC identifies additional clusters compared to iPAC and NMC’ and ‘GraphPAC finds fewer clusters compared to NMC’ for an in-depth review of selected protein-structure combinations.

Cluster localization in relevant sites and performance evaluation

We note that 9 of the 13 proteins that GraphPAC identified as having significant clustering have their most significant cluster overlap a binding site, catalytic domain or kinase domain. Out of the remaining four proteins, three proteins have their most significant cluster fall within a previously identified biologically relevant region. For instance, IDE’s most significant cluster is located on residues 684–698, a denaturation-resistant epitope region [39]. For NRP1, which plays roles in angiogenesis [40] and axon guidance [41], the most significant cluster directly overlaps the F5/8 type C 1 domain - a domain in many blood coagulation factors. Finally, for PIK3C- α, the most significant cluster overlaps residue 1047 which has been shown to potentially increase the substrate turnover rate, a common oncogenic behavior [42]. For further detail on relevant biological site information, please see "Relevant Sites" in Additional file5.

Further, we evaluated the performance of GraphPAC via two well-known machine learning algorithms: CHASM and PolyPhen-2. It is critical to first note however, that the machine learning algorithms utilize a much more detailed set of features when evaluating the mutation. Thus these algorithms may identify mutations as significant while GraphPAC would not. Nevertheless, of all the mutations that fall within significant clusters identified by GraphPAC, 93% and 91% of them were also identified as significant (FDR ≤ 20%) by CHASM and PolyPhen-2 (respectively). We note that GraphPAC is only able to determine statistically significant clustering and not whether a mutation is truly damaging and/or activating. However, given the high percentages described above, the evidence supports the hypothesis that clustering is in fact indicative of potential driver mutations. Thus, via GraphPAC, the researcher has a fast and easily available tool to identify potential driver mutations for further study. The benefit of GraphPAC is that it is able to be executed with far less prior information as compared to the machine learning approaches. For further details, see "Performance Evaluation" in Additional file6.

Finally, we note that while GraphPAC provides an improvement in cluster identification compared to prior work, the algorithm is unable to distinguish between mutations that increase or decrease kinase activity nor between gain-of-function (GOF) or loss-of-function (LOF) mutations. As described by Lapenna and Giordano [43], Brognard et al.[44], Geiger et al.[45], Ahn et al.[36], Lisabeth et al.[46] and Linka et al.[47], a large body of literature suggests that inactivating loss-of-function mutations are more common than previously thought and often occur in regions that regulate kinase activation. Nevertheless, as described above, many of the clusters identified by GraphPAC contain mutations that are classified as driver and/or damaging by common machine learning algorithms. As such, GraphPAC provides a fast and easy method to identify such potential mutations, which can then be verified and analyzed via additional approaches. These approaches can range from the aforementioned machine learning algorithms to experimental approaches that test for GOF mutations as described by Fawdar et al.[48].


In this section we discuss in depth some of the clustering results presented in Section ‘Results’. Specifically, we review in detail three situations: 1) GraphPAC identifies novel proteins (Section ‘GraphPAC finds novel proteins compared to iPAC and NMC’), 2) GraphPAC finds additional clusters in proteins identified to contain clustering by other methods (Section ‘GraphPAC identifies additional clusters compared to iPAC and NMC’) and 3) GraphPAC finds fewer clusters compared to other methods (Section ‘GraphPAC finds fewer clusters compared to NMC’). In each of these sections, we discuss the biological relevance of our findings.

GraphPAC finds novel proteins compared to iPAC and NMC

As shown in Table 2, GraphPAC identified five additional proteins as compared to the linear NMC algorithm. In this section we will consider two of these proteins, both of which are directly related to cancer: EGFR, which is also identified by iPAC, and NRP1, which is not identified by iPAC.

EGFR is a cell-surface receptor for ligands in the epidermal growth factor family [49] and is present in a wide range of diseases such as glioblastoma multiforme [50], lung adenocarcinoma [51] and colorectal cancer [52]. The most significant cluster found was in the 2ITX structure [53] between residues 719–768 (see Figure 5) with a corresponding p-value of 0.0009. This cluster contains mutations G719S, T751I and S768I which are all found in non-small cell lung carcinomas (NSCLC) [5456] with mutation G719S well known for increased kinase activity [57]. It is also interesting to note that all three mutations within this cluster, which was identified purely through statistical clustering analysis, show a beneficial clinical response to either Erlotinib or Getfinib [55, 58, 59]. Exclusion of the tertiary information would have resulted in this cluster being missed. Finally, it is worth noting that recent research has shown that signaling by EGFR is dependent upon an allosteric interaction between two kinase domains in an asymmetric dimer as opposed to phosphorylation. As the formation of this asymmetric dimer is believed to activate all EGFR family members, it is likely that oncogenic activation of EGFR may differ from other protein kinases [60, 61].

Figure 5
figure 5

The EGFR ectodomain fragment structure (PDB ID 2ITX) where the 719–768 cluster is colored in blue. The three mutations, 719, 751 and 768 are displayed as purple spheres.

We now consider the NRP-1 protein, a coreceptor for the vascular endothelial growth factor (VEGF) which is upregulated in a large variety of cancers including lung tumors [35], gastrointestinal metasteses [62] and pancreatic carcinomas [63]. In NSCLC patients, it has been shown to be an independent predictor of cancer relapse and reduced survival as well as a cancer invasion enhancer [64]. Moreover, research has shown that NRP-1 inhibitors provide an additive effect to anti-VEGF therapy in reducing tumor progression. Monoclonal antibodies that attach to the b1-b2 domains, the domains responsible for VEGF binding, have already been created [65]. The b1 domain, which spans residues 275–424 almost exactly overlaps the most significant cluster found by GraphPAC, which consists of residues 277–432 (p-value 0.0158) in the 2QQI [66] structure (Figure 6). Finally, it is worth noting that mutations on residues 297 and 320 were recently found that completely disrupt VEGF binding, both of which also fall within the GraphPAC identified cluster of 277-432 in the 2QQI structure.

Figure 6
figure 6

The NRP-1 structure (PDB ID 2QQI) where the 277–432 cluster is colored in red. The mutations that disrupt VEGF binding, 297 and 320 are shown as orange spheres while the end-points of the cluster, 277 and 432, are shown as purple spheres.

GraphPAC identifies additional clusters compared to iPAC and NMC

A representative example where GraphPAC identifies additional clusters as compared to NMC and iPAC is in the KRAS protein for the 3GFT structurec[67] (Figure 7). KRAS, a GTPase, is one of the most pervasively activated oncogenes, with some estimates stating that between 17–25% of all human tumors contain an activating mutation of the gene [68]. Due to the large number of samples with mutations in this gene and the resulting strong statistical signal, GraphPAC, iPAC and NMC all identify that KRAS contains highly statistically significant mutational clusters. Nevertheless, GraphPAC identifies several novel clusters that are missed by iPAC and NMC. While all three methods identify clustering at residues 12–13, 12–61 and 12–146, only iPAC and GraphPAC identify two additional clusters at 1) 61–117 and 2) 117–146.

Figure 7
figure 7

The KRAS structure (PDB ID 3GFT) color coded by region: amino acids 13–22 are blue, 24–60 are red and 62–145 are yellow. Residues 12 and 13 which make up the most significant cluster are shown as purple spheres, while residues 23, 61, 117 and 146 are shown as brown spheres.

Moreover, only GraphPAC (under the cheapest and nearest insertion methods) identifies a statistically significant cluster for residues 12–23 and 23–61 as shown in Table 3. Considering the 12–23 cluster, we see that a sub-cluster of 12–13 is identified as well. This follows biological function as mutations on residues 12 and 13 appear in a large variety of cancers, such as breast, lung, bladder, pancreas and colon [6, 69, 70] while mutations on residues 22 and 23 appeared in colorectal/large intestine tissue samples in our data. It is interesting to note that germline mutations on residue 22 often result in developmental disorders such as Noonan Syndrome Type 3 (NS3) as well as Cardiofaciocutaneous Syndrome (CFC) [71, 72].

Table 3 P-value comparison of the three algorithms for several significant clusters

Finally, the majority of mutations in cluster 61–146 also segregate along pathological lines with all the mutations in our data either occurring in lung or gastrointestinal tract/large intestine carcinomas. Specifically, residue 61 mutations are typically found in colorectal and lung cancer [6, 73] while mutations K117N and A146T are found in colorectal cancer [6].

GraphPAC finds fewer clusters compared to NMC

As seen from Figure 4, between 25%–40% of the structures identified with significant clustering had fewer clusters under the GraphPAC methodology as compared to the linear NMC algorithm with the vast majority of these structures corresponding to BRAF, HRAS and TP53. Here we consider a representative example, the 4E26 structure [74] for BRAF when analyzed using the farthest insertion method (Figure 8). As iPAC identified even more clusters than NMC, we compare GraphPAC to NMC in this section when showing that fewer mutational clusters is of benefit. Further, as V600 is well known to be the most likely mutated position in BRAF, the most significant cluster identified by GraphPAC, iPAC and NMC is located only on that residue with a p-value of 2.12×10-129 under all three methods. In all, GraphPAC identifies 16 clusters while NMC identifies 22, with the differences shown in Table 4.

Figure 8
figure 8

The BRAF structure (PDB ID 4E26) color coded by segment: I) amino acids 464–599 are orange 2) amino acids 601–671 are green. The α-carbons of the mutated residues 464, 466, 469, 581, 596, 597, 601 and 671 are shown as purple spheres. Residue 600 is shown as a red sphere.

Table 4 A comparison of GraphPAC and NMC identified clusters for the BRAF structure

Although it is outside the scope of this manuscript to consider every difference between Tables 4a and 4b, we observe that three of the longest clusters 464–671, 466–671 and 469–671 are dropped by GraphPAC. Since after alignment of the protein structural data to the mutational data (see Section ‘Obtaining the 3D structural data’), tertiary information was available on residues 448–603 and 610–723, these clusters cover 77.0%, 76.3% and 75.2% of all the available residues, respectively. By considering the 3D structure via GraphPAC, the longest clusters are dropped and the remaining overlapping clusters focus almost exclusively on residues 464–600.

After structure and mutation alignment, the residue substitutions in significant clusters include: G464V, G466V, G469V, G469A, N581S, G596R, L597V, LV597R, V600E, V600K, K601N and R671Q. Since mutation R671Q does not have extensive literature and comes from a non-specified tissue sample in the COSMIC database, it will no longer be considered here. Thus, by considering the tertiary structure, we significantly narrow the window of which residues to consider for potential driver mutations and can partition the protein into three segments: I) 464–599 and II) 600 and III) 601. Segment I is primarily associated with lung and colorectal cancer as shown in [3, 7577]. Segment II represents the two most common mutations in BRAF, V600E and V600K. Overall, 95% of BRAF mutations occur on V600, with some studies showing that V600E occurs within 73% to 79% of patients while V600K occurs within 12% to 19% of patients [78, 79]. Mutations at this position result in the oncogene being constitutively activated with increased kinase activity and have been found in a wide range of cancers such as metastatic melanoma [80], ovarian serous carcinoma [81] and hairy cell leukemia [82]. Furthermore, recent inhibitors, such as Vemurafenib and GSK2118436 specifically target the V600E and V600E/K mutations (respectively), supporting the hypothesis that somatic clusters can provide pharmacological targets [83]. Lastly, segment III is comprised of the much less common K601N mutation which has been observed in myeloma cases along with V600E. Since these patients share the more common BRAF mutations as well, they may also potentially benefit from BRAF inhibitors [84].

Further, as shown in Section ‘Results’ and described above, GraphPAC finds fewer clusters for a significant percentage of the structures analyzed. Overall, the reduction in total clusters identified can result from two sources: the removal of some residues because no tertiary data was available or the cluster is no longer significant when using the traveling salesman algorithm to account for 3D structure. The first case, which is already rare, will become increasingly more so as additional studies result in more complete and detailed structural information. For the second case, if a cluster is not found to be significant under GraphPAC when compared to NMC, a near or overlapping cluster is usually found (see Tables 4a and 4b). For BRAF specifically, under every type of graph insertion method (cheapest, nearest and farthest), every "probably damaging" or "possibly damaging" mutation (as classified by PolyPhen-2) was still identified in at least one significant cluster for the structure. For a complete analysis, see "Potential Driver Loss" in Additional file7.


In this manuscript we provide an alternative method to utilize protein tertiary structure when identifying somatic mutation clusters. By employing a graph theoretic approach to restructuring the protein order, we identify both new clusters in proteins previously shown to have clustering as well as proteins that were not previously shown to have clustering. We have also provided several examples where we are able to identify clusters of mutations that may benefit from pharmacological treatment. Moreover, as GraphPAC uses the NMC algorithm to identify clusters rather than a fixed window size, we are able to detect clusters of varying lengths. Finally, the methodology is fast and robust with the overwhelming majority of structure/protein combinations taking under 10 minutes each to analyze on a consumer desktop.

The GraphPAC algorithm, while presenting a viable alternative to the MDS restriction of iPAC and an improvement over NMC, nevertheless contains several limitations. First, while no longer bound to the MDS requirement of iPAC, there is no closed form solution to the shortest path problem and our algorithm must appeal to heuristic approximations. Second, to satisfy the uniformity assumption, the mutation status of all residues must be known ahead of time. With the growth of high-throughput sequencing however, this issue is temporary. Next, unequal rates of mutagenesis along with hypermutability of specific genomic regions may violate the assumption that every residue has a uniform probability of mutation. To help ensure that this assumption holds, we only consider single residue missense substitutions and have removed insertions and deletions from the analysis since they tend to be sequence dependent. Further, research has shown that CpG dinucleotides may have a mutational frequency ten times or higher compared to other dinucleotides [85]. However, in the analyses presented in Sections ‘GraphPAC finds novel proteins compared to iPAC and NMC’, ‘GraphPAC identifies additional clusters compared to iPAC and NMC’, ‘GraphPAC finds fewer clusters compared to NMC’, only approximately 13% of the mutations used to identify clustering occurred in CpG sites. Relatedly, colorectal carcinomas [86] contain more transition mutations while cigarette use results in more transversion mutations in lung carcinomas [8]. Still, when considering KRAS, the overwhelming majority of substitutions occur on residues 12, 13, and 61 for both colorectal and lung cancer, implying that while the mutational landscape may vary, it does not have a significant effect on mutation location and thus would not violate the uniformity assumption. Hence, while this analysis is influenced by a variety of factors, as are previous studies, it nevertheless appears that the primary cause of clustering is selection for a cancer phenotype.

Several areas for future research are also directly evident. First, an approach that considers the protein directly in 3D space via simulation may be employed. However, such an approach would not be able to use the order statistic methodology to identify clustering and thus might not be as sensitive for small mutation counts. Moreover, while we only consider distance when finding the shortest path through the graph, future research can incorporate the physico-chemical properties of the specific residues or domains by appropriately increasing or decreasing edge length. The potential additive effect of multiple cancer mutations in the same protein, as discussed in the case of EGFR by Hashimoto et al.[87], can also be incorporated via additional refinement of the edge weights. Additional research is required in this area in order to incorporate these improvements.

Overall however, GraphPAC utilizes protein tertiary structure via a graph theoretic approach in identifying mutational clustering. We show that this method identifies new clusters that are otherwise missed and that in some cases, pharmaceutical targets for mutations in these clusters have already been found and therapies created. Specifically, Erlotinib and Getfinib are used to target mutations in EGFR significant clusters (see Section ‘GraphPAC finds novel proteins compared to iPAC and NMC’) while Vemurafenib is used to target mutations that occur within BRAF significant clusters (see Section ‘GraphPAC finds fewer clusters compared to NMC’). This helps confirm the hypothesis that mutational clustering may be indicative of driver mutations and as new protein structures become available, GraphPAC can provide a rapid methodology to identify such potential mutations.

Availability and requirements

Project Name: GraphPAC: Identification of Mutational Clusters in Proteins via a Graph Theoretical Approach.

Project Home Page:

Operating system(s): Platform independent

Programming Language: R

Other Requirements: R ≥2.15 (see homepage for R package requirements).

License: GPL-2


a Under a complete graph, every vertex is connected to every other vertex. The length of the edge between vertices i and j is set to be equal to the length between amino acids i and j in R 3 .

b A Hamiltonian path is a walk through the graph that visits every vertex once and only once.

c For this analysis, a manual reconstruction was performed in order to include residue 61 which is listed as a histidine under isoform 2B in the Uniprot Database and a glutamine in the COSMIC database. As the substitution of one amino acid in the structure would not have a significant impact on the spatial structure of the protein, and residue 61 is a highly mutated position, the residue was kept in the analysis. As a result, amino acids 1–167 are used.