Predicting functional consequences of mutations using molecular interaction network features

Variant interpretation remains a central challenge for precision medicine. Missense variants are particularly difficult to understand as they change only a single amino acid in a protein sequence yet can have large and varied effects on protein activity. Numerous tools have been developed to identify missense variants with putative disease consequences from protein sequence and structure. However, biological function arises through higher order interactions among proteins and molecules within cells. We therefore sought to capture information about the potential of missense mutations to perturb protein interaction networks by integrating protein structure and interaction data. We developed 16 network-based annotations for missense mutations that provide orthogonal information to features classically used to prioritize variants. We then evaluated them in the context of a proven machine-learning framework for variant effect prediction across multiple benchmark datasets to demonstrate their potential to improve variant classification. Interestingly, network features resulted in larger performance gains for classifying somatic mutations than for germline variants, possibly due to different constraints on what mutations are tolerated at the cellular versus organismal level. Our results suggest that modeling variant potential to perturb context-specific interactome networks is a fruitful strategy to advance in silico variant effect prediction. Supplementary Information The online version contains supplementary material available at 10.1007/s00439-021-02329-5.


Introduction
Advances in high-throughput sequencing technologies have resulted in the rapid accumulation of genomic data and allowed profiling of patient genomes in clinical settings. Such studies frequently uncover previously unobserved and uncharacterized genetic variants of ambiguous relevance to health, making variant interpretation an important challenge in precision medicine (Fernald et al. 2011). Missense mutations are particularly challenging as they only change a single amino acid in a protein sequence yet can have effects spanning no difference to complete loss of function. Numerous methods have been developed to prioritize functional missense variants (Adzhubei et al. 2010;Cooper and Shendure 2011;Hecht et al. 2015;Ioannidis et al. 2016;Kircher et al. 2014;Liu et al. 2020;Ng and Henikoff 2003;Pejaver et al. 2020;Ponzoni et al. 2020). Typically, these tools rely on protein sequence/structure information to predict variant effects at the protein level, and the scores they provide tend to capture coarse grained estimates of impact (e.g., damaging, benign, and tolerated).
Biological functions and cellular behaviors arise from interactions among proteins and other molecules within cells, and biological systems evolve to be robust to random error (Félix and Barkoulas 2015). Diseases are often associated with perturbations to protein interactions, and different perturbations can result in different phenotypes (Vidal et al. 2011), and the level of impact caused by mutations to the underlying molecular interaction network may determine the likelihood of generating a phenotype (Capriotti et al. 2019). For example, loss-of-function mutations were more likely to be tolerated when they affected proteins at the periphery of the interactome (Khurana et al. 2013). Similarly, variants that otherwise were predicted to have little effect were more likely to be deleterious if they had a large number of interaction partners (Yates et al. 2014) and de novo missense variants in autism probands with functional Polyphen2 predictions were enriched at protein interfaces of more central proteins relative to similar mutations in control siblings 1 3 ). Thus, a protein's location within the system provides biological context that may be important for understanding the effects of mutations (Ozturk et al. 2018).
Within proteins, different mutations may have different effects on protein functions (Sahni et al. 2013;Zhong et al. 2009). While destabilizing mutations at the core of a protein are likely to interfere with all protein activities, mutations on the surface could potentially interfere with specific protein activities while preserving others (Zhong et al. 2009). In this way, different mutations targeting the same protein might perturb its interactions differently, affecting different pathways that the protein is involved in, and resulting in different disease phenotypes (Engin et al. 2015). Indeed, analyses have demonstrated an unexpected enrichment of Mendelian mutations (David et al. 2012;Guo et al. 2013;Wang et al. 2012) and somatic mutations (Engin et al. 2016;Kamburov et al. 2015;Porta-Pardo et al. 2015;Raimondi et al. 2016) at protein interaction interfaces. Although protein structurederived features have long been integral to variant classification, some more recent features capturing 3D location of mutations within key protein regions including local density of mutation and location at interface regions have emerged (Iqbal et al. 2020;Laskowski et al. 2020;Tokheim et al. 2016;Tokheim and Karchin 2019). While these features begin to capture information about the potential variants to affect distinct interactions, they do not incorporate context about the importance of specific interactions within the larger interactome.
Based on the above, we sought to assess the potential for artificial intelligence-based methods for variant interpretation to derive new information from molecular interaction data. We first integrated structure and protein-protein interaction (PPI) networks to enable systematic annotation of proteins according to location and interactions (Fig. 1a). We mapped various germline variants and somatic mutations to network edges to describe their potential to impact biological function (Fig. 1b). We then designed features capturing information about proteins and amino acids in the context of their importance to the network architecture and evaluated them within a machine-learning variant classification framework (Fig. 1c). We found that network-based features capture orthogonal information to classical amino acid (AA) sequence/structure-based features and can improve variant classification, though they may be more informative for some variant classification tasks than others.

Disease-causing genes are central in PPI networks
The architectures of biological networks can provide important information for understanding the pathogenesis of mutations Ozturk et al. 2018). The scalefree topology of PPI networks suggests that they are more tolerant to random failures, but variants affecting higher degree nodes are more likely to disrupt function (Albert et al. 2000). Indeed, when we compared disease genes using a high-confidence human PPI network of experimentally verified interactions from STRING (Szklarczyk et al. 2015), cancer driver (Vogelstein et al. 2013) and Mendelian disease genes (Amberger et al. 2015) score higher with various centrality measures than other genes (Fig. 2). This suggests that the network niche of a gene provides information about the potential of an amino acid substitution to create deleterious phenotypes, a relationship that has proven robust to study  (Vinayagam et al. 2016); in our data, only node degree correlates weakly with the number of publications (Pearson r = 0.23, Fig. S1).

Creating a structurally resolved PPI network
While disease mutations target proteins more central in interaction networks (Fig. 2), protein-level descriptors of centrality are not capable of distinguishing the effects of different mutations within proteins. Investigation of residuespecific network perturbations requires mapping of mutations to 3D protein structures and interaction interfaces, so that we can model their potential to affect network edges (Fig. 1b). We constructed a structurally resolved PPI network (called SRNet from here on) comprising 6230 proteins and 10,615 PPIs using 3D structures and homology models (Fig. 1a, Table S1). This network contains annotations for 530,668 interface residues, defined here as the subset of amino acid residues that mediate physical contact between proteins. Otherwise, amino acids are annotated according to location at the surface or core based on relative solvent accessible surface area calculated from protein 3D structures (Materials and methods). SRNet is an updated and extended version of our previous structurally resolved PPI network (Engin et al. 2016).

Disease mutations frequently target interface or core residues
We further assessed the potential for SRNet to capture information about residue-based network-perturbation by analyzing location of mutations relative to core, surface, or interface regions. Similar to the finding in Engin et al. (2016), SRNet supports that somatic missense mutations in tumors (obtained from The Cancer Genome Atlas (TCGA) (Collins and Barker 2007)) target surface regions in oncogenes (OR 1.32, p = 1.4e−06) and other genes (OR

Network-based features for variant classification
As the above analyses support that both protein and amino acid-level information derived from networks is informative about disease-association, we hypothesized that network information would be useful for machine-learning-based variant classification. We designed and analyzed 16 features describing network-level effects of mutations, including 7 protein-level features (Fig. 4a) that estimate the significance of the target protein in the network, and 9 residue-level features (Fig. 4b, c) quantifying the potential of individual amino acid positions on the protein to impact network architecture. The residue-level features are based on comparing network measures before and after removing edges in SRNet potentially affected by a mutation (Materials and methods). These 16 features show potential to distinguish between  Table S2).

Utility of network features for classifying cancer driver mutations
To evaluate the benefit of using network features for somatic mutation classification, we trained a Random Forest to predict driver or passenger class labels using different combinations of features. We separately evaluated classifier performance when trained using all 16 network-derived features, only the 7 protein-level features or the 9 residue-level features alone, or in combination with 83 amino acid-level features obtained from the SNVBox database (Wong et al. 2011). As a training set, we used likely driver and likely passenger missense mutations from Tokheim et al., which they obtained from TCGA using a semi-supervised approach based on known cancer driver gene annotations and mutation rates with the goal of generating a more balanced training set consisting of both driver and passenger mutations in cancer genes (Tokheim and Karchin 2019). While passengers greatly outnumber drivers in practice, we constrained the ratio as 1:4 driver vs. passenger mutations for classifier training (Materials and methods). Generalization error was estimated using a fivefold cross-validation with gene hold out to prevent information leakage and consequent overfitting.
We measured performance using the area under the ROC (auROC) and precision-recall curve (auPRC) metrics, similar to prior variant effect prediction studies (Tokheim and Karchin 2019). We note that use of network features limits training and prediction to mutations that can be annotated by SRNet. For driver classification, protein-level network features performed better than residue-specific features (Figs. 5a, S3a), though performance for residue-level features was better for the top scoring ~ 20% of drivers (left edge of ROC curve). We note that residue-level features alone classify all surface non-interface mutations as passengers, since their feature values should all be the same (there is no change to network centrality measures when no edges are affected). Combining residue-level features with more classic amino acid-level features significantly boosts performance over residue-level features alone (Figs. 5b, S3b). Interestingly, network features alone slightly outperform amino acid-level features alone, pointing to the extreme centrality of driver genes. As residue-level features are likely to be most informative for mutations at interfaces, we further explored performance for interface mutations only (Figs. 5c,S3c). Here, we see that residue-specific network features perform considerably better as they are not hindered by misclassification of surface mutations (Figs. 5c,S3c). Overall, the combination of network-based and amino acid features displays the highest performance . Notably, precision-recall curves indicate that incorporating both network and classic AA features resulted in a significant drop in false-positive predictions relative to either type of feature alone (Fig. S3a-c). Odds ratios (OR) and 95% confidence intervals using Fisher's exact test are shown (*p < 0.05). Comparison of somatic mutations in oncogenes (OG), tumor suppressor genes (TSG), and other genes located at a core vs. surface residues, and b surface non-interface vs. surface interface residues. Comparison of pathogenic and neutral variants located at c core vs. surface residues, and d surface non-interface vs. surface interface residues. For a and c, an OR > 1 means that more mutations/variants were found at the surface. For b and d, an OR > 1 means more mutations/variants were found at interfaces

Incorporating in silico predicted interface residues
The restriction to analysis of mutations for which 3D structural information about interfaces is available is a problematic limitation. In silico prediction of interfaces can be used to augment interface coverage, as done for Interactome INSIDER (Meyer et al. 2018). To explore whether in silico predicted interfaces could boost mutation coverage without loss of performance, we repeated our analysis on an extended network with both structurederived and predicted interfaces. This resulted in improved performance overall (Figs. 5d-f, S3d-f), suggesting that improvements to interface features and the ability to train on a larger set of mutations enabled by higher coverage in the expanded network outweigh the introduction of noise caused by interface prediction error. We also noted a larger gain in precision for network features relative to AA features when using expanding the network (Fig. S3e). A more stringent comparison considering only proteins shared between the original and the expanded network found similar results (auROC is 0.832 and 0.871 for SRNet and the extended network for the classifier with Network and AA features, respectively). As we obtained our optimal performance using all features with the extended network, we used this classifier to evaluate feature importances. In Random Forest classifiers, feature importances are determined as the mean decrease in impurity when using that feature to split training examples according to class label during classifier training (Breiman 2001). Fourteen of the top 21 features were network derived, including the top 9 ( Fig. S2, Table S2). Protein-level network features were more informative than residue-level features, possibly reflecting the limitation of residue-level features to distinguish surface mutations. Simple 3D location annotating mutations to location at core, surface, or interface contributed less information, which may reflect its redundancy with other network features. We further investigated residue-level network features in the extended network by examining cases where the classifier was successful in differentiating between driver and passenger mutations occurring in the same proteins. Since residue-level features only vary within protein for interface mutations, we looked for cancer genes where both driver and passenger mutations at interfaces were correctly classified. We found seven cancer genes (EGFR, HRAS, KRAS, TP53, PIK3R1, CTNNB1, and PTEN) that contained both correctly classified interface driver and interface passenger mutations. Focusing on 212 correctly classified interface mutations in these genes, we observed a significant difference in distribution of residue-level features for the driver and passenger classes (Fig. 6), further supporting that residue-level network features provide information useful for within gene mutation classification.

Overall performance on benchmark datasets
We next sought to evaluate the improvement obtained from network features on independent studies of cancer mutations. We used our highest performing classifier which is trained on cancer mutations that map to the extended network and all network-based and classic amino acid features (Fig. 5e, Net and AA classifier with auROC = 0.880, Fig.  S2, Table S2). Since no 'gold standard' dataset exists for cancer, we evaluated classifier performance relative to bestin-class methods that do not use network-derived features on four external pan-cancer datasets constructed using different approaches: an in vivo screen: Kim et al. (2016), an in vitro assay: Ng et al. (2018), and two literature-derived datasets: MSK-IMPACT and CGC-recurrent, previously described in Tokheim and Karchin (2019). For each dataset, considering the mutations scored by all methods, classifier performance was evaluated using the area under the ROC (auROC) and PR curves (auPRC), accuracy, F1 score, and the Matthews correlation coefficient (MCC) ( Table S3).
We assessed the performance of our classifier relative to both cancer-specific: CHASM (Carter et al. 2009 (Fig. 7). Comparison was based on the set of benchmark mutations scored by all methods, and was on the basis of auROC, auPRC, accuracy, F1 score, and Matthew's correlation coefficient (MCC). We note the later three use discrete labels rather than continuous scores. We used provided labels or recommended cutoffs for all methods as possible, and used a cutoff of 0.5 otherwise (Materials and methods).
Our classifier performed well on all five metrics across the four benchmark sets relative to most other methods (Table S3). It had the highest auPRC (Fig. 7b), F1 score (Fig. 7d), and MCC (Fig. 7e), of all methods on the four benchmark sets, and also performed well on auROC and accuracy (Fig. 7a, c). In most cases, the difference in auROC relative to other methods was deemed significant by the DeLong test (Table S4). After our method, the next best-performing method was ParsSNP, after which there was considerable variation in what methods performed best by various measures. Overall, these results suggest that network-derived D r iv e r P a s s e n g e r P a th o g e n ic N e u tr a l  features that capture abstract information about the role of proteins in networks and the potential of mutations to perturb this role are helpful for driver classification across a variety of settings, though the gains over methods trained on classic amino acid-based features are modest.
We separately compared our approach to two methods that incorporate interactome related features: SuSPect (Yates et al. 2014) and CHASMplus (Tokheim and Karchin 2019). SuSPect includes protein degree as a predictive feature, while CHASMplus now includes a feature indicating the number of interactions affected if a mutation occurs at a protein interaction interface, along with other improvements relative to the original method. Location at an interface was reported as the second most informative feature after a feature describing within protein clustering of observed mutations (Tokheim and Karchin 2019). We noted improved performance relative to SuSPect and comparable performance to CHASMplus (Fig S4). We note that both our method and CHASMplus derive classic AA features from the SNVbox database (Wong et al. 2011). We further analyzed mutations that were correctly classified by our method but misclassified by SuSPect or CHASMplus to see whether the network features implemented relative to these methods explained the difference. These mutations were significantly enriched at interface regions compared to surface or core (OR 3.78, p = 1.45e−11) and they had significantly different distributions of all network features (Mann-Whitney U test, p < 0.05) apart from closeness change (p = 0.22), when compared to mutations misclassified by our method but correctly classified by SuSPect or CHASMplus, suggesting that even though some shared features exist between the methods, our classifier better reflects the network rewiring by mutations.
However, it should also be considered that our networkbased approach is dependent on the inclusion of proteins in the network and availability of annotations mapping amino acid residues to core, surface, or interface residues. This generally results in a smaller training set than other methods, and an inability to score some fraction of mutations. For the benchmark sets evaluated here, 95.77% of Kim et al.,72.03% of Ng et al.,78.82% of MSK-IMPACT, and 74.11% of CGCrecurrent dataset mutations could be scored, respectively. It is possible that better network and amino acid annotation coverage could further boost performance.

Utility of network features for classifying pathogenic germline variants
We next evaluated whether network features are also useful in the context of germline variation. We previously observed that inherited disease genes were less central than cancer genes and both pathogenic and neutral mutations were enriched at interface positions, though to different extents. We once again trained a Random Forest classifier to prioritize missense mutations that alter protein activity using the 16 network-based features and 83 amino acid descriptors, using a training set composed of pathogenic and neutral variants (Materials and methods).
For germline variants, residue-specific features yielded similar (with SRNet) or higher performance (with extended network) than protein-level features for all mutations (Figs. 8a, S5a), and for interface mutations only (Figs. 8c, S5c). But overall, network features are outperformed by non-network amino acid features (Figs. 8b, S5b) which is the opposite of the case with cancer driver classifier. This is consistent with proteins targeted by pathogenic germline variants being less central than cancer driver genes. Since proteins harboring germline pathogenic variants have fewer interaction partners, pathogenic variants in the protein core or at interfaces tend not to result in as extreme values of residue-level network features as driver mutations do (Fig. 4c), despite the observed enrichment for these variants in core and interface regions (Fig. 3c-d). The similar performance by network features in both SRNet and the extended network (Fig. 8) suggests that either increased coverage does not improve performance as much, or the noise introduced by interface prediction error counteracts the performance gained by higher coverage in this setting. A stricter comparison considering only shared proteins between networks once again showed similar performance (auROC is 0.835 and 0.849 for SRNet and the extended network for the classifier with Network and AA features, respectively). Precision-recall curves show similar results (Fig. S5).

Discussion
Understanding the functional consequences of protein coding variants remains a challenging task. Machine learning methods developed thus far to predict whether a mutation is likely to impair protein activity or cause a pathogenic phenotype have largely been protein-centric; however, a growing . c-f ROC curves for identifying cancer mutations targeting interface residues only, using all above-mentioned features. ROC curves using Net and AA features are bold. Performance is measured using auROC scores It is increasingly apparent that the role of proteins within molecular networks is a key determinant of the potential of variants to exert deleterious effects Khurana et al. 2013;Yates et al. 2014). Motivated by these studies, we investigate here how networkderived features can capture novel information about variant effects that is not already present in the classical amino acid features used by most variant classification methods, and show that combining both sets of features improves classifier performance.
Our approach relies on a structurally resolved PPI network that allows variants to be characterized according to their potential to affect network architecture by mapping them to their location on protein structures and protein-interaction interfaces. These mappings are used to capture the potential of variant positions to perturb information flow through the network. We developed protein-level features to capture the relative importance of a protein within the network, and residue-level features to capture the potential of mutations to alter the architecture of the network. Though protein-level network features are shared by all variants in a protein, they nonetheless can interact with other amino acid-level features to support classification; in all cases, combining both protein and residue-level network features with classic amino acid features outperformed combining only residue-level network features with classic amino acid features. Residue-level features were helpful for distinguishing between variants within proteins; however, because we designed them to capture the potential of mutations to alter the network architecture, all surface non-interface variants received the same value for these features. We also did not consider the possibility that surface mutations could generate a new edge in the network. Such mutations could be more common in cancer, where missense mutations have been reported to alter binding specificities for kinases and their substrates, thereby remodeling network architectures (Creixell et al. 2015).
Though network features show fairly different distributions for different classes of variants (Fig. 4) and are orthogonal to the features typically used for variant classification, the best classifier combining both feature types shows only modest gains over classifiers that use only classic amino acid features. This may result from more limited availability of training set mutations due to the requirement for structure and interface information to estimate network feature values. This requirement also constrained the coverage of benchmark set mutations that could be classified, though the values remained generally high, and over 70% in the worst case. Performance generally improved when we included predicted protein interactions from Interactome INSIDER (Meyer et al. 2018), suggesting that in silico approaches may be an effective strategy to boost performance until more complete experimentally derived interaction maps are available.
Network features were more informative in the context of somatic mutations than when classifying inherited variants, though performance gains were observed in both cases. This is perhaps expected, since inherited disease genes tended to be less central in PPI networks than cancer genes (Fig. 2), and location at an interface by itself was less discriminatory in the germline setting (Fig. 3d). We speculate that these differences may arise from different selective pressures acting on somatic versus germline variation. Because development at the organismal level is likely dependent on the integrity of molecular interaction networks, both pathogenic and neutral variants may be more constrained by network architecture; whereas in cancer, where selection operates at the cellular level and is predominantly positive, mutations may be better tolerated and be more advantageous in central network positions. We note also, however, that training sets for the driver versus passenger classification problems tend to be more gene-centric leading to concerns over whether cancer mutation classifiers distinguish primarily between genes (Raimondi et al. 2021). Although we made an effort to include both drivers and passengers in each driver gene to mitigate this, it may still be reflected in the higher utility of protein-level network features for driver classification.
In conclusion, our study suggests that information about molecular interaction networks can be incorporated into machine-learning-based variant interpretation frameworks. This opens future directions for the development of novel features capturing network information. Since networks can be constructed to model cell-type and condition specificity (Greene et al. 2015), it may be possible to build classifiers that can capture context-specific effects of variants. Furthermore, as studies have shown that different interfaces are associated with different protein activities, network-based features could make it possible for machine-learning methods to provide more insight into the potential for mutations within a protein to have distinct functional consequences. We anticipate that such advances will boost the utility of variant classification tools for precision medicine applications.

Materials and methods
Data and code are available at https:// github. com/ carte rcomp bio/ NetFe atures.

Source of protein interaction data
To analyze disease gene centrality, we obtained a human PPI network of 12,811 proteins that are involved in 97,376 experimentally verified undirected interactions with a confidence score higher than 0.4 from STRING v11.0 (Szklarczyk et al. 2015).

Disease genes
A list of 125 high-confidence cancer genes consisting of 54 oncogenes and 71 tumor suppressor genes was obtained from Vogelstein et al. (2013). We also obtained a list of 4524 Mendelian genes from the OMIM database (Amberger et al. 2019). These genes were used to evaluate disease gene centrality (Fig. 2).

Collecting structural protein and protein interaction data
We obtained human protein interaction data (complete set) from Interactome3D (Mosca et al. 2013), which contains a collection of a highly reliable set of experimentally identified human PPIs. We collected experimental co-crystal 3D structures for 5865 of these interactions from the Protein Data Bank (PDB) (Berman et al. 2003) and homology models for 5768 additional interactions (Mosca et al. 2013) making a total of 11,633 interactions between 6807 proteins with structural protein and interaction data.

Creating a structurally resolved PPI network
Amino acid residues were annotated as participating in a protein interaction interface based on KFC2 (Zhu and Mitchell 2011) scores, and we removed interactions containing fewer than five interface residues on either partner. Additionally, we calculated relative solvent accessible surface areas (RSA) using NACCESS (Hubbard and Thornton 1993) for all residues in each protein structure. Residues with RSA < 5% and RSA > 15% were designated as core and surface residues, respectively. Residues with RSAs between these thresholds were excluded from further analysis due to ambiguity. When multiple PDB chains were available for the same protein, we used the consensus designation as the final label. The mapping of PDB residue positions onto UniProt residue positions was performed via PDBSWS web server (Martin 2005). After this mapping, we created a structurally resolved PPI network (named SRNet) of 6230 proteins and 10,615 undirected protein-protein interactions with a total of 530,668 interface residues (Table S1). To extend coverage of the structurally resolved network, we defined an extended network based on the "High Confidence" dataset of Interactome INSIDER (Meyer et al. 2018), a human PPI network of 14,445 proteins with 110,206 undirected interactions containing in silico interface residue predictions in addition to those derived from 3D structures.

Source of somatic mutation data
To investigate structural location of cancer mutations on proteins (Fig. 3), we mapped more than 1.4 million somatic missense mutations from TCGA (Collins and Barker 2007) onto the structurally resolved PPI network using structural annotations. Only mutations mapping to canonical proteins were used. After this mapping, we identified a total of 56,667 interface residues of 5005 proteins that are involved in 9235 interactions as mutated.

Training set for cancer mutation prediction
We collected a set of cancer missense mutations designated as likely driver (n = 2051) and likely passenger (n = 623,992) from Tokheim et al. (Tokheim and Karchin 2019). Of these, 961 driver mutations from 32 genes and 28,043 passenger mutations from 2986 genes mapped to SRNet for a total of 29,004 mutations. All 32 genes with driver mutations also contain passenger mutations. To handle the driver vs. passenger mutation count imbalance in the training set by maintaining an approximate 1:4 driver vs. passenger mutation ratio similar to Carter et al. (2009) while not overrepresenting particular genes, we limited the number of passenger mutations for each gene to 16 (median per gene driver mutation count) and collected 4626 passenger mutations at random across all genes with passenger mutations. In the extended network, we mapped 1513 driver mutations from 52 genes and 118,777 passenger mutations from 4478 genes for a total of 120,290 mutations. Thirty-eight of fifty-two genes with driver mutations also contain passenger mutations. To maintain an approximate 1:4 driver vs. passenger ratio as described above, we limited the number of passenger mutations for each gene to 17 (median per gene driver mutation count) and collected 6549 passenger mutations at random across all genes with passenger mutations. Fig. 7 Comparison of classifier performance on benchmark datasets relative to established methods. Bar plots depict a the area under the ROC (auROC) and b the area under the PR curve (auPRC) scores, c accuracy, d F1 score, and e the Matthews correlation coefficient (MCC) results for each method. Mean category displays the mean of scores of each method across datasets. Methods are ordered based on their mean scores. All panels use the same color scheme ◂

Features
We designed 16 network-based features to quantify the potential impact of a mutation to the underlying network architecture, comprising 7 protein-level and 9 residue-level features. The 7 protein-level features are degree, betweenness, closeness, eigenvector and load centralities, clustering coefficient, and pagerank of the proteins within the PPI network. They are computed using the NetworkX package of Python and they aim to characterize the centrality of a protein in the network based on measures such as the number of nodes it is directly connected to (degree), the amount of shortest paths it is involved in (betweenness and load), the overall closeness to all other nodes (closeness), its embeddedness (clustering coefficient), and the centrality of its neighbors (eigenvector and pagerank). Nine residuelevel features describe mutation 3D location (core, interface, and surface) on the protein and changes in centrality of the protein within the PPI network resulting from mapping the mutation to network edges. Core mutations are assumed to affect all edges in the network, while interface mutations are mapped to corresponding edges in the network, and surface mutations retain all edges. Each interface mutation causes the removal of all edges that they are mapped to. The remaining eight residue-level features are based on this description of how the mutation perturbs the network by capturing degree change, betweenness change, closeness change, eigenvector change, clustering coefficient change, load change, pagerank change, and percent degree change. Non-network-related amino acid-based features (n = 83) obtained from the SNVBox database (Wong et al. 2011) describe substitution effects on amino acid biophysical properties, evolutionary conservation of variant sites, local sequence biases, and site-specific functional annotations. Pearson correlation coefficient was used to evaluate feature correlations. Our proposed classifier (used in Fig. 7) uses a total of 99 features consisting of all 16 network-based features (7 protein-level and 9 residue-level features) and 83 non-network-related amino acid-based features. The importance of each feature is computed as the normalized total reduction of the criterion brought by that feature (mean decrease in impurity), also known as the Gini importance ( Figure S2, Table S2).
The number of PubMed studies featuring each gene was obtained from the NCBI database (https:// ftp. ncbi. nih. gov/ gene/ DATA/ gene2 pubmed. gz) for all genes in SRNet with NCBI (Entrez) gene IDs. This was used to assess the potential for protein-level features to be affected by study bias.

Classifier training
We trained a Random Forest classifier (n_estimators = 1000, max_features = 'sqrt') on the training set using the scikitlearn Python package. To avoid classifier overfitting, we performed prediction using a fivefold gene hold out crossvalidation by dividing the training set into 5 random folds for cross-validation while ensuring a balanced number of disease and neutral mutations across the folds. All mutations occurring in the same gene were kept within the same fold.

Benchmark datasets
We obtained 4 pan-cancer benchmark sets of missense mutations consisting of an in vivo screen: Kim et al. (2016), an in vitro assay: Ng et al. (2018), and 2 literature-derived datasets: MSK-IMPACT and CGC-recurrent from Tokheim and Karchin (2019). The in vivo screen contains 71 mutations selected based on their presence in sequenced human tumors and screened in mice to assess oncogenicity and then labeled as 'functional' or 'neutral' based on their abundance . The in vitro assay consists of 747 mutations from a growth factor dependent cell viability assay annotated as 'activating' for increased cell viability, or as 'neutral' for the remaining, with the assumption that a mutation yielding higher cell viability indicates driverness (Ng et al. 2018

Comparison to other methods
Performance was compared to 24 state-of-the-art methods that do not use network-based information, 4 cancer-focused methods: CHASM (Carter et al. 2009 . We also obtained scores on benchmark datasets for two additional methods that use network features: SuSPect (Yates et al. 2014) from (http:// www. sbg. bio. ic. ac. uk/ suspe ct) and CHASMplus from Tokheim and Karchin (2019) (Fig. S4). Classifier performance was compared using the area under the ROC (auROC) and PR curves (auPRC), accuracy, F1 score, and the Matthews correlation coefficient (MCC) ( Table S3). Only the mutations scored by all methods were considered for comparison. Significance of difference of auROC measures is evaluated by DeLong test (Table S4). auROC and auPRC values were computed using the predicted scores; while accuracy, F1 score, and MCC were estimated based on the predicted labels (positive vs. negative). For label assignments, we used the provided labels from dbNSFP for SIFT, PolyPhen, ClinPred, DEOGEN2, FATHMM, LIST-S2, LRT, M-CAP, MetaLR, MetaSVM, MutationAssessor, MutationTaster, and PROVEAN. Methods that did not provide labels directly typically provided a score between 0 and 1 (except CADD and MPC), and we ensured that a higher score indicated a more damaging mutation. Where specified we used recommended score cutoffs (0.1 for ParsSNP, 0.7 for MVP, and 50 for SuSPect) for label assignments when evaluating F1 score, accuracy, and MCC results. When no threshold was suggested (or if the suggestion was 0.5), we used a cutoff of 0.5 (our classifier Network&AA, CHASM, VEST, CanDrA, TransFIC, CADD, DANN, MPC, MutPred, REVEL, and CHASMplus). It is important to note that while auROC and auPRC results are independent of the predicted class labels; accuracy, F1 score, and MCC results are dependent on the labels and the threshold used for their assignment; therefore assuming a cutoff of 0.5 could underestimate accuracy, F1, and MCC for some methods.

Statistical analysis
Distributions are compared using a Mann-Whitney U test. Correlations are evaluated using the Pearson correlation coefficient. Odds ratios are calculated using Fisher's exact test. auROC scores are compared using the DeLong test.
Author contributions Original concept and project supervision by HC. Project planning, design, and method development by KO and HC. Data acquisition, processing, and analysis by KO. Preparation of manuscript by KO and HC.
Funding This work was supported by SDCSB/CCMI Systems Biology training Grant (GM085764 and CA209891) to K.O. and NIH Grant DP5 OD017937 and CIFAR award FL-000655 to H.C. NIH Grant 2P41GM103504-11 provided access to computational resources.

Data availability
The datasets generated and analyzed during the current study are available at https:// github. com/ carte rcomp bio/ NetFe atures.

Conflict of interest
The authors declare that there are no competing interests.

Animal research (ethics) Not applicable.
Consent to participate (ethics) Not applicable.

Consent to publish (ethics) Not applicable.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.