Background

Interactions between proteins drive the majority of cellular mechanisms, including signal transduction, metabolism, and senescence, among others. The identification of the surface residues mediating these processes, known as protein-protein interaction sites (PPISs), holds great therapeutic potential for the rational design of molecules modulating or mimicking their effects. Further, knowledge of the interacting sites can aid in other domains of computational biology, including PPI network construction and simulated docking. However, biochemical identification methods, such as experimental alanine scanning mutagenesis and crystallographic complex determination, are costly and time-consuming [1,2]. Such methods also consider only sites of the complex under examination, disregarding disparate sites involved in other interactions [3,4]. In response to these shortcomings, computational methods for the prediction of PPISs have been developed, starting with Jones and Thornton’s pioneering analysis of surface patches [5,6], and many predictors have since been published [7-43], utilizing a wide variety of algorithmic approaches to the problem.

This review will first provide a systematic analysis of the features and datasets used in PPIS prediction, from both a theoretical and application-oriented standpoint. An examination of the algorithms used in a selected set of the most recent PPIS predictors is also given, showcasing the diversity of the latest methods employed in this endeavour, as well as the potential for combining or extending them. This is complemented by a comparative evaluation of the performance of these predictors. Because it most accurately simulates the missing information inherent in real-world applications, we focus on the general case of PPIS prediction: using only a single unbound protein structure, without knowledge of an interacting partner, to predict the binding site of that protein at the amino acid scale. Finally, the applications of PPIS prediction and promising areas for future improvements are also discussed.

Interaction types

Virtually all cellular machinery is composed of proteins, whose functions are mediated through biomolecular interactions; these serve to transmit signals and traffic molecular materials throughout the cell, as well as to form larger multimeric complexes capable of more complex behaviour [44,45]. These interactions occur predominantly at conserved interfaces on the surfaces of the folded protein structures, often resulting in allosteric changes in the flexible conformations of the partners that alter their functions [46]. The potential biomedical utility of interface identification makes the prediction of PPISs a critical endeavour, which necessitates theoretical knowledge of the various types of protein-protein interaction sites.

One major distinction is between obligate complexes, which, by definition, do not form their characteristic structure in vivo unless bound, and non-obligate complexes, which can exist as stable monomers [47,48]; complexes are also divided along a continuum between transient and permanent interactions [49], based on temporal length or energetic strength [48,50-52]. Many methods are designed to predict transient interfaces (TIs) [8,17,22,28,35-37,53,54], as they have greater pharmacological relevance, particularly for signal transduction cascades [50,52]. However, TIs tend to be more difficult to predict than permanent interfaces [18,20,23,27,33,39,52,55], possibly resulting from the weaker nature of the interaction manifesting itself as a weaker signal in the properties defining the interacting residues [27,33,52]. However, the existence of fewer training examples due to data gathering difficulties may also play a role [47,49,56-58]. In general, TIs are less evolutionarily conserved than permanent interfaces [50,59-61], but more conserved than the rest of the protein surface [48]. Further, TIs tend to be more compact [51] and richer in water (i.e. more prone to water-mediated binding) [51,62,63]. They also differ in residue propensities [18], including fewer hydrophobic [64] and more polar residues [65]. Thus, unsurprisingly, training on one interface type to predict on the other tends to decrease scores [18,33], though this is sometimes not the case [13]. Generally, analysis of transient versus permanent complexes uses predefined sets [13,60,66] or programs designed to separate them [52,67-69].

All interfaces have distinctive “core” and “rim” regions, with core regions exhibiting lower sequence entropy (higher conservation) than rim [70], as well as reduced tolerance for water and decreased polarity [71-73]. The core of interfaces may be more readily predictable than the rim [7,33], likely for the same reason (i.e. stronger characterizing signal) that permanent PPISs are easier to predict than TIs.

Upon binding, many proteins undergo conformational changes [51,74], which some interface predictors take into account [4,37,38,75]. Large-scale conformational change, such as a disorder-to-order shift [76], is believed to make prediction more difficult for computational protein-protein docking [77,78]; some PPIS predictors also have this difficulty [7,16,25,28,34,35], though several do not [12,32,42].

Datasets

Sources of training data

The majority of predictors based on machine learning (ML) rely on sets of structural information to train their learners, mainly curated from the PDB [79]. However, in the process of mining this database, it is necessary to filter out molecules that are not of sufficient quality or utility for use in the training set [37,80]. An overview of these filters is presented in Table 1.

Table 1 Filters used to curate protedatasets for use in training PPIS predictors, including the reasoning behind their use, the methods and specific software used to implement them, as well as references detailing the predictors making use thereof

Performance benchmark datasets

Due to the wide range of techniques used by existing predictors, an objective performance evaluation requires the use of standardized datasets that encompass as much of the diversity of proteins and interfaces as possible [75,104,105]. This includes sets such as the Docking Benchmark set, created in 2003 [106] and updated 3 times since its inception [77,107,108], which has seen significant use among PPIS predictors in its original, unedited form [12,18,28,103,109], as well as in modified forms [12,42,42,109]. All widely used modern testing sets are presented in Table 2.

Table 2 Datasets Used to Evaluate Predictors in Table 4 , including the source from which they were derived, as well as the publication in which they were created using the requirements in the “Description” column

Features

Characterizing features have been used to predict PPISs since the founding of the field [5], and have since been combined with ML algorithms of increasing sophistication. While no single feature appears to possess sufficient information to allow prediction on its own, certain attributes have been consistently favoured, such as conservation and hydrophobicity. Recently, the use of databases of precalculated features, instead of on-the-fly calculations, has gained popularity, via databases such as AAIndex [114,115] and Sting [116]. The most popular features, as well as references detailing their utility and procurement, are presented in Table 3.

Table 3 Compilation of selected software/methods used to compute features described in Features section, including the predictors utilizing each and the publications describing their utility and recommending their use

Evolutionary metrics

Conservation scores

It has been clearly established that measures of evolutionary rate and conservation carry information about ligand-binding and catalytic sites on proteins [153-155]. However, PPIS conservation has been debated, with some claiming that interface residues are not highly differentially conserved compared to the rest of the protein [156,157] or that evolutionary measures alone are of limited predictive accuracy [22,39,158,159]. Some PPIS predictors do not use conservation [7,8] or note that its use makes little difference [27,39]. On the other hand, a number of analytical studies have found greater conservation among interface residues [16,30,55,160] and that conservation holds predictive utility [40,161]. Such conflicting conclusions may be a result of the different datasets and methods used to compute conservation.

Regardless, numerous PPIS predictors have made heavy use of residue conservation features [13,16,21,22,33,35-38,40,41,162], including several based almost solely on evolutionary metrics [12,20,23,52,163], suggesting such measures do indeed have significant predictive power. In terms of generality, evolutionary methods can be more widely applicable than physicochemical ones, as the latter characteristics may differ markedly between functional sites, whereas conservation patterns in the former may be more easily recognizable [164]. Further, sequence conservation-based predictors do not require structural information [14,22,36]. However, since evolutionary methods are non-specific enough to be used for identification of any type of functional site [165-169], this lack of specialization may reduce performance for PPIS prediction specifically [4]. Unsurprisingly, conservation is not helpful for interfaces selected by non-evolutionary means, such as antigen-antibody complexes [16,102] (which require separate specialized predictors [170]).

Predictors employ a number of methods to extract evolutionary features. Commonly, a multiple sequence alignment (MSA) from BLAST [84] or PSI-BLAST [127] is used to derive conservation scores, in conjunction with substitution matrices (e.g. Dayhoff [171], BLOSUM [172], or position-specific) [13,14,33,38,41], often in combination with information theoretic measures (generally based on Shannon entropy [173]) [18,27,33,39]. Some predictors [14,37,40] use previously developed programs for computing conservation [128-130,174-176].

Sequence homology

Predictors that are more reliant on evolutionary metrics tend to have more complex methods of deriving evolutionary conservation scores [12,20,23,52]. JET [23] builds off the evolutionary trace approach [177-180], constructing phylogenetic distance trees to extract a residue ranking based on evolutionary importance, then using this to derive likely PPISs. Similarly, BindML [12] uses local MSAs, specialized substitution matrices, and phylogenetic tree construction to obtain the likelihood that a given patch MSA belongs to a PPIS (described below). HomPPI [20] derives linear combinations of BLAST statistics correlated to conservation score in known interfaces, allowing an estimation of conservation in unknown interfaces.

Structural homology

While the above methods are largely dependent on sequence-based measures of evolutionary information, structural conservation has also been shown to be useful for distinguishing protein binding sites [31,160,181,182] and even general functional sites [183]. Importantly, structure evolves more slowly than sequence and may have more powerful signals for conservation [184,185]. Predictors using structural homology have verified this proposition [15,21,31,32,66,162,186].

The main issue with the use of structural homologs, or with predictors requiring structural information in general, is the paucity of usable structures [15,47,105,187], particularly when considering the relatively small size of the PDB (∼80,000 structures, including redundancy) compared to the number of sequences known (∼17 million non-redundant sequences) [36,188], though this can be partly circumvented by using local (rather than global) structural homologies [34]. However, studies on the properties of the interfaces themselves have found that their structural space is degenerate [189], that templates for the majority of known interactions exist [185,189], and that interfaces are conserved across structure space [31]. This bodes well for structure-based PPIS prediction, as it suggests that numerically limited interface examples can cover most potential queries, despite incomplete structural [187,188,190] and interaction type (or quaternary fold) [3,191] coverage. The potential contribution of homology models [190,192-194] and growing structural coverage [188] further justify using structure.

Physicochemical characteristics

Physicochemical properties have long been used for interface prediction [102]. The observed trend of increased hydrophobicity in interfaces compared to non-interacting surface patches [5,195-197], as well as the proposed pattern of a highly hydrophobic central region surrounded by polar residues [71,198], has been used with success in several recent predictors [8,33,37,40,75]. Additionally, electrostatic potential [13,42] and energy of desolvation [9] have shown utility as discriminative properties of PPISs [13,37,42]. B-factors (Debye-Waller temperature factors) [15,40] and disorder measures [37,38] in prediction software are also useful, with interface sites shown to be more disordered [38,140,141] yet also appearing to have lower B-factors than other surface patches [15,28,35,54,150]. Further, the decreased flexibility implied by decreasing B-factors is confirmed by the observation that interface residues minimize the entropic cost of complex formation [28] by avoiding the sampling of alternative side-chain rotamers [40,150].

Statistical measures

Propensity

Residue propensity, which measures PPIS amino acid composition, has been used in the characterization of interface types (e.g. homodimers versus heterodimers [16,199], permanent versus transient [18,80], biological versus non-biological [200]) or subtypes (e.g. core versus rim [72,201]), in hotspot prediction [160,202], and evolution [203], as well as in PPIS prediction [13,28,33,35,37,41,204,205]. In general, polar amino acids are statistically disfavoured in interfaces sites, with the exception of arginine [18,35]. Further, while propensity differences are relatively minor between complex types, greater favouring of cysteine and leucine has been observed in permanent (but not transient) interfaces [18].

In general, predictors [13,35,37] use a variant of the following equation to compute propensity based on amino acid frequencies:

$$ prop(r) = \frac{{count}_{int}(r)/|Interface|}{{count}_{sur}(r)/|Surface|} $$

where c o u n t int (r) and c o u n t sur (r) are the frequencies of occurrence of residue r in interfaces and on protein surfaces, respectively, and |I n t e r f a c e| and |S u r f a c e| denote the sizes of these sets. More sophisticated extensions include using combinations of residues [33] (combinatorially expanding the number of possible r values, though amino acid categories can reduce this [206]), weighting by accessibility [28,205], or computing binary profiles [18].

Atomic contact probability density maps

The packing preferences and geometries of atomic protein structures have long been studied as a characterizing feature of association [207], via probability density maps (PDMs) describing likelihoods of contacts [7,151,152]. Advantageously, PDMs can be derived from the intra-molecular contacts in the protein interior and are hence less limited by the structural information available [7,152]. While contacts in the protein interior differ from those in interfaces (e.g. artefactual interactions from structural constraints [152] or greater contribution of electrostatics to folding than binding [208]), these differences appear to be relatively minor [196,209-211], particularly when the interface core is mainly considered [137].

Recently, 3D PDMs were used as input features for PPIS prediction [7]. By projecting onto a previously described coordinate system [152], interacting contacts can then be added to the density map, allowing preservation of both magnitude and direction. Based on a similar method applied to protein folding [212], “co-incidental" interactions (due to proximal atoms forced together by structure constraints) [7,151] were filtered out.

These density maps can then serve as PPIS prediction features: given query protein P, feature vector v i = <a i1,…,a in > ∀ iP, where a ij represents the distance-weighted normalized sum of PDM values of atom i with interacting atom type j, and n is the number of interacting atom types (31, based on current works [7,151]). The resulting attribute vector set V={v i iP} is amenable to ML-based PPIS prediction.

Structural geometry

Solvent accessible surface area

Often chosen as one of the most discriminatory features by a wide array of predictors [8,23,29,31-33,37-39], the Solvent Accessible Surface Area (SASA) of residues is thought to be of significant importance in PPIS prediction. Jones and Thornton [5,6] were among the first to suggest that high solvent accessibility is indicative of a residue’s participation in an interaction site. The relationship was further validated by Chen and Zhou on a set of 1256 nonhomologous protein chains [16], as well as by Hoskins et al. [124], who also accounted for the relative contributions of both the main and side chains (polar and non-polar) of the protein.

3-D characteristics

PPISs have long been thought to posses distinct 3-dimensional characteristics that allow them to be distinguished from the rest of the protein surface [5]. In particular, curvature has been singled out as an important 3D structural characteristic [33,37,38,186], with interface sites thought to be significantly more concave than the rest of the protein surface, lending stability and specificity to the interface [38].

This has, however, been disputed in the literature [11], with the prevailing opinion appearing to favour shape complementarity, in which one of the proteins in a complex contains a concave binding site, while the interaction site on its partner exhibits convexity, in order to bind “snugly” [213]. Interestingly, Nooren and Thornton [48] showed that transient PPISs not only tend to be more planar than their permanent counterparts, but that there is a gradient even within transient sites, with “stronger” sites exhibiting greater curvature than those in “weaker” transient interactions.

Similarly, secondary structural characteristics have been used in several predictors [8,22,38,102], but have also elicited dispute regarding their utility and biological interpretation. Specifically, some studies [35,124] have found that β-sheets are favoured in interface sites while α-helices are more prevalent over the rest of the protein surface, though others disagree [38,102].

Depth and protrusion indices

As discussed previously, interfaces are rich in hydrophobic residues, which is superficially incongruous with the finding that interfaces tend to have a higher solvent accessibility than non-interface patches. This has been explained by Li et al. [38], who, along with Sikic et al. [8] and Segura et al. [40], found the depth and protrusion features to be the most highly discriminatory features of their respective predictors. Thus, interface residues tend to have a higher average depth index (are more deeply buried), while maintaining a higher side-chain protrusion (leading to the observed increase in solvent accessible surface area) [38].

Algorithmic approaches

While there exist previous reviews on PPIS prediction [4,75,109,214,215], these did not have access to the most recent algorithmic advances. As such, we provide a systematic overview of these techniques, in the hope that future predictors can employ these methodologies as a foundation for the creation of more sophisticated predictors.

Feature selection

Feature selection is an indispensable part of ML, in which redundant and irrelevant attributes are removed from the feature set to ensure predictor efficacy [216]. Redundancy provides no new information (but potentially creates noise), is computationally inefficient, and overweights the contribution of that information, leading to overfitting and thus lower prediction scores.

Genetic-race search

The computational power required to check all possible combinations of features renders such an approach impractical. To efficiently search this space, a method of feature selection termed Genetic-Race Search (GRS) [37] was recently used, combining a genetic algorithm [217] with RACE search [218].

Each feature set (“individual”) is represented as a bitstring, and the fitness of the individual is defined as its Matthews Correlation Coefficient (MCC) [219] after leave-one-out-cross-validation (LCV). A population of these individuals is then iteratively altered by three operations: mutation (with preference for less fit individuals), selection, and crossover (preferentially choosing higher scoring individuals) [217].

At every iteration, the top k individuals (“elites”) are saved and used to augment evaluation efficiency via Hoeffding races [220]. As LCV is performed for a given individual, its empirical mean \(\bar {x}\) is continuously updated as each protein is analyzed. The kth elite (i.e. the least fit elite) and its fitness score f EK is continually compared to \(\bar {x}\) as LCV goes on. If (1−δ)% certainty that f EK >μ is reached, evaluation of the current individual can be halted, as it is statistically incapable of entering the elites.

To compute this certainty, the two-tailed symmetric distribution derived from Hoeffding’s original bounds [221] can be used [220]:

$$ P(|\bar{x}-\mu| > \epsilon) < 2 \exp\left({\frac{-2n \epsilon^{2}}{B^{2} }}\right) $$

where the random variable (MCC) is bounded with range B and n is the number of samples. Clearly, (1−δ)% certainty that the maximum distance between \(\bar {x}\) and μ is less than ε requires \(P(|\bar {x}-\mu | > \epsilon) < \delta \). Thus:

$$ \epsilon = \sqrt{\frac{B^{2}}{2n}\ln\left(\frac{2}{\delta}\right)} $$

Hence, given n measurements, μ is within ε of \(\bar {x}\) with (1−δ)% certainty. Notably, Hoeffding bounds do not rely on a particular underlying probability distribution and are thus widely applicable to different fitness measures with high conservatism (though reducing δ can mitigate this).

MRMR-IFS

The minimum redundancy maximal relevance (MRMR) method [222,223] ranks features by importance via mutual information [224], which measures the non-linear dependence between random variables. The ranked variables are then combined with incremental feature selection (IFS), which chooses an attribute set by stepwise construction of unique feature subsets [225,226]. For example, in the recent PPIS prediction method by Li et al. [38], MRMR-IFS was used to reduce a set of over 700 features to only 51.

First, the mutual information I(X,Y) between two features X and Y, which serves as a measure of non-linear correlation, is defined as follows:

$$ I(X,Y)= \iint\limits_{X\,Y} P(x,y) \log_{b}\left(\frac{P(x,y)}{P(x)P(y)}\right)dy\,dx $$

where P(x,y) is the joint and P(x) and P(y) are the marginal probability density functions.

To calculate the relevance D of a feature f to the class c being predicted, one can compute D=I(f,c). Then, given a feature set Ω, the redundancy R of f to the values in Ω is defined as the average information held by f that is already in Ω:

$$ R = \frac{1}{|\Omega|} \sum\limits_{f_{i} \in \Omega} I\left(f,f_{i}\right) $$

MRMR then orders the feature set Ω by sequential addition to a set of already selected attributes, Ω s , from the “remainder” set Ω t =ΩΩ s . At every step, the best feature f b to add to Ω s is the one that best balances high D with low R:

$$ f_{b} = \underset{f_{j}\in \Omega_{t}}{\arg\max} \left[ I(f_{j},c) - \frac{1}{|\Omega_{s}|} \sum\limits_{f_{i}\in\Omega_{s}} I\left(f_{j},f_{i}\right) \right] $$

The order of placement into Ω s is the output ranking of MRMR.

To apply IFS to the MRMR ranking, Li et al. [38] then built subsets of features by iteratively adding attributes in the order of the MRMR ranking and chose the feature subset with the maximal MCC score via ten-fold cross-validation.

Principal component analysis

Alternatively, principal component analysis (PCA), a method of information-preserving dimensionality reduction, can be used [227,228]. The number of principal component (PC) vectors required to account for any amount of the original variance in the data can be calculated (via the PCA eigenvalues), allowing control of the trade-off between high dimensionality and relevance to the predicted class. Further, the PCs are orthogonal and hence linearly uncorrelated, greatly reducing feature redundancy. The features of this new space (with the PCs as basis vectors) can then be used as input to a machine learner.

There are a variety of criteria for selecting the appropriate number of eigenvectors to use, including the widely used method of accounting for an arbitrarily chosen amount of variance sufficient to cover the majority of information required for the prediction task [229]. The PPIS predictor by de Moraes et al. [42], for example, chooses the number of PCs necessary to account for 95% of the variance of the data, after removing variables with excessively high linear correlation to each other.

Surface-interface size relation

The percentage of interacting surface residues is not constant with respect to protein size; rather, it has long been known that it follows a non-linear distribution (e.g. exponential regression line) [6,230,231]. However, this information has only seldom been applied to PPIS prediction [16,23], or even treated as a linearly changing proportion [13,35], despite results showing that this “size bias” carries significant predictive power on its own [103]. One potential application to ML-based predictors is dynamically setting the prediction threshold such that the proportion of predicted active residues matches the estimated prior distribution for the query protein. Such intelligent biasing of the learner might even be extended by looking for patterns in the interface-surface residue ratio in different types of proteins as well.

Homology-based predictors

PredUs

The PredUs algorithm is based on the observation that PPISs are conserved across structural space [31,32], allowing known sites of structural homologs to be mapped onto query proteins. The initial version of PredUs [31] used an interfacial score ς with an empirically derived threshold. First, for a given query protein Q, a set of structural neighbours N i was derived using Ska [232,233], where each N i is in complex with a partner P i . By bringing P i into the coordinate space of Q via superposition for every N i , the sum of contacts in the new space gives a frequency f r for every residue rQ. The interfacial score for each residue is then given by:

$$\varsigma(r) = \frac{1}{1+\exp\left(\frac{-f_{r} + f_{max}/2}{f_{max}/10} \right) } $$

where f max is the maximum value across the whole structure.

The second version of PredUs added an ML layer to their homology-based method [32]. For a given Q, a map of contact frequencies was computed for each surface residue as shown above. Then, for every rQ, a surface patch consisting of r and its closest spatial neighbours was constructed. For each patch, a feature vector of contact frequencies and SASA values (per amino acid) was derived, and a support vector machine (SVM) [234] was used to segregate interacting and non-interacting residues.

PrISE

The PrISE (Predictor of Interface Residues using Structural Elements) algorithm addresses several limitations of using whole-protein structural similarity, including the coarseness of global homology measures and the requirement for sufficient numbers of structural neighbours, via local structural conservation information [34].

First, a set S(Q) of “structural elements” (SEs) is extracted from a query protein Q. Every SE q r S(Q) contains the surface residue r as its central residue and all of r’s surface neighbours, and is represented by an atomic composition frequency histogram of its constituents. A database containing more than 34 million SEs [91] was created to isolate sets of SEs homologous to any Q, denoted H q , from a wide variety of proteins based on comparison of their atomic frequency histograms.

For the global predictor, weights w(p,q) were calculated for every pair of elements from pH q and qS(Q) based on the similarity of the entire protein Q to the protein from which p was extracted, denoted as π(p). To evaluate this similarity, the contribution, c o n t(P,R), was defined as the number of SEs in R present in P. Thus, the weight for the global predictor is defined as

$$w_{G}(p,q) = cont\left(\pi(p),Z_{Q}\right) $$

where Z Q is the set union of H q qS(Q).

This was coupled with a local predictor, where the weight was computed as

$$w_{L}(p,q) = cont\left(\pi(p),N_{q}\right) $$

where N q is the set of SEs with local structural homology to the SE in question, q. This is computed by examining every residue r i in q, which has its own associated SE, \(S_{r_{i}}\). Next, the set of SEs most similar to \(S_{r_{i}}\) is considered, denoted by \(R_{r_{i}}\), from the repository of all SEs. Each \(\phantom {\dot {i}\!}s_{i} \in R_{r_{i}}\) is locally homologous to some part of q, since r i is a residue in q and every SE \(\phantom {\dot {i}\!}s_{i} \in R_{r_{i}}\) is homologous to the SE around r i (i.e. \(S_{r_{i}}\)). Thus, N q is defined as:

$$ N_{q} = \bigcup\limits_{r_{i} \in q}R_{r_{i}} $$

The final predictor, called PrISE C , combined local and global information via a combined weight w C (p,q)=w G (p,qw L (p,q). Known PPIS participation of central SE residues was used to compute weights W C+ and W C by summing across all w C (p,q) in which p is known to be in an interface (p+) and known to not be in an interface (p−), respectively. Thus,

$$W_{C+q} = \sum_{p+ \in H_{q}}w_{C}(p,q) \;\;\; \& \;\;\; W_{C-q} = \sum_{p- \in H_{q}}w_{C}(p,q) $$

The probability of a residue interacting is derived from the weights via:

$$P_{C+(q)} = \frac{W_{C+q}}{W_{C+q} + W_{C-q}} $$

HomPPI

The foundation of the HomPPI predictor family is the evolutionary conservation of interface residues, derived solely from sequence information. The two HomPPI predictors [20], NPS-HomPPI (Non-Partner Specific) and PS-HomPPI (Partner Specific), depend on the correlation between conservation of interfaces and several BLAST alignment statistics of sequence pairs, discovered by PCA analysis. The conservation is calculated as the correlation coefficient of a prediction made by assigning all interacting residues of a protein in the pair to the corresponding residues on its sequence homolog. The BLAST statistics log(EVal), Positive score and log(LAL) were found to be highly correlated with conservation in Non-Partner Specific interfaces, where EVal is the expectation score, LAL is the Local Alignment Length, and the Positive score is the number of positive matches in the alignment.

The information obtained by PCA was used to create a linear scoring function via Interface Conservation (IC) score, with the NPS predictor using

$$\begin{aligned} {IC}_{NPS}=\beta_{0} &+ \beta_{1} \text{log}(EV\,al) \\ &+ \beta_{2} PosS + \beta_{3} \text{log}(LAL) \end{aligned} $$

where all β i s were chosen to correlate best with the correlation coefficient above.

This I C NPS score was used to rank homologs of the query protein by their predicted conservation, of which the top ten were chosen to undergo a form of majority vote, where each residue was given a score based on the ratio of positive to negative votes. A threshold for this score was used to determine the interacting residues on the query.

BindML

The Binding site prediction by Maximum Likelihood (BindML) approach is based on sequence-derived evolutionary information, though it does use an input structure to choose patches of the query protein to target [12,52]. The first step involves the construction of two amino acid substitution matrices: one describing PPISs or protein binding interfaces (PBIs), and the other describing non-protein binding interfaces (NPBIs) or non-PPISs (NPPISs), via MSAs from iPFAM [94]. These matrices, M PBI and M NPBI respectively, are computed by counting substitutions with pairwise alignment sets [235], followed by construction with the BLOSUM method [172].

Then, for a given query protein Q with surface residues S i , a set of patches ({P i }) is produced for every S i , based on an empirically chosen radial distance cutoff. For every P i , the corresponding residues (and the aligned columns within the family MSA in which Q belongs) are concatenated. This patch MSA is then used to produce two phylogenetic trees for each P i , T PBI (i) and T NPBI (i), using M PBI and M NPBI , respectively. This is done with the BIONJ method [236], an extension of the neighbor-joining algorithm [237] that takes the variance of the evolutionary distances into account. A modified version of the PHYML algorithm [238] is then used to compute the log likelihood of the patch and the tree (built with M PBI or M NPBI ) as follows:

$$ \begin{aligned} L_{PBI}(i) &= \log\left(\text{P}(P_{i},T_{PBI}(i)|M_{PBI})\right)\\ L_{NPBI}(i) &= \log\left(\text{P}(P_{i},T_{NPBI}(i)|M_{NPBI})\right) \end{aligned} $$

A difference score, d L(i)=L NPBI (i)−L PBI (i), can then be computed for every surface residue, combined and recast into Z-scores, and finally thresholded to determine whether a given residue is interacting or not.

Machine learning-based techniques

Li et al. 2012 method

The Li et al method [38] uses machine learning on feature vectors derived from sequence windows. These data vectors include amino acid properties (from AAIndex [114 , 115]), conservation data, solvent accessibility values, and structural information. First, for every residue in each protein, a peptide is extracted with the residue in question serving as its center, accounting for the local environment of each amino acid. The peptides are labeled as active or inactive based on the label of their central residue and filtered for homology. Then, large feature vectors are extracted to represent each peptide, by extracting and concatenating a variety of attributes per residue, in the order of the peptide. Their dimensionality D is given by D=N L, where N is the number of residue-level features and L is the length of each peptide (here, 34 and 21, respectively [38]). To remove irrelevant and redundant features from this high dimensional feature space, the MRMR-IFS feature selection procedure is applied (see Feature Selection). In the reduced space, the attribute vectors of the peptides are then used to construct a Random Forest (RF) classifier [239], an ensemble learner based on combining the output of multiple decision trees.

PresCont

The PresCont algorithm [33] combines local residue features with environmental information as input to an SVM. The creators of PresCont note their belief that interface prediction will not benefit from the use of a large number of noisy features, and thus make use of just a few important attributes, namely SASA, hydrophobicity, propensity and conservation. Additionally, the weighted average of each of these features over the neighbouring residues using a Euclidean distance cutoff is used. The features are scaled to the range [0,1] and used as input to an SVM with the radial basis kernel function, which constructs a hyperplane capable of optimally separating PPIS and NPPIS feature vectors.

To test the utility of accounting for core-rim differences, the authors used the Intervor [240] algorithm (which computes distance to the interface rim via Voronoi shelling order) with PresCont, detecting differences in propensities (as found previously [198]), but ultimately not recommending use of core residues alone for training when the full PPIS is desired.

RAD-T

The Residues on Alternating Decision Trees (RAD-T) algorithm combines supervised ML with representative characteristics from all major feature types [37]. Training data was produced from monomers mapped back from their complexes, but separately crystallized as monomeric structures. This was done to train the learner on proteins in their monomeric conformation, rather than their complexed one, as PPIS predictors will tend to be run on monomeric proteins of interest, produced by crystallization or by modelling, for which the partners or complexes are not known.

The class imbalance problem, caused by the numerical disparity between PPIS and NPPIS training examples, was addressed by resampling the high number of non-interacting examples to match the number of interacting ones in a 1:1 ratio, based on empirically testing different ratios of positive-to-negative results over a variety of machine learners. The resulting set of feature vectors can then be used to optimize a given machine learner by GRS, which will find the feature subset optimally discriminative of PPIS participation for a given learner and dataset.

For ML, RAD-T then uses an alternating decision tree (ADTree) [241 , 242], an extension of the classical decision tree that integrates across multiple paths in the tree and makes use of boosting, in which multiple weak learners are used to build a single strong one.

VORFFIP

The limited information contained in single residues has often been supplemented with environmental or neighbourhood information [8 , 38 , 39]. The VORFFIP (Voronoi Random Forest Feedback Interface Predictor) algorithm [40] makes use of atom-resolution 3D Voronoi diagrams [243] to identify spatially neighbouring surface residues, for which features are assigned using structural, energetic, evolutionary, and experimental data. The use of Voronoi diagrams may be better than other approaches as it is based on an implicitly defined “visibility” between residues (avoiding choice of falloff rates and threshold values), and it allows weighting environmental contributions with greater resolution [40].

In addition, the method discriminates and removes outliers, as residues with high probability of PPIS participation are generally in contiguous patches and not surrounded by low probability residues. This is accomplished with the use of a 2-step RF ensemble classifier, each instantiation of which consists of several hundred discrete decision trees, with the first RF utilizing structure-, energy-, and evolutionary-based features for each surface residue, as well as environmental information, and the second RF making use of the scores from the previous step, along with further environmental score-derived metrics.

Both RFs include a weighting function that accounts for the strength of contact c ij between amino acids a i and a j based on their atoms’ positions in the Voronoi diagram, giving greater influence to neighbouring residues with more atomic contacts. This weight is defined as c ij =N ij /N i , where N ij is the number of contacts (shared facets in the Voronoi diagram) between the atoms of a i and a j , and N i is the sum of all atomic contacts made by a i . Once the first RF calculates the probability of residues being involved in interaction sites, several further metrics are created from these predictions. These include the environmental score e s i , which weights the scores from the first RF for each neighbouring residue, and Contact Score Vector csv features, which account for the contributions of each residue type:

$${es}_{i} = \sum\limits_{j=1}^{n}c_{ij}s_{j} \;\;\;\;\; \& \;\;\;\;\; {csv}_{l} = \sum\limits_{a_{j} \equiv {type}_{l}}c_{ij}s_{j} $$

where s j is the score of the neighbour residue, l is one of the 20 residue types, and residue a j has type l. Once calculated, these scores are added to those from the first-step RF for each residue as input to the second-step RF, generating a revised prediction with reduced outliers, better environmental accounting, and improved prediction performance.

Sting-LDA-WNA

The method by de Moraes et al., designated Sting-LDA-WNA, combines PCA-based recombination of neighbourhood-averaged feature vectors with amino acid-specific linear classifiers [42]. The Sting database [116] was used to extract a residue-level feature vector for each amino acid in every protein. To consider the local environment of each residue, weighted neighbour averages were utilized via the two approaches of Porollo and Meller [39]:

$$V^{S}_{\text{WNA}} = \sum\limits_{i=0}^{N} V_{i} \text{RSA}_{i} \;\;\;\;\;\&\;\;\;\;\; V^{d}_{\text{WNA}} = V_{0} + \sum\limits_{i=1}^{N} \frac{V_{i}}{d_{i}} $$

where N is the number of neighbouring residues (within a sphere of 15Å, based on previous work [39 , 244]), V i is the feature value of the ith neighbour, d i is its distance from the target residue, and RSA i is its relative solvent accessibility. Finally, feature selection was performed by (1) removal of attributes with high linear correlation and (2) PCA, with sufficient principal components to permit 95% of the variance to be explained (see Feature Selection), resulting in a feature space with low dimensionality and redundancy.

Linear discriminant analysis (LDA) was then used to derive a hyperplane capable of separating input vectors based on the class labels of the training data [245 , 246]. Importantly, the authors used amino-acid specific classifiers (i.e. a different LDA classifier was built and then applied for each of the canonical amino acid types), leading to higher predictive ability.

Chen et al. 2012 method

Unlike most other predictors, the method by Chen et al. operates on the atomic level, utilizing PDMs (encoding the probabilistic “strength” of interaction of a given protein atom with every other potential atom type) to produce atom-level feature vectors [7]. To account for neighbourhood information, Chen et al. also add the distance-weighted atomic density values for the neighbouring atoms, followed by normalization. Along with a measure of the unoccupied Van der Waals volume around a given atom, this gives a set of per-atom feature vectors for any protein.

These vectors were then used for ML via a feed-forward artificial neural network (ANN) with a sigmoid transfer function [247 , 248] and the resilient back-propagation algorithm [249]. Separate ANN classifiers for each protein atom type were trained and their outputs combined, with training designed to maximize MCC. Further, the ensemble-based bootstrap aggregation (bagging) algorithm was employed to counter the class imbalance [250].

The final ensemble of atom-specific, bagging ANNs could then predict the interface atoms of a given query protein surface. To turn these into residue-level predictions, high-confidence atomic predictions were treated as “seeds”, and any surrounding atoms of even moderate confidence were assigned to be part of an atomic interacting patch. Any residues with a high proportion of its atoms being part of such a patch were considered interacting.

Ensemble predictors

Given the disparate approaches and information sources applied to PPIS prediction, it is natural that combining multiple methods should increase scores. For example, using amino acid-specific [42] and atom-specific [7] classifier ensembles permits the reduction of noise and better separation between residues/atoms that likely have different properties differentiating them in interface sites. Similarly, the combination of local and global structural homology-based predictors in PrISE [34] and the two-step RF of VORFFIP [40] both increase scores over the independent counterparts. Some predictors also find increased scores upon combination with previous methods, as in WHISCYMATE [41] and Combined BindML [12].

More integrative approaches utilize meta-predictors, which combine multiple predictors into a single output score. Meta-PPISP [25] combines cons-PPISP [16], Promate [35], and PINUP [28] via a linear regression model taking local environment into account.

By adding SPPIDER [39] and PPI-PRED [13] to that list, metaPPI [24] uses the frequency with which its constituent predictors consider a residue interacting to predict continuous PPIS patches, by iteratively adding surface vertices. More recently, CPORT [17] uses a consensus approach to combine several predictors, by adding residues if they pass a predictor-specific threshold for any given predictor. In all cases tested so far, these meta-predictors see an increase in score and robustness compared to the individual predictors.

Evaluation

A comparative evaluation of PPIS predictors has been performed in previous reviews [109 , 214]; as such, we focus on information regarding only the most recent predictors, shown in Table 4.

Table 4 Comparative evaluation of recent predictors

Objective evaluation of PPIS predictor performance is made difficult by the varying definitions of interaction sites and accessible surface residues in the literature, the lack of available servers for all predictors, the adoption of varying training and testing datasets, and the different metrics used for evaluation [4 , 47]. We partially circumvent these problems by considering the performance of each predictor across a variety of test sets based on literature values.

Assessment measures

PPIS predictors are generally judged with a number of standard performance metrics, including sensitivity (recall, true positive rate, or coverage), precision, and specificity:

$$\begin{aligned} & Accuracy &&= \frac{T_{P}+T_{N}}{T_{P}+F_{P}+T_{N}+F_{N}} \\ & Precision &&= \frac{T_{P}}{T_{P}+F_{P}} \\ & Sensitivity &&= \frac{T_{P}}{T_{P}+F_{N}}\\ & Specificity &&= \frac{T_{N}}{T_{N}+F_{P}} \end{aligned} $$

where T P , T N , F P , and F N denote true and false positives and negatives, respectively.

Measures designed to balance between false negative and positive rates include F 1 and MCC [219]:

$$\begin{aligned} F_{1} &= 2\frac{precision \cdot recall}{precision + recall} \\ \\ MCC &= \frac{T_{P}\cdot T_{N} - F_{P}\cdot F_{N}}{\alpha} \end{aligned} $$

where:

$${\alpha=\sqrt{(T_{P}+F_{P})(T_{P}+F_{N})(T_{N}+F_{P})(T_{N}+F_{N})}} $$

Similarly, the receiver operator curve (ROC), which is a plot of sensitivity versus 1−s p e c i f i c i t y derived by varying the classifier prediction threshold, can be used to compute the area under the ROC curve (AUROC/AUC) [251], which is especially useful for identifying artificially “inflated” performance (e.g. higher sensitivity at the expense of specificity) and for being decision threshold independent [252].

In general, similar to the lack of consensus in interface definitions and datasets, there is no standard criteria for performance assessment [47]. Given that some false positive predictions may be correct (due to the paucity of crystallized complexes), patch-specific performance metrics (i.e. assessing the correct answer in a local patch around an interface in question, such as by the Sørensen-Dice index [253 , 254]) may be used, though this poorly accounts for false positives. While other evaluation methods have been devised [16 , 35], computing the statistics above per residue and averaging across the dataset appears to be the most objective and easily comparable method.

The authors note that even the more balanced measures should not be solely relied on (e.g. MCC may favour overprediction in PPIS prediction [214] and underprediction elsewhere [255]) and that predictor performance should be viewed holistically across as many metrics as possible, as balancing performance metrics is domain-dependent [47 , 255]. When considering PPIS prediction for mimetic drug design, slight underprediction may be desirable, as it will likely find the better discriminated core residues [7 , 33], from which the remaining PPIS can be inferred (rather than “guessing” which of many allegedly “active” residues is even interacting).

Comparative evaluation

While it is difficult to draw conclusions from the differing performance of the predictors, we can nevertheless observe some trends that may be explained by the biological theory discussed previously. For example, while transient datasets (such as TransComp_1) generally garner lower scores than permanent ones (such as PlaneDimers), this is not perfectly followed (Table 4), possibly due to the difficulty in defining a threshold on the transient-permanent continuum. Some sets (e.g. S149) may be intrinsically more predictable, as evidenced by higher scores across all predictors; others achieve better results only on certain types of predictors (e.g. DB3-188 on structural homology-based predictors). To achieve high scores on specialized testing datasets, predictors often require either specializations of their own, or inherent characteristics that permit accurate classification (e.g. ANCHOR’s [10] specialization for disordered proteins and HomPPI’s [20] lack of requirement for structural information allow them both to successfully predict on the S1/2 disordered sets). Theoretically, unbound structures are more difficult to predict on than bound monomers (due to the conformational disparity between the two sets); this is largely confirmed by differing results on the DS56B/U sets, as well as generally lower scores on unbound sets (Table 4). Overall, we find that there has been significant progress in the predictive abilities of the predictors over the last decade across diverse interaction types and datasets.

Future directions

Though the field of PPIS prediction has been steadily improving in accuracy and sophistication over time, challenges remain before scores sufficient to permit its many potential applications can be achieved.

Accounting for interface type and subclass

The classification of interfaces between transient, permanent, and obligate, as well as within interfaces as core and rim structures, has been extensively studied from a theoretical standpoint. However, with some exceptions [20 , 33 , 52], this information has been algorithmically underutilized for PPIS prediction. Focusing on datasets of a particular complex type, combining interface classification with interaction likelihood prediction, and integrating learning of the different properties of the interface core and rim into PPIS prediction are just a few examples of promising areas that are currently under investigation.

Closer examination of datasets

As with much biological data, protein structural datasets are non-standardized and virtually all structure-based predictors have varying criteria for processing and filtering this data. The types of proteins that are the least predictable, or the difference between large but heterogeneous training sets versus smaller but cleaner ones, have not been comprehensively examined. Specialized training sets per query protein, whether by structural, sequence-level, or functional data, are worth exploring as well. Potentially, unsupervised learning could be applied to extract hidden patterns within the data, possibly contributing to further analysis of the relation between interaction types, protein categories, and the feature space. Further, methods accounting for systemic biases in the PDB (e.g. under-representation of membrane/disordered proteins) may also improve robustness [256 , 257].

Improved benchmarking

More comprehensive and standardized benchmarking, as noted in other reviews [75], is essential to advancing the field. Additionally, to facilitate comparative evaluation of performance, as well as to ensure that improvements are not statistical anomalies, the authors suggest that significance testing be applied to future published work.

Utilizing ensemble approaches

The numerous methods developed for PPIS prediction often utilize diverse sources of information and varied techniques, suggesting that methods of combining approaches are promising. Current meta-predictors tend to use relatively straightforward methods of combining their constituents [17 , 24 , 25]; ensemble techniques that could take spatial relations into account, such as graph-based or random field models, could be applied for this purpose [258 , 259]. The strengths, weaknesses, and specializations of the various approaches should also be taken into account. For instance, homology-based predictors (e.g. PredUS [32] or HomPPI [20]), which can predict exceedingly well when homologues are available but fail otherwise, could be combined with a more general, machine-learning based predictor to complement it in cases where homology information is missing. On a smaller scale, ensemble methods utilizing a set of predictors (e.g. residue-specific [42], atom-specific [7], bagging-based [7]) appear to be useful for reducing noise and making better use of the available information contained within the protein.

Integrating other areas of computational biology

Greater integration of PPIS prediction with other areas of computational biology also holds promise. The relatively small number of crystallized structures suggests PPIS predictions based on structural characteristics may not be helpful when such information is not present; however, the use of molecular modeling could prove useful in mitigating this problem. Other areas of bioinformatics are also being applied to assist PPIS prediction, such as molecular docking [11 , 19 , 53 , 260].

Further application of computational techniques

Several areas of PPIS prediction could benefit from the use of more sophisticated computational learning techniques. For feature selection and extraction, current methods can be combined (e.g. MRMR or PCA with GRS) or extended (e.g. empirical Bernstein bounds [261] with GRS, non-linear component analysis [262], or autoencoders [263]). The class imbalance problem has been recently circumvented via bagging [7], but semi-supervised learning could also be applied, as seen for hotspot [264 , 265] and pairwise protein interaction prediction [266], with minimal changes to the features used. Indeed, as “non-interacting” residues may be mislabelled (since every possible protein complex is certainly not known), semi-supervised learning methods for handling this problem are even more applicable [267 , 268].

Recently, methods for accounting for neighbourhood information have become more prevalent, including averaging across the local environment [7 , 33], Voronoi diagrams [40], and feature concatenation [8 , 38], each providing gains in predictive ability. This suggests that methods for multi-scale machine learning could prove effective [258 , 259]. Machine learning techniques for multi-class classification [269 , 270] (e.g. separating core vs. rim vs. surface) also hold potential for improvement.

Applications

The identification of interacting residues on protein surfaces holds potential for use in diverse fields across biology and medicine. One of the most related problems is the elucidation of the residues inside PPISs that account for the major change in free binding energy upon complex formation, known as hotspots (HSs) [271 , 272]. ML-based PPIS predictors have already been successfully used for HS prediction by simply altering the training set [161]. Interface predictions may be used to narrow the search space of HS predictors, as HSs tend to localize to the interface core [33 , 273]. For the same reason, putative HS residues could be used to “seed” interface site predictions. Similarly, knowledge of a PPIS can be used to guide mutagenesis experiments to more promising sites, reducing the expense and time required for a whole-protein analysis.

Most importantly, knowledge of PPISs and HSs can be used for rational design of therapeutics and biomolecules by serving as a template for the de novo creation of small molecules with enhanced efficacy and selectivity. Mimetics of the interaction sites of well-known molecules have been successfully built [274 , 275], the process of which could be significantly expedited by the knowledge of putative interaction sites, as it would allow rational construction of a mimetic compound without requiring mass screening. The design of novel interface sites has also shown promise for the construction of new functional biomolecules [276 , 277].

Another interesting application is in assisting computational protein-protein docking. Docking without prior knowledge of the interaction sites of the proteins in question (i.e. ab initio docking) has been shown to be more difficult due to the staggering search space dimensionality involved [278]. Information-driven docking [279] can mitigate this problem by utilizing mass spectroscopy and interaction data, the latter of which is often difficult to obtain, but can be provided by PPIS prediction. This approach has already proven successful in both high-resolution docking [17 , 24 , 41] and coarse mass docking experiments [280]. Additionally, PPIS-driven coarse docking studies could assist in large-scale PPI network creation [281], as well as alignment of such networks for functional ortholog identification [282 - 284].

Conclusion

The field of protein-protein interaction site prediction has grown significantly since the pioneering work of Jones and Thornton, and is now poised to bring great benefit to other problems in biomedical science, particularly rational drug design. This growth has, however, brought several issues to the forefront, including the need for standardized testing sets and evaluation metrics to ensure that objective comparisons of performance can be carried out. The field has seen numerous algorithmic advances over the past few years, building on decades of theoretical biology, and we expect that combining and extending these algorithms will be a considerable source of improvement in the future. As such, we have undertaken an exhaustive analysis of the state-of-the-art algorithms presently in use and their performance, as well as an exploration of the datasets and features employed by current predictors. We believe that future advances will bring predictors capable of significantly contributing to biomedical and pharmaceutical science.