An overview of the evoKGsim methodology is shown in Fig. 10. In a first step, the semantic similarities corresponding to each semantic aspect are computed for each protein pair in our input data. In a second step, GP evolves a good (hopefully the best) combination of the different SS aspects to support PPI prediction. Finally, the quality of the classifications obtained on the test set, using the evolved combination, is evaluated.
The implementation of our methodology takes as input an ontology file, a protein annotation file and a list of protein pairs. The Semantic Measures Library 0.9.1 [44] is used to compute the SSMs using GO and GO annotations. Two machine learning and GP libraries are used in the second step: scikit-learn 0.20.2 [34] and gplearn 3.0 (https://gplearn.readthedocs.io).
Data sources
Data sources are organized in KG and benchmark datasets, which are described in the next subsections.
Knowledge graph
The KG used in this work is composed by the GO and GO annotations. GO [5] (dated January 2019) contains 45006 ontology terms subdivided into 4206 cellular component terms, 29689 biological process terms, and 11111 molecular function terms. Only is-a relations are considered. GO annotations are downloaded from Gene Ontology Annotation (GOA) database [45] (dated January 2019) for different species. These link Uniprot identifiers for proteins with GO terms describing them.
GO [5] is the most widely-used biological ontology. GO defines the universe of concepts (also called “GO terms”) associated with gene productFootnote 1 functions and how these functions are related with each other with respect to three aspects: (i) biological process (BP), which captures the larger process accomplished by multiple molecular activities in which the gene product is active; (ii) molecular function (MF), biochemical (or molecular-level) activity of a gene product; (iii) cellular component (CC), the location relative to cellular structures in which a gene product performs a function. GO terms and their semantic relations form a hierarchical directed acyclic graph (DAG) where the three GO aspects are represented as root nodes of the graph. The ancestor terms in the hierarchy subsume the semantics of descendent terms.
A GO annotation associates a specific gene product with a specific term in the GO, identifying some aspect of its function. For instance, in Fig. 1 the gene product for ACES HUMAN is annotated with the GO term amyloid percursor protein metabolic process. A single gene product may be annotated with several terms across all semantic aspects of GO.
Benchmark protein-protein interaction datasets
For evaluation and comparison, we use benchmark PPI datasets of different species. These datasets were produced by other works and have been applied by several others in evaluating PPI approaches (see Table 6). The positive data (interacting protein pairs) of these datasets were collected from existing databases. The negative data is obtained by random sampling of protein pairs, since experimental high-quality negative data (non-interacting protein pairs) is hardly available. Random sampling is based on the assumption that the expected number of negatives is several orders of magnitude higher than the number of positives, such that the negative space is randomly sampled with larger probability than the positive space [43]. In most of the datasets, negative data is generated by randomly creating protein pairs that are not reported to interact. In the dataset GRID/HPRD-bal-HS a different strategy is employed to achieve balanced random sampling. Here, the number of times each protein appears in the negative set is equal to the number of times it appears in the positive set, with the negative set still being composed of protein pairs that are not known to interact.
Table 6 PPI benchmark datasets, with number of positive interactions (PI) and number of negative interactions (NI) The species and the number of interactions for each dataset are provided in Table 4. Given the evolving nature of GO annotations, some benchmark proteins are no longer found in current GOA files. Consequently, we removed all pairs that failed to meet this criterion: both proteins have at least one annotation in one semantic aspect. Furthermore, the yeast datasets do not use Uniprot identifiers. We used the Protein Identifier Cross-Reference (PICR) tool [46] web application to map protein identifiers to the corresponding UniProt accession numbers. PICR provides programmatic access through Representational State Transfer (REST) that is very useful since we simply need to build a well-formatted RESTful URL. Thus, not all identifiers could be mapped to Uniprot and those proteins were removed.
Table S1 of Additional file 1 provides the number of interactions for each dataset before excluding the pairs that did not meet the above criteria.
Semantic similarity measures
A SSM is a function that, given two ontology terms or two sets of terms annotating two entities, returns a numerical value reflecting the closeness in meaning between them. Thus, SS can be calculated for two ontology terms, for instance calculating the similarity between the GO terms protein metabolic process and protein stabilization; or between two entities each annotated with a set of terms, for instance calculating the similarity between APBB1 HUMAN and ACES HUMAN. In the case of proteins annotated with GO, SS can be interpreted as a measure of functional similarity between proteins.
Many SSMs applied to biomedical ontologies have been proposed, see for instance [14, 47, 48] and references therein. Early approaches for term semantic similarity have used path distances between terms, assuming that all the semantic links have equal weight. More recent approaches explore the notion of information content (IC), a measure of how specific and informative a term is. This gives SSMs the ability to weight the similarity of two terms according to their specificity. IC can be calculated based on intrinsic properties, such as the structure of the ontology, or using external data, such as the frequency of annotations of entities in a corpus. Taking Fig. 1 as an example, this allows SSMs to consider protein catabolic process and amyloid precursor protein metabolic process more similar than protein metabolic process and protein stabilization.
Entity SSMs typically employ one of two approaches: (1) pairwise: where pairwise comparisons between all terms annotating each entity are considered; (2) groupwise: where set, vector or graph-based measures are employed, circumventing the need for pairwise comparisons. Figure 11 illustrates how two proteins are represented by their GO terms when some terms annotate only one protein while others annotate both proteins.
In this work, the SS between two proteins is computed using three different SSMs (SimGIC, ResnikMax and ResnikBMA), summarized in Table 7. SimGIC is a groupwise approach proposed by Pesquita et al. [49], based on a Jaccard index in which each GO term is weighted by its IC and given by
$$ \text{simGIC}(p_{1},p_{2}) = \frac{ \sum_{t \in \{\text{GO}(p_{1}) \cap \text{GO}(p_{2})\}}\text{IC}(t)}{ \sum_{t \in \{\text{GO}(p_{1}) \cup \text{GO}(p_{2})\}}\text{IC}(t)} $$
(1)
Table 7 Summary of SSMs used to calculate the SS between gene-products where GO(pi) is the set of annotations (direct and inherited) for protein pi.
ResnikMax and ResnikBMA are pairwise approaches based on the term-based measure proposed by Resnik [50] in which the similarity between two terms corresponds to the IC of their most informative common ancestor. This pairwise approach is used with two combination variants, maximum
$$ \begin{aligned} &\text{Resnik}_{\text{Max}}(p_{1},p_{2}) = \\ &\hspace{5mm}\max{\{\text{sim}(t_{1},t_{2}): t_{1} \in \text{GO}(p_{1}), t_{2} \in \text{GO}(p_{2})\}} \end{aligned} $$
(2)
and best-match average
$$ \begin{aligned} \text{Resnik}_{\text{BMA}}(p_{1},p_{2}) = & \frac{\sum_{t_{1} \in \text{GO}(p_{1})}\text{sim}(t_{1},t_{2})}{2|{\text{GO}(p_{1})}|} + \\ & \frac{\sum_{t_{2} \in \text{GO}(p_{2})}\text{sim}(t_{1},t_{2})}{2|{\text{GO}(p_{2})}|} \end{aligned} $$
(3)
where |GO(pi)| is the number of annotations for protein pi and sim(t1,t2) is the SS between the GO term t1 and GO term t2 and is defined as
$$ \text{sim}(t_{1},t_{2})= \max{\{ \text{IC}(t) : t \in \{\mathrm{A}(t_{1}) \cap \mathrm{A}(t_{2})\}\}} $$
(4)
where A(ti) is the set of ancestors of ti.
These measures were selected because SimGIC and ResnikBMA represent high-performing group and pairwise approaches in predicting sequence, Pfam and Enzyme Commission similarity [49], whereas ResnikMax may help elucidating whether a single source of similarity is enough to establish interaction.
The IC of each GO term is calculated using a structure-based approach proposed by Seco et al. [51] based on the number of direct and indirect descendants and given by
$$ \text{IC}_{\text{Seco}}(t) = 1 - \frac{\log{\bigl[\text{hypo}(t)+1\bigr]}\, }{\log{\bigl[\text{maxnodes}\bigr]}\,} $$
(5)
where hypo(t) is the number of direct and indirect descendants from term t (including term t) and maxnodes is the total number of concepts in the ontology.
Genetic programming and supervised learning
GP [33] is one of the methods of evolutionary computation [52–54] that is capable of solving complex problems by evolving populations of computer programs, using Darwinian evolution and Mendelian genetics as inspiration. GP can be applied to supervised learning problems [33, 55], including several in the biomedical domain (e.g. [56–58]).
Figure 12 illustrates the basic GP evolutionary cycle. Starting from an initial population of randomly created programs/models representing the potential solutions to a given problem (e.g., combinations of SS aspects to predict PPI), it evaluates and attributes a fitness value to each of them, quantifying how well the program/model solves the problem (e.g., what is the F-measure obtained). New generations of programs are iteratively created by selecting parents based on their fitness and breeding them using (independently applied) genetic operators like crossover (swapping of randomly chosen parts between two parents, thus creating two offspring) and mutation (modification of a randomly chosen part of a parent, thus creating one offspring). The fitter individuals are selected more often to pass their characteristics to their offspring, so the population tends to improve in quality along successive generations. This evolutionary process continues until a given stop condition is verified (e.g, maximum number of generations, or fitness reaching some threshold), after which the individual with the best fitness is returned as the best model found.
Theoretically, GP can solve any problem whose candidate solutions can be measured and compared. It normally evolves solutions that are competitive with the ones developed by humans [59], and sometimes surprisingly creative. GP implicitly performs automatic feature selection, as selection promptly discards the unfit individuals, keeping only the ones that supposedly contain the features that warrant a good fitness. Unlike other powerful machine learning methods (e.g., Deep Learning), GP produces ’white-box’ models, potentially readable depending on their size. For PPI prediction, the models evolved by GP are simply combinations of the SS of the three semantic aspects. In tree-based GP (the most common type), these models are represented as parse trees that are readily translated to readable strings. Figure 13 shows a parse tree of one of the simplest combinations evolved in our experiments, here translated as
$$ \max{(BP,CC)} \times \max{(BP,MF)} $$
(6)
where the SS aspects BP, CC and MF are the variables X0, X1, and X2, respectively. These three variables constitute what is called the terminal set in GP, as they are only admitted as terminal nodes of the trees. In contrast, the function set contains the operators that can be used to combine the variables, and can only appear in internal nodes of the trees. The function set is a crucial element in GP. Together with the fitness function and the genetic operators, it determines the size and shape of the search space.
Given the free-form nature of the models evolved by GP, its intrinsic stochasticity, and the size of the search space where it normally operates, there is high variability among the raw models returned in different runs, even when using the same settings and same dataset. Even upon simplification, these models normally remain structurally very different from each other, while possibly exhibiting similar behavior, i.e., returning similar predictions. This characteristic raises some difficulty in interpreting the GP models, even if they are fully readable. Either way, it is always advisable to run GP more than once for the same problem, to avoid the risk of adopting a sub-optimal model that may have resulted from a less successful search on such a large space.
We have used a “vanilla” tree-based GP system, with no extras to boost the performance. The parameters we have set are listed in Table 8. All others were used with the default values of the gplearn software and are listed in Table S2 of Additional file 1. The parsimony coefficient is a non-standard parameter, specific to gplearn, and consists of a constant that penalizes large programs by adjusting their fitness to be less favorable for selection. It was set to 10−5, a value experimentally found to reduce the size of the evolved models without compromising their fitness. The function set contained only the four basic arithmetic operators (+,−,×, and ÷, protected against division by zero as in [60]), plus the Maximum (max) and Minimum (min) operators. Although there is a vast array of tunable parameters even in the most basic GP system, normally they do not substantially influence the outcome in terms of best fitness achieved [61].
For binary classification, it is fairly standard to use GP in a regression-like fashion, where the expected class labels are treated as numeric expected outputs (0 for no interaction; 1 for interaction), and the fitness function that guides the evolution is based on the error between the expected and predicted values [62]. We have used this same system in our experiments, with the Root Mean Squared Error (RMSE) as fitness function [63]. However, when we report the performance of evoKGsim, we first transform the real-valued predicted outputs in class labels, by applying the natural cutoff of 0.5.
Performance measures
The classification quality is evaluated using the weighted average of F-measures (WAF). This metric accounts for class unbalance by computing the F-measure for each class and then calculating the average of all computed F-measures, weighted by the number of instances of each class:
$$ \text{WAF} = \frac{\sum_{c \in C} \text{F-measure}_{\text{c}} \times \text{Support}_{\text{c}}}{\sum_{c \in C}\text{Support}_{\text{c}}} $$
(7)
where C is the set of classes, F-measurec is the F-measure computed for class c, and Supportc is the number of instances in class c.
In each experiment, we perform stratified 10-fold cross-validation. The same folds are used throughout all experiments. At the end of each fold, we evaluate the WAF of classifications on the respective test set and report the median.