Introduction

Graphical networks can depict many complex systems involving biological, social and informational connections between entities. At the most abstract level, these networks are modelled by graphs in which nodes represent individuals or agents and links denote the interactions or relationships between nodes. Structural properties of biological networks are of great interest as they directly correlate with biological function (Qi and Ge 2006; Wuchty et al. 2003). Various attempts have been made to understand the topological evolution of networks (Albert and Barabási 2002; Dorogovtsev and Mendes 2002). The evolution of networks involves two processes: i) the addition or deletion of nodes and ii) The addition or deletion of edges (links) between nodes. The second process of topological evolution particularly when new connections are added to the existing network has not yet been concretely formalised and revolves around the link-prediction problem. Many applications utilize link prediction to identify new links in large, sparse networks armed only with knowledge of network topology. Therefore, improvements in link prediction accuracy will be of great significance in both science and engineering applications. Meanwhile, link-prediction also reflects the extent to which the evolution of a network can be modelled by topological features intrinsic to the network itself.

The link prediction task can be stated as follows: given a network, or a graph, predict what edges will form between nodes in the future. Alternatively, in domains where data collection is costly and the resulting networks are noisy and incomplete, link prediction can be used to identify unobserved edges. In such cases, the problem is also known as the missing link problem.

The objective of this work is to better identify undiscovered (missing and suspicious) links between pairs of nodes in a protein-protein interaction (PPI) network. Link prediction uses the existing protein interaction topology to predict missing links. Discovery of links in biological networks such as gene networks, protein-protein interaction networks, metabolic networks etc. are very costly and time-consuming if done via laboratory experiments and hence the known connections within these networks remains largely incomplete (Martinez et al. 1999; Sprinzak et al. 2003). Instead of identifying links between all possible pairs of nodes, predictions that focus on already known interactions and are accurate enough can sharply reduce the experimental costs. Discovering protein protein interactions is a pivotal task for understanding the underlying biological processes behind tasks such as protein function prediction, drug delivery control and disease diagnosis.

Researchers have formulated link prediction as a binary classification problem, where class labels indicate the presence or absence of a link (referred to as positive links or negative links respectively) between pairs of nodes in the network. In this approch, features based on network topology such as common neighbors, Jaccard coefficient, etc. of the two nodes under consideration are fed to the classifier which predicts the presence or absence of a link. This paper also formulates the link prediction problem as a binary classification problem based on topological features, with a view to improve classification performance. Recently it was found that local community-based features were most effective for link prediction in biological networks both in monopartite (Cannistraci et al. 2013b) and bipartite networks (Daminelli et al. 2015). Therefore, we have included these in our feature list for the PPI network datasets under consideration.

An unresolved issue with formulating the link prediction problem as a classification problem is the label noise present in the training data. Typically, a set of positive and negative links are randomly chosen from the existing graph and are used for training a classifier, which is then used to predict links on the remaining network. However, the absence of a link in the network does not necessarily mean it is a negative link; it may be the case that the link exists but is undiscovered (as commonly occurs with PPI networks). Therefore, to include this pair of nodes in the training data as a negative link may introduce label noise and bias the resulting classifier. In this paper, we claim that by using anomaly detection on the negative links of the training data, and by subsequently filtering out the detected anomalous negative links from training data, we can obtain better classifiers that yield superior link prediction performance. The suggested approach is evaluated on five different PPI networks, with four different classifiers. A comparison with classification with and without anomaly detection is provided and results demonstrate that utilizing anomaly detection for filtering suspicious negative links yields superior classifier performance on test data.

Related work

General purpose neighborhood based methods have been proposed for link prediction in different kinds of networks: collaboration, social, citation, roadmaps, etc. (Liben Nowell and Kleinberg 2007; Zhou et al. 2009). Various bio-inspired methods were created to either assess reliability of interactions in PPI networks such as Interaction Generality (IG1) (Saito et al. 2002), IG2 (Saito et al. 2003) and IRAP (Chen et al. 2005) or predict protein function such as the Czekanowski-Dice Dissimilarity (CDD) (Brun et al. 2003) and FSW (Chen et al. 2006). Later, these techniques were applied to protein interaction prediction (Cannistraci et al. 2013b; Chua et al. 2006). Both approaches rely on the number of neighbors that two non-directly connected nodes have and assign a likelihood score to this pair of nodes.

The simplest techniques are Jaccard’s coefficient (Jaccard 1912), Common Neighbors and Preferential Attachment (Newman 2001). Jaccard’s coefficient assigns higher likelihood scores to the node pairs for which the set of common interactors as a proportion of all available neighbors is higher and Common Neighbors does the same for pairs of nodes that simply share more interactors. Preferential Attachment, on the other hand, gives high scores when both nodes have a large number of neighbors: if one of the nodes has a low number of interactors, the score is reduced. In contrast, Adamic and Adar (2003) and Resource Allocation (Ou et al. 2007) are two similar indices that give more importance to Common Neighbors with low degree.

Various other methods have been proposed to assess the reliability of high-throughput protein interaction data. In 2009, Kuchaiev et al. (2009) proposed a method for geometric denoising of PPI networks. Cannistraci et al. in (2010) proposed topology-based link prediction method using minimum curvilinear embedding. In 2013, Cannistraci et al. (2013a) proposed a new valid variation of minimum curvilinear embedding, named non-centred minimum curvilinear embedding. Alanis-Lobato et al. in (2013) utilized several measures for the proximity of genes based on the common neighborhood structure of a GI network. However these methods do not explicitly utilize a classification based approach to the problem of identifying missing interactions.

Hasan et al. (2006) formulated the link prediction problem into binary classification problem. The method extracted a set of topological features of the network as input for supervised learning for link prediction. A binary classification approach integrated information from multiple measures to get a better prediction. In 2011 Fire M et al. (2011) utilized topological features for supervised learning, and ranked the importance of each feature. They proposed a set of simple, computationally efficient topological features that could be analyzed to identify missing links. In 2013 Cannistraci et al. (2013b) proposed a new paradigm to support link formation called the Local Community Paradigm (LCP), which emphasizes the role of the local network community structure in link formation. They proposed local community-based Cannistraci features for link-prediction in PPI networks. Yu et al. (2006) in 2006 predicted missing links in PPI networks by completing defective cliques. Some methods have been reviewed in Lü and Zhou (2011) and some have been successfully applied for link detection in PPI networks.

Several anomaly detection techniques have been proposed for detecting outlier nodes, edges or substructures in graph data. The techniques may broadly be classified as: i) Feature-based approaches which utilize structural graph-centric features for outlier detection in the constructed feature space. Essentially, these methods transform the graph anomaly detection problem to the well-understood outlier detection problem (Akoglu et al. 2010; Henderson et al. 2011). ii) Proximity-based approaches that exploit the graph structure to measure closeness (or proximity) of objects in the graph. These methods capture the simple autocorrelation between these objects, where similar objects are likely to belong to the same class (Jeh and Widom 2002; Brin and Page 1998). iii) Community-based approaches that utilize clustering methods for graph anomaly detection and rely on finding densely connected groups of ’close-by’ nodes in the graph to discover anomalies that have connections across communities (Chakrabarti 2004; Sun et al. 2005; Tong and Lin 2011). iv) Relational learning based approaches consist of network-based collective classification algorithms, the main idea of which is to exploit the relationships between the objects to assign them into classes, where the number of classes is often two: anomalous and normal (Getoor et al. 2001; Jensen et al. 2004). Further details on these approaches can be found in a thorough survey (Akoglu et al. 2015). In this paper we use feature-based anomaly detection techniques to discover suspicious negative links, thereby reducing the impact of label noise introduced by assigning undiscovered positive links to the class of negative links in the training data.

Materials and methods

Network Datasets

We used four protein-protein interaction (PPI) network datasets: Caenorhabditis elegans, Mus musculus, Arabidopsis thaliana and Rattus norvegicus. These are publicly available and were collected from the Protein Interaction Network Anaysis (PINA) platform. The platform integrates data from six curated databases and builds a complete, non-redundant dataset for the model organisms 1. Since, only interactions reported across multiple datasets were considered after careful curation, in this paper we assume that the reported interactions are relatively noise free. A brief summarization of the nework characteristics is provided in Table 1:

Table 1 Network datasets

Methods

Our objective was to minimize the classification bias arising due to currently undiscovered edges (positive links) being incorrectly labeled as negative links. To address this bias, we use anomaly detection for removing suspicious negative links (which may be undiscovered positive links) from the training set before classifier training. Finally, we train a link classifier on the filtered dataset after removal of these detected suspicious negative links. Since we focused on predicting links based only on network topology, we extracted a set of features for node-pairs (edges) from the corresponding PPI network with the goal of developing a network topological feature-based classifier. We then performed supervised learning, using different machine learning classifiers. The network topology based features utilized for classification are described here.

Topology-based Measures

We briefly describe the set of topology-based measures or features that were used during our experiment. A graph theoretic approach is used to model the protein-protein interaction as a network. In this method, a PPI network is represented by an undirected graph G=(V,E), with a set of nodes or vertices V and a set of links or edges E, where vertices represent proteins and edges represent interactions between proteins respectively. In this paper, G will always be an un-weighted, undirected graph. Graphs can be characterized by many different topology-based measures, each one reflecting some particular traits of the studied structure. The topology-based measures were chosen based on their successful application in prior work on link prediction (Cannistraci et al. 2013b; Fire et al. 2011; Zhou et al. 2009).

Node-based measures: Let N(v) denote a neighborhood (or open neighborhood) of a node v in a graph G. N(v) is the set of all the nodes adjacent to v. The closed neighborhood of a node v, denoted by N[ v] is simply the set {v}∪N(v). The Formal definitions of neighborhoods that were used in this study to extract topological measures are:

$$ \begin{aligned} N(v) &= \{ u \mid (u,v) \in E \; or \; (v,u) \in E\}\\ N[\!v] &= N(v) \cup \{v\} \end{aligned} $$
(1)

Based on the above definition, neighborhood-subgraph of v which induced by the neighborhoods of v are defined as:

$$ \begin{aligned} nbhd-subgraph(v)&=\{(x,y) \in E \mid x,y \in N(v)\}\\ nbhd-subgraph[\!v]&=\{(x,y) \in E \mid x,y \in N[\!v]\} \end{aligned} $$
(2)

Note that the Induced subgraph of the open and closed neighborhoods of a node are very different with respect to their topological properties.Following measures for a node are created using the above neighborhood definitions:

  • Node degree : The degree of a node in a network is the number of links the node has to other nodes. For an undirected network, degree of a node is defined as:

    Let vV and

    $$ deg(v)= |N(v)| $$
    (3)
  • Node subgraphs: This measure denotes the number of links within the open and closed nbhd-subgraphs for each node v, which is defined as:

    $$ \begin{aligned} subgraph-edge-no(v) &= |nbhd-subgraph(v)|\\ subgraph-edge-no[\!v] &= |nbhd-subgraph[\!v]| \end{aligned} $$
    (4)

    Density of subgraph is defined as:

    $$ \begin{aligned} density-nbhd-subgraph(v)&= \frac {deg(v)} {nbhd-subgraph(v)}\\ density-nbhd-subgraph[\!v]&= \frac {deg(v)} {nbhd-subgraph[\!v]} \end{aligned} $$
    (5)

Note that the formal density of a graph is defined differently, however, the aim of this feature and all other features used in the paper is to be as straight forward and simple as possible. Therefore, we used a somewhat different density that is more related to a vertex v.

Edge-based measures: Let u,vV where u,vE. Using the neighborhoods of u and v we extract various measures. These measures help to determine the likelihood that a link between u and v exists.

  • Common-Neighbors (CN): The common neighbors (CN) of u and v refers to the number of common neighbors of u and v. Two vertices u and v are more likely to connect if they have bigger number of common neighbors. It is defined as Newman (2001):

    $$ CN(u,v)=|N(u) \cap N(v)| $$
    (6)
  • Total-Neighbors (TN): The total neighbors (TN) of u and v measure the number of distinct neighbors of u and v. which refers to the total number of neighbors u and v have together. The formal definition of TN is:

    $$ TN(u,v)=|N(u) \cup N(v)| $$
    (7)
  • Jaccard’s Coefficient (JC): Jaccard’s coefficient (JC) normalizes the size of common neighbors by total neighbors. This gives higher weight to those pairs of nodes which share a higher proportion of common neighbors relative to the total number of neighbors they have. The formal definition of JC is (Jaccard 1912):

    $$ JC(u,v)=\frac{|N(u) \cap N(v)|}{|N(u) \cup N(v)|} $$
    (8)
  • Adamic-Adar Coefficient (AA): This metric refines the simple counting of common neighbors by assigning higher likelihood scores to neighbors that are not shared with many others. It is defined as (Adamic and Adar 2003):

    $$ AA(u,v)=\sum_{z \in N(u) \cap N(v)} \frac{1}{log(|N(z)|)} $$
    (9)
  • Resource allocation Coefficient (RA): The RA coefficient and AA coefficient have very similar forms the only difference being that the RA coefficient punishes the high degree common neighbors more heavily than the AA coefficient. It is defined as (Ou et al. 2007):

    $$ RA(u,v)=\sum_{z \in N(u) \cap N(v)} \frac{1}{|N(z)|} $$
    (10)
  • Preferential Attachment (PA): This measure assigns higher likelihood scores to those pairs of nodes for which one or both nodes have a high degree. The formal definition of PA is (Newman 2001):

    $$ PA(u,v)=|N(u)|.|N(v)| $$
    (11)
  • LCP-based measures and Cannistraci variants: The local community paradigm suggests that two nodes are more likely to link together if their common-first-neighbors are members of a strongly inner-linked cohort or local-community. The Cannistraci (LCP-based) variants of classical neighborhood methods (CN, PA, AA, RA, JC) are defined as (Cannistraci et al. 2013b):

    $$\begin{array}{@{}rcl@{}} CAR(u,v)= CN(u,v).LCL(u,v) =CN(u,v).\sum_{z \in N(u) \cap N(v)}\frac{|\gamma(z)|}{2} \end{array} $$
    (12)
    $$\begin{array}{@{}rcl@{}} CPA(u,v)= e_{u}.e_{v} + e_{u}.CAR(u,v) + e_{v}.CAR(u,v) + CAR(u,v)^{2} \end{array} $$
    (13)
    $$\begin{array}{@{}rcl@{}} CAA(u,v)=\sum_{z \in N(u) \cap N(v)}\frac{|\gamma(z)|}{{log}_{2}(|N(z)|)} \end{array} $$
    (14)
    $$\begin{array}{@{}rcl@{}} CRA(u,v)=\sum_{z \in N(u) \cap N(v)}\frac{|\gamma(z)|}{|N(z)|}\!\!\qquad\qquad\qquad\qquad\qquad\quad \end{array} $$
    (15)
    $$\begin{array}{@{}rcl@{}} CJC(u,v)=\frac{CAR(u,v)}{|N(u) \cup N(v)|}\!\!\qquad\qquad\qquad\qquad\qquad\quad \end{array} $$
    (16)

    Where γ(z) refers to the sub-set of nodes in the neighborhood of z that are also common neighbors of of u and v, thus |γ(z)| is the local community degree of z; e u refers to the external degree of u, and is computed considering the nodes in the neighborhood of u that are not common neighbors of u and v.

  • Friends Measure (FM): Friend Measure (FM) of u and v measures the total number of links between the neighborhoods of u and v. Here we assume that two nodes have higher chance to get connected if their neighborhoods have more links with each other. The formal definition of FM is (Fire et al. 2011):

    $$ FM(u,v)=\sum_{x \in N(u)} \sum_{y \in N(v)} \delta(x,y) $$
    (17)

    Where

    $$ \delta(x,y)= \left\{\begin{array}{cl} 1 & if\ x=y \;or\; (x,y) \in E \;or\; (y,x) \in E\\ 0 & otherwise \end{array}\right. $$
    (18)

Edge Subgraph-based measures: The following subgraphs are defined by using the neighborhoods definitions (Fire et al. 2011):Let u,vV

$$ \begin{aligned} nbhd-subgraph(u,v)=\{(x,y) \in E \mid (x,y \in N(u) \cup N(v))\}\\ nbhd-subgraph[\!u,v]=\{(x,y) \in E \mid (x,y \in N[\!u] \cup N[\!v])\} \end{aligned} $$
(19)

The above subgraph equations contain information about the number of links between the neighborhood of u and v including the inner connections or links between each node neighborhood. The following subgraph equation represents the inner-connection subgraph:

$$ \begin{aligned} inner-subgraph(u,v)=\{(x,y) \in E \mid (x \in N(u) \; and \; y \in N(v)) \; or \; \\ (y \in N(u) \; and \; x \in N(v))\} \end{aligned} $$
(20)
  • Edge Subgraphs Edges Number: This measure counts the number of links in the above subgraphs:

    $$ \begin{aligned} |nbhd-subgraph(u,v)|\\ |nbhd-subgraph[\!u,v]|\\ |inner-subgraph(u,v)| \end{aligned} $$
    (21)

In this study, we extracted a total of 25 features for each PPI network.

Anomaly detection

We attempt to apply multiple anomaly detection techniques such as Parzen Windows, Principal Component Analysis (PCA), Nearest Neighbor (a distance-based method) and a one-class Gaussian process for removing anomalous negative links from the training data. The details of these methods can be found in (Clifton 2007; 2009; Pimentel et al. 2014). We utilize the link prediction feature set for training the anomaly detector, described in an earlier subsection. Note that all the methods presented below require only normal data for training, however abnormal data is used for validating the models. In that sense the methods below may be considered unsupervised. as these methods do not require anomalous data for training. After experimentation, we found that the Gaussian Process based anomaly detection gave the most reliable results. Hence, we chose the Gaussian Process model as our anomaly detector for our classification experiments. Next, we present a brief introduction to all of the methods considered.

Parzen window method: The Parzen window kernel density estimator method (Parzen 1962) is the model adopted here to estimate the probability density function (pdf), p(x), for the training (normal) data. With this method (Bishop 2006), p(x) is estimated using the following steps:

  1. 1.

    Locate a hyperspherical Gaussian window, or kernel, with width σ, on each of the D-dimensional feature vectors in the training dataset, x i , where i = 1, …, N.

  2. 2.

    Evaluate the sum of the Gaussian distributions using the squared Euclidean distances between the test feature vector x and the training vectors x i , normalized by a factor that ensures p(x) integrates to 1.

This gives the following formula for the estimate of p(x):

$$ p(x)=\frac{1}{N(2\pi)^(D/2)(\sigma)^{D}} \times exp\left(-\frac{|| x - x_{i}||^{2}} {2.\sigma^{2}}\right) $$
(22)

By placing a Gaussian kernel over each feature vector x i in our training dataset, we construct a probability density estimate of p(x) that will have a higher value of p where the concentration of training data is greatest. Points in the test set with values of p(x) are classified as anomalies.

PCA method: PCA is an orthogonal transformation for transforming the raw data into a space such that the new basis vectors (principal components) are linear combinations of the original basis vectors, are linearly uncorrelated and correspond to the directions of maximal variance of the data, where the first principal component is in the direction of the highest variance, the second in the direction of the highest remaining variance and so on. Anomaly detection is performed with PCA under the assumption that normal data would be best explained by looking at the first few principal components whereas abnormal data would be captured by the remaining principal components (Bishop 2006; Chiang et al. 2001; Marsland 2003). Thus points in the data that have high coefficients for the last few principal components would correspond to anomalous data.

Nearest neighbor method: These approaches rely on the intuition that normal points will have normal neighbours in their vicinity and abnormal points would conversely have fewer normal points in their neighborhood (Hautamäki et al. 2004). Assuming that normal data is partitioned into clusters, the Novelty score z(x) of a data point x for some cluster width σ k k is given by:

$$ z(x)= \frac{1}{\sigma_{k}} \lVert x, \mu_{k} \rVert_{2} $$
(23)

where,

$$\sigma_{k} = \sqrt{\frac{1}{N_{k}} \sum_{j \in k} (x_{j} - \mu_{k})^{2}}$$

and μ k is the centre of cluster k, and σ k is defined to be the standard deviation of intra-cluster distances (Clifton 2009). Now, points with high Novelty scores for all clusters are regarded as anomalous.

Gaussian process: Given a training set \(D =\{(x_{i},y_{i})\}_{i=1}^{n} =(X,y)\) where x i XR d denotes feature vector and y denotes a scalar output or target. We are interested in identifying the target y for a new sample x . The objective of regression is to find the association between inputs x and target y. To identify the association between the input and target, we modelled the mapping in terms of y=f(x)+ε, where f is an unknown function, and ε denotes a noise term. To do this, one approach is to assume that f is a parametric function f(x;θ) where the parameters θ are tuned based on the training data. But, the major pitfall of this kind of approach is that, if in case, a wrong form of the function is chosen, it can lead to poor predictions. Another approach, based on Gaussian process takes care of this problem by assigning a priori probability to all possible functions, which are more likely to be sampled. The process is based on the assumption that these functions are drawn from a specified probability distribution. This method requires a training set and may be considered supervised.

The core of GP regression lies in the selection of a prior probability distribution over latent function which are sampled from a Gaussian process i.e., \(f \sim \mathcal {GP} (m(x),\kappa (x,x'))\). Where m(x) and κ(x,x ) are mean and covariance function respectively. Without any prior knowledge about the underlying data, the most common choice is to choose a GP with mean zero. Gaussian Process can be described as a generalization of multivariate Gaussian distribution, where the dimensions can extend to infinity. The latent function f is said to follow a Gaussian process, if and only if every finite subset of function values is multivariate Gaussian distributed. Therefore, the function values f obey the model below:

$$ f \mid X \sim \mathcal{N} (m(X),\kappa(X,X)) $$
(24)

Furthermore, we assume the noise ε to be Gaussian distributed with mean zero and standard deviation σ n i.e., \(\epsilon \sim \mathcal {N} (0,\sigma _{n}^{2})\). As a result, now output value y for test sample x can be deduced in a Bayesian manner by marginalizing over latent function f. Given training data D, the predictive distribution of y is normally distributed i.e.,

$$ y_{*} \mid D,x_{*} \sim \mathcal{N} (\mu_{*},\sigma_{*}^{2}) $$
(25)

Where moments μ and \(\sigma _{*}^{2}\) can be given in closed form expressions. More details about GP framework, can be found in (Williams and Rasmussen 2006).

In 2010, Kemmler et al. (2010) have shown how GP regression can be employed for one-class classification problems. They proposed using both the predictive mean μ (GP-Mean) and negative variance \(-\sigma _{*}^{2}\) (GP-Var) as one-class scores applied to training data with labels y = 1:

$$ \mu_{*} = k_{*}^{T}(K + \sigma_{n}^{2}.I)^{-1}.1 $$
(26)
$$ - \sigma_{*}^{2} = -(k_{**} - k_{*}^{T}(K + \sigma_{n}^{2}.I)^{-1}k_{*} + \sigma_{n}^{2}) $$
(27)

Where K=κ(X,X) denotes the kernel matrix of the training set, k =κ(X,x ) represents the vector of kernel values between training set and test input and k ∗∗=κ(x ,x ) is the kernel values of the test input. The correlation of function values using the similarity of input samples are calculated by the radial basis function (rbf): \(\kappa (x,x')=exp\left (-\frac {|| x - x^{'}||^{2}} {2.\sigma ^{2}}\right)\).

Experimental setup

Since the number of known links are few, we oversample the positive links in the datasets to generate sufficient positive links from each network when required. The set of negative links is much larger and it has been shown that the subset sampling method used to generate the negative training links impacts the performance of the resulting classifier (Yu et al. 2010). Two predominant sampling methods have been proposed for the negative set sampling in PPI networks, namely balanced random sampling and simple random sampling. In simple random sampling care is taken to ensure the proteins in the positive set must also appear in the negative set. In balanced random sampling the proteins must occur with the same frequency in both sets. It has further been shown that protein pairs with higher number of common neighbours are more likely to interact (Alanis-Lobato 2015), therefore by choosing non interacting pairs within 2 hops of each other we are in effect constructing a negative set that is harder to classify. To ensure no bias is introduced due to the sampling method we experimented with both balanced random and simple random sampling for choosing our negative set.

Figure 1 shows the workflow of the proposed method. The methodology is configurable into two phase. In the first phase, we construct a dataset to train the anomaly detector to filter out the anomalous negative links from training data of each PPI network. In the second phase, we construct another dataset (disjoint from the dataset in Phase-I) to train a classifier to classify a pair of nodes as a positive link or a negative link for each PPI network. To this end, we construct first and second phase as follows:

Fig. 1
figure 1

Experimental setup overview. a Training and Testing of Anomaly Detection Methods. b Classifiers prediction on the unfiltered and filtered dataset

Phase-I: In this phase, First, we construct the dataset to train the anomaly detector. Then, we train and evaluate performance of different anomaly detection methods on each PPI network.

  1. 1.

    We extract positive links from the network, and divided them into a validation and test set in the ratio 50:50. Note that positive links are not used for training the anomaly detector but only for validation.

  2. 2.

    We extract negative links from the network, such that the vertices are within two hops of each other. These are divided into training, validation and test set in the ratio 60:20:20.

  3. 3.

    Topological features are extracted for the above training, validation and test sets.

  4. 4.

    We train and evaluate different anomaly detection methods and select the best performing anomaly detection method from these methods. We use this trained anomaly detector model in phase-II.

Phase-II: In this phase, We construct another dataset to train the classifiers. Then, we train and evaluate the performance of different classifiers on each PPI network.

  1. 1.

    We extract positive links from the network, and divided them into a training set, validation set and a test set in the ratio 60:20:20.

  2. 2.

    In order to introduce synthetic noise we mislabel a fraction of the positive links and assign them labels corresponding to negative links.

  3. 3.

    We extract negative links from the network using simple random sampling, such that the vertices are within two hops of each other.

  4. 4.

    The mislabelled negative links generated in step 2 are merged with the negative links in the training set from step 3 to allow for creation of a noisy dataset with synthetic ground truth. This is divided into a training, validation and test set in the ratio 60:20:20 such that the positive and negative datasets are balanced.

  5. 5.

    Topological features are extracted for the above training and test sets. We call this the unfiltered training dataset.

  6. 6.

    Next we generate a filtered version of the dataset using anomaly detection to filter out the noisy negative links we had generated in step 4.

  7. 7.

    We evaluate different machine learning classifiers on both the filtered and unfiltered training datasets.

  8. 8.

    Prediction accuracy is compared across classifiers trained on the filtered vs. the unfiltered dataset.

Results

Performance evaluation of different anomaly detector

We trained different anomaly detection methods on the training set of each PPI network and measured the performance on corresponding test set, where the training and test sets are constructed as described in Phase-I of the previous subsection. We trained the anomaly detection methods on negative links (normal class) only and utilized the positive links (abnormal class) for validation and test purpose. Each method yields an anomaly score on the validation set and a threshold is chosen for detection based on minimizing the false positive and false negative rate on the validation set. We report accuracy metrics on the test set using the chosen optimal threshold in Table 2. We notice that One Class Gaussian Process (gpoc) anomaly detection technique has a better score than the other anomaly detectors. Since this was consistent across the datasets, hence in this paper we used One Class Gaussian Process technique for anomaly detection.

Table 2 Anomaly detection techniques comparison under different metrics

Now we focus our attention on the dataset described earlier for Phase-II. We apply the anomaly detector trained above only on negative links of noisy Phase-II training set. We validate the performance of the GP anomaly detector (one class gaussian process) using True Positive Rate (TPR) and True Negative Rate (TNR). Results are provided in Table 3. The TNR is slightly lower than TPR because of uncertain labels of negative links i.e. some of the negative links may be positive links and may be detected as outliers.

Table 3 Anomaly detector performance measure

Gene Ontology (GO) validation of anomaly detection

We also elucidated the biological significance of anomaly detection using the Gene Ontology (GO) scores of protein pairs in different PPI networks. Since proteins which are involved in the same biological function or share the same biological pathway are more likely to interact with each other compared to proteins which belong to other pathways, hence this statistics is a better measure to test the quality of our prediction. We calculated the GO score corresponding to each of the gene ontology classes i.e. biological process, cellular components and molecular functions of protein pairs using the Protein Interaction Network Analysis Platform (PINA) 2. As we saw in Table 3 TNR is lower than the TPR this is because our anomaly detector extracts some non interacting protein pairs as anomalies. These may be undiscovered interactions and to validate this hypothesis we look at the GO scores of these anomalous protein pairs. In Mus musculus, the anomaly detection extracts 1360 proteins pairs as anomalies out of which 612 protein pairs have a GO score greater than 0.5, and out of these 268 protein pairs were found to interact with different public databases. In Rattus norvegicus, 707 protein pairs were extracted out of which 543 protein pairs had GO scores greater than 0.5. In Caenorhabditis elegans, 2826 protein pairs were extracted as anomalies out of which 1254 protein pairs had GO scores greater than 0.5. Thus, a high proportion of the protein pairs filtered by the anomaly detection technique outlined in this paper appear to have significant GO scores and may potentially have undiscovered interactions. We further validated the discovered anomalies against the Negatome Database which contains experimentally supported non-interacting protein pairs. On matching the resuts not a single anomalous negative link discovered by the anomaly detector was found to lie in the Negatome database, further validating the fact that the interactions discovered by the anomaly detector.

Performance evaluation of different classifiers

After we removed the anomalies and generated the two training sets before and after filtering out the suspicious negative links, we trained four standard classifiers on both training sets. We evaluated the different machine learning classifiers (SVM, C5.0, KNN and Naive Bayes) on each PPI network. We used three standard metrics Accuracy, F-measure and Area Under the ROC curve (AUC) to measure the performance of each classifier. The F-measure indicates the trade off between precision and recall score of a classifier for a particular threshold setting whereas the AUC is independent of the threshold. It is an evaluation of the classifier as threshold varies over all possible values. In evaluation terminology, we denote the set of true positives as TP, the set of true negatives as TN, the set of false positives as FP, the set of false negative as FN. Various evaluation metrics is defined as:

$$ Recall \quad or \quad True \enspace Positive \enspace Rate \quad or \quad Sensitivity = \frac{TP}{TP+FN} $$
(28)
$$ True \enspace Negative \enspace Rate \quad or \quad Specificity = \frac{TN}{TN+FP} $$
(29)
$$ Precision = \frac{TP}{TP+FP} $$
(30)
$$ Accuracy = \frac{TP+TN}{TP+TN+FP+FN} $$
(31)
$$ F-measure=\frac{2*Precisions*Recall}{ Precisions + Recall} $$
(32)
$$ False \enspace Positive \enspace Rate = \frac{FP}{FP+TN} $$
(33)
$$ False \enspace Negative \enspace Rate = \frac{FN}{FN+TP} $$
(34)

As mentioned earlier we experiment with both simple random and balanced random sampling for constructing our negative set, please refer to Tables 4 and 5 for a comparitive analysis. Results indicate that the accuracies of the models do not change significantly using either approach, so for all further experiments we chose simple random sampling as our sampling method.

Table 4 Classification comparison Under different metrics using simple random sampling
Table 5 Classification comparison Under different metrics using balanced random sampling

We repeat our experiments ten times by randomly selecting the training and test set to remove any statistical bias. Then we used the t-test for validating the statistical significance of differences between the Accuracy scores obtained with and without anomaly detection. In all cases barring that of the C5.0 classifier for the Arabidopsis thaliana dataset, we get a p-value < 0.0001 which shows that this difference may be considered to be extremely statistically significant. It can be seen that naive Bayes classifier is a weak classifier without anomaly detection technique, but improves most significantly after using anomaly detection technique. The reason for this is that the Naive Bayes classifier has more room for improvement after filtering the data via anomaly detection due to prior poor performance. The remaining classifiers SVM, C5.0, and KNN exhibit good classification performance before anomaly detection, but these too show significant performance improvement when anomaly detection is used for filtering the training set. The sole exception is the C5.0 classifier which yields 99.36 % accuracy on the Arabidopsis thaliana dataset without anomaly detection and therefore has very little margin for improvement after anomaly detection (this is marked with a * in Table 4). A comparison of the accuracies of the different classifiers on each PPI network is shown in Fig. 2. The results shown in Table 4 illustrate the classification performance measures in terms of Accuracy, F-measure, and AUC. In Table 4, we can see that all three performance measures (Accuracy, F-measure, AUC) improve with anomaly detection. Thus, it appears that classification performance improves after filtering via anomaly detection.

Fig. 2
figure 2

Accuracy Comparison of different classifiers with and without anomaly detection technique using simple random sampling

Feature importance

In order to understand the contribution from each feature for link prediction in the PPI network, we comparatively analyzed the predictive power of the features. To measure the relative importance of different features, we analysed the information gain with respect to each feature. Information gain is based on the decrease in entropy after a dataset is split on an attribute. An attribute with highest information gain is selected for the split. We obtained the information gain of an attribute as follows: Information gain=(Entropy of distribution before the split) - (entropy of distribution after the split)

Where, entropy of a discrete probability distribution p on the countable set {x 1,x 2,x 3,...}, with p i =p(x i ), is defined as:

$$ h(p)= - \sum_{i \geq 1} p_{i}*log(p_{i}) $$
(35)

By comparing the entropy before and after the split, we obtain a measure of information gain (Han and Kamber 2006). Now, we ranked all the features based on its information gain. Table 6 presents the Information gain on the training sets of all the PPI networks. It can be seen that Common-Neighbors, Adamic Adar Coefficient, Resource allocation Coefficient, Cannistraci-based Preferential Attachment, Friends Measure, number of links in inner-subgraph and number of links in neighborhoods-subgraph are higgly influential for almost all of the PPI networks. In PPI networks, we know that proteins that form complexes display common functions. So, if proteins A, B, and C share the same function and protein A interacts with B and C, it is very probable that B and C also interact. Thus it is expected that Common-Neighbors would be an influential feature for link prediction in PPI networks (Alanis-Lobato 2015). We also know that proteins which are grouped together into cliques and quasi-cliques in PPI networks share identical functions and hence have greater probability of link formation in a densely connected group of proteins. The number of links in inner-subgraph and in neighborhood-subgraph are thus also highly influential features for link prediction in PPI networks. It is also noteworthy that more nuanced neighbor counting features like Adamic-Adar and Resource Allocation are more predictive than features that rely only on the number of common neighbors.

Table 6 InfoGain values of different features for all PPI networks

Performance evaluation across datasets

To evaluate the generalization of our method across different datasets we conducted experiments where a model trained on one dataset is tested on all the other datasets. Tables 7, 8, 9 and 10 tabulate the results for models trained on Arabidopsis thaliana, Caenorhabditis elegans, Mus musculus and Rattus norvegicus datasets respectively. The results demonstrate that for the most part the models derived from topological features on one dataset using anomaly detection show a gain over models learned without anomaly detection. The one exception is the translation of performance to C elegans which suggests that this network may be topologically somewhat different from the rest. Interestingly, the models trained on C elegans with anomaly detection do exhibit a strong gain in performance on other datasets. Overall though the method does seem to translate well across datasets.

Table 7 Results for Classifier trained on Arabidopsis thaliana and tested on remaining datasets
Table 8 Results for Classifier trained on Caenorhabditis elegans and tested on remaining datasets
Table 9 Results for Classifier trained on Mus musculus and tested on remaining datasets
Table 10 Results for Classifier trained on Rattus norvegicus and tested on remaining datasets

Discussion

This paper presents a technique for filtering graphical link training data by using anomaly detection for the purpose of link prediction in PPI networks. The performance of the resulting predictor compares favourably with the classifier trained on unfiltered data. The central idea is to have a filtering step before the classification step where suspicious links are removed from the training data. One issue that needs emphasis here is that the choice of anomaly detection technique plays a critical role in the success of the resulting classifier. If the anomaly detector is inaccurate, then the classifier may not yield optimum performance. One way to ascertain the efficacy of the anomaly detection is by deliberately mislabeling the positive links and checking if the anomaly detection algorithm can detect them, which is how we have selected our Gaussian Process algorithm. Additionally, this technique shows most improvement in performance when the link prediction accuracy is not particularly high before filtering, as this allows for greater room for classification improvement. Additionally, this technique needs to be extended to link prediction in networks with directed edges (metabolic networks), weighted edges (neural networks). While the given technique is useful for detecting missing links more efficiently, it may have to adapt to work for evolving networks where the links are constantly changing. These ideas are the focus of our future work.

Endnotes

1 Downloaded from: http://cbg.garvan.unsw.edu.au/pina/interactome.stat.doon February 10, 2015.

2 http://cbg.garvan.unsw.edu.au/pina/interactome.goSimForm.do