Materials
The protein interaction and gene expression data used in this study were obtained from PPIN and Gene expression data sets.
The protein–protein interaction network (PPIN)
We used the human PPIN from the protein interaction network analysis database (PINA, http://cbg.garvan.unsw.edu.au/pina/interactome.stat.do, version October 2011. Online Resource 1: Wu et al. 2009). PINA is an integrated platform of PPIN data that has been extracted from six different public databases: IntAct, MINT, BioGRID, DIP, HPRD, and MIPS/MPact. It includes self-interactions, interactions predicted by computational methods, and interactions between human proteins and proteins from other species. Moreover, it has recently been used in other similar studies (Xia et al. 2011; Laakso and Hautaniemi 2010).
Besides the PINA network, we also used two additional PPINs in order to guarantee that a similar outcome was obtained: The Human Protein Reference Database (HPRD, http://www.hprd.org/download, version April 2010) that contains pairs of human protein interactions based on experimental evidence from the literature and that has been used in several studies (Teschendorff and Severini 2010; West et al. 2012); and the Human Integrated protein–protein interaction rEference (HIPPIE, http://cbdm.mdc-berlin.de/tools/hippie/download.php, version September 2014) that incorporates a human PPI dataset with a normalized scoring scheme, integrating data from HPRD, BioGRID, IntAct, MINT, Rual05, Lim06, Bell09, Stelzl05, DIP, BIND, Colland04, Lehner04, Albers05, MIPS, Venkatesan09, Kaltenbach07 and Nakayama02. We selected the interactions from these PPINs with a curated score above 0.73 in order to be confident that the pairs of proteins interact (Schaefer et al. 2012).
Gene expression data sets
Measuring gene expression with microarrays is now a common molecular biology approach in biomedicine, making it possible to simultaneously measure the relative expression of thousands of genes under different experimental conditions (Current Topics in Computational Molecular Biology, 2002). Thousands of gene expression data sets are available in public databases, each containing a description of the corresponding biomedical origin of the sample, the analytic procedures followed and the experimental results in terms of expression (i.e.: the amount of RNA produced for each gene in the genome).
Raw experimental gene expression data (CEL files) for Ovarian, Colon, Liver and Kidney datasets were downloaded from the Barcode human transcriptome repository (Gene Expression Barcode, http://barcode.luhs.org/), and for the SCZ and AD datasets they were downloaded from the NCBI GEO omnibus (GEO, http://www.ncbi.nlm.nih.gov/geo) and Stanley Medical Research Institute Online Genomics Database (SMRI, https://www.stanleygenomics.org: Online Resource 2). Importantly, each dataset corresponds to a collection of disease and control samples. For the analysis we filtered out the cases with too few disease/control cases (less than 9) and we only used those produced in the same platform (Affymetrix array GeneChip Human Genome U133 Plus 2.0), rendering information on 23,945 human genes. This technical platform has been widely used, and using the same platform on all data sets facilitates comparative studies and ameliorates potential experimental errors.
Methods
In order to study the stability of the PPIN in cancer, neurological and normal samples, we implemented an original method inspired by the well-known DSA approach that was customized to study neighbor-energy (nE). In this case, stability describes a network state that is not significantly altered, even when fundamental properties have changed or perturbations have been introduced. From the biological point of view, network instability could reflect a situation where mutations in a key protein involved in many interactions will alter several associated biological processes.
A filtered PPIN (Sect. 2.2.1), and preprocessed and normalized gene expression data (Sect. 2.2.3) for three different conditions (cancer, normal and neurological disorders), were the inputs for our approach (Sect. 2.2.4). A scheme of the workflow is presented in Fig. 1, where preprocessing and filtering are clearly represented as two separate modules.
Protein–protein interaction network filtering
Data from the PINA network were filtered by requiring experimental evidence for PPIs, removing redundancy and self-interactions, as well as interactions involving proteins that were not from Homo sapiens. Thus, we only considered those interactions between proteins that were also detected in the Human Genome U133 Plus 2.0 microarray platform. The resulting filtered PINA network contains 10,650 proteins with 63,119 interactions. Each node denotes a protein encoded by a gene and each edge denotes an interaction between two proteins (Fig. 2a).
Sub-network related to the synaptic vesicle cycle
A sub-network of proteins encoded by genes related to the synaptic vesicle cycle was analyzed, retrieving proteins in the synaptic vesicle (SV) cycle from the KEGG pathway (http://www.genome.jp/dbget-bin/www_bget?pathway:hsa04721, version September 2014). The number of genes involved in the SV cycle pathway are 63, and 50 out of 63 genes were detected in microarrays. The resulting sub-network contains 50 proteins and 3815 interactions.
Microarray gene expression preprocessing
Handling microarrays requires the preprocessing of each individual microarray to estimate the expression of each gene in the array. Gene expression data from Ovarian, Colon, Liver and Kidney cancers, and from SCZ and AD samples, were normalized by frozen Robust Multiarray Analysis (fRMA: (McCall et al. 2012) from the R Affy package (Gautier et al. 2004). Background-corrected gene intensities were obtained by applying fRMA processes to each array individually, and accounting for probe variability, batch effects, probe effects, array-to-array variability and background noise. The samples were then processed using Barcode (McCall et al. 2012) in order to convert gene intensities into estimates of gene expression (Z-score, Fig. 2b). Additionally, gene intensities were mapped into a binary vector of “ones” and “zeros” that denote whether a gene was expressed (1, when the Z-score is higher than a threshold value: 4.98 by default) or not (0) in each sample (Fig. 2b and Supplementary Material: McCall et al. 2011; Zilliox and Irizarry 2007). These values were used in Eq. 1, in which it is not necessary to specify whether a gene is expressed or not.
To compare the Z-score between these diseases, we normalized them using the pnorm function of the R stats package to calculate the normal distribution function of each Z-score. This normalization step is commonly employed to avoid values in a given range dominating other values. High Z-scores indicate intense gene expression, while small Z-scores correspond to weak expression. For expressed genes, defining S as the normalized Z-score, S = pnorm(Z-score), represents the probability of the gene being expressed. When the gene is not expressed, S = 1 \(-\) pnorm(Z-score) indicates the probability of the gene not being expressed. These S values were used in Eq. 2. Hence, each state in the system would represent the significance (S) of the expression of each gene (Fig. 2b). In summary, for each disease we associated a binary value reflecting whether or the gene is expressed (one or zero, respectively), attributing a value and a significance to the expression each of the 10,650 genes in the network (Fig. 2b).
Adapted simulated annealing approach
To study network stability we adopted an approach based on the SA concept, a probabilistic method that allows the global minimum of a generic cost function to be found (Kirkpatrick et al. 1983; Cerny 1985). This procedure reproduces the way the structure of a solid reaches its minimum energy configuration through cooling, becoming “frozen” at this minimum energy.
A full description of the DSA is included in Online Resource 3 (Duda et al. 2007; Haykin 1994), which also follows a physical analogy based on a set of interconnected nodes, each one with its associated state. During the cooling process forces between interconnected nodes act on the structure, which evolves until each node reaches a stable state. Thus, the nodes interacting with other nodes within the system influence one another with a defined weight.
Our algorithm is inspired on the definition of a nE function that measures the stability of the network, as well as on the general deterministic approach whereby a lower nE is related to greater stability. In our case, using a nE function that decreases in function of the interactions or over time does not make sense given the characteristics of the biological problem. Indeed, our approach does not evolve through iterations or time and thus, this part of the algorithm was not considered.
Our system is represented by a PPIN in which nodes represent proteins associated to the expression of the corresponding gene (\(S_{i}\) describes the significance (S) of a gene \(i\) being expressed or not). The DSA approach is applied to estimate the dynamic structures in the PPIN (Fig. 2c), where \(S_{i}\) represents the state of the node in the original DSA approach and the edges reflect the interactions existing between proteins. Each \(W_{ij}\) represents the weight required (Eq. 1), where \(W_{ij}\) is inversely associated to the existence of the interaction between two proteins. If the two genes \(i\) and \(j\) are both expressed, then the two corresponding proteins can interact (\(W_{ij}\) value \(-1\)). The value of \(W_{ij}\) will be +1 if the interaction is not possible because one of the two genes is not expressed.
$$\begin{aligned} W_{ij} =\left\{ {\begin{array}{l} -1\quad if\; i\, expressed,\, j\, expressed \\ +1\quad if\; i\, or\, j\, not\, expressed \\ +1\quad if\; i\, not\,expressed,\, j\, not\, expressed \\ \end{array}} \right. \end{aligned}$$
(1)
Consistent with the main idea of the SA algorithm, the local_nE is defined as the sum of the energy from all the nodes connected to a given node \(i\). This influence is calculated by multiplying the expression of each gene (normalized value of expression, S) by the associated weights of the connected nodes (\(W_{ij})\), as summarized in (2).
$$\begin{aligned} local\_nE(i)=-\sum _j {W_{ij} *S_i *S_j} \end{aligned}$$
(2)
According to the definition in Eq. 2, the local_nE is maximal when \(W_{ij} *S_i *S_j \) is at its minimum, representing active connections between nodes of expressed genes (Eq. 1, case 1) and indicating that any alteration in this node will destabilize the network.
The value of the local_nE decreases for those node connections that involve at least one gene that is not expressed in that condition, reflecting the fact that the interactions cannot take place (Eq. 1, cases 2 and 3). In these situations, the local_nE achieves its minimum values indicating network stability.
The local_nE function measures the stability of a single protein or node \(i\) in function of its neighborhood, i.e. only with respect to the directly interacting partners and not within the entire network. The global nE value (Eq. 3), and therefore the stability of the entire network, will be a consequence of the equilibrium between interactions among active (corresponding to the expressed genes) and inactive nodes (corresponding to non-expressed genes).
$$\begin{aligned} nE=\sum _i {local\_nE(i)} \end{aligned}$$
(3)
Computation of network robustness
To assess the robustness of the system, we analyzed how the network structure changes as nodes are removed in accordance with previously defined procedures (Iyer et al. 2013). Changes in the network structure are evaluated in terms of the size of the largest connected component of the network. Networks in which the largest component decreases faster than that of the original network are considered to be less robust to perturbations. Thus, nodes were removed in decreasing order of their local_nE scores (Eq. 2), removing the proteins (or nodes) with higher local_nE values first (i.e.: those with more active connections) and those with the lowest local_nE scores last (i.e.: those less connected to their neighbors) .
Network robustness was measured through the R-index in Eq. 4, where \({\upalpha }\) corresponds to the size of the largest connected component within the network after a node is removed.
$$\begin{aligned} R = \frac{1}{N} \sum _{i=1}^N {\alpha (i/N)} \end{aligned}$$
(4)
We computed the R-index for cancer and normal control samples at each step after the removal of nodes in function of the order of local_nE scores.