Completing sparse and disconnected protein-protein network by deep learning
Abstract
Background
Protein-protein interaction (PPI) prediction remains a central task in systems biology to achieve a better and holistic understanding of cellular and intracellular processes. Recently, an increasing number of computational methods have shifted from pair-wise prediction to network level prediction. Many of the existing network level methods predict PPIs under the assumption that the training network should be connected. However, this assumption greatly affects the prediction power and limits the application area because the current golden standard PPI networks are usually very sparse and disconnected. Therefore, how to effectively predict PPIs based on a training network that is sparse and disconnected remains a challenge.
Results
In this work, we developed a novel PPI prediction method based on deep learning neural network and regularized Laplacian kernel. We use a neural network with an autoencoder-like architecture to implicitly simulate the evolutionary processes of a PPI network. Neurons of the output layer correspond to proteins and are labeled with values (1 for interaction and 0 for otherwise) from the adjacency matrix of a sparse disconnected training PPI network. Unlike autoencoder, neurons at the input layer are given all zero input, reflecting an assumption of no a priori knowledge about PPIs, and hidden layers of smaller sizes mimic ancient interactome at different times during evolution. After the training step, an evolved PPI network whose rows are outputs of the neural network can be obtained. We then predict PPIs by applying the regularized Laplacian kernel to the transition matrix that is built upon the evolved PPI network. The results from cross-validation experiments show that the PPI prediction accuracies for yeast data and human data measured as AUC are increased by up to 8.4 and 14.9% respectively, as compared to the baseline. Moreover, the evolved PPI network can also help us leverage complementary information from the disconnected training network and multiple heterogeneous data sources. Tested by the yeast data with six heterogeneous feature kernels, the results show our method can further improve the prediction performance by up to 2%, which is very close to an upper bound that is obtained by an Approximate Bayesian Computation based sampling method.
Conclusions
The proposed evolution deep neural network, coupled with regularized Laplacian kernel, is an effective tool in completing sparse and disconnected PPI networks and in facilitating integration of heterogeneous data sources.
Keywords
Disconnected protein interaction network Neural network Interaction prediction Network evolution Regularized LaplacianAbbreviations
- ABCDEP
Approximate bayesian computation and modified differential evolution sampling
- ADJ-RL
Adjacency matrix based regularized Laplacian kernel
- AUC
Area under the curve
- ENN
Evolution neural network
- ENN-RL
Evolution neural network based regularized Laplacian kernel
- PPI
Protein- protein interaction
- RL
Regularized Laplacian kernel
- ROC
Receiver operating characterisitic
- WOLP
Weight optimization by linear programming
Background
Studying protein-protein interaction (PPI) can help us better understand intracellular signaling pathways, model protein complex structures and elucidate various biochemical processes. To aid discovering more denovo PPIs, many computational methods have been developed and can generally be categorized into one of the following three types: (a) pair-wise biological similarity based computational approaches by sequence homology, gene co-expression, phylogenetic profiles, three-dimensional structural information, etc.; [1, 2, 3, 4, 5, 6, 7]; (b) pair-wise topological features based methods [8, 9, 10, 11]; and (c) whole network structure based methods [1, 12, 13, 14, 15, 16, 17, 18, 19, 20].
For the pair-wise biological similarity based methods, without resort to determining whether two given proteins will interact from first principles in physics and chemistry, the predictive power of those methods is greatly affected by the features being used, which may be noisy or inconsistent. To circumvent limitations of pair-wise biological similarity, network structure based methods are playing an increasing role in PPI prediction since these methods can not only get the whole network structure involved and topological similarities implicitly included, but also utilize pair-wise biological similarities as weights for the edges in the networks.
Along this line, variants of random walk [12, 13, 14, 15] have been developed. Given a PPI network with N proteins, the computational cost of these methods increases by N times for all-against-all PPI prediction. In Fouss et al. [16], many kernel methods for link prediction have been systematically studied, which can measure the similarities for all node pairs and make prediction at once. Compared to the random walk, kernel methods are usually more efficient. However, neither random walk methods nor kernel methods perform very well in predicting interaction between faraway node pairs in networks [16]. Instead of utilizing network structure explicitly, many latent features based on rank reduction and spectral analysis have also been used to do prediction, such as geometric de-noise methods [1, 17], multi-way spectral clustering [18], matrix factorization based methods [19, 20]. Note that the objective functions in these methods should be carefully designed to ensure fast convergence and avoid being stuck in local optima. What is advantagous for these methods is that biological features and network topological features can complement each other to improve the prediction performance, such as by weighting network edges with pair-wise biological similarity scores [19, 20]. However, one limitation for these methods is that, only the pair-wise features for the existing edges in the PPI network are utilized, whereas from a PPI prediction perspective what is particularly useful is to incorporate pair-wise features for node pairs that are not currently linked by a direct edge but may become linked. Recently, Huang et al. proposed a sampling method [21] and a linear programming method [22] to find optimal weights for multiple heterogeneous data, thereby building weighted kernel fusion for all node pairs. These methods applied regularized Laplacian kernel (RL) to the weighted kernel fusion to infer missing or new edges in the PPI network. These methods improved PPI prediction performance, especially for detecting interactions between nodes that are far apart in the training network, by using only small training networks.
However, almost all the methods discussed above need the training network to be a single connected component to measure node-pair similarities, despite of the fact that existing PPI networks are usually disconnected. Consequently, these traditional methods only keep the maximum connected component of the original PPI network as golden standard data, which is then divided as a connected training network and testing edges. That is to say, these methods cannot effectively predict interactions for proteins that are not located in the maximum connected component. Therefore, it is of great interest and utility if we can infer PPI network from a small amount of interaction edges that do not need to form a connected network.
From our previous study of network evolutionary analysis [23], we here designed a neural network based evolution model to implicitly simulate the evolution processes of PPI networks. Instead of simulating the evolution of the whole network structure with the growth of nodes and edges as models discussed in Huang et al. [23], we only focus on the edge evolution and assume all nodes are already existing. We initialize the ancient PPI network as an all-zero adjacent matrix, and use the disconnected training network with interaction edges as labels. Each row of the all-zero adjacent matrix and the training matrix will be used as the input and label for the neural network respectively. We then train the model to simulate the evolution process of interactions. After the training step, we use outputs of the last layer of the neural network to represent rows of the evolved contact matrix. Finally, we further apply the regularized Laplacian kernel to a transition matrix that is built upon the evolved contact matrix to infer new PPIs. The results show our method can efficiently utilize the extremely sparse and disconnected training network, and improve the prediction performances by up to 8.4% for yeast and 14.9% for human PPI data.
Methods
Problem definition
where i and j are two nodes in the nodes set V, and (i,j) represents an edge between i and j, (i,j)∈E. We divide the golden standard network into two parts: the training network G_{ tn }=(V,E_{ tn }), and testing set G_{ tt }=(V_{ tt },E_{ tt }), such that E=E_{ tn }∪E_{ tt }, and any edge in G can only belong to one of these two parts. The detailed process of dividing the golden standard network is shown by Algorithm 1. We set the α (the preset ratio of G_{ tn }(,E) to G(,E)) less than a small value to make the G_{ tn } extremely sparse and with a large number of disconnected components.
Algorithm 2 describes the detailed training and prediction processes.
Evolution neural network
The structure of the evolution neural network is shown in the Fig. 2, which contains five layers including the input layer, three hidden layers and the output layer. Sigmoid is adopted as the activation function for each neuron, and layers are connected with dropouts. Dropouts can not only help us prevent over-fitting, but also indicate the mutation events during the evolution processes, such as which nodes (representing proteins) at a layer (corresponding a time during evolution) may be evolved from some nodes from the previous layer, as indicated by edges and corresponding weights connecting those nodes.
For specific configuration of the neural network in our experiments, the number of neurons in the input and out-put layer depends on the network size m=|V| of specific PPI data. Each protein is represented by the corresponding row of the adjacency matrix trainingAdj of G_{ tn } that contains the interaction information for that protein with other proteins in the proteome. We train the evolution neural network by each row of the blank adjacency matrix as the input and the corresponding row of trainingAdj as the label. A typical autoencoder structure is chosen for the three hidden layers, where encoder and decoder correspond to the biological devolution and evolution processes respectively; and cross entropy is used as the loss function. Note that, the correspondence of encoder/decoder to biological devolution/evolution is at this stage more of an analogy in helping with the design of the neural network structure than a real evolution mode for PPI networks. It is also worth noting that different with the traditional autoencoder, we did not include the layerwise isomorphism pretraining to initial the weights for our neural network since the inputs are all zero vectors. The neural network is implemented by the TensorFlow library [25], deployed on Biomix cluster at Delaware Biotechnology Institute.
Data
PPI network information
Species | Proteins | Interactions |
---|---|---|
Yeast | 5093 | 22,423 |
Human | 9617 | 37,039 |
Results
Experiments on yeast PPI data
Division of yeast golden standard PPI interactions
α | G _{ tn } | G_{ tn }(#C) | G_{ tn }(minC,avgC,maxC) | G _{ tt } |
---|---|---|---|---|
0.25 | 4061 | 2812 | (1, 1.81, 2,152,) | 18,362 |
0.125 | 1456 | 3915 | (1, 1.30, 1,006) | 20,967 |
AUC summary of repetitions for yeast PPI data
Methods | Avg ± Std (α=0.25) | Avg ± Std (α=0.125) |
---|---|---|
ENN-RL | 0.8339 ± 0.0016 | 0.8195 ± 0.0023 |
ADJ-RL | 0.8104 ± 0.0039 | 0.7403 ± 0.0083 |
Experiments on human PPI data
Division of human golden standard PPI interactions
α | G _{ tn } | G_{ tn }(#C) | G_{ tn }(minC,avgC,maxC) | G _{ tt } |
---|---|---|---|---|
0.25 | 6567 | 5370 | (1, 1.79, 3,970) | 30,472 |
0.125 | 2260 | 7667 | (1, 1.25, 1,566) | 34,779 |
AUC summary of repetitions for human PPI data
Methods | Avg ± Std (α=0.25) | Avg ± Std (α=0.125) |
---|---|---|
ENN-RL | 0.8320 ± 0.0012 | 0.8140 ± 0.0013 |
ADJ-RL | 0.7795 ± 0.0047 | 0.6970 ± 0.0059 |
Optimize weights for heterogeneous feature kernels
Most recently, Huang et al. [22, 28] developed a method to infer de novo PPIs by applying regularized Laplacian kernel to a kernel fusion that based on optimally weighted heterogeneous feature kernels. To find the optimal weights, they proposed weight optimization by linear programming (WOLP) method that based on random walk over a connected training networks. Firstly, they utilized Barker algorithm and the training network to construct a transition matrix which constrains how a random walk would traverse the training network. Then the optimal kernel fusion can be obtained by adjusting the weights to minimize the element-wise difference between the transition matrix and the weighted kernels. The minimization problem is solved by linear programming.
We tested this method by the yeast PPI network with same setting in Table 2; and six feature kernels are included: G_{ tn }: G_{ tn } is training network with α=0.25 or 0.125 in Table 2. K_{ Jaccard } [29]: This kernel measure the similarity of protein pairs i,j in term of \(\frac {neigbors(i) \cap neighbors(j)}{neighbors(i) \cup neighbors(j)}\).
K_{ SN }: It measures the total number of neighbors of protein i and j, K_{ SN }=neighbors(i)+neighbors(j). K_{ B } [30]: It is a sequence-based kernel matrix that is generated using the BLAST [31]. K_{ E } [30]: This is a gene co-expression kernel matrix constructed entirely from microarray gene expression measurements. K_{ Pfam } [30]: Similarity measure derived from Pfam HMMs [32]. All these kernels are normalized to the scale of [ 0,1] in order to avoid bias.
Discussion
As shown in Fig. 9, ENNlp-RL benefits from a more complete and accurate target transition matrix provided by ENN-RL, outperforms other methods and gets very close to the upper bound 0.8529 achieved by ABCDEP-RL. Similarly, in Fig. 10, although the maximum component of G_{ tn } is very sparse – with 1006 proteins and only 1456 training interactions, the ENNlp-RL still is enhanced from the ENN-RL and gets very close to the ABCDEP-RL. Therefore, all these results indicate that the transition matrix T learned by our ENN model can further improve the prediction performance for other downstream tools like WOLP in leveraging useful information from heterogeneous feature kernels.
Conclusions
In this work we developed a novel method based on deep learning neural network and regularized Laplacian kernel to predict de novo interactions for sparse and disconnected PPI networks. We built the neural network with a typical auto-encoder structure to implicitly simulate the evolutionary processes of PPI networks. Based on the supervised learning using the rows of a sparse and disconnected training network as labels, we can obtain an evolved PPI network as the outputs of the deep neural network, which has an input layer identical to the output layer but with zero input value and a smaller hidden layer simulating an ancient interactome. Then we predicted PPIs by applying regularized Laplacian kernel to the transition matrix built upon that evolved PPI network. Tested on DIP yeast PPI network and HPRD human PPI network, the results show that our method exhibits competitive advantages over the traditional regularized Laplacian kernel that based on the training network only. The proposed method achieved significant improvement in PPI prediction, as measured by ROC score, over 8.39% higher than the baseline for yeast data, and 14.9% for human data. Moreover, the transition matrix learned from our evolution neural network can also help us to build optimized kernel fusion, which effectively overcome the limitation of traditional WOLP method that needs a relatively large and connected training network to obtain the optimal weights. Then we also tested it by the DIP yeast data with six feature kernels, the prediction result shows the AUC can be further improved and very close to the upper bound. Given the current golden standard PPI networks are usually disconnected and very sparse, we believe our model provides a promising tool that can effectively utilize disconnected networks to predict PPIs. In this paper, we designed the autoencoder deep learning structure analogous to the evolution process of PPI network, which, although should not be interpreted as a real evolution model of PPI networks, would nonetheless be worthwhile to explore further for the future work. Meanwhile, we also plan to investigate other deep learning models for solving PPI prediction problems.
Notes
Acknowledgements
The authors are grateful to the anonymous reviewers for their valuable comments and suggestions.
Funding
Publication of this article is funded by Delaware INBRE program, with grant from the National Institute of General Medical Sciences-NIGMS (8 P20 GM103446-12) from the National Institutes of Health. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.
Availability of data and materials
The code for this work can downloaded from the project homepage. https://www.eecis.udel.edu/~lliao/enn/The data were downloaded from reference [26] http://dip.mbi.ucla.edu/dip/and refrence [27] http://www.hprd.org/.
Authors’ contributions
LH designed the algorithm and experiments, and performed all calculations and analyses. LL and CHW aided in interpretation of the data and preparation of the manuscript. LH wrote the manuscript, LL and CHW revised it. LL and CHW conceived of this study. All authors have read and approved this manuscript.
Ethics approval and consent to participate
No human, animal or plant experiments were performed in this study, and ethics committee approval was therefore not required.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
- 1.Kuchaiev O, Rašajski M, Higham DJ, Pržulj N. Geometric de-noising of protein-protein interaction networks. PLoS Comput Biol. 2009; 5(8):1000454.CrossRefGoogle Scholar
- 2.Murakami Y, Mizuguchi K. Homology-based prediction of interactions between proteins using averaged one-dependence estimators. BMC Bioinformatics. 2014; 15(1):213.CrossRefPubMedPubMedCentralGoogle Scholar
- 3.Salwinski L, Eisenberg D. Computational methods of analysis of protein–protein interactions. Curr Opin Struct Biol. 2003; 13(3):377–82.CrossRefPubMedGoogle Scholar
- 4.Craig R, Liao L. Phylogenetic tree information aids supervised learning for predicting protein-protein interaction based on distance matrices. BMC Bioinformatics. 2007; 8(1):6.CrossRefPubMedPubMedCentralGoogle Scholar
- 5.Gonzalez A, Liao L. Predicting domain-domain interaction based on domain profiles with feature selection and support vector machines. BMC Bioinformatics. 2010; 11(1):537.CrossRefPubMedPubMedCentralGoogle Scholar
- 6.Zhang QC, Petrey D, Deng L, Qiang L, Shi Y, Thu CA, Bisikirska B, Lefebvre C, Accili D, Hunter T, Maniatis T, Califano A, Honig B. Structure-based prediction of protein-protein interactions on a genome-wide scale. Nature. 2012; 490(7421):556–60.CrossRefPubMedPubMedCentralGoogle Scholar
- 7.Singh R, Park D, Xu J, Hosur R, Berger B. Struct2net: a web service to predict protein–protein interactions using a structure-based approach. Nucleic Acids Res. 2010; 38(suppl 2):508–15.CrossRefGoogle Scholar
- 8.Chen HH, Gou L, Zhang XL, Giles CL. Discovering missing links in networks using vertex similarity measures. In: Proceedings of the 27th Annual ACM Symposium on Applied Computing. SAC ’12. New York: ACM: 2012. p. 138–43.Google Scholar
- 9.Lü L, Zhou T. Link prediction in complex networks: A survey. Physica A. 2011; 390(6):11501170.CrossRefGoogle Scholar
- 10.Lei C, Ruan J. A novel link prediction algorithm for reconstructing protein-protein interaction networks by topological similarity. Bioinformatics. 2013; 29(3):355–64.CrossRefPubMedGoogle Scholar
- 11.Pržulj N. Protein-protein interactions: Making sense of networks via graph-theoretic modeling. BioEssays. 2011; 33(2):115–23.CrossRefPubMedGoogle Scholar
- 12.Page L, Brin S, Motwani R, Winograd T. The pagerank citation ranking: Bringing order to the web. Technical Report 1999-66, Stanford InfoLab. Previous number = SIDL-WP-1999-0120. 1999. http://ilpubs.stanford.edu:8090/422/.
- 13.Tong H, Faloutsos C, Pan JY. Random walk with restart: fast solutions and applications. Knowl Inf Syst. 2008; 14(3):327–46.CrossRefGoogle Scholar
- 14.Li RH, Yu JX, Liu J. Link prediction: The power of maximal entropy random walk. In: Proceedings of the 20th ACM International Conference on Information and Knowledge Management. CIKM ’11. New York: ACM: 2011. p. 1147–1156. https://doi.org/10.1145/2063576.2063741.
- 15.Backstrom L, Leskovec J. Supervised random walks: Predicting and recommending links in social networks. In: Proceedings of the Fourth ACM International Conference on Web Search and Data Mining. WSDM ’11. New York: ACM: 2011. p. 635–44.Google Scholar
- 16.Fouss F, Francoisse K, Yen L, Pirotte A, Saerens M. An experimental investigation of kernels on graphs for collaborative recommendation and semisupervised classification. Neural Netw. 2012; 31(0):53–72.CrossRefPubMedGoogle Scholar
- 17.Cannistraci CV, Alanis-Lobato G, Ravasi T. Minimum curvilinearity to enhance topological prediction of protein interactions by network embedding. Bioinformatics. 2013; 29(13):199–209.CrossRefGoogle Scholar
- 18.Symeonidis P, Iakovidou N, Mantas N, Manolopoulos Y. From biological to social networks: Link prediction based on multi-way spectral clustering. Data Knowl Eng. 2013; 87(0):226–42.CrossRefGoogle Scholar
- 19.Wang H, Huang H, Ding C, Nie F. Predicting protein–protein interactions from multimodal biological data sources via nonnegative matrix tri-factorization. J Comput Biol. 2013; 20(4):344–58. https://doi.org/10.1089/cmb.2012.0273.
- 20.Menon AK, Elkan C. Link prediction via matrix factorization. In: Proceedings of the 2011 European Conference on Machine Learning and Knowledge Discovery in Databases - Volume Part II. ECML PKDD’11. Berlin: Springer: 2011. p. 437–52.Google Scholar
- 21.Huang L, Liao L, Wu CH. Inference of protein-protein interaction networks from multiple heterogeneous data. EURASIP J Bioinforma Syst Biol. 2016; 2016(1):1–9. https://doi.org/10.1186/s13637-016-0040-2.
- 22.Huang L, Liao L, Wu CH. Protein-protein interaction prediction based on multiple kernels and partial network with linear programming. BMC Syst Biol. 2016; 10(2):45. https://doi.org/10.1186/s12918-016-0296-x.
- 23.Huang L, Liao L, Wu CH. Evolutionary model selection and parameter estimation for protein-protein interaction network based on differential evolution algorithm. IEEE/ACM Trans Comput Biol Bioinforma. 2015; 12(3):622–31. https://doi.org/10.1109/TCBB.2014.2366748.
- 24.Bengio Y. Learning deep architectures for ai. Found Trends Mach Learn. 2009; 2(1):1–127. https://doi.org/10.1561/2200000006.
- 25.Abadi M, Agarwal A, Barham P, Brevdo E, Chen Z, Citro C, Corrado GS, Davis A, Dean J, Devin M, Ghemawat S, Goodfellow I, Harp A, Irving G, Isard M, Jia Y, Jozefowicz R, Kaiser L, Kudlur M, Levenberg J, Mané D, Monga R, Moore S, Murray D, Olah C, Schuster M, Shlens J, Steiner B, Sutskever I, Talwar K, Tucker P, Vanhoucke V, Vasudevan V, Viégas F, Vinyals O, Warden P, Wattenberg M, Wicke M, Yu Y, Zheng X. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. Software available from tensorflow.org. 2015. http://tensorflow.org/.
- 26.Salwinski L, Miller CS, Smith AJ, Pettit FK, Bowie JU, Eisenberg D. The database of interacting proteins: 2004 update. Nucleic Acids Res. 2004; 32(90001):449–51.CrossRefGoogle Scholar
- 27.Keshava Prasad TS, Goel R, Kandasamy K, Keerthikumar S, Kumar S, Mathivanan S, Telikicherla D, Raju R, Shafreen B, Venugopal A, Balakrishnan L, Marimuthu A, Banerjee S, Somanathan DS, Sebastian A, Rani S, Ray S, Harrys Kishore CJ, Kanth S, Ahmed M, Kashyap MK, Mohmood R, Ramachandra YL, Krishna V, Rahiman BA, Mohan S, Ranganathan P, Ramabadran S, Chaerkady R, Pandey A. Human protein reference database-2009 update. Nucleic Acids Res. 2009; 37(suppl 1):767–72.CrossRefGoogle Scholar
- 28.Huang L, Liao L, Wu CH. Protein-protein interaction network inference from multiple kernels with optimization based on random walk by linear programming. In: Proceedings of 2015 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). Washington DC: IEEE computer society: 2015. p. 201–7.Google Scholar
- 29.Jaccard P. Étude comparative de la distribution florale dans une portion des Alpes et des Jura. Bulletin del la Société Vaudoise des Sci Nat. 1901; 37:547–79.Google Scholar
- 30.Lanckriet GRG, De Bie T, Cristianini N, Jordan MI, Noble WS. A statistical framework for genomic data fusion. Bioinformatics. 2004; 20(16):2626–635.CrossRefPubMedGoogle Scholar
- 31.Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol. 1990; 215(3):403–10.CrossRefPubMedGoogle Scholar
- 32.Sonnhammer ELL, Eddy SR, Durbin R. Pfam: A comprehensive database of protein domain families based on seed alignments. Proteins Struct Funct Bioinforma. 1997; 28(3):405–20.CrossRefGoogle Scholar
Copyright information
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.