A framework towards data analytics on host–pathogen protein–protein interactions

Abstract

With the rapid development of high-throughput technologies, systems biology is now embracing a great opportunity made possible by the increased accumulation of data available online. Biological data analytics is considered as a critical means to contribute to a better understanding on such data through extraction of the latent features, relationships and the associated mechanisms. Therefore, it is important to evaluate how to involve data analytics from both computational and biological perspectives in practice. This paper has investigated interaction relationships in the proteomics area, which provide insights of the critical molecular processes within infection mechanisms. Specifically, we focused on host–pathogen protein–protein interactions, which represented the primary challenges associated with infectious diseases and drug design. Accordingly, a novel framework based on data analytics and machine learning techniques is detailed for analyzing these areas and we will describe the analytical results from host–pathogen protein–protein interactions (HP-PPI). Based on this framework, which serves as a pipeline solution for extracting and learning from the raw proteomics data, we have firstly evaluated several models from literature using different analytic technologies and performance measurements. An unsupervised deep learning model based on stacked denoising autoencoders, is subsequently proposed to capture higher level feature regarding the sequence information in the framework. The achieved performance indicates a superior capability of the unsupervised deep learning model in dealing with the host–pathogen protein interactions scenario among all of these models. The results will further help to enrich a theoretical and technical foundation for analyzing HP-PPI networks.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

References

  1. Abadi M, Agarwal A, Barham P, Brevdo E, Chen Z, Citro C, Corrado GS, Davis A, Dean J, Devin M, et al (2015) Tensorflow: large-scale machine learning on heterogeneous systems, 2015. Software available from tensorflow. org, 1

  2. Akusok A, Björk K-M, Miche Y, Lendasse A (2015) High-performance extreme learning machines: a complete toolbox for big data applications. IEEE Access 3:1011–1025

    Article  Google Scholar 

  3. Barrett T, Wilhite SE, Ledoux P, Evangelista C, Kim IF, Tomashevsky M, Marshall KA, Phillippy KH, Sherman PM, Holko M et al (2013) Ncbi geo: archive for functional genomics data sets–update. Nucleic Acids Res 41(D1):D991–D995

    Article  Google Scholar 

  4. Berg JM, Tymoczko JL, Stryer L (2002) Biochemistry. Freeman, New York. ISBN-10: 0-7167-3051-0

  5. Calderone A, Licata L, Cesareni G (2014) VirusMentha: a new resource for virus-host protein interactions. Nucleic Acids Res 43(D1):D588–D592

    Article  Google Scholar 

  6. Chang C-C, Lin C-J (2011) Libsvm: a library for support vector machines. ACM Trans Intell Syst Technol (TIST) 2(3):27

    Google Scholar 

  7. Chaudhari P, Agarwal H, Bhateja V (2019) Data augmentation for cancer classification in oncogenomics: an improved KNN based approach. Evol Intell. https://doi.org/10.1007/s12065-019-00283-w

    Article  Google Scholar 

  8. Chen H, Shen J, Wang L, Song J (2016) Towards data analytics of pathogen–host protein–protein interaction: a survey. In: 2016 IEEE International Congress on Big Data (BigData Congress), IEEE, pp 377–388

  9. Chen H, Shen J, Wang L, Song J (2017) Leveraging stacked denoising autoencoder in prediction of pathogen–host protein–protein interactions. In: 2017 IEEE International Congress on Big Data (BigData Congress), IEEE, pp 368–375

  10. Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3):273–297

    MATH  Google Scholar 

  11. Dagher GG, Machado AP, Davis EC, Green T, Martin J, Ferguson M (2019) Data storage in cellular DNA: contextualizing diverse encoding schemes. Evol Intell. https://doi.org/10.1007/s12065-019-00202-z

    Article  Google Scholar 

  12. Davies MN, Secker A, Freitas AA, Clark E, Timmis J, Flower DR (2008) Optimizing amino acid groupings for GPCR classification. Bioinformatics 24(18):1980–1986

    Article  Google Scholar 

  13. Du Z, Li L, Chen C-F, Philip SY, Wang JZ (2009) G-SESAME: web tools for GO-term-based gene similarity analysis and knowledge discovery. Nucleic Acids Res 37(Suppl_2):W345–W349

    Article  Google Scholar 

  14. Gao M, Zhou H, Skolnick J (2019) Destini: a deep-learning approach to contact-driven protein structure prediction. Sci Rep 9(1):3514

    Article  Google Scholar 

  15. Gene Ontology C et al (2015) Gene ontology consortium: going forward. Nucleic Acids Res 43(D1):D1049–D1056

    Article  Google Scholar 

  16. Goel R, Harsha H, Pandey A, Prasad TK (2012) Human protein reference database and human proteinpedia as resources for phosphoproteome analysis. Mol BioSyst 8(2):453–463

    Article  Google Scholar 

  17. Greene CS, Tan J, Ung M, Moore JH, Cheng C (2014) Big data bioinformatics. J Cell Physiol 229(12):1896–1900

    Article  Google Scholar 

  18. Guo Y, Yu L, Wen Z, Li M (2008) Using support vector machine combined with auto covariance to predict protein–protein interactions from protein sequences. Nucleic Acids Res 36(9):3025–3030

    Article  Google Scholar 

  19. Hilbe JM (2009) Logistic regression models. CRC Press, USA

    Google Scholar 

  20. Huang G-B, Zhu Q-Y, Siew C-K (2006) Extreme learning machine: theory and applications. Neurocomputing 70(1):489–501

    Article  Google Scholar 

  21. Ito T, Chiba T, Ozawa R, Yoshida M, Hattori M, Sakaki Y (2001) A comprehensive two-hybrid analysis to explore the yeast protein interactome. Proc Natl Acad Sci 98(8):4569–4574

    Article  Google Scholar 

  22. Kshirsagar M, Carbonell J, Klein-Seetharaman J (2013a) Multisource transfer learning for host–pathogen protein interaction prediction in unlabeled tasks. NIPS Work Mach Learn Comput Biol 1:3–6

    Google Scholar 

  23. Kshirsagar M, Carbonell J, Klein-Seetharaman J (2013b) Multitask learning for host–pathogen protein interactions. Bioinformatics 29(13):i217–i226

    Article  Google Scholar 

  24. Kshirsagar M, Schleker S, Carbonell J, Klein-Seetharaman J (2015) Techniques for transferring host-pathogen protein interactions knowledge to new tasks. Front Microbiol 6:36

    Article  Google Scholar 

  25. Kumar R, Nanduri B (2010) Hpidb—a unified resource for host–pathogen interactions. BMC Bioinf 11(6):1

    Google Scholar 

  26. LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521(7553):436–444

    Article  Google Scholar 

  27. Li W, Godzik A (2006) Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22(13):1658–1659

    Article  Google Scholar 

  28. Masood MMD, Manjula D, Sugumaran V (2018) Identification of new disease genes from protein–protein interaction network. J Ambient Intell Human Comput. https://doi.org/10.1007/s12652-018-0788-1

    Article  Google Scholar 

  29. Mei S, Zhu H (2015) A novel one-class svm based negative data sampling method for reconstructing proteome-wide htlv–human protein interaction networks. Sci Rep 5:8034

    Article  Google Scholar 

  30. Min S, Lee B, Yoon S (2017) Deep learning in bioinformatics. Brief Bioinf 18(5):851–869

    Google Scholar 

  31. Navratil V, de Chassey B, Meyniel L, Delmotte S, Gautier C, André P, Lotteau V, Rabourdin-Combe C (2009) Virhostnet: a knowledge base for the management and the analysis of proteome-wide virus–host interaction networks. Nucleic Acids Res 37(suppl 1):D661–D668

    Article  Google Scholar 

  32. Panda B, Majhi B (2018) A novel improved prediction of protein structural class using deep recurrent neural network. Evol Intell. https://doi.org/10.1007/s12065-018-0171-3

    Article  Google Scholar 

  33. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V et al (2011) Scikit-learn: machine learning in python. J Mach Learn Res 12(Oct):2825–2830

    MathSciNet  MATH  Google Scholar 

  34. Prabukumar M, Agilandeeswari L, Ganesan K (2019) An intelligent lung cancer diagnosis system using cuckoo search optimization and support vector machine classifier. J Ambient Intell Hum Comput 10(1):267–293

    Article  Google Scholar 

  35. Qi Y, Tastan O, Carbonell JG, Klein-Seetharaman J, Weston J (2010) Semi-supervised multi-task learning for predicting interactions between hiv-1 and human proteins. Bioinformatics 26(18):i645–i652

    Article  Google Scholar 

  36. Savage N (2014) Bioinformatics: big data versus the big c. Nature 509(7502):S66–S67

    Article  Google Scholar 

  37. Schleker S, Kshirsagar M, Klein-Seetharaman J (2015) Comparing human–Salmonella with plant–Salmonella protein–protein interaction predictions. Front Microbiol 6:45

    Article  Google Scholar 

  38. Sen R, Nayak L, De RK (2016) A review on host-pathogen interactions: classification and prediction. Eur J Clin Microbiol Infect Dis 35(10):1581–1599

    Article  Google Scholar 

  39. Shen J, Zhang J, Luo X, Zhu W, Yu K, Chen K, Li Y, Jiang H (2007) Predicting protein–protein interactions based only on sequences information. Proc Natl Acad Sci 104(11):4337–4341

    Article  Google Scholar 

  40. Soyemi J, Isewon I, Oyelade J, Adebiyi E (2018) Inter-species/host–parasite protein interaction predictions reviewed. Curr Bioinf 13(4):396–406

    Article  Google Scholar 

  41. Tekir SD, Çakır T, Ardıç E, Sayılırbaş AS, Konuk G, Konuk M, Sarıyer H, Uğurlu A, Karadeniz İ, Özgür A et al (2013) Phisto: pathogen–host interaction search tool. Bioinformatics 29(10):1357–1358

    Article  Google Scholar 

  42. Tomasiello S (2019) A granular functional network classifier for brain diseases analysis. Comput Methods Biomech Biomed Eng Imaging Vis. https://doi.org/10.1080/21681163.2019.1627910

    Article  Google Scholar 

  43. UniProt C et al (2008) The universal protein resource (uniprot). Nucleic Acids Res 36(suppl 1):D190–D195

    Google Scholar 

  44. Varadharajan R, Priyan MK, Panchatcharam P, Vivekanandan S, Gunasekaran M (2018) A new approach for prediction of lung carcinoma using back propogation neural network with decision tree classifiers. J Ambient Intell Human Comput. https://doi.org/10.1007/s12652-018-1066-y

    Article  Google Scholar 

  45. Vincent P, Larochelle H, Bengio Y, Manzagol P-A (2008) Extracting and composing robust features with denoising autoencoders. In: Proceedings of the 25th international conference on Machine learning, ACM, pp 1096–1103

  46. Wang F, Liu S, Ni W, Xu Z, Qiu Z, Wan Z, Pan Z (2019) Imbalanced data classification algorithm with support vector machine kernel extensions. Evol Intell 12(3):341–347

    Article  Google Scholar 

  47. Wattam AR, Abraham D, Dalay O, Disz TL, Driscoll T, Gabbard JL, Gillespie JJ, Gough R, Hix D, Kenyon R et al (2013) PATRIC, the bacterial bioinformatics database and analysis resource. Nucleic Acids Res 42(D1):D581–D591

    Article  Google Scholar 

  48. Wikipedia (2017) Decision tree. Accessed 12 Dec 2017

  49. Wikipedia (2017) Naive bayes classifier. Accessed 12 Dec 2017

  50. Yan C, Xie H, Yang D, Yin J, Zhang Y, Dai Q (2018) Supervised hash coding with deep neural network for environment perception of intelligent vehicles. IEEE Trans Intell Transp Syst 19(1):284–295

    Article  Google Scholar 

  51. You Z-H, Lei Y-K, Zhu L, Xia J, Wang B (2013) Prediction of protein-protein interactions from amino acid sequences with ensemble extreme learning machines and principal component analysis. BMC Bioinf 14(8):1

    Google Scholar 

  52. You Z-H, Li S, Gao X, Luo X, Ji Z (2014) Large-scale protein-protein interactions detection by integrating big biosensing data with computational model. BioMed Res Int. https://doi.org/10.1155/2014/598129

    Article  Google Scholar 

  53. Zhang H (2004) The optimality of naive Bayes. AA 1(2):3

    Google Scholar 

  54. Zhang QC, Petrey D, Deng L, Qiang L, Shi Y, Thu CA, Bisikirska B, Lefebvre C, Accili D, Hunter T et al (2012) Structure-based prediction of protein–protein interactions on a genome-wide scale. Nature 490(7421):556–560

    Article  Google Scholar 

Download references

Acknowledgements

This work was supported by a scholarship from the China Scholarship Council (CSC) while the first author pursues his PhD degree in University of Wollongong, Australia. The first and second authors were also supported by UGPN RCF 2019 to visit University of Surrey to strengthen the algorithmic part and we wish to extend our deepest gratitude to Prof. Yaochu Jin for his valuable suggestions and supports in this paper.

Author information

Affiliations

Authors

Corresponding authors

Correspondence to Huaming Chen or Jun Shen.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Chen, H., Shen, J., Wang, L. et al. A framework towards data analytics on host–pathogen protein–protein interactions. J Ambient Intell Human Comput 11, 4667–4679 (2020). https://doi.org/10.1007/s12652-020-01715-7

Download citation

Keywords

  • Protein interactions networks
  • Deep learning
  • Data analytics