Advertisement

Exploiting causality in gene network reconstruction based on graph embedding

  • 111 Accesses

Abstract

Gene network reconstruction is a bioinformatics task that aims at modelling the complex regulatory activities that may occur among genes. This task is typically solved by means of link prediction methods that analyze gene expression data. However, the reconstructed networks often suffer from a high amount of false positive edges, which are actually the result of indirect regulation activities due to the presence of common cause and common effect phenomena or, in other terms, due to the fact that the adopted inductive methods do not take into account possible causality phenomena. This issue is accentuated even more by the inherent presence of a high amount of noise in gene expression data. Existing methods for the identification of a transitive reduction of a network or for the removal of (possibly) redundant edges suffer from limitations in the structure of the network or in the nature/length of the indirect regulation, and often require additional pre-processing steps to handle specific peculiarities of the networks (e.g., cycles). Moreover, they are not able to consider possible community structures and possible similar roles of the genes in the network (e.g. hub nodes), which may change the tendency of nodes to be highly connected (and with which nodes) in the network. In this paper, we propose the method INLOCANDA, which learns an inductive predictive model for gene network reconstruction and overcomes all the mentioned limitations. In particular, INLOCANDA is able to (i) identify and exploit indirect relationships of arbitrary length to remove edges due to common cause and common effect phenomena; (ii) take into account possible community structures and possible similar roles by means of graph embedding. Experiments performed along multiple dimensions of analysis on benchmark, real networks of two organisms (E. coli and S. cerevisiae) show a higher accuracy with respect to the competitors, as well as a higher robustness to the presence of noise in the data, also when a huge amount of (possibly false positive) interactions is removed. Availability: http://www.di.uniba.it/~gianvitopio/systems/inlocanda/

This is a preview of subscription content, log in to check access.

Access options

Buy single article

Instant unlimited access to the full article PDF.

US$ 39.95

Price includes VAT for USA

Subscribe to journal

Immediate online access to all issues from 2019. Subscription will auto renew annually.

US$ 199

This is the net price. Taxes to be calculated in checkout.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18

Notes

  1. 1.

    A formal definition of neighborhood will be introduced later.

  2. 2.

    The fact that this dataset is relatively difficult to analyze is confirmed by the maximum AUPRC obtained by the method proposed in Marbach et al. (2012) and by GENERE (Ceci et al. 2015), which are 0.09 and 0.12, respectively.

References

  1. Aha, D. W., Kibler, D., & Albert, M. K. (1991). Instance-based learning algorithms. Machine Learning, 6(1), 37–66.

  2. Aho, A. V., Garey, M. R., & Ullman, J. D. (1972). The transitive reduction of a directed graph. SIAM Journal on Computing, 1(2), 131–137.

  3. Atias, N., & Sharan, R. (2012). Comparative analysis of protein networks: Hard problems, practical solutions. Communications of the ACM, 55(5), 88–97.

  4. Babu, M. M., Luscombe, N. M., Aravind, L., Gerstein, M., & Teichmann, S. A. (2004). Structure and evolution of transcriptional regulatory networks. Current Opinion in Structural Biology, 14(3), 283–291.

  5. Belkin, M., & Niyogi, P. (2002). Laplacian eigenmaps and spectral techniques for embedding and clustering. In T. G. Dietterich, S. Becker, & Z. Ghahramani (Eds.), Advances in neural information processing systems 14 (pp. 585–591). Cambridge: MIT Press.

  6. Berger, M. F., & Bulyk, M. L. (2009). Universal protein-binding microarrays for the comprehensive characterization of the DNA-binding specificities of transcription factors. Nature Protocols, 4(3), 393–411.

  7. Blockeel, H., Raedt, L. D., & Ramon, J. (1998). Top-down induction of clustering trees. In J. W. Shavlik (Ed.), ICML 1998 (pp. 55–63). Burlington: Morgan Kaufmann.

  8. Böck, M., Ogishima, S., Tanaka, H., Kramer, S., & Kaderali, L. (2012). Hub-centered gene network reconstruction using automatic relevance determination. PLOS ONE, 7(5), 1–17.

  9. Bošnački, D., Odenbrett, M. R., Wijs, A., Ligtenberg, W., & Hilbers, P. (2012). Efficient reconstruction of biological networks via transitive reduction on general purpose graphics processors. BMC Bioinformatics, 13(1), 281.

  10. Bulyk, M. L. (2005). Discovering DNA regulatory elements with bacteria. Nature Biotechnology, 23(8), 942–944.

  11. Ceci, M., Pio, G., Kuzmanovski, V., & Dẑeroski, S. (2015). Semi-supervised multi-view learning for gene network reconstruction. PLOS ONE, 10(12), 1–27.

  12. Cohen, W. W. (1995). Fast effective rule induction. In Proceedings of the twelfth international conference on international conference on machine learning, ICML’95 (pp. 115–123). San Francisco, CA: Morgan Kaufmann Publishers Inc.

  13. de Jong, H. (2002). Modeling and simulation of genetic regulatory systems: A literature review. Journal of Computational Biology, 9(1), 67–103.

  14. Demšar, J. (2006). Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research, 7, 1–30.

  15. Emmert-Streib, F., Glazko, G., De Matos Simoes, R., et al. (2012). Statistical inference and reverse engineering of gene regulatory networks from observational expression data. Bioinformatics and Computational Biology, 3, 8.

  16. Gallagher, B., & Eliassi-Rad, T. (2010). Leveraging label-independent features for classification in sparsely labeled networks: An empirical study. In L. Giles, M. Smith, J. Yen, & H. Zhang (Eds.), Advances in Social Network Mining and Analysis (pp. 1–19). Berlin: Springer.

  17. Geistlinger, L., Csaba, G., Dirmeier, S., Küffner, R., & Zimmer, R. (2013). A comprehensive gene regulatory network for the diauxic shift in saccharomyces cerevisiae. Nucleic Acids Research, 41(18), 8452–8463. https://doi.org/10.1093/nar/gkt631.

  18. Grover, A., & Leskovec, J. (2016). Node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’16 (pp. 855–864). New York, NY: ACM.

  19. Hase, T., Ghosh, S., Yamanaka, R., & Kitano, H. (2013). Harnessing diversity towards the reconstructing of large scale gene regulatory networks. PLoS Computational Biology, 9(11), e1003361.

  20. Hecker, M., Lambeck, S., Toepfer, S., Van Someren, E., & Guthke, R. (2009). Gene regulatory network inference: Data integration in dynamic models—A review. Biosystems, 96(1), 86–103.

  21. Hempel, S., Koseska, A., Nikoloski, Z., & Kurths, J. (2011). Unraveling gene regulatory networks from time-resolved gene expression data—A measures comparison study. BMC Bioinformatics, 12(1), 292.

  22. Henderson, K., Gallagher, B., Li, L., Akoglu, L., Eliassi-Rad, T., Tong, H., & Faloutsos, C. (2011). It’s who you know: Graph mining using recursive structural features. In Proceedings of the 17th ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’11 (pp. 663–671). New York: ACM.

  23. Hsu, H. T. (1975). An algorithm for finding a minimal equivalent graph of a digraph. Journal of ACM, 22(1), 11–16.

  24. Ibarguren, I., Lasarguren, A., Pérez, J. M., Muguerza, J., Gurrutxaga, I., & Arbelaitz, O. (2016). Bfpart: Best-first part. Information Sciences, 367–368, 927–952.

  25. Itani, S., Ohannessian, M., Sachs, K., Nolan, G.P., & Dahleh, M.A. (2008). Structure learning in causal cyclic networks. In Proceedings of the international conference on causality: objectives and assessment—Vol. 6, COA’08 (pp. 165–176) JMLR.org.

  26. Korb, K. B., & Nicholson, A. E. (2010). Bayesian Artificial Intelligence (2nd ed.). Boca Raton, FL: CRC Press Inc.

  27. Li, J., & Xie, D. (2015). Rack1, a versatile hub in cancer. Oncogene, 34(15), 1890–1898.

  28. Lo, L., Wong, M., Lee, K., & Leung, K. (2015). Time delayed causal gene regulatory network inference with hidden common causes. PLOS ONE, 10(9), 1–47.

  29. Lü, L., & Zhou, T. (2011). Link prediction in complex networks: A survey. Physica A: Statistical Mechanics and its Applications, 390(6), 1150–1170.

  30. Marbach, D., Costello, J. C., Küffner, R., Vega, N. M., Prill, R. J., Camacho, D. M., et al. (2012). Wisdom of crowds for robust gene network inference. Nature Methods, 9, 796–804.

  31. Margolin, A., Nemenman, I., Basso, K., Wiggins, C., Stolovitzky, G., Favera, R., et al. (2006). Aracne: An algorithm for the reconstruction of gene regulatory networks in a mammalian cellular context. BMC Bioinformatics, 7(Suppl 1), S7.

  32. Markowetz, F., & Spang, R. (2007). Inferring cellular networks—A review. BMC Bioinformatics, 8(Suppl 6), S5.

  33. Mikolov, T., Chen, K., Corrado, G. S., & Dean, J. (2013). Efficient estimation of word representations in vector space. CoRR arXiv:1301.3781

  34. Omranian, N., Eloundou-Mbebi, J. M. O., Mueller-Roeber, B., & Nikoloski, Z. (2016). Gene regulatory network inference using fused lasso on multiple data sets. Scientific Reports, 6, 20533.

  35. Park, P. J. (2009). ChIP-seq: Advantages and challenges of a maturing technology. Nature Reviews Genetics, 10(10), 669–680.

  36. Pearl, J. (2000). Causality: Models, reasoning, and inference. New York, NY: Cambridge University Press.

  37. Penfold, C. A., & Wild, D. L. (2011). How to infer gene networks from expression profiles, revisited. Interface Focus, 1(6), 857–870.

  38. Perozzi, B., Al-Rfou, R., & Skiena, S. (2014). Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’14 (pp. 701–710). New York, NY: ACM.

  39. Pinna, A., Soranzo, N., & de la Fuente, A. (2010). From knockouts to networks: Establishing direct cause-effect relationships through graph analysis. PLoS ONE, 10(5), e12912.

  40. Pio, G., Ceci, M., Malerba, D., & D’Elia, D. (2015). ComiRNet: A web-based system for the analysis of miRNA-gene regulatory networks. BMC Bioinformatics, 16(9), S7.

  41. Pio, G., Ceci, M., Prisciandaro, F., & Malerba, D. (2017). LOCANDA: Exploiting causality in the reconstruction of gene regulatory networks. In A. Yamamoto, T. Kida, T. Uno, & T. Kuboyama (Eds.), Discovery science 2017, Lecture notes in computer science (Vol. 10558, pp. 283–297). Berlin: Springer.

  42. Roweis, S. T., & Saul, L. K. (2000). Nonlinear dimensionality reduction by locally linear embedding. Science, 290(5500), 2323–2326.

  43. Selvanathan, S. P., Graham, G. T., Erkizan, H. V., Dirksen, U., Natarajan, T. G., Dakic, A., et al. (2015). Oncogenic fusion protein ews-fli1 is a network hub that regulates alternative splicing. Proceedings of the National Academy of Sciences, 112(11), E1307–E1316.

  44. Tang, J., Qu, M., Wang, M., Zhang, M., Yan, J., & Mei, Q. (2015). Line: Large-scale information network embedding. In Proceedings of the 24th international conference on world wide web, WWW ’15 (pp. 1067–1077). International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, Switzerland.

  45. Tenenbaum, J. B., Silva, V. D., & Langford, J. C. (2000). A global geometric framework for nonlinear dimensionality reduction. Science, 290(5500), 2319–2323.

  46. Thattai, M., & van Oudenaarden, A. (2001). Intrinsic noise in gene regulatory networks. Proceedings of the National Academy of Sciences, 98(15), 8614–8619.

  47. Van den Bulcke, T., Van Leemput, K., Naudts, B., van Remortel, P., Ma, H., Verschoren, A., et al. (2006). SynTReN: A generator of synthetic gene expression data for design and analysis of structure learning algorithms. BMC Bioinformatics, 7, 43.

  48. Vilalta, R., & Drissi, Y. (2002). A perspective view and survey of meta-learning. Artificial Intelligence Review, 18(2), 77–95.

  49. Yu, D., Lim, J., Wang, X., Liang, F., & Xiao, G. (2017). Enhanced construction of gene regulatory networks using hub gene information. BMC Bioinformatics, 18(1), 186.

  50. Zitnik, M., & Zupan, B. (2015). Data imputation in epistatic MAPs by network-guided matrix completion. Journal of Computational Biology, 22(6), 595–608.

Download references

Acknowledgements

We would like to acknowledge the support of the European Commission through the Projects MAESTRA - Learning from Massive, Incompletely annotated, and Structured Data (Grant Number ICT-2013-612944) and TOREADOR - Trustworthy Model-aware Analytics Data Platform (Grant Number H2020-688797). We would also like to thank Lynn Rudd for her help in reading and correcting the manuscript.

Author information

Correspondence to Gianvito Pio.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Editors: Takuya Kida, Takeaki Uno, Tetsuji Kuboyama, Akihiro Yamamoto.

Appendices

Appendix 1: Results obtained by KNN classifier

See Figs. 19, 20, 21, 22 and 23.

Fig. 19
figure19

Box plots depicting the results obtained by KNN on Syntren E.coli datasets with 100 nodes, by varying the threshold on the weight of the original network edges

Fig. 20
figure20

Box plots depicting the results obtained by KNN on Syntren E.coli datasets with 200 nodes, by varying the threshold on the weight of the original network edges

Fig. 21
figure21

Box plots depicting the results obtained by KNN on Syntren Yeast datasets with 100 nodes, by varying the threshold on the weight of the original network edges

Fig. 22
figure22

Box plots depicting the results obtained by KNN on Syntren Yeast datasets with 200 nodes, by varying the threshold on the weight of the original network edges

Fig. 23
figure23

Box plots depicting the results obtained by KNN on DREAM5 E.coli dataset, by varying the threshold on the weight of the original network edges

Appendix 2: Results obtained by JCHAID classifier

See Figs. 24, 25, 26, 27 and 28.

Fig. 24
figure24

Box plots depicting the results obtained by JCHAID on Syntren E.coli datasets with 100 nodes, by varying the threshold on the weight of the original network edges

Fig. 25
figure25

Box plots depicting the results obtained by JCHAID on Syntren E.coli datasets with 200 nodes, by varying the threshold on the weight of the original network edges

Fig. 26
figure26

Box plots depicting the results obtained by JCHAID on Syntren Yeast datasets with 100 nodes, by varying the threshold on the weight of the original network edges

Fig. 27
figure27

Box plots depicting the results obtained by JCHAID on Syntren Yeast datasets with 200 nodes, by varying the threshold on the weight of the original network edges

Fig. 28
figure28

Box plots depicting the results obtained by JCHAID on DREAM5 E.coli dataset, by varying the threshold on the weight of the original network edges

Appendix 3: Results obtained by JRIP classifier

See Figs. 29, 30, 31, 32 and 33.

Fig. 29
figure29

Box plots depicting the results obtained by JRIP on Syntren E.coli datasets with 100 nodes, by varying the threshold on the weight of the original network edges

Fig. 30
figure30

Box plots depicting the results obtained by JRIP on Syntren E.coli datasets with 200 nodes, by varying the threshold on the weight of the original network edges

Fig. 31
figure31

Box plots depicting the results obtained by JRIP on Syntren Yeast datasets with 100 nodes, by varying the threshold on the weight of the original network edges

Fig. 32
figure32

Box plots depicting the results obtained by JRIP on Syntren Yeast datasets with 200 nodes, by varying the threshold on the weight of the original network edges

Fig. 33
figure33

Box plots depicting the results obtained by JRIP on DREAM5 E.coli dataset, by varying the threshold on the weight of the original network edges

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Pio, G., Ceci, M., Prisciandaro, F. et al. Exploiting causality in gene network reconstruction based on graph embedding. Mach Learn (2019) doi:10.1007/s10994-019-05861-8

Download citation

Keywords

  • Causality
  • Bionformatics
  • Network Reconstruction
  • Link prediction