Phenotype Inference from Text and Genomic Data

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10536)


We describe ProTraits, a machine learning pipeline that systematically annotates microbes with phenotypes using a large amount of textual data from scientific literature and other online resources, as well as genome sequencing data. Moreover, by relying on a multi-view non-negative matrix factorization approach, ProTraits pipeline is also able to discover novel phenotypic concepts from unstructured text. We present the main components of the developed pipeline and outline challenges for the application to other fields.


Phenotypic trait Microbes Comparative genomics Late fusion Text mining Non-negative matrix factorization 



This work has been funded by the by the European Union FP7 grants ICT-2013-612944 (MAESTRA) and Croatian Science Foundation grants HRZZ-9623.


  1. 1.
    Brbić, M., Warnecke, T., Kriško, A., Supek, F.: Global shifts in genome and proteome composition are very tightly coupled. Genome Biol. Evol. 7, 1519–1532 (2015)CrossRefGoogle Scholar
  2. 2.
    Brbić, M., Piškorec, M., Vidulin, V., Kriško, A., Šmuc, T., Supek, F.: The landscape of microbial phenotypic traits and associated genes. Nucleic Acids Res. 44, 10074–10090 (2016)Google Scholar
  3. 3.
    Chaffron, S., Rehrauer, H., Pernthaler, J., von Mering, C.: A global network of coexisting microbes from environmental and whole-genome sequence data. Genome Res. 20, 947–959 (2010)CrossRefGoogle Scholar
  4. 4.
    Feldbauer, R., Schulz, F., Horn, M., Rattei, T.: Prediction of microbial phenotypes based on comparative genomics. BMC Bioinform. 16, 1–8 (2015)CrossRefGoogle Scholar
  5. 5.
    Kriško, A., Copić, T., Gabaldón, T., Lehner, B., Supek, F.: Inferring gene function from evolutionary change in signatures of translation efficiency. Genome Biol. 15, R44 (2014)CrossRefGoogle Scholar
  6. 6.
    MacDonald, N.J., Beiko, R.G.: Efficient learning of microbial genotype-phenotype association rules. Bioinformatics 26, 1834–1840 (2010)CrossRefGoogle Scholar
  7. 7.
    Reddy, T.B.K., Thomas, A.D., Stamatis, D., Bertsch, J., Isbandi, M., Jansson, J., Mallajosyula, J., Pagani, I., Lobos, E.A., Kyrpides, N.C.: The Genomes OnLine Database (GOLD) v. 5: a metadata management system based on a four level (meta)genome project classification. Nucleic Acids Res. 43, D1099–1106 (2015)CrossRefGoogle Scholar
  8. 8.
    Rogozin, I.B., Makarova, K.S., Murvai, J., Czabarka, E., Wolf, Y.I., Tatusov, R.L., Szekely, L.A., Koonin, E.V.: Connected gene neighborhoods in prokaryotic genomes. Nucleic Acids Res. 30, 2212–2223 (2002)CrossRefGoogle Scholar
  9. 9.
    Smole, Z., Nikolic, N., Supek, F., Šmuc, T., Sbalzarini, I.F., Kriško, A.: Proteome sequence features carry signatures of the environmental niche of prokaryotes. BMC Evol. Biol. 11–26 (2011)Google Scholar
  10. 10.
    Supek, F., Škunca, N., Repar, J., Vlahoviček, K., Šmuc, T.: Translational selection is ubiquitous in prokaryotes. PLoS Genet. 6, e1001004 (2010)CrossRefGoogle Scholar
  11. 11.
    Stothard, P., Van Domselaar, G., Shrivastava, S., Guo, A., O’Neill, B., Cruz, J., Ellison, M., Wishart, D.S.: BacMap: an interactive picture atlas of annotated bacterial genomes. Nucleic Acids Res. 33, D317–D320 (2005)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  1. 1.Ruđer Bošković InstituteZagrebCroatia
  2. 2.Mediterranean Institute of Life SciencesSplitCroatia
  3. 3.Centre for Genomic RegulationBarcelonaSpain

Personalised recommendations