Advertisement

Phenotype Prediction with Semi-supervised Classification Trees

  • Jurica Levatić
  • Maria Brbić
  • Tomaž Stepišnik Perdih
  • Dragi Kocev
  • Vedrana Vidulin
  • Tomislav Šmuc
  • Fran Supek
  • Sašo Džeroski
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10785)

Abstract

In this work, we address the task of phenotypic traits prediction using methods for semi-supervised learning. More specifically, we propose to use supervised and semi-supervised classification trees as well as supervised and semi-supervised random forests of classification trees. We consider 114 datasets for different phenotypic traits referring to 997 microbial species. These datasets present a challenge for the existing machine learning methods: they are not labelled/annotated entirely and their distribution is typically imbalanced. We investigate whether approaching the task of phenotype prediction as a semi-supervised learning task can yield improved predictive performance. The results suggest that the semi-supervised methodology considered here is especially helpful when using single trees, especially when the amount of labeled data ranges from 20 to 40%. Similar improvements can be seen when the presence of the phenotype is very imbalanced.

Keywords

Semi-supervised learning Phenotype Decision trees Predictive clustering trees Random forests Binary classification 

Notes

Acknowledgments

We acknowledge the financial support of the Slovenian Research Agency, via the grant P2-0103 and a young researcher grant to TSP, Croatian Science Foundation grants HRZZ-9623 (DescriptiveInduction), as well as the European Commission, via the grants ICT-2013-612944 MAESTRA and ICT-2013-604102 HBP. We would also like to acknowledge the joint support of the Republic of Slovenia and the European Union under the European Regional Development Fund (grant “Raziskovalci-2.0-FIŠ-52900”, implementation of the operation no. C3330-17-529008).

References

  1. 1.
    Chapelle, O., Schölkopf, B., Zien, A.: Semi-supervised Learning, vol. 2. MIT Press, Cambridge (2006)CrossRefGoogle Scholar
  2. 2.
    MacDonald, N.J., Beiko, R.G.: Efficient learning of microbial genotype-phenotype association rules. Bioinformatics 26(15), 1834 (2010)CrossRefGoogle Scholar
  3. 3.
    Smole, Z., Nikolic, N., Supek, F., Šmuc, T., Sbalzarini, I.F., Krisko, A.: Proteome sequence features carry signatures of the environmental niche of prokaryotes. BMC Evol. Biol. 11(1), 26 (2011)CrossRefGoogle Scholar
  4. 4.
    Feldbauer, R., Schulz, F., Horn, M., Rattei, T.: Prediction of microbial phenotypes based on comparative genomics. BMC Bioinform. 16(14), S1 (2015)CrossRefGoogle Scholar
  5. 5.
    Brbić, M., Warnecke, T., Kriško, A., Supek, F.: Global shifts in genome and proteome composition are very tightly coupled. Genome Biol. Evol. 7(6), 1519 (2015)CrossRefGoogle Scholar
  6. 6.
    Chaffron, S., Rehrauer, H., Pernthaler, J., von Mering, C.: A global network of coexisting microbes from environmental and whole-genome sequence data. Genome Res. 20(7), 947–959 (2010)CrossRefGoogle Scholar
  7. 7.
    Brbić, M., Piškorec, M., Vidulin, V., Kriško, A., Šmuc, T., Supek, F.: The landscape of microbial phenotypic traits and associated genes. Nucleic Acids Res. 44(21), 10074 (2016)Google Scholar
  8. 8.
    Levatić, J., Ceci, M., Kocev, D., Džeroski, S.: Semi-supervised classification trees. J. Intell. Inf. Syst. 49(3), 461–486 (2017)CrossRefGoogle Scholar
  9. 9.
    Blockeel, H., De Raedt, L., Ramon, J.: Top-down induction of clustering trees. In: Proceedings of the 15th International Conference on Machine learning, pp. 55–63 (1998)Google Scholar
  10. 10.
    Kocev, D., Vens, C., Struyf, J., Džeroski, S.: Tree ensembles for predicting structured outputs. Pattern Recogn. 46(3), 817–833 (2013)CrossRefGoogle Scholar
  11. 11.
    Blockeel, H., Struyf, J.: Efficient algorithms for decision tree cross-validation. J. Mach. Learn. Res. 3, 621–650 (2002)zbMATHGoogle Scholar
  12. 12.
    Nigam, K., McCallum, A.K., Thrun, S., Mitchell, T.: Text classification from labeled and unlabeled documents using EM. Mach. Learn. 39(2–3), 103–134 (2000)CrossRefzbMATHGoogle Scholar
  13. 13.
    Cozman, F., Cohen, I., Cirelo, M.: Unlabeled data can degrade classification performance of generative classifiers. In: Proceedings of the 15th International Florida Artificial Intelligence Research Society Conference, pp. 327–331 (2002)Google Scholar
  14. 14.
    Guo, Y., Niu, X., Zhang, H.: An extensive empirical study on semi-supervised learning. In: Proceedings of the 10th International Conference on Data Mining, pp. 186–195 (2010)Google Scholar
  15. 15.
    Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)CrossRefzbMATHGoogle Scholar
  16. 16.
    Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, Cambridge (2005)zbMATHGoogle Scholar
  17. 17.
    Powell, S., Szklarczyk, D., Trachana, K., Roth, A., Kuhn, M., Muller, J., Arnold, R., Rattei, T., Letunic, I., Doerks, T., Jensen, L.J., von Mering, C., Bork, P.: eggNOG v3.0: orthologous groups covering 1133 organisms at 41 different taxonomic ranges. Nucleic Acids Res. 40(D1), D284 (2012)CrossRefGoogle Scholar
  18. 18.
    Stothard, P., Van Domselaar, G., Shrivastava, S., Guo, A., O’Neill, B., Cruz, J., Ellison, M., Wishart, D.S.: BacMap: an interactive picture atlas of annotated bacterial genomes. Nucleic Acids Res. 33(suppl. 1), D317–D320 (2005)Google Scholar
  19. 19.
    Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers Inc., San Francisco (1993)Google Scholar
  20. 20.
    Chawla, N., Karakoulas, G.: Learning from labeled and unlabeled data: an empirical study across techniques and domains. J. Artif. Intell. Res. 23(1), 331–366 (2005)zbMATHGoogle Scholar
  21. 21.
    Reddy, T., Thomas, A.D., Stamatis, D., Bertsch, J., Isbandi, M., Jansson, J., Mallajosyula, J., Pagani, I., Lobos, E.A., Kyrpides, N.C.: The genomes online database (GOLD) v.5: a metadata management system based on a four level (meta)genome project classification. Nucleic Acids Res. 43(D1), D1099 (2015)CrossRefGoogle Scholar
  22. 22.
    Land, M.L., Hyatt, D., Jun, S.R., Kora, G.H., Hauser, L.J., Lukjancenko, O., Ussery, D.W.: Quality scores for 32,000 genomes. Stand. genomic sci. 9(1), 20 (2014)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing AG, part of Springer Nature 2018

Authors and Affiliations

  • Jurica Levatić
    • 1
    • 2
  • Maria Brbić
    • 3
  • Tomaž Stepišnik Perdih
    • 1
    • 2
  • Dragi Kocev
    • 1
    • 2
  • Vedrana Vidulin
    • 1
    • 3
    • 4
  • Tomislav Šmuc
    • 3
  • Fran Supek
    • 3
    • 5
  • Sašo Džeroski
    • 1
    • 2
  1. 1.Department of Knowledge TechnologiesJožef Stefan InstituteLjubljanaSlovenia
  2. 2.Jožef Stefan International Postgraduate SchoolLjubljanaSlovenia
  3. 3.Division of ElectronicsRuder Boskovic InstituteZagrebCroatia
  4. 4.Faculty of Information StudiesNovo MestoSlovenia
  5. 5.Center for Genomic RegulationBarcelonaSpain

Personalised recommendations