Fast Target Set Reduction for Large-Scale Protein Function Prediction: A Multi-class Multi-label Machine Learning Approach

  • Thomas Lingner
  • Peter Meinicke
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5251)


Large-scale sequencing projects have led to a vast amount of protein sequences, which have to be assigned to functional categories. Currently, profile hidden markov models and kernel-based machine learning methods provide the most accurate results for protein classification. However, the prediction of new sequences with these approaches is computationally expensive. We present an approach for fast scoring of protein sequences by means of feature-based protein sequence representation and multi-class multi-label machine learning techniques. Using the Pfam database, we show that our method provides high computational efficiency and that the approach is well-suitable for pre-filtering of large sequence sets.


protein classification large-scale multi-class multi-label Pfam homology search metagenomics target set reduction protein function prediction machine learning 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Yooseph, S., et al.: The Sorcerer II Global Ocean Sampling expedition: expanding the universe of protein families. PLoS Biol. 5, 16 (2007)CrossRefGoogle Scholar
  2. 2.
    Friedberg, I.: Automated protein function prediction–the genomic challenge. Brief. Bioinformatics 7, 225–242 (2006)CrossRefGoogle Scholar
  3. 3.
    Pandey, G., Kumar, V., Steinbach, M.: Computational approaches for protein function prediction. Technical Report TR 06-028, Department of Computer Science and Engineering, University of Minnesota, Twin Cities (2006)Google Scholar
  4. 4.
    Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J.: Basic local alignment search tool. J. Mol. Biol. 215(3), 403–410 (1990)Google Scholar
  5. 5.
    Finn, R., et al.: Pfam: clans, web tools and services. Nucleic Acids Res. 34, D247–251 (2006)CrossRefGoogle Scholar
  6. 6.
    Eddy, S.R.: Profile hidden Markov models. Bioinformatics 14(9), 755–763 (1998)CrossRefGoogle Scholar
  7. 7.
    Liao, L., Noble, W.S.: Combining pairwise sequence similarity and support vector machines for detecting remote protein evolutionary and structural relationships. J. Comput. Biol. 10(6), 857–868 (2003)CrossRefGoogle Scholar
  8. 8.
    Walters, J.P., Meng, X., Chaudhary, V., Oliver, T.F., Yeow, L.Y., Schmidt, B., Nathan, D., Landman, J.I.: MPI-HMMER-Boost: Distributed FPGA Acceleration. VLSI Signal Processing 48(3), 223–238 (2007)CrossRefGoogle Scholar
  9. 9.
    Ong, S., Lin, H., Chen, Y., Li, Z., Cao, Z.: Efficacy of different protein descriptors in predicting protein functional families. BMC Bioinformatics 8, 300 (2007)CrossRefGoogle Scholar
  10. 10.
    Strope, P., Moriyama, E.: Simple alignment-free methods for protein classification: a case study from G-protein-coupled receptors. Genomics 89, 602–612 (2007)CrossRefGoogle Scholar
  11. 11.
    Han, L., Cui, J., Lin, H., Ji, Z., Cao, Z., Li, Y., Chen, Y.: Recent progresses in the application of machine learning approach for predicting protein functional class independent of sequence similarity. Proteomics 6, 4023–4037 (2006)CrossRefGoogle Scholar
  12. 12.
    Leslie, C., Eskin, E., Noble, W.S.: The spectrum kernel: a string kernel for SVM protein classification. In: Pac. Symp. Biocomput., pp. 564–575 (2002)Google Scholar
  13. 13.
    Ben-Hur, A., Brutlag, D.: Remote homology detection: a motif based approach. Bioinformatics 19 (suppl. 1), 26–33 (2003)CrossRefGoogle Scholar
  14. 14.
    Leslie, C.S., Eskin, E., Cohen, A., Weston, J., Noble, W.S.: Mismatch string kernels for discriminative protein classification. Bioinformatics 20(4), 467–476 (2004)CrossRefGoogle Scholar
  15. 15.
    Lingner, T., Meinicke, P.: Remote homology detection based on oligomer distances. Bioinformatics 22(18), 2224–2231 (2006)CrossRefGoogle Scholar
  16. 16.
    Saigo, H., Vert, J.P., Ueda, N., Akutsu, T.: Protein homology detection using string alignment kernels. Bioinformatics 20(11), 1682–1689 (2004)CrossRefGoogle Scholar
  17. 17.
    Rangwala, H., Karypis, G.: Profile-based direct kernels for remote homology detection and fold recognition. Bioinformatics 21(23), 4239–4247 (2005)CrossRefGoogle Scholar
  18. 18.
    Rifkin, R., Klautau, A.: In Defense of One-Vs-All Classification. Journal of Machine Learning Research 5, 101–141 (2004)MathSciNetGoogle Scholar
  19. 19.
    Jensen, L.J., Gupta, R., Staerfeldt, H., Brunak, S.: Prediction of human protein function according to Gene Ontology categories. Bioinformatics 19, 635–642 (2003)CrossRefGoogle Scholar
  20. 20.
    Schapire, R., Singer, Y.: Boostexter: A system for multiclass multi-label text categorization (1998)Google Scholar
  21. 21.
    Elisseeff, A., Weston, J.: A kernel method for multi-labelled classification. In: Dietterich, T.G., Becker, S., Ghahramani, Z. (eds.) NIPS, pp. 681–687. MIT Press, Cambridge (2001)Google Scholar
  22. 22.
    Zhang, M.L., Zhou, Z.H.: A k-nearest neighbor based algorithm for multi-label classification. The IEEE Computational Intelligence Society 2, 718–721 (2005)Google Scholar
  23. 23.
    Lee, K., Kim, D., Na, D., Lee, K., Lee, D.: PLPD: reliable protein localization prediction from imbalanced and overlapped datasets. Nucleic Acids Res. 34, 4655–4666 (2006)CrossRefGoogle Scholar
  24. 24.
    Diplaris, S., Tsoumakas, G., Mitkas, P.A., Vlahavas, I.P.: Protein classification with multiple algorithms. In: Bozanis, P., Houstis, E.N. (eds.) PCI 2005. LNCS, vol. 3746, pp. 448–456. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  25. 25.
    Rifkin, R., Yeo, G., Poggio, T.: Regularized Least Squares Classification. In: Advances in Learning Theory: Methods, Model and Applications NATO Science Series III: Computer and Systems Sciences, vol. 190, pp. 131–153. IOS Press, Amsterdam (2003)Google Scholar
  26. 26.
    Cohen, G., Hilario, M., Sax, H., Hugonnet, S., Geissbuhler, A.: Learning from imbalanced data in surveillance of nosocomial infection. Artif. Intell. Med., 7–18 (2006)Google Scholar
  27. 27.
    Hoff, K., Tech, M., Lingner, T., Daniel, R., Morgenstern, B., Meinicke, P.: Gene prediction in metagenomic fragments: a large scale machine learning approach. BMC Bioinformatics 9, 217 (2008)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2008

Authors and Affiliations

  • Thomas Lingner
    • 1
  • Peter Meinicke
    • 1
  1. 1.Department of Bioinformatics, Institute for Microbiology and GeneticsUniversity of GöttingenGöttingenGermany

Personalised recommendations