Efficient Discriminative Models for Proteomics with Simple and Optimized Features

  • Lionel Morgado
  • Carlos Pereira
  • Paula Veríssimo
  • António Dourado
Conference paper
Part of the Intelligent Systems, Control and Automation: Science and Engineering book series (ISCA, volume 61)


The broad diversity in biology offers a panoply of interesting categorization problems for the machine learning community. New challenges arise in modern subjects such as protein classification, where huge and complex datasets are common, and demand the most accurate and fast classifiers to retrieve meaningful biological traits in acceptable time. Although the Support Vector Machine algorithm has been playing a significant role by offering the most precise solutions in diverse domains, the problem of protein classification is far from being solved. Other successful Kernel Methods such as the Relevance Vector Machine and extensions that combine Recursive Feature Elimination in formulations capable of performing feature selection like SVM-RFE and RVM-RFE, were tested in a benchmark environment and compared to other popular statistical models such as Nearest Neighbor, Random Forest, Artificial Neural Networks and Logistics Regression. The results show that SVM-RFE can create classifiers with the highest recognition ability even using a simple compact feature set easily computable from protein primary structure. Plus, these models allow getting predictions in a time scale reduced by orders of magnitude when compared with the standardly used PSI-BLAST.


Protein Family Classification Kernel Machines Support Vector Machine Relevance Vector Machine Feature Selection Recursive Feature Elimination 



Area Under the Curve


False Negative


False Positive


False Positive Rate


Kernel Machine


Recursive Feature Elimination


Receiver Operating Characteritic


Relevance Vector Machine


Support Vector Machine


True Negative


True Positive


True Positive Rate



This work was executed under the project FCOMP-01-0124-FEDER-010160 (PTDC/EIA/71770/2006), designated BIOINK – Incremental Kernel Learning for Biological Data Analysis, supported by Fundação para a Ciência e Tecnologia and FEDER through Program COMPETE (QREN).


  1. 1.
    Altschul S, Madden T, Schaffer A, Zhang J, Zhang Z, Miller W, Lipman D (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25:3389–3402CrossRefGoogle Scholar
  2. 2.
    Vapnik V (1998) Statistical learning theory. Adaptive and learning systems for signal processing, communications, and control. Wiley, New YorkGoogle Scholar
  3. 3.
    Tipping M (2001) Sparse Bayesian learning and the relevance vector machine. J Mach Learn Res I:211–214MathSciNetGoogle Scholar
  4. 4.
    Jaakkola T, Diekhans M, Haussler D (1999) Using the Fisher Kernel Method to detect remote protein homologies. In: Proceedings of the international conference on intelligent systems for molecular biology, HeidelbergGoogle Scholar
  5. 5.
    Krogh A, Brown M, Mian I, Sjolander K, Haussler D (1994) Hidden Markov models in computational biology: applications to protein modeling. J Mol Biol 235:1501–1531. doi: 10.1006/jmbi.1994.1104 CrossRefGoogle Scholar
  6. 6.
    Kuang R, Ie E, Wang K, Wang K, Siddiqi M, Freund Y, Leslie C (2005) Profile-based string kernels for remote homology detection and motif extraction. J Bioinform Comput Biol 3:527–550. doi: 10.1142/S021972000500120X CrossRefGoogle Scholar
  7. 7.
    Leslie C, Eskin E, Noble W (2002) The spectrum kernel: a string kernel for SVM protein classification. In: Proceedings of the Pacific symposium on biocomputing, vol 7, pp 564–575Google Scholar
  8. 8.
    Leslie C, Eskin E, Cohen A, Weston J, Noble W (2004) Mismatch string kernels for discriminative protein classification. Bioinformatics 20:467–476. doi: 10.1093/bioinformatics/btg431 CrossRefGoogle Scholar
  9. 9.
    Melvin I, Ie E, Kuang R, Weston J, Noble W, Leslie C (2007) Svm-fold: a tool for discriminative multi-class protein fold and superfamily recognition. BMC Bioinform 8(4). doi: 10.1186/1471-2105-8-S4-S2
  10. 10.
    Aydin Z, Altunbasak Y, Pakatci I, Erdogan H (2007) Training set reduction methods for protein secondary structure prediction in single-sequence condition. In: Proceedings of the 29th annual international conference IEEE EMBS, LyonGoogle Scholar
  11. 11.
    Kurgan L, Chen K (2007) Prediction of protein structural class for the twilight zone sequences. Biochem Biophys Res Commun 357(2):453–460CrossRefGoogle Scholar
  12. 12.
    Cheng J, Baldi P (2006) A machine learning information retrieval approach to protein fold recognition. Bioinformatics 22(12):1456–1463CrossRefGoogle Scholar
  13. 13.
    Mei S, Fei W (2010) Amino acid classification based spectrum kernel fusion for protein subnuclear localization. BMC Bioinform 11(Suppl 1):S17. doi: 10.1186/1471-2105-11-S1-S17 CrossRefGoogle Scholar
  14. 14.
    Du P, Li Y (2006) Prediction of protein submitochondria locations by hybridizing pseudo-amino acid composition with various physicochemical features of segmented sequence. BMC Bioinform 7:518. doi: 10.1186/1471-2105-7-518 CrossRefGoogle Scholar
  15. 15.
    Lanckriet G, Deng M, Cristianini N, Jordan M, Noble W (2004) Kernel-based data fusion and its application to protein function prediction in yeast. In: Pacific symposium on biocomputing, pp 300–311Google Scholar
  16. 16.
    Kuang R, Gu J, Cai H, Wang Y (2009) Improved prediction of malaria degradomes by supervised learning with SVM and profile kernel. Genetica 36(1):189–209CrossRefGoogle Scholar
  17. 17.
    Guyon I, Weston J, Barnhill S, Vapnik V (2002) Gene selection for cancer classification using support vector machines. Mach Learn 46(1–3):389–422. [Online]. Available: Google Scholar
  18. 18.
    Zhang W, Liu J (2007) Gene selection for cancer classification using relevance vector machine. In: The first international conference on bioinformatics and biomedical engineering, pp 184–187. doi: 10.1109/ICBBE.2007.50
  19. 19.
    Webpage dedicated to Protein classification benchmark collection of the international center for genetic engineering and biotechnology:
  20. 20.
    ICGEB/EMBNet Protein classification benchmark collection webpage 2:
  21. 21.
    Murzin A, Brenner S, Hubbard T, Chothia C (1995) SCOP: a structural classification of proteins database for the investigation of sequences and structure. J Mol Biol 247:536–540. doi: 10.1006/jmbi.1995.0159 Google Scholar
  22. 22.
    ICGEB/EMBNet Protein classification benchmark collection webpage:
  23. 23.
    Chang C, Lin C (2004) LIBSVM: a Library for Support Vector Machines. Software available at
  24. 24.
    Cristianini N, Shawe-Taylor J (1999) An introduction to support vector machines. Cambridge University Press, CambridgeGoogle Scholar
  25. 25.
    Bradley AP (1997) The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognit 30(7):1145–1159. doi: 10.1016/S0031-3203(96)00142-2 CrossRefGoogle Scholar
  26. 26.
    Fawcett T (2006) An introduction to ROC analysis. Pattern Recognit Lett 27:861–874CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media Dordrecht 2013

Authors and Affiliations

  • Lionel Morgado
    • 1
  • Carlos Pereira
    • 1
    • 2
  • Paula Veríssimo
    • 3
  • António Dourado
    • 1
  1. 1.Center for Informatics and Systems of the University of CoimbraPolo II – University of CoimbraCoimbraPortugal
  2. 2.Department of Informatics Engineering and SystemsCoimbra Institute of Engineering – ISECCoimbraPortugal
  3. 3.Department of Biochemistry and Center for Neuroscience and Cell BiologyUniversity of CoimbraCoimbraPortugal

Personalised recommendations