Skip to main content

Efficient Discriminative Models for Proteomics with Simple and Optimized Features

  • Conference paper
  • First Online:
Computational Intelligence and Decision Making

Abstract

The broad diversity in biology offers a panoply of interesting categorization problems for the machine learning community. New challenges arise in modern subjects such as protein classification, where huge and complex datasets are common, and demand the most accurate and fast classifiers to retrieve meaningful biological traits in acceptable time. Although the Support Vector Machine algorithm has been playing a significant role by offering the most precise solutions in diverse domains, the problem of protein classification is far from being solved. Other successful Kernel Methods such as the Relevance Vector Machine and extensions that combine Recursive Feature Elimination in formulations capable of performing feature selection like SVM-RFE and RVM-RFE, were tested in a benchmark environment and compared to other popular statistical models such as Nearest Neighbor, Random Forest, Artificial Neural Networks and Logistics Regression. The results show that SVM-RFE can create classifiers with the highest recognition ability even using a simple compact feature set easily computable from protein primary structure. Plus, these models allow getting predictions in a time scale reduced by orders of magnitude when compared with the standardly used PSI-BLAST.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Abbreviations

AUC:

Area Under the Curve

FN:

False Negative

FP:

False Positive

FPR:

False Positive Rate

KM:

Kernel Machine

RFE:

Recursive Feature Elimination

ROC:

Receiver Operating Characteritic

RVM:

Relevance Vector Machine

SVM:

Support Vector Machine

TN:

True Negative

TP:

True Positive

TPR:

True Positive Rate

References

  1. Altschul S, Madden T, Schaffer A, Zhang J, Zhang Z, Miller W, Lipman D (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25:3389–3402

    Article  Google Scholar 

  2. Vapnik V (1998) Statistical learning theory. Adaptive and learning systems for signal processing, communications, and control. Wiley, New York

    Google Scholar 

  3. Tipping M (2001) Sparse Bayesian learning and the relevance vector machine. J Mach Learn Res I:211–214

    MathSciNet  Google Scholar 

  4. Jaakkola T, Diekhans M, Haussler D (1999) Using the Fisher Kernel Method to detect remote protein homologies. In: Proceedings of the international conference on intelligent systems for molecular biology, Heidelberg

    Google Scholar 

  5. Krogh A, Brown M, Mian I, Sjolander K, Haussler D (1994) Hidden Markov models in computational biology: applications to protein modeling. J Mol Biol 235:1501–1531. doi:10.1006/jmbi.1994.1104

    Article  Google Scholar 

  6. Kuang R, Ie E, Wang K, Wang K, Siddiqi M, Freund Y, Leslie C (2005) Profile-based string kernels for remote homology detection and motif extraction. J Bioinform Comput Biol 3:527–550. doi:10.1142/S021972000500120X

    Article  Google Scholar 

  7. Leslie C, Eskin E, Noble W (2002) The spectrum kernel: a string kernel for SVM protein classification. In: Proceedings of the Pacific symposium on biocomputing, vol 7, pp 564–575

    Google Scholar 

  8. Leslie C, Eskin E, Cohen A, Weston J, Noble W (2004) Mismatch string kernels for discriminative protein classification. Bioinformatics 20:467–476. doi:10.1093/bioinformatics/btg431

    Article  Google Scholar 

  9. Melvin I, Ie E, Kuang R, Weston J, Noble W, Leslie C (2007) Svm-fold: a tool for discriminative multi-class protein fold and superfamily recognition. BMC Bioinform 8(4). doi:10.1186/1471-2105-8-S4-S2

  10. Aydin Z, Altunbasak Y, Pakatci I, Erdogan H (2007) Training set reduction methods for protein secondary structure prediction in single-sequence condition. In: Proceedings of the 29th annual international conference IEEE EMBS, Lyon

    Google Scholar 

  11. Kurgan L, Chen K (2007) Prediction of protein structural class for the twilight zone sequences. Biochem Biophys Res Commun 357(2):453–460

    Article  Google Scholar 

  12. Cheng J, Baldi P (2006) A machine learning information retrieval approach to protein fold recognition. Bioinformatics 22(12):1456–1463

    Article  Google Scholar 

  13. Mei S, Fei W (2010) Amino acid classification based spectrum kernel fusion for protein subnuclear localization. BMC Bioinform 11(Suppl 1):S17. doi:10.1186/1471-2105-11-S1-S17

    Article  Google Scholar 

  14. Du P, Li Y (2006) Prediction of protein submitochondria locations by hybridizing pseudo-amino acid composition with various physicochemical features of segmented sequence. BMC Bioinform 7:518. doi:10.1186/1471-2105-7-518

    Article  Google Scholar 

  15. Lanckriet G, Deng M, Cristianini N, Jordan M, Noble W (2004) Kernel-based data fusion and its application to protein function prediction in yeast. In: Pacific symposium on biocomputing, pp 300–311

    Google Scholar 

  16. Kuang R, Gu J, Cai H, Wang Y (2009) Improved prediction of malaria degradomes by supervised learning with SVM and profile kernel. Genetica 36(1):189–209

    Article  Google Scholar 

  17. Guyon I, Weston J, Barnhill S, Vapnik V (2002) Gene selection for cancer classification using support vector machines. Mach Learn 46(1–3):389–422. [Online]. Available: citeseer.ist.psu.edu/guyon02gene.html

    Google Scholar 

  18. Zhang W, Liu J (2007) Gene selection for cancer classification using relevance vector machine. In: The first international conference on bioinformatics and biomedical engineering, pp 184–187. doi:10.1109/ICBBE.2007.50

  19. Webpage dedicated to Protein classification benchmark collection of the international center for genetic engineering and biotechnology: http://net.icgeb.org/benchmark

  20. ICGEB/EMBNet Protein classification benchmark collection webpage 2: http://hydra.icgeb.trieste.it/benchmark_previous/index.php?page=33

  21. Murzin A, Brenner S, Hubbard T, Chothia C (1995) SCOP: a structural classification of proteins database for the investigation of sequences and structure. J Mol Biol 247:536–540. doi:10.1006/jmbi.1995.0159

    Google Scholar 

  22. ICGEB/EMBNet Protein classification benchmark collection webpage: http://hydra.icgeb.trieste.it/benchmark_previous/index.php?experiment=33

  23. Chang C, Lin C (2004) LIBSVM: a Library for Support Vector Machines. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm

  24. Cristianini N, Shawe-Taylor J (1999) An introduction to support vector machines. Cambridge University Press, Cambridge

    Google Scholar 

  25. Bradley AP (1997) The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognit 30(7):1145–1159. doi:10.1016/S0031-3203(96)00142-2

    Article  Google Scholar 

  26. Fawcett T (2006) An introduction to ROC analysis. Pattern Recognit Lett 27:861–874

    Article  Google Scholar 

Download references

Acknowledgments

This work was executed under the project FCOMP-01-0124-FEDER-010160 (PTDC/EIA/71770/2006), designated BIOINK – Incremental Kernel Learning for Biological Data Analysis, supported by Fundação para a Ciência e Tecnologia and FEDER through Program COMPETE (QREN).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lionel Morgado .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer Science+Business Media Dordrecht

About this paper

Cite this paper

Morgado, L., Pereira, C., Veríssimo, P., Dourado, A. (2013). Efficient Discriminative Models for Proteomics with Simple and Optimized Features. In: Madureira, A., Reis, C., Marques, V. (eds) Computational Intelligence and Decision Making. Intelligent Systems, Control and Automation: Science and Engineering, vol 61. Springer, Dordrecht. https://doi.org/10.1007/978-94-007-4722-7_9

Download citation

  • DOI: https://doi.org/10.1007/978-94-007-4722-7_9

  • Published:

  • Publisher Name: Springer, Dordrecht

  • Print ISBN: 978-94-007-4721-0

  • Online ISBN: 978-94-007-4722-7

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics