Efficient Discriminative Models for Proteomics with Simple and Optimized Features

Morgado, Lionel; Pereira, Carlos; Veríssimo, Paula; Dourado, António

doi:10.1007/978-94-007-4722-7_9

Lionel Morgado⁴,
Carlos Pereira^4,5,
Paula Veríssimo⁶ &
…
António Dourado⁴

Part of the book series: Intelligent Systems, Control and Automation: Science and Engineering ((ISCA,volume 61))

1759 Accesses

Abstract

The broad diversity in biology offers a panoply of interesting categorization problems for the machine learning community. New challenges arise in modern subjects such as protein classification, where huge and complex datasets are common, and demand the most accurate and fast classifiers to retrieve meaningful biological traits in acceptable time. Although the Support Vector Machine algorithm has been playing a significant role by offering the most precise solutions in diverse domains, the problem of protein classification is far from being solved. Other successful Kernel Methods such as the Relevance Vector Machine and extensions that combine Recursive Feature Elimination in formulations capable of performing feature selection like SVM-RFE and RVM-RFE, were tested in a benchmark environment and compared to other popular statistical models such as Nearest Neighbor, Random Forest, Artificial Neural Networks and Logistics Regression. The results show that SVM-RFE can create classifiers with the highest recognition ability even using a simple compact feature set easily computable from protein primary structure. Plus, these models allow getting predictions in a time scale reduced by orders of magnitude when compared with the standardly used PSI-BLAST.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Hardcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Abbreviations

AUC:: Area Under the Curve
FN:: False Negative
FP:: False Positive
FPR:: False Positive Rate
KM:: Kernel Machine
RFE:: Recursive Feature Elimination
ROC:: Receiver Operating Characteritic
RVM:: Relevance Vector Machine
SVM:: Support Vector Machine
TN:: True Negative
TP:: True Positive
TPR:: True Positive Rate

References

Altschul S, Madden T, Schaffer A, Zhang J, Zhang Z, Miller W, Lipman D (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25:3389–3402
Article Google Scholar
Vapnik V (1998) Statistical learning theory. Adaptive and learning systems for signal processing, communications, and control. Wiley, New York
Google Scholar
Tipping M (2001) Sparse Bayesian learning and the relevance vector machine. J Mach Learn Res I:211–214
MathSciNet Google Scholar
Jaakkola T, Diekhans M, Haussler D (1999) Using the Fisher Kernel Method to detect remote protein homologies. In: Proceedings of the international conference on intelligent systems for molecular biology, Heidelberg
Google Scholar
Krogh A, Brown M, Mian I, Sjolander K, Haussler D (1994) Hidden Markov models in computational biology: applications to protein modeling. J Mol Biol 235:1501–1531. doi:10.1006/jmbi.1994.1104
Article Google Scholar
Kuang R, Ie E, Wang K, Wang K, Siddiqi M, Freund Y, Leslie C (2005) Profile-based string kernels for remote homology detection and motif extraction. J Bioinform Comput Biol 3:527–550. doi:10.1142/S021972000500120X
Article Google Scholar
Leslie C, Eskin E, Noble W (2002) The spectrum kernel: a string kernel for SVM protein classification. In: Proceedings of the Pacific symposium on biocomputing, vol 7, pp 564–575
Google Scholar
Leslie C, Eskin E, Cohen A, Weston J, Noble W (2004) Mismatch string kernels for discriminative protein classification. Bioinformatics 20:467–476. doi:10.1093/bioinformatics/btg431
Article Google Scholar
Melvin I, Ie E, Kuang R, Weston J, Noble W, Leslie C (2007) Svm-fold: a tool for discriminative multi-class protein fold and superfamily recognition. BMC Bioinform 8(4). doi:10.1186/1471-2105-8-S4-S2
Aydin Z, Altunbasak Y, Pakatci I, Erdogan H (2007) Training set reduction methods for protein secondary structure prediction in single-sequence condition. In: Proceedings of the 29th annual international conference IEEE EMBS, Lyon
Google Scholar
Kurgan L, Chen K (2007) Prediction of protein structural class for the twilight zone sequences. Biochem Biophys Res Commun 357(2):453–460
Article Google Scholar
Cheng J, Baldi P (2006) A machine learning information retrieval approach to protein fold recognition. Bioinformatics 22(12):1456–1463
Article Google Scholar
Mei S, Fei W (2010) Amino acid classification based spectrum kernel fusion for protein subnuclear localization. BMC Bioinform 11(Suppl 1):S17. doi:10.1186/1471-2105-11-S1-S17
Article Google Scholar
Du P, Li Y (2006) Prediction of protein submitochondria locations by hybridizing pseudo-amino acid composition with various physicochemical features of segmented sequence. BMC Bioinform 7:518. doi:10.1186/1471-2105-7-518
Article Google Scholar
Lanckriet G, Deng M, Cristianini N, Jordan M, Noble W (2004) Kernel-based data fusion and its application to protein function prediction in yeast. In: Pacific symposium on biocomputing, pp 300–311
Google Scholar
Kuang R, Gu J, Cai H, Wang Y (2009) Improved prediction of malaria degradomes by supervised learning with SVM and profile kernel. Genetica 36(1):189–209
Article Google Scholar
Guyon I, Weston J, Barnhill S, Vapnik V (2002) Gene selection for cancer classification using support vector machines. Mach Learn 46(1–3):389–422. [Online]. Available: citeseer.ist.psu.edu/guyon02gene.html
Google Scholar
Zhang W, Liu J (2007) Gene selection for cancer classification using relevance vector machine. In: The first international conference on bioinformatics and biomedical engineering, pp 184–187. doi:10.1109/ICBBE.2007.50
Webpage dedicated to Protein classification benchmark collection of the international center for genetic engineering and biotechnology: http://net.icgeb.org/benchmark
ICGEB/EMBNet Protein classification benchmark collection webpage 2: http://hydra.icgeb.trieste.it/benchmark_previous/index.php?page=33
Murzin A, Brenner S, Hubbard T, Chothia C (1995) SCOP: a structural classification of proteins database for the investigation of sequences and structure. J Mol Biol 247:536–540. doi:10.1006/jmbi.1995.0159
Google Scholar
ICGEB/EMBNet Protein classification benchmark collection webpage: http://hydra.icgeb.trieste.it/benchmark_previous/index.php?experiment=33
Chang C, Lin C (2004) LIBSVM: a Library for Support Vector Machines. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm
Cristianini N, Shawe-Taylor J (1999) An introduction to support vector machines. Cambridge University Press, Cambridge
Google Scholar
Bradley AP (1997) The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognit 30(7):1145–1159. doi:10.1016/S0031-3203(96)00142-2
Article Google Scholar
Fawcett T (2006) An introduction to ROC analysis. Pattern Recognit Lett 27:861–874
Article Google Scholar

Download references

Acknowledgments

This work was executed under the project FCOMP-01-0124-FEDER-010160 (PTDC/EIA/71770/2006), designated BIOINK – Incremental Kernel Learning for Biological Data Analysis, supported by Fundação para a Ciência e Tecnologia and FEDER through Program COMPETE (QREN).

Author information

Authors and Affiliations

Center for Informatics and Systems of the University of Coimbra, Polo II – University of Coimbra, Coimbra, Portugal
Lionel Morgado, Carlos Pereira & António Dourado
Department of Informatics Engineering and Systems, Coimbra Institute of Engineering – ISEC, Quinta da Nora, 3030-199, Coimbra, Portugal
Carlos Pereira
Department of Biochemistry and Center for Neuroscience and Cell Biology, University of Coimbra, 3004-517, Coimbra, Portugal
Paula Veríssimo

Authors

Lionel Morgado
View author publications
You can also search for this author in PubMed Google Scholar
Carlos Pereira
View author publications
You can also search for this author in PubMed Google Scholar
Paula Veríssimo
View author publications
You can also search for this author in PubMed Google Scholar
António Dourado
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lionel Morgado .

Editor information

Editors and Affiliations

Computer Science Department, Polytechnique Institute of Porto, Rua Dr. Bernardino de Almeida 431, Porto, 4200-072, Portugal
Ana Madureira
Electrical and Engineering Department, Polytechnique Institute of Porto, Rua Dr. Bernardino de Almeida 431, Porto, 4200-072, Portugal
Cecilia Reis
, Department of Computer Science and Syste, Polytechnique Institute of Coimbra, Rua Pedro Nunes - Quinta da Nora, Coimbra, 3030-199, Portugal
Viriato Marques

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Morgado, L., Pereira, C., Veríssimo, P., Dourado, A. (2013). Efficient Discriminative Models for Proteomics with Simple and Optimized Features. In: Madureira, A., Reis, C., Marques, V. (eds) Computational Intelligence and Decision Making. Intelligent Systems, Control and Automation: Science and Engineering, vol 61. Springer, Dordrecht. https://doi.org/10.1007/978-94-007-4722-7_9

Download citation

DOI: https://doi.org/10.1007/978-94-007-4722-7_9
Published: 01 August 2012
Publisher Name: Springer, Dordrecht
Print ISBN: 978-94-007-4721-0
Online ISBN: 978-94-007-4722-7
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics