Abstract
The broad diversity in biology offers a panoply of interesting categorization problems for the machine learning community. New challenges arise in modern subjects such as protein classification, where huge and complex datasets are common, and demand the most accurate and fast classifiers to retrieve meaningful biological traits in acceptable time. Although the Support Vector Machine algorithm has been playing a significant role by offering the most precise solutions in diverse domains, the problem of protein classification is far from being solved. Other successful Kernel Methods such as the Relevance Vector Machine and extensions that combine Recursive Feature Elimination in formulations capable of performing feature selection like SVM-RFE and RVM-RFE, were tested in a benchmark environment and compared to other popular statistical models such as Nearest Neighbor, Random Forest, Artificial Neural Networks and Logistics Regression. The results show that SVM-RFE can create classifiers with the highest recognition ability even using a simple compact feature set easily computable from protein primary structure. Plus, these models allow getting predictions in a time scale reduced by orders of magnitude when compared with the standardly used PSI-BLAST.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Abbreviations
- AUC:
-
Area Under the Curve
- FN:
-
False Negative
- FP:
-
False Positive
- FPR:
-
False Positive Rate
- KM:
-
Kernel Machine
- RFE:
-
Recursive Feature Elimination
- ROC:
-
Receiver Operating Characteritic
- RVM:
-
Relevance Vector Machine
- SVM:
-
Support Vector Machine
- TN:
-
True Negative
- TP:
-
True Positive
- TPR:
-
True Positive Rate
References
Altschul S, Madden T, Schaffer A, Zhang J, Zhang Z, Miller W, Lipman D (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25:3389–3402
Vapnik V (1998) Statistical learning theory. Adaptive and learning systems for signal processing, communications, and control. Wiley, New York
Tipping M (2001) Sparse Bayesian learning and the relevance vector machine. J Mach Learn Res I:211–214
Jaakkola T, Diekhans M, Haussler D (1999) Using the Fisher Kernel Method to detect remote protein homologies. In: Proceedings of the international conference on intelligent systems for molecular biology, Heidelberg
Krogh A, Brown M, Mian I, Sjolander K, Haussler D (1994) Hidden Markov models in computational biology: applications to protein modeling. J Mol Biol 235:1501–1531. doi:10.1006/jmbi.1994.1104
Kuang R, Ie E, Wang K, Wang K, Siddiqi M, Freund Y, Leslie C (2005) Profile-based string kernels for remote homology detection and motif extraction. J Bioinform Comput Biol 3:527–550. doi:10.1142/S021972000500120X
Leslie C, Eskin E, Noble W (2002) The spectrum kernel: a string kernel for SVM protein classification. In: Proceedings of the Pacific symposium on biocomputing, vol 7, pp 564–575
Leslie C, Eskin E, Cohen A, Weston J, Noble W (2004) Mismatch string kernels for discriminative protein classification. Bioinformatics 20:467–476. doi:10.1093/bioinformatics/btg431
Melvin I, Ie E, Kuang R, Weston J, Noble W, Leslie C (2007) Svm-fold: a tool for discriminative multi-class protein fold and superfamily recognition. BMC Bioinform 8(4). doi:10.1186/1471-2105-8-S4-S2
Aydin Z, Altunbasak Y, Pakatci I, Erdogan H (2007) Training set reduction methods for protein secondary structure prediction in single-sequence condition. In: Proceedings of the 29th annual international conference IEEE EMBS, Lyon
Kurgan L, Chen K (2007) Prediction of protein structural class for the twilight zone sequences. Biochem Biophys Res Commun 357(2):453–460
Cheng J, Baldi P (2006) A machine learning information retrieval approach to protein fold recognition. Bioinformatics 22(12):1456–1463
Mei S, Fei W (2010) Amino acid classification based spectrum kernel fusion for protein subnuclear localization. BMC Bioinform 11(Suppl 1):S17. doi:10.1186/1471-2105-11-S1-S17
Du P, Li Y (2006) Prediction of protein submitochondria locations by hybridizing pseudo-amino acid composition with various physicochemical features of segmented sequence. BMC Bioinform 7:518. doi:10.1186/1471-2105-7-518
Lanckriet G, Deng M, Cristianini N, Jordan M, Noble W (2004) Kernel-based data fusion and its application to protein function prediction in yeast. In: Pacific symposium on biocomputing, pp 300–311
Kuang R, Gu J, Cai H, Wang Y (2009) Improved prediction of malaria degradomes by supervised learning with SVM and profile kernel. Genetica 36(1):189–209
Guyon I, Weston J, Barnhill S, Vapnik V (2002) Gene selection for cancer classification using support vector machines. Mach Learn 46(1–3):389–422. [Online]. Available: citeseer.ist.psu.edu/guyon02gene.html
Zhang W, Liu J (2007) Gene selection for cancer classification using relevance vector machine. In: The first international conference on bioinformatics and biomedical engineering, pp 184–187. doi:10.1109/ICBBE.2007.50
Webpage dedicated to Protein classification benchmark collection of the international center for genetic engineering and biotechnology: http://net.icgeb.org/benchmark
ICGEB/EMBNet Protein classification benchmark collection webpage 2: http://hydra.icgeb.trieste.it/benchmark_previous/index.php?page=33
Murzin A, Brenner S, Hubbard T, Chothia C (1995) SCOP: a structural classification of proteins database for the investigation of sequences and structure. J Mol Biol 247:536–540. doi:10.1006/jmbi.1995.0159
ICGEB/EMBNet Protein classification benchmark collection webpage: http://hydra.icgeb.trieste.it/benchmark_previous/index.php?experiment=33
Chang C, Lin C (2004) LIBSVM: a Library for Support Vector Machines. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm
Cristianini N, Shawe-Taylor J (1999) An introduction to support vector machines. Cambridge University Press, Cambridge
Bradley AP (1997) The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognit 30(7):1145–1159. doi:10.1016/S0031-3203(96)00142-2
Fawcett T (2006) An introduction to ROC analysis. Pattern Recognit Lett 27:861–874
Acknowledgments
This work was executed under the project FCOMP-01-0124-FEDER-010160 (PTDC/EIA/71770/2006), designated BIOINK – Incremental Kernel Learning for Biological Data Analysis, supported by Fundação para a Ciência e Tecnologia and FEDER through Program COMPETE (QREN).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer Science+Business Media Dordrecht
About this paper
Cite this paper
Morgado, L., Pereira, C., Veríssimo, P., Dourado, A. (2013). Efficient Discriminative Models for Proteomics with Simple and Optimized Features. In: Madureira, A., Reis, C., Marques, V. (eds) Computational Intelligence and Decision Making. Intelligent Systems, Control and Automation: Science and Engineering, vol 61. Springer, Dordrecht. https://doi.org/10.1007/978-94-007-4722-7_9
Download citation
DOI: https://doi.org/10.1007/978-94-007-4722-7_9
Published:
Publisher Name: Springer, Dordrecht
Print ISBN: 978-94-007-4721-0
Online ISBN: 978-94-007-4722-7
eBook Packages: EngineeringEngineering (R0)