Abstract
The focus of this work is the use of ensembles of classifiers for predicting HIV protease cleavage sites in proteins. Due to the complex relationships in the biological data, several recent works show that often ensembles of learning algorithms outperform stand-alone methods. We show that the fusion of approaches based on different encoding models can be useful for improving the performance of this classification problem. In particular, in this work four different feature encodings for peptides are described and tested. An extensive evaluation on a large dataset according to a blind testing protocol is reported which demonstrates how different feature extraction methods and classifiers can be combined for obtaining a robust and reliable system. The comparison with other stand-alone approaches allows quantifying the performance improvement obtained by the ensembles proposed in this work.
Similar content being viewed by others
Notes
Available at http://www.genome.jp/dbget/aaindex.html.
Implemented as in OSU toolbox. http://www.ece.osu.edu/∼maj/osu_svm/.
Implemented as in GAOT (Genetic Algorithms for Optimization Toolbox) http://www.ie.ncsu.edu/mirage/GAToolBox/gaot/.
Implemented as in PRtools 3.1.7 toolbox http://130.161.42.18/prtools/.
AUC is implemented as in dd_tools 0.95 davidt@ph.tn.tudelft.nl.
Before the fusion the scores of the classifiers are normalized to mean 0 and standard deviation 1.
References
Althaus IW, Chou JJ, Gonzales AJ, Diebel MR, Chou KC, Kezdy FJ, Romero DL, Aristoff PA, Tarpley WG, Reusser F (1993a) Kinetic studies with the nonnucleoside HIV-1 reverse transcriptase inhibitor U-88204E. Biochemistry 32:6548–6554
Althaus IW, Chou JJ, Gonzales AJ, Diebel MR, Chou KC, Kezdy FJ, Romero DL, Aristoff PA, Tarpley WG, Reusser F (1993b) Steady-state kinetic studies with the non-nucleoside HIV-1 reverse transcriptase inhibitor U-87201E. J Biol Chem 268:6119–6124
Althaus IW, Gonzales AJ, Chou JJ, Diebel MR, Chou KC, Kezdy FJ, Romero DL, Aristoff PA, Tarpley WG, Reusser F (1993c) The quinoline U-78036 is a potent inhibitor of HIV-1 reverse transcriptase. J Biol Chem 268:14875–14880
Althaus IW, Chou JJ, Gonzales AJ, Diebel MR, Chou KC, Kezdy FJ, Romero DL, Aristoff PA, Tarpley WG, Reusser F (1994a) Steady-state kinetic studies with the polysulfonate U-9843, an HIV reverse transcriptase inhibitor. Experientia 50:23–28
Althaus IW, Chou JJ, Gonzales AJ, Diebel MR, Chou KC, Kezdy FJ, Romero DL, Thomas RC, Aristoff PA, Tarpley WG, Reusser F (1994b) Kinetic studies with the non-nucleoside HIV-1 reverse transcriptase inhibitor U-90152E. Biochem Pharmacol 47:2017–2028
Althaus IW, Chou KC, Franks KM, Diebel MR, Kezdy FJ, Romero DL, Thomas RC, Aristoff PA, Tarpley WG, Reusser F (1996) The benzylthio-pyrididine U-31,355 is a potent inhibitor of HIV-1 reverse transcriptase. Biochem Pharmacol 51:743–750
Altıncay H, Demirekler M (2000) An information theoretic framework for weight estimation in the combination of probabilistic classifiers for speaker identification. Speech Commun 30(4):255–272
Bhanu B, Lin Y (2004) Object detection in multi-modal images using genetic programming. Appl Soft Comput J, vol 4, pp 175–201
Breiman L (1996) Bagging predictors. Mach Learn 24:123–140
Breinman L (2001) Random forest. Mach Learn 45(1):5–32
Cai YD, Chou KC (1998) Artificial neural network model for predicting HIV protease cleavage sites in protein. Adv Eng Softw 29:119–128
Cai YD, Liu X, Xu XB, Chou KC (2002) Support vector machines for predicting HIV protease cleavage sites in protein. J Comput Chem 23:267–274
Chou JJ (1993a) Predicting cleavability of peptide sequences by HIV protease via correlation-angle approach. J Protein Chem 12:291–302
Chou JJ (1993b) Predicting cleavability of peptide sequences by HIV protease via correlation-angle approach. J Protein Chem 12:291–302
Chou JJ (1993c) A formulation for correlating properties of peptides and its application to predicting human immunodeficiency virus protease-cleavable sites in proteins. Biopolymers 33:1405–1414
Chou KC (1993d) A vectorized sequence-coupling model for predicting HIV protease cleavage sites in proteins. J Biol Chem 268:16938–16948
Chou KC (1996) Review: Prediction of HIV protease cleavage sites in proteins. Anal Biochem 233:1–14
Chou KC (2004) Review: Structural bioinformatics and its impact to biomedical science. Curr Med Chem 11:2105–2134
Chou KC, Shen HB (2006a) Hum-PLoc: a novel ensemble classifier for predicting human protein subcellular localization. Biochem Biophys Res Commun 347:150–157
Chou KC, Shen HB (2006b) Predicting eukaryotic protein subcellular location by fusing optimized evidence-theoretic K-nearest neighbor classifiers. J Proteome Res 5:1888–1897
Chou KC, Shen HB (2007a) Large-scale plant protein subcellular location prediction. J Cell Biochem 100:665–678
Chou KC, Shen HB (2007b) Euk-mPLoc: a fusion classifier for large-scale eukaryotic protein subcellular location prediction by incorporating multiple sites. J Proteome Res 6:1728–1734
Chou KC, Shen HB (2007c) MemType-2L: A Web server for predicting membrane proteins and their types by incorporating evolution information through Pse-PSSM. Biochem Biophys Res Commun 360:339–345
Chou KC, Shen HB (2007d) Signal-CF: a subsite-coupled and window-fusing approach for predicting signal peptides. Biochem Biophys Res Commun 357:633–640
Chou KC, Shen HB (2007e) Review: recent progresses in protein subcellular location prediction. Anal Biochem 370:1–16
Chou KC, Shen HB (2008) Cell-PLoc: a package of web-servers for predicting subcellular localization of proteins in various organisms. Nat Protoc 3:153–162
Chou KC, Kezdy FJ, Reusser F (1994) Review: steady-state inhibition kinetics of processive nucleic acid polymerases and nucleases. Anal Biochem 221:217–230
Chou KC, Wei DQ, Zhong WZ (2003) Binding mechanism of coronavirus main proteinase with ligands and its implication to drug design against SARS. (Erratum: ibid., 2003, vol 310, 675). Biochem Biophys Res Commun 308:148–151
Cristianini N, Shawe-Taylor J (2000) An introduction to support vector machines and other kernel-based learning methods, Cambridge University Press, London
Demsar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30
Duda RO, Hart PE, Stork G (2000) Pattern classification, 2nd edn. Wiley, New York
Fawcett T (2004) ROC graphs: notes and practical considerations for researchers, technical report. HP Laboratories, Palo Alto
Franco A, Lumini A, Maio D, Nanni L (2006) An enhanced subspace method for face recognition. Pattern Recognit Lett 27:76–84
González-Díaz H, González-Díaz Y, Santana L, Ubeira FM, Uriarte E (2008) Proteomics, networks, and connectivity indices. Proteomics 8:750–778
Guo J, Lin Y, Sun Z (2005) A novel method for protein subcellular localization: combining residue-couple model and SVM. In: Proceedings of third Asia-Pacific bioinformatics conference, pp 117–129
Huang J, Ling CX (2005) Using AUC and accuracy in evaluating learning algorithms. IEEE Trans Knowl Data Eng 17(3):299–310
Kawashima S, Kanehisa M (2000) AAindex: amino acid index database. Nucleic Acids Res 28:374
Kittler J (1998) On combining classifiers, IEEE Trans. Pattern Anal Mach Intell 20(3):226–239
Kontijevskis A, Wikberg JES, Komorowski J (2007) Computational proteomics analysis of HIV-1 protease interactome. Proteins: Structure, Function, and Bioinformatics (1)305–312
Kuncheva LI, Whitaker CJ (2003) Measures of diversity in classifier ensembles and their relationship with the ensemble accuracy, Mach Learn 51:181–207
Lubec G, Afjehi-Sadat L, Yang JW, John JP (2005) Searching for hypothetical proteins: theory and practice based upon original data and literature. Prog Neurobiol 77:90–127
Martin A et al (1997) The DET curve in assessment of decision task performance. In: Proc of EuroSpeech, pp 1895–1898
Melville P, Mooney RJ (2003) Constructing diverse classifier ensembles using artificial training examples. In: Proceedings of the IJCAI, pp 505–510
Murphy LR, Wallqvist A, Levy RM (2000) Simplified amino acid alphabets for protein fold recognition and implications for folding. Protein Eng 13:149–152
Nanni L (2006) Comparison among feature extraction methods for HIV-1 Protease Cleavage Site Prediction, Pattern Recognition, (39):711–713
Nanni L, Lumini A (2006a) MppS: an ensemble of support vector machine based on multiple physicochemical properties of amino-acids, NeuroComputing, vol 69, no.13, pp.1688–1690, August 2006
Nanni L, Lumini A (2006b) An ensemble of K-local hyperplane for predicting protein–protein interactions. BioInformatics 22(10):1207–1210
Nanni L, Lumini A (2006c) A reliable method for HIV-1 protease cleavage site prediction. Neurocomputing 69:838–841
Nanni L, Lumini A (2008a) A genetic approach for building different alphabets for peptide and protein classification. BMC Bioinformatics 9:45
Nanni L, Lumini A (2008b) Using ensemble of classifiers in Bioinformatics. In: Columbus F (ed) Machine learning research progress, Hauppauge, New York, Nova (to appear)
Narayanan A, Wu X, Yang Z (2002) Mining viral protease data to extract cleavage knowledge. Bioinformatics 18:5–13
Ogul H, Mumcuoglu EU (2007) Subcellular localization prediction with new protein encoding schemes, IEEE Trans on Computational Biology and Bioinformatics
Opitz D, Maclin R (1999) Popular ensemble methods: an empirical study. J Artif Intell Res 11:169–198
Poorman RA, Tomasselli AG, Heinrikson RL, Kezdy FJ (1991) A cumulative specificity model for proteases from human immunodeficiency virus types 1 and 2, inferred from statistical analysis of an extended substrate data base. J Biol Chem 266:14554–14561
Pudil P, Novovicova J, Kittler J (1994) Flotating search methods in feature selection. Pattern Recognit Lett 15:1119–1125
Qin ZC (2006). ROC analysis for predictions made by probabilistic classifiers. In: Proceedings of the fourth international conference on machine learning and cybernetics, vol 5, pp 3119–3124
Rögnvaldsson T, You L (2003) Why neural networks should not be used for HIV-1 protease cleavage site prediction. Bioinformatics 20:1702–1709
Rögnvaldsson T, You L, Garwicz D (2007) Bioinformatic approaches for modeling the substrate specificity of HIV-1 protease: an overview. Expert Rev Mol Diagn 7(4):435–451
Shen HB, Chou KC (2006) Ensemble classifier for protein fold pattern recognition. Bioinformatics 22:1717–1722
Shen HB, Chou KC (2007a) Virus-PLoc: a fusion classifier for predicting the subcellular localization of viral proteins within host and virus-infected cells. Biopolymers 85(3):233–240
Shen HB, Chou KC (2007b) Hum-mPLoc: an ensemble classifier for large-scale human protein subcellular location prediction by incorporating samples with multiple sites. Biochem Biophys Res Commun 355(4):1006–1011
Shen HB, Chou KC (2007c) Signal-3L: a 3-layer approach for predicting signal peptide. Biochem Biophys Res Commun 363:297–303
Shen HB, Chou KC (2007d) EzyPred: a top-down approach for predicting enzyme functional classes and subclasses. Biochem Biophys Res Commun 364:53–59
Shen HB, Chou KC (2007e) Using ensemble classifier to identify membrane protein types. Amino Acids 32:483–488
Shen HB, Chou KC (2008) HIVcleave: a web-server for predicting HIV protease cleavage sites in proteins. Anal Biochem 375:388–390
Shen HB, Yang J, Chou KC (2007) Euk-PLoc: an ensemble classifier for large-scale eukaryotic protein subcellular location prediction. Amino Acids 33:57–67
Thompson TB, Chou KC, Zheng C (1995) Neural network prediction of the HIV-1 protease cleavage sites. J Theor Biol 177:369–379
Whitaker CJ, Kuncheva LI (2003) Examining the relationship between majority vote accuracy and diversity in bagging and boosting. In: Technical Report, School of Informatics, University of Wales, Bangor
Zenobi G, Cunningham P (2001) Using diversity in preparing ensembles of classifiers based on different feature subsets to minimize generalization error. In: Raedt LD, Flach PA (eds) Proceedings of the 12th conference on machine learning, Lecture Notes in Computer Science, vol 2167, pp 576–587
Acknowledgments
The authors would like to thank: A. Kontijevskis from Uppsala University for sharing the UPPSALA dataset.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Nanni, L., Lumini, A. Using ensemble of classifiers for predicting HIV protease cleavage sites in proteins. Amino Acids 36, 409–416 (2009). https://doi.org/10.1007/s00726-008-0076-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00726-008-0076-z