Abstract
The last decade has seen an explosion in the collection of protein data. To actualize the potential offered by this wealth of data, it is important to develop machine systems capable of classifying and extracting features from proteins. Reliable machine systems for protein classification offer many benefits, including the promise of finding novel drugs and vaccines. In developing our system, we analyze and compare several feature extraction methods used in protein classification that are based on the calculation of texture descriptors starting from a wavelet representation of the protein. We then feed these texture-based representations of the protein into an Adaboost ensemble of neural network or a support vector machine classifier. In addition, we perform experiments that combine our feature extraction methods with a standard method that is based on the Chou’s pseudo amino acid composition. Using several datasets, we show that our best approach outperforms standard methods. The Matlab code of the proposed protein descriptors is available at http://bias.csr.unibo.it/nanni/wave.rar.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
Available at http://www.genome.jp/dbget/aaindex.html. We have not considered the properties where the amino acids have value 0 or 1.
The IDs of the properties are available at http://bias.csr.unibo.it\nanni\IDw.docx.
Implemented as in DDtool 0.95 Matlab Toolbox.
It is performed 10 times and the average results are reported.
For a multi-class classification with a two-class classifiers the one-versus-one or one-versus-all approach should be used (Cristianini 2000).
Before the fusion the scores of each method are normalized to mean 0 and standard deviation 1.
We have tested both linear and Gaussian kernels, the parameters are estimated using a grid search in the training set.
References
Ahonen T et al (2009) Rotation invariant image description with local binary pattern histogram Fourier features, Image Analysis, SCIA 2009. Lect Notes Comp Sci 5575:61–70
Althaus IW et al (1993) Steady-state kinetic studies with the non-nucleoside HIV-1 reverse transcriptase inhibitor U-87201E. J Biol Chem 268:6119–6124
Andraos J (2008) Kinetic plasticity and the determination of product ratios for kinetic schemes leading to multiple products without rate laws: new methods based on directed graphs. Can J Chem 86:342–357
Bairoch A, Apweiler R (2000) The SWISS-PROT protein sequence database and its supplement TrEMBL. Nucl Acids Res 28:45–48
Ben-Gal I et al (2005) Identification of transcription factor binding sites with variable-order bayesian networks. Bioinformatics 21(11):2657–2666
Bock J, Gough D (2003) Whole-proteome interaction mining. Bioinformatics 19:125–135
Bulashevska A, Eils R (2006) Predicting protein subcellular locations using hierarchical ensemble of Bayesian classifiers based on Markov chains. BMC Bioinform 7:298
Chen YL, Li QZ (2007) Prediction of apoptosis protein subcellular location using improved hybrid approach and pseudo-amino acid composition. J Theor Biol 248:377–381
Chen L et al (2005) VFDB: a reference database for bacterial virulence factors. Nucl Acids Res 33:D325–D328
Chen C et al (2009) Prediction of protein secondary structure content by using the concept of Chou’s pseudo amino acid composition and support vector machine. Protein Peptide Lett 16:27–31
Chou KC (1985) Low-frequency motions in protein molecules: beta-sheet and beta-barrel. Biophys J 48:289–297
Chou KC (1988) Review: low-frequency collective motion in biomacromolecules and its biological functions. Biophys Chem 30:3–48
Chou KC (1989a) Graphic rules in steady and non-steady enzyme kinetics. J Biol Chem 264:12074–12079
Chou KC (1989b) Low-frequency resonance and cooperativity of hemoglobin. Trends Biochem Sci 14:212
Chou KC (1990) Review: applications of graph theory to enzyme kinetics and protein folding kinetics: steady and non-steady state systems. Biophys Chem 35:1–24
Chou KC (2001) Prediction of protein cellular attributes using pseudo-amino acid composition. Proteins Struct Funct Genet 43:246–255
Chou KC (2010) Graphic rule for drug metabolism systems. Curr Drug Metab 11:369–378
Chou KC (2011) Some remarks on protein attribute prediction and pseudo amino acid composition (50th Anniversary Year Review). J Theor Biol 273:236–247
Chou KC, Shen HB (2007) Review: recent progresses in protein subcellular location prediction. Anal Biochem 370:1–16
Chou KC, Shen HB (2007b) MemType-2L: a web server for predicting membrane proteins and their types by incorporating evolution information through Pse-PSSM. Biochem Biophys Res Commun 360:339–345
Chou KC, Shen HB (2009) Review: recent advances in developing web-servers for predicting protein attributes. Nat Sci 2:63–92. (openly accessible at http://www.scirp.org/journal/NS/)
Chou KC, Shen HB (2010a) Cell-PLoc 2.0: an improved package of web-servers for predicting subcellular localization of proteins in various organisms. Nat Sci 2:1090–1103
Chou KC, Shen HB (2010b) Plant-mPLoc: a top–down strategy to augment the power for predicting plant protein subcellular localization. PLoS ONE 5:e11335
Chou KC, Zhang CT (1995) Review: prediction of protein structural classes. Crit Rev Biochem Mol Biol 30:275–349
Chou KC, Kezdy FJ, Reusser F (1994) Review: steady-state inhibition kinetics of processive nucleic acid polymerases and nucleases. Anal Biochem 221:217–230
Chou KC, Zhang CT, Maggiora GM (1997) Disposition of amphiphilic helices in heteropolar environments. Proteins Struct Funct Genet 28:99–108
Cristianini N, Shawe-Taylor J (2000) An introduction to support vector machines and other kernel-based learning methods. Cambridge University Press, Cambridge
Daras P et al (2006) Three-dimensional shape-structure comparison method for protein classification. IEEE Trans Comput Biol Bioinform 3(3):193–207
Demsar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30
Ding YS, Zhang TL (2008) Using Chou’s pseudo amino acid composition to predict subcellular localization of apoptosis proteins: an approach with immune genetic algorithm-based ensemble classifier. Pattern Recognit Lett 29:1887–1892
Ding H, Luo L, Lin H (2009) Prediction of cell wall lytic enzymes using Chou’s amphiphilic pseudo amino acid composition. Protein Peptide Lett 16:351–355
Du PF, Li YD (2006) Prediction of protein submitochondria locationsby hybridizing pseudoamino acid composition with various physicochemical. BMC Bioinform 7:518
Du PF, Cao SJ, Li YD (2009a) SubChlo: predicting protein subchloroplast locations with pseudo- amino acid composition and the evidence-theoretic K-nearest neighbor (ET-KNN) algorithm. J Theor Biol 261:330–335
Du P, Cao S, Li Y (2009b) SubChlo: predicting protein subchloroplast locations with pseudo-amino acid composition and the evidence-theoretic K-nearest neighbor (ET-KNN) algorithm. J Theor Biol 261(2):330–335
Fang Y et al (2008) Predicting DNA-binding proteins: approached from Chou’s pseudo amino acid composition and other specific sequence features. Amino Acids 34(1):103–109
Fawcett T (2004) ROC graphs: notes and practical considerations for researchers. HP Laboratories, Palo Alto
Garg A, Gupta D (2008) VirulentPred: a SVM based prediction method for virulent proteins in bacterial pathogens. BMC Bioinform 9:62. doi:10.1186/1471-2105-9-62
Hayat M, Khan A (2011) Predicting membrane protein types by fusing composite protein sequence features into pseudo amino acid composition. J Theor Biol 271:10–17
Hu L et al (2011) Predicting functions of proteins in mouse based on weighted protein–protein interaction network and protein hybrid properties. PLoS ONE 6:e14556
Jaakkola T, Diekhans M, Haussler D (1999) Using the Fisher kernel method to detect remote protein homologies. In: Seventh international conference on intelligent systems for molecular biology. AAAI Press, Menlo Park, pp 149–158
Jiang X et al (2008) Using the concept of Chou’s pseudo amino acid composition to predict apoptosis proteins subcellular location: an approach by approximate entropy. Protein Peptide Lett 15:392–396
Kandaswamy KK et al (2011) AFP-Pred: a random forest approach for predicting antifreeze proteins from sequence-derived properties. J Theor Biol 270:56–62
Kawashima S, Kanehisa M (2000) AAindex: amino acid index database. Nucl Acids Res 20:1
Lei Z, Dai Y (2005) An SVM-based system for predicting protein subnuclear localizations. BMC Bioinform 6:291
Leslie CS et al (2004) Mismatch string kernels for discriminative protein classification. Bioinformatics 20:467–476
Li FM, Li QZ (2008) Predicting protein subcellular location using Chou’s pseudo amino acid composition and improved hybrid approach. Protein Peptide Lett 15:612–616
Liao S, Law MWK, Chung ACS (2009) Dominant local binary patterns for texture classification. IEEE Trans Image Process 18(5):1107–1118
Lin H (2008) The modified Mahalanobis discriminant for predicting outer membrane proteins by using Chou’s pseudo amino acid composition. J Theor Biol 252:350–356
Lin MT, Beal MF (2006) Mitochondrial dysfunction and oxidative stress in neurodegenerative diseases. Nature 443:787–795
Lin H et al (2008) Predicting subcellular localization of mycobacterial proteins by using Chou’s pseudo amino acid composition. Protein Peptide Lett 15:739–744
Lowell BB, Shulman GI (2005) Mitochondrial dysfunction and type 2 diabetes. Science 307:384–387
Masso M, Vaisman II (2010) Knowledge-based computational mutagenesis for predicting the disease potential of human non-synonymous single nucleotide polymorphisms. J Theor Biol 266:560–568
Mohabatkar H (2010) Prediction of cyclin proteins using Chou’s pseudo amino acid composition. Protein Peptide Lett 17:1207–1214
Nanni L, Lumini A (2006) An ensemble of K-local hyperplane for predicting protein–protein interactions. Bioinformatics 22(10):1207–1210
Nanni L, Lumini A (2008a) Genetic programming for creating Chou’s pseudo amino acid based features for submitochondria localization. Amino Acids 34(4):653–660
Nanni L, Lumini A (2008b) Genetic programming for creating Chou’s pseudoamino acid based features for submitochondria localization. Amino Acids 34(4):653–660
Nanni L, Lumini A (2010) A high performance set of descriptors extracted from the amino acid sequence for protein classification. J Theor Biol 266(1):1–10
Niu B et al (2006) Predicting protein structural class with AdaBoost learner. Protein Peptide Lett 13:489–492
Ojansivu V, Heikkila J (2008) Blur insensitive texture classification using local phase quantization. In: ICISP
Qin ZC (2006) ROC analysis for predictions made by probabilistic classifiers. In: Fourth international conference on machine learning and cybernetics, pp 3119–3124
Qiu JD et al (2009) Prediction of G-protein-coupled receptor classes based on the concept of Chou’s pseudo amino acid composition: an approach from discrete wavelet transform. Anal Biochem 390:68–73
Rahtu E, Salo M, Heikkila J (2005) Affine invariant pattern recognition using multi- scale autoconvolution. IEEE Trans Pattern Anal Machine Intell 27(6):908–918
Saigo H et al (2004) Protein homology detection using string alignment kernels. Bioinformatics 20(11):1682–1689
Shen H-B, Chou K-C (2007) Gpos-PLoc: an ensemble classifier for predicting subcellular localization of Gram-positive bacterial proteins. Protein Eng Design Select 20:39–46
Shi SP et al (2011) Identify submitochondria and subchloroplast locations with pseudo amino acid composition: Approach from the strategy of discrete wavelet transform feature extraction. Biochim Biophys Acta 1813:424–430
Tan X, Triggs B (2007) Enhanced local texture feature sets for face recognition under difficult lighting conditions. Analysis and modelling of faces and gestures. In: LNCS, vol 4778, pp 168–182
Wen ZN, Wang KL, Li ML, Nie FS, Yang Y (2005) Analyzing functional similarity of protein sequences with discrete wavelet transform. Comput Biol Chem 29:220–228
Wolfram S (1984) Cellular automation as models of complexity. Nature 311:419–424
Xiao X, Chou KC (2007) Digital coding of amino acids based on hydrophobic index. Protein Peptide Lett 14:871–875
Xiao X et al (2005a) An application of gene comparative image for predicting the effect on replication ratio by HBV virus gene missense mutation. J Theor Biol 235:555–565
Xiao X et al (2005b) Using cellular automata to generate Image representation for biological sequences. Amino Acids 28:29–35
Xiao X, Shao SH, Chou KC (2006a) A probability cellular automaton model for hepatitis B viral infections. Biochem Biophys Res Commun 342:605–610
Xiao X et al (2006b) Using cellular automata images and pseudo amino acid composition to predict protein subcellular location. Amino Acids 30:49–54
Xiao X, Wang P, Chou KC (2009) GPCR-CA: a cellular automaton image approach for predicting G-protein-coupled receptor functional classes. J Comput Chem 30(9):1414–1423
Xiao X, Wang P, Chou KC (2011a) Quat-2L: a web-server for predicting protein quaternary structural attributes. Mol Divers 15:149–155
Xiao X, Wang P, Chou KC (2011b) GPCR-2L: predicting G protein-coupled receptors and their types by hybridizing two different modes of pseudo amino acid compositions. Mol Biosyst 7:911–919
Yang ZR, Thomson R (2005) Bio-basis function neural network for prediction of protease cleavage sites in proteins. IEEE Trans Neural Netw 16:263–274
Zeng YH et al (2009) Using the augmented Chou’s pseudo amino acid composition for predicting protein submitochondria locations based on auto covariance approach. J Theor Biol 259(2):366–372
Zhou GP (2011) The disposition of the LZCC protein residues in wenxiang diagram provides new insights into the protein–protein interaction mechanism. J Theor Biol 284:142–148
Zhou GP, Deng MH (1984) An extension of Chou’s graphical rules for deriving enzyme kinetic equations to system involving parallel reaction pathways. Biochem J 222:169–176
Zhou XB et al (2007) Using Chou’s amphiphilic pseudo-amino acid composition and support vector machine for prediction of enzyme subfamily classes. J Theor Biol 248:546–551
Acknowledgments
We wish to thank Ojansivu and Heikkila for sharing their LPQ code; Rahtu, Salo and Heikkila for sharing their MSAhist code; Ahonen, Matas, He and Pietikäinen for sharing their LBP-HF code.
Conflict of interest
The authors declare that they have no conflict of interest.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Nanni, L., Brahnam, S. & Lumini, A. Wavelet images and Chou’s pseudo amino acid composition for protein classification. Amino Acids 43, 657–665 (2012). https://doi.org/10.1007/s00726-011-1114-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00726-011-1114-9