Amino Acids

, Volume 44, Issue 3, pp 887–901 | Cite as

An empirical study on the matrix-based protein representations and their combination with sequence-based approaches

Original Article

Abstract

Many domains have a stake in the development of reliable systems for automatic protein classification. Of particular interest in recent studies of automatic protein classification is the exploration of new methods for extracting features from a protein that enhance classification for specific problems. These methods have proven very useful in one or two domains, but they have failed to generalize well across several domains (i.e. classification problems). In this paper, we evaluate several feature extraction approaches for representing proteins with the aim of sequence-based protein classification. Several protein representations are evaluated, those starting from: the position specific scoring matrix (PSSM) of the proteins; the amino-acid sequence; a matrix representation of the protein, of dimension (length of the protein) ×20, obtained using the substitution matrices for representing each amino-acid as a vector. A valuable result is that a texture descriptor can be extracted from the PSSM protein representation which improves the performance of standard descriptors based on the PSSM representation. Experimentally, we develop our systems by comparing several protein descriptors on nine different datasets. Each descriptor is used to train a support vector machine (SVM) or an ensemble of SVM. Although different stand-alone descriptors work well on some datasets (but not on others), we have discovered that fusion among classifiers trained using different descriptors obtains a good performance across all the tested datasets. Matlab code/Datasets used in the proposed paper are available at http://www.bias.csr.unibo.it\nanni\PSSM.rar.

Keywords

Proteins classification Machine learning Ensemble of classifiers Support vector machines 

References

  1. Chou KC (2001) Prediction of protein cellular attributes using pseudo-amino acid composition. Proteins Struct Funct Genet 43:246–255PubMedCrossRefGoogle Scholar
  2. Chou KC (2009) Pseudo amino acid composition and its applications in bioinformatics, proteomics and system biology. Curr Proteomics 6:262–274CrossRefGoogle Scholar
  3. Chou KC, Shen HB (2007a) Review: recent progresses in protein subcellular location prediction. Anal Biochem 370:1–16Google Scholar
  4. Chou KC, Shen HB (2007b) Signal-CF: a subsite-coupled and window-fusing approach for predicting signal peptides. Biochem Biophys Res Commun 357:633–640PubMedCrossRefGoogle Scholar
  5. Chou KC, Shen HB (2007c) MemType-2L: a Web server for predicting membrane proteins and their types by incorporating evolution information through Pse-PSSM. Biochem Biophys Res Commun 360:339–345PubMedCrossRefGoogle Scholar
  6. Chou KC, Shen HB (2010) A new method for predicting the subcellular localization of eukaryotic proteins with both single and multiple sites: Euk-mP Loc 2.0. PLoS ONE 5(4):e9931PubMedCrossRefGoogle Scholar
  7. Chou KC, Zhang CT (1995) Review: prediction of protein structural classes. Crit Rev Biochem Mol Biol 30:275–349PubMedCrossRefGoogle Scholar
  8. Cristianini N, Shawe-Taylor J (2000) An introduction to support vector machines and other kernel-based learning methods. Cambridge University Press, CambridgeGoogle Scholar
  9. Demsar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30Google Scholar
  10. Du P, Li Y (2006) Prediction of protein submitochondria locations by hybridizing pseudo-amino acid composition with various physicochemical features of segmented sequence. BMC Bioinform 7:518CrossRefGoogle Scholar
  11. Fan GL, Li QZ (2011) Predicting protein submitochondria locations by combining different descriptors into the general form of Chou’s pseudo amino acid composition. Amino acid (on-line press)Google Scholar
  12. Fawcett T (2004) ROC graphs: notes and practical considerations for researchers. HP Laboratories, Palo AltoGoogle Scholar
  13. Garg A, Gupta D (2008) VirulentPred: a SVM based prediction method for virulent proteins in bacterial pathogens. BMC Bioinformatics 9. doi:10.1186/1471-2105-9-62
  14. Gribskov M et al (1987) Profile analysis: detection of distantly related proteins. Proc Nat Acad Sci USA 84:4355–4358PubMedCrossRefGoogle Scholar
  15. Guo J, Lin Y, Sun Z (2005) A novel method for protein subcellular localization: combining residue-couple model and SVM. In: Proceedings of 3rd Asia-Pacific Bioinformatics Conference, pp 117–129Google Scholar
  16. Guo Y, Feng Y, Li M (2008) Predicting DNA-binding proteins: approached from Chou’s pseudo amino acid composition and other specific sequence features. Amino Acids 34(1):103–109PubMedCrossRefGoogle Scholar
  17. Holm S (1979) A simple sequentially rejective multiple test procedure. Scand J Stat 6:65–70Google Scholar
  18. Jaakkola T, Diekhans M, Haussler D (1999) Using the fisher kernel method to detect remote protein homologies. Seventh International Conference on Intelligent Systems for Molecular Biology. AAAI Press, California, pp 149–158Google Scholar
  19. Jeong JC, Lin X, Chen X.-W (2011) On position-specific scoring matrix for protein function prediction. IEEE/ACM Trans Comput Biol Bioinform 8: 2Google Scholar
  20. Kawashima S, Kanehisa M (1999) AAindex: amino acid index database. Nucleic Acids Res 27(1):368–369PubMedCrossRefGoogle Scholar
  21. Landgrebe TCW, Duin RobertPW (2007) Approximating the multiclass ROC by pairwise analysis. Pattern Recogn Lett 28(2007):1747–1758CrossRefGoogle Scholar
  22. Lei Z, Dai Y (2005) An SVM-based system for predicting protein subnuclear localizations. BMC Bioinformatics 6:291PubMedCrossRefGoogle Scholar
  23. Leslie C, Eskin E, Noble WS (2002) The spectrum kernel: a string kernel for svm protein classification. Pacific Symposium on Biocomputing (PSB) 7:564–575Google Scholar
  24. Leslie CS, Eskin E, Cohen A, Weston J, Noble WS (2004) Mismatch string kernels for discriminative protein classification. Bioinformatics 20:467–476PubMedCrossRefGoogle Scholar
  25. Li Yang, Yizhou Li, Rongquan Xiao, Yuhong Zeng, Jiamin Xiao, Fuyuan Tan, Menglong Li (2010) Using auto covariance method for functional discrimination of membrane proteins based on evolution information. Amino Acids 38:1497–1503PubMedCrossRefGoogle Scholar
  26. Lin WZ, Xiao X, Chou KC (2009) GPCR-GIA: a web-server for identifying G-protein coupled receptors and their families with grey incidence analysis. Protein Eng Des Sel 22(11):699–705PubMedCrossRefGoogle Scholar
  27. Lu L, Qian Z, Cai Y-D, Li Y (2007) ECS: an automatic enzyme classifier based on functional domain composition. Comput Biol Chem 31:226–232PubMedCrossRefGoogle Scholar
  28. Maddouri M, Elloumi M (2004) Encoding of primary structures of biological macromolecules within a data mining perspective. J Comput Sci Technol (JCST) 19(1):78–88CrossRefGoogle Scholar
  29. Nanni L (2005) Fusion of classifiers for predicting protein–protein interactions. Neurocomputing 68:289–296CrossRefGoogle Scholar
  30. Nanni L (2006) Comparison among feature extraction methods for HIV-1 protease cleavage site prediction. Pattern Recogn 39:711–713CrossRefGoogle Scholar
  31. Nanni L, Mazzara S, Pattini L, Lumini A (2009) Protein classification combining surface analysis and primary structure. Protein Eng Des Sel 22(4):267–272PubMedCrossRefGoogle Scholar
  32. Nanni L, Brahnam S, Lumini A (2010) High performance set of PseAAC and sequence based descriptors for protein classification. J Theor Biol 266(1):1–10PubMedCrossRefGoogle Scholar
  33. Ojala T, Pietikäinen M, Mäenpää T (2002) Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. IEEE Trans Pattern Anal Mach Intell 24(7):971–987CrossRefGoogle Scholar
  34. Ojansivu V, Heikkila J (2008) Blur insensitive texture classification using local phase quantization. In: Lecture Notes in Computer Science 5099: 236–243 (ICISP)Google Scholar
  35. Pudil P, Novovicova J, Kittler J (1994) Floating search methods in feature selection. Pattern Recogn Lett 5:1119–1125CrossRefGoogle Scholar
  36. Qin ZC (2006) ROC analysis for predictions made by probabilistic classifiers. Fourth International Conference on Machine Learning and Cybernetics 5:3119–3312Google Scholar
  37. Rodriguez JJ, Kuncheva LI, Alonso CJ (2006) Rotation forest: a new classifier ensemble method. IEEE Trans Pattern Anal Mach Intell 28:1619–1630PubMedCrossRefGoogle Scholar
  38. Saidi R, Maddouri M, Nguifo EM (2010) Protein sequences classification by means of feature extraction with substitution matrices. BMC Bioinformatics 11:175PubMedCrossRefGoogle Scholar
  39. Shen H-B, Chou K-C (2007a) Virus-PLoc: a fusion classifier for predicting the subcellular localization of viral proteins within host and virus-infected cells. Biopolymers 15:233–240CrossRefGoogle Scholar
  40. Shen H-B, Chou K-C (2007b) Gpos-PLoc: an ensemble classifier for predicting subcellular localization of Gram-positive bacterial proteins. Protein Eng Des Sel 20:39–46PubMedCrossRefGoogle Scholar
  41. Shen HB, Chou KC (2010) Gneg-mPLoc: a top-down strategy to enhance the quality of predicting subcellular localization of Gram-negative bacterial proteins. J Theor Biol 264:326–333PubMedCrossRefGoogle Scholar
  42. Tan X, Triggs B (2007) Enhanced local texture feature sets for face recognition under difficult lighting conditions. Analysis and Modelling of Faces and Gestures. LNCS 4778:168–182Google Scholar
  43. Wang JTL, Marr TG, Shasha D, Shapiro BA, Chirn GW (1994) Discovering active motifs in sets of related protein sequences and using them for classification. Nucleic Acids Res 22(14):2769–2775PubMedCrossRefGoogle Scholar
  44. Wang J, Li Y, Wang Q, Zhang J, You X, Man J, Wang C, Gao X (2012) ProClusEnsem: predicting membrane protein types by fusing different models of pseudo amino acid composition. Comput Biol Med 42(5):564–574PubMedCrossRefGoogle Scholar
  45. Xiao X, Lin WZ (2009) Application of protein grey incidence degree measure to predict protein quaternary structural types. Amino Acids 37:741–749PubMedCrossRefGoogle Scholar
  46. Xiao X, Shao S, Ding Y, Huang Z, Huang Y, Chou KC (2005) Using complexity measure factor to predict protein subcellular location. Amino Acids 28:57–61PubMedCrossRefGoogle Scholar
  47. Xiao X, Shao SH, Huang ZD, Chou KC (2006a) Using pseudo amino acid composition to predict protein structural classes: approached with complexity measure factor. J Comput Chem 27(4):478–482PubMedCrossRefGoogle Scholar
  48. Xiao X, Shao SH, Ding YS, Huang ZD, Chou KC (2006b) Using cellular automata images and pseudo amino acid composition to predict protein subcellular location. Amino Acids 30:49–54PubMedCrossRefGoogle Scholar
  49. Xiao X, Wang P, Chou KC (2008a) Predicting protein structural classes with pseudo amino acid composition: an approach using geometric moments of cellular automaton image. J Theor Biol 254:691–696PubMedCrossRefGoogle Scholar
  50. Xiao X, Lin WZ, Chou KC (2008b) Using grey dynamic modeling and pseudo amino acid composition to predict protein structural classes. J Comput Chem 29:2018–2024PubMedCrossRefGoogle Scholar
  51. Xiao X, Wang P, Chou KC (2009a) GPCR-CA: a cellular automaton image approach for predicting G-protein-coupled receptor functional classes. J Comput Chem 30:1414–1423PubMedCrossRefGoogle Scholar
  52. Xiao X, Wang P, Chou KC (2009b) Predicting protein quaternary structural attribute by hybridizing functional domain composition and pseudo amino acid composition. J Appl Crystallogr 42:169–173CrossRefGoogle Scholar
  53. Xiao-Yong Pan, Ya-Nan Zhang, Hong-Bin Shen (2010) Large-scale prediction of human protein–protein interactions from amino acid sequence based on latent topic features. J Proteome Res 9:4992–5001CrossRefGoogle Scholar
  54. Yang ZR, Thomson R (2005) Bio-basis function neural network for prediction of protease cleavage sites in proteins. IEEE Trans Neural Netw 16:263–274PubMedCrossRefGoogle Scholar
  55. Yu X, Zheng X, Liu T, Dou Y, Wang J (2011) Predicting subcellular location of apoptosis proteins with pseudo amino acid composition: approach from amino acid substitution matrix and auto covariance transformation. Amino Acids 42(5):1619–1625PubMedCrossRefGoogle Scholar
  56. Zeng YH, Guo YZ, Xiao RQ, Yang L, Yu LZ, Li ML (2009) Using the augmented Chou’s pseudo amino acid composition for predicting protein submitochondria locations based on auto covariance approach. J Theor Biol 259:366–372PubMedCrossRefGoogle Scholar

Copyright information

© Springer-Verlag Wien 2012

Authors and Affiliations

  • Loris Nanni
    • 1
  • Alessandra Lumini
    • 2
  • Sheryl Brahnam
    • 3
  1. 1.DEIUniversity of PaduaPaduaItaly
  2. 2.DEISUniversità di BolognaCesenaItaly
  3. 3.Computer Information SystemsMissouri State UniversitySpringfieldUSA

Personalised recommendations