Skip to main content

An empirical study on the matrix-based protein representations and their combination with sequence-based approaches

Abstract

Many domains have a stake in the development of reliable systems for automatic protein classification. Of particular interest in recent studies of automatic protein classification is the exploration of new methods for extracting features from a protein that enhance classification for specific problems. These methods have proven very useful in one or two domains, but they have failed to generalize well across several domains (i.e. classification problems). In this paper, we evaluate several feature extraction approaches for representing proteins with the aim of sequence-based protein classification. Several protein representations are evaluated, those starting from: the position specific scoring matrix (PSSM) of the proteins; the amino-acid sequence; a matrix representation of the protein, of dimension (length of the protein) ×20, obtained using the substitution matrices for representing each amino-acid as a vector. A valuable result is that a texture descriptor can be extracted from the PSSM protein representation which improves the performance of standard descriptors based on the PSSM representation. Experimentally, we develop our systems by comparing several protein descriptors on nine different datasets. Each descriptor is used to train a support vector machine (SVM) or an ensemble of SVM. Although different stand-alone descriptors work well on some datasets (but not on others), we have discovered that fusion among classifiers trained using different descriptors obtains a good performance across all the tested datasets. Matlab code/Datasets used in the proposed paper are available at http://www.bias.csr.unibo.it\nanni\PSSM.rar.

This is a preview of subscription content, access via your institution.

Fig. 1

Notes

  1. SVM is implemented as in the LibSVM toolbox http://www.csie.ntu.edu.tw/~cjlin/libsvm/.

  2. Matlab code: http://bias.csr.unibo.it/nanni/QRcouple2.zip.

  3. Available at http://www.genome.jp/dbget/aaindex.html. We have not considered the properties where the amino acids have value 0 or 1.

  4. Matlab code: http://bias.csr.unibo.it/nanni/EstraggoFeaturesAC.rar.

  5. Extracted by the matlab code shared by the original authors.

  6. For extracting PSSM, after installation of PSI-BLAST, with Matlab to use system('blastpgp.exe -i input.txt -d D:\PSI-BLAST\swissprot -Q PSSM.txt -j 3'); where: input.txt is the protein sequence; PSSM.txt contains the PSSM matrix.

  7. A “substitution matrix” describes the rate at which one character in a protein sequence changes to other character states over time. Substitution matrices are usually seen in the context of amino acid or DNA sequence alignments, where the similarity between sequences depends on their divergence time and the substitution rates as represented in the matrix. (http://en.wikipedia.org/wiki/Substitution_matrix, Accessed 07/14/2012).

  8. Matlab code: http://bias.csr.unibo.it/nanni/SMR.rar.

  9. Matlab code: http://bias.csr.unibo.it/nanni/SA1.rar.

  10. Matlab code: http://bias.csr.unibo.it/nanni/PP.rar.

  11. Matlab code: http://www.cse.oulu.fi/CMV/Downloads/LPQMatlab.

  12. Implementing LTP from the matlab LBP code available at http://www.cse.oulu.fi/CMV/Downloads/LBPMatlab.

  13. Using the matlab code of LPQ available at http://www.ee.oulu.fi/mvg/download/lpq/.

  14. The Tables 3 and 4 with standard deviations, for the datasets where a cross validation testing protocol is used, are available at: https://www.dropbox.com/s/k9szugy2at896j1/tabelleConSTD.docx.

  15. Only results from methods that use the same testing protocol are reported (the results obtained by leave-one-out cross validation are not considered).

References

  • Chou KC (2001) Prediction of protein cellular attributes using pseudo-amino acid composition. Proteins Struct Funct Genet 43:246–255

    Article  PubMed  CAS  Google Scholar 

  • Chou KC (2009) Pseudo amino acid composition and its applications in bioinformatics, proteomics and system biology. Curr Proteomics 6:262–274

    Article  CAS  Google Scholar 

  • Chou KC, Shen HB (2007a) Review: recent progresses in protein subcellular location prediction. Anal Biochem 370:1–16

    Google Scholar 

  • Chou KC, Shen HB (2007b) Signal-CF: a subsite-coupled and window-fusing approach for predicting signal peptides. Biochem Biophys Res Commun 357:633–640

    Article  PubMed  CAS  Google Scholar 

  • Chou KC, Shen HB (2007c) MemType-2L: a Web server for predicting membrane proteins and their types by incorporating evolution information through Pse-PSSM. Biochem Biophys Res Commun 360:339–345

    Article  PubMed  CAS  Google Scholar 

  • Chou KC, Shen HB (2010) A new method for predicting the subcellular localization of eukaryotic proteins with both single and multiple sites: Euk-mP Loc 2.0. PLoS ONE 5(4):e9931

    Article  PubMed  Google Scholar 

  • Chou KC, Zhang CT (1995) Review: prediction of protein structural classes. Crit Rev Biochem Mol Biol 30:275–349

    Article  PubMed  CAS  Google Scholar 

  • Cristianini N, Shawe-Taylor J (2000) An introduction to support vector machines and other kernel-based learning methods. Cambridge University Press, Cambridge

    Google Scholar 

  • Demsar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30

    Google Scholar 

  • Du P, Li Y (2006) Prediction of protein submitochondria locations by hybridizing pseudo-amino acid composition with various physicochemical features of segmented sequence. BMC Bioinform 7:518

    Article  Google Scholar 

  • Fan GL, Li QZ (2011) Predicting protein submitochondria locations by combining different descriptors into the general form of Chou’s pseudo amino acid composition. Amino acid (on-line press)

  • Fawcett T (2004) ROC graphs: notes and practical considerations for researchers. HP Laboratories, Palo Alto

    Google Scholar 

  • Garg A, Gupta D (2008) VirulentPred: a SVM based prediction method for virulent proteins in bacterial pathogens. BMC Bioinformatics 9. doi:10.1186/1471-2105-9-62

  • Gribskov M et al (1987) Profile analysis: detection of distantly related proteins. Proc Nat Acad Sci USA 84:4355–4358

    Article  PubMed  CAS  Google Scholar 

  • Guo J, Lin Y, Sun Z (2005) A novel method for protein subcellular localization: combining residue-couple model and SVM. In: Proceedings of 3rd Asia-Pacific Bioinformatics Conference, pp 117–129

  • Guo Y, Feng Y, Li M (2008) Predicting DNA-binding proteins: approached from Chou’s pseudo amino acid composition and other specific sequence features. Amino Acids 34(1):103–109

    Article  PubMed  CAS  Google Scholar 

  • Holm S (1979) A simple sequentially rejective multiple test procedure. Scand J Stat 6:65–70

    Google Scholar 

  • Jaakkola T, Diekhans M, Haussler D (1999) Using the fisher kernel method to detect remote protein homologies. Seventh International Conference on Intelligent Systems for Molecular Biology. AAAI Press, California, pp 149–158

  • Jeong JC, Lin X, Chen X.-W (2011) On position-specific scoring matrix for protein function prediction. IEEE/ACM Trans Comput Biol Bioinform 8: 2

    Google Scholar 

  • Kawashima S, Kanehisa M (1999) AAindex: amino acid index database. Nucleic Acids Res 27(1):368–369

    Article  PubMed  CAS  Google Scholar 

  • Landgrebe TCW, Duin RobertPW (2007) Approximating the multiclass ROC by pairwise analysis. Pattern Recogn Lett 28(2007):1747–1758

    Article  Google Scholar 

  • Lei Z, Dai Y (2005) An SVM-based system for predicting protein subnuclear localizations. BMC Bioinformatics 6:291

    Article  PubMed  Google Scholar 

  • Leslie C, Eskin E, Noble WS (2002) The spectrum kernel: a string kernel for svm protein classification. Pacific Symposium on Biocomputing (PSB) 7:564–575

    Google Scholar 

  • Leslie CS, Eskin E, Cohen A, Weston J, Noble WS (2004) Mismatch string kernels for discriminative protein classification. Bioinformatics 20:467–476

    Article  PubMed  CAS  Google Scholar 

  • Li Yang, Yizhou Li, Rongquan Xiao, Yuhong Zeng, Jiamin Xiao, Fuyuan Tan, Menglong Li (2010) Using auto covariance method for functional discrimination of membrane proteins based on evolution information. Amino Acids 38:1497–1503

    Article  PubMed  Google Scholar 

  • Lin WZ, Xiao X, Chou KC (2009) GPCR-GIA: a web-server for identifying G-protein coupled receptors and their families with grey incidence analysis. Protein Eng Des Sel 22(11):699–705

    Article  PubMed  CAS  Google Scholar 

  • Lu L, Qian Z, Cai Y-D, Li Y (2007) ECS: an automatic enzyme classifier based on functional domain composition. Comput Biol Chem 31:226–232

    Article  PubMed  CAS  Google Scholar 

  • Maddouri M, Elloumi M (2004) Encoding of primary structures of biological macromolecules within a data mining perspective. J Comput Sci Technol (JCST) 19(1):78–88

    Article  Google Scholar 

  • Nanni L (2005) Fusion of classifiers for predicting protein–protein interactions. Neurocomputing 68:289–296

    Article  Google Scholar 

  • Nanni L (2006) Comparison among feature extraction methods for HIV-1 protease cleavage site prediction. Pattern Recogn 39:711–713

    Article  Google Scholar 

  • Nanni L, Mazzara S, Pattini L, Lumini A (2009) Protein classification combining surface analysis and primary structure. Protein Eng Des Sel 22(4):267–272

    Article  PubMed  CAS  Google Scholar 

  • Nanni L, Brahnam S, Lumini A (2010) High performance set of PseAAC and sequence based descriptors for protein classification. J Theor Biol 266(1):1–10

    Article  PubMed  CAS  Google Scholar 

  • Ojala T, Pietikäinen M, Mäenpää T (2002) Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. IEEE Trans Pattern Anal Mach Intell 24(7):971–987

    Article  Google Scholar 

  • Ojansivu V, Heikkila J (2008) Blur insensitive texture classification using local phase quantization. In: Lecture Notes in Computer Science 5099: 236–243 (ICISP)

  • Pudil P, Novovicova J, Kittler J (1994) Floating search methods in feature selection. Pattern Recogn Lett 5:1119–1125

    Article  Google Scholar 

  • Qin ZC (2006) ROC analysis for predictions made by probabilistic classifiers. Fourth International Conference on Machine Learning and Cybernetics 5:3119–3312

    Google Scholar 

  • Rodriguez JJ, Kuncheva LI, Alonso CJ (2006) Rotation forest: a new classifier ensemble method. IEEE Trans Pattern Anal Mach Intell 28:1619–1630

    Article  PubMed  Google Scholar 

  • Saidi R, Maddouri M, Nguifo EM (2010) Protein sequences classification by means of feature extraction with substitution matrices. BMC Bioinformatics 11:175

    Article  PubMed  Google Scholar 

  • Shen H-B, Chou K-C (2007a) Virus-PLoc: a fusion classifier for predicting the subcellular localization of viral proteins within host and virus-infected cells. Biopolymers 15:233–240

    Article  Google Scholar 

  • Shen H-B, Chou K-C (2007b) Gpos-PLoc: an ensemble classifier for predicting subcellular localization of Gram-positive bacterial proteins. Protein Eng Des Sel 20:39–46

    Article  PubMed  CAS  Google Scholar 

  • Shen HB, Chou KC (2010) Gneg-mPLoc: a top-down strategy to enhance the quality of predicting subcellular localization of Gram-negative bacterial proteins. J Theor Biol 264:326–333

    Article  PubMed  CAS  Google Scholar 

  • Tan X, Triggs B (2007) Enhanced local texture feature sets for face recognition under difficult lighting conditions. Analysis and Modelling of Faces and Gestures. LNCS 4778:168–182

    Google Scholar 

  • Wang JTL, Marr TG, Shasha D, Shapiro BA, Chirn GW (1994) Discovering active motifs in sets of related protein sequences and using them for classification. Nucleic Acids Res 22(14):2769–2775

    Article  PubMed  CAS  Google Scholar 

  • Wang J, Li Y, Wang Q, Zhang J, You X, Man J, Wang C, Gao X (2012) ProClusEnsem: predicting membrane protein types by fusing different models of pseudo amino acid composition. Comput Biol Med 42(5):564–574

    Article  PubMed  CAS  Google Scholar 

  • Xiao X, Lin WZ (2009) Application of protein grey incidence degree measure to predict protein quaternary structural types. Amino Acids 37:741–749

    Article  PubMed  CAS  Google Scholar 

  • Xiao X, Shao S, Ding Y, Huang Z, Huang Y, Chou KC (2005) Using complexity measure factor to predict protein subcellular location. Amino Acids 28:57–61

    Article  PubMed  CAS  Google Scholar 

  • Xiao X, Shao SH, Huang ZD, Chou KC (2006a) Using pseudo amino acid composition to predict protein structural classes: approached with complexity measure factor. J Comput Chem 27(4):478–482

    Article  PubMed  Google Scholar 

  • Xiao X, Shao SH, Ding YS, Huang ZD, Chou KC (2006b) Using cellular automata images and pseudo amino acid composition to predict protein subcellular location. Amino Acids 30:49–54

    Article  PubMed  CAS  Google Scholar 

  • Xiao X, Wang P, Chou KC (2008a) Predicting protein structural classes with pseudo amino acid composition: an approach using geometric moments of cellular automaton image. J Theor Biol 254:691–696

    Article  PubMed  CAS  Google Scholar 

  • Xiao X, Lin WZ, Chou KC (2008b) Using grey dynamic modeling and pseudo amino acid composition to predict protein structural classes. J Comput Chem 29:2018–2024

    Article  PubMed  CAS  Google Scholar 

  • Xiao X, Wang P, Chou KC (2009a) GPCR-CA: a cellular automaton image approach for predicting G-protein-coupled receptor functional classes. J Comput Chem 30:1414–1423

    Article  PubMed  CAS  Google Scholar 

  • Xiao X, Wang P, Chou KC (2009b) Predicting protein quaternary structural attribute by hybridizing functional domain composition and pseudo amino acid composition. J Appl Crystallogr 42:169–173

    Article  CAS  Google Scholar 

  • Xiao-Yong Pan, Ya-Nan Zhang, Hong-Bin Shen (2010) Large-scale prediction of human protein–protein interactions from amino acid sequence based on latent topic features. J Proteome Res 9:4992–5001

    Article  Google Scholar 

  • Yang ZR, Thomson R (2005) Bio-basis function neural network for prediction of protease cleavage sites in proteins. IEEE Trans Neural Netw 16:263–274

    Article  PubMed  CAS  Google Scholar 

  • Yu X, Zheng X, Liu T, Dou Y, Wang J (2011) Predicting subcellular location of apoptosis proteins with pseudo amino acid composition: approach from amino acid substitution matrix and auto covariance transformation. Amino Acids 42(5):1619–1625

    Article  PubMed  Google Scholar 

  • Zeng YH, Guo YZ, Xiao RQ, Yang L, Yu LZ, Li ML (2009) Using the augmented Chou’s pseudo amino acid composition for predicting protein submitochondria locations based on auto covariance approach. J Theor Biol 259:366–372

    Article  PubMed  CAS  Google Scholar 

Download references

Conflict of interest

The authors declare that they have no conflict of interest.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Loris Nanni.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Nanni, L., Lumini, A. & Brahnam, S. An empirical study on the matrix-based protein representations and their combination with sequence-based approaches. Amino Acids 44, 887–901 (2013). https://doi.org/10.1007/s00726-012-1416-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00726-012-1416-6

Keywords

  • Proteins classification
  • Machine learning
  • Ensemble of classifiers
  • Support vector machines