Abstract
Many domains have a stake in the development of reliable systems for automatic protein classification. Of particular interest in recent studies of automatic protein classification is the exploration of new methods for extracting features from a protein that enhance classification for specific problems. These methods have proven very useful in one or two domains, but they have failed to generalize well across several domains (i.e. classification problems). In this paper, we evaluate several feature extraction approaches for representing proteins with the aim of sequence-based protein classification. Several protein representations are evaluated, those starting from: the position specific scoring matrix (PSSM) of the proteins; the amino-acid sequence; a matrix representation of the protein, of dimension (length of the protein) ×20, obtained using the substitution matrices for representing each amino-acid as a vector. A valuable result is that a texture descriptor can be extracted from the PSSM protein representation which improves the performance of standard descriptors based on the PSSM representation. Experimentally, we develop our systems by comparing several protein descriptors on nine different datasets. Each descriptor is used to train a support vector machine (SVM) or an ensemble of SVM. Although different stand-alone descriptors work well on some datasets (but not on others), we have discovered that fusion among classifiers trained using different descriptors obtains a good performance across all the tested datasets. Matlab code/Datasets used in the proposed paper are available at http://www.bias.csr.unibo.it\nanni\PSSM.rar.
This is a preview of subscription content, access via your institution.

Notes
SVM is implemented as in the LibSVM toolbox http://www.csie.ntu.edu.tw/~cjlin/libsvm/.
Matlab code: http://bias.csr.unibo.it/nanni/QRcouple2.zip.
Available at http://www.genome.jp/dbget/aaindex.html. We have not considered the properties where the amino acids have value 0 or 1.
Matlab code: http://bias.csr.unibo.it/nanni/EstraggoFeaturesAC.rar.
Extracted by the matlab code shared by the original authors.
For extracting PSSM, after installation of PSI-BLAST, with Matlab to use system('blastpgp.exe -i input.txt -d D:\PSI-BLAST\swissprot -Q PSSM.txt -j 3'); where: input.txt is the protein sequence; PSSM.txt contains the PSSM matrix.
A “substitution matrix” describes the rate at which one character in a protein sequence changes to other character states over time. Substitution matrices are usually seen in the context of amino acid or DNA sequence alignments, where the similarity between sequences depends on their divergence time and the substitution rates as represented in the matrix. (http://en.wikipedia.org/wiki/Substitution_matrix, Accessed 07/14/2012).
Matlab code: http://bias.csr.unibo.it/nanni/SMR.rar.
Matlab code: http://bias.csr.unibo.it/nanni/SA1.rar.
Matlab code: http://bias.csr.unibo.it/nanni/PP.rar.
Matlab code: http://www.cse.oulu.fi/CMV/Downloads/LPQMatlab.
Implementing LTP from the matlab LBP code available at http://www.cse.oulu.fi/CMV/Downloads/LBPMatlab.
Using the matlab code of LPQ available at http://www.ee.oulu.fi/mvg/download/lpq/.
The Tables 3 and 4 with standard deviations, for the datasets where a cross validation testing protocol is used, are available at: https://www.dropbox.com/s/k9szugy2at896j1/tabelleConSTD.docx.
Only results from methods that use the same testing protocol are reported (the results obtained by leave-one-out cross validation are not considered).
References
Chou KC (2001) Prediction of protein cellular attributes using pseudo-amino acid composition. Proteins Struct Funct Genet 43:246–255
Chou KC (2009) Pseudo amino acid composition and its applications in bioinformatics, proteomics and system biology. Curr Proteomics 6:262–274
Chou KC, Shen HB (2007a) Review: recent progresses in protein subcellular location prediction. Anal Biochem 370:1–16
Chou KC, Shen HB (2007b) Signal-CF: a subsite-coupled and window-fusing approach for predicting signal peptides. Biochem Biophys Res Commun 357:633–640
Chou KC, Shen HB (2007c) MemType-2L: a Web server for predicting membrane proteins and their types by incorporating evolution information through Pse-PSSM. Biochem Biophys Res Commun 360:339–345
Chou KC, Shen HB (2010) A new method for predicting the subcellular localization of eukaryotic proteins with both single and multiple sites: Euk-mP Loc 2.0. PLoS ONE 5(4):e9931
Chou KC, Zhang CT (1995) Review: prediction of protein structural classes. Crit Rev Biochem Mol Biol 30:275–349
Cristianini N, Shawe-Taylor J (2000) An introduction to support vector machines and other kernel-based learning methods. Cambridge University Press, Cambridge
Demsar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30
Du P, Li Y (2006) Prediction of protein submitochondria locations by hybridizing pseudo-amino acid composition with various physicochemical features of segmented sequence. BMC Bioinform 7:518
Fan GL, Li QZ (2011) Predicting protein submitochondria locations by combining different descriptors into the general form of Chou’s pseudo amino acid composition. Amino acid (on-line press)
Fawcett T (2004) ROC graphs: notes and practical considerations for researchers. HP Laboratories, Palo Alto
Garg A, Gupta D (2008) VirulentPred: a SVM based prediction method for virulent proteins in bacterial pathogens. BMC Bioinformatics 9. doi:10.1186/1471-2105-9-62
Gribskov M et al (1987) Profile analysis: detection of distantly related proteins. Proc Nat Acad Sci USA 84:4355–4358
Guo J, Lin Y, Sun Z (2005) A novel method for protein subcellular localization: combining residue-couple model and SVM. In: Proceedings of 3rd Asia-Pacific Bioinformatics Conference, pp 117–129
Guo Y, Feng Y, Li M (2008) Predicting DNA-binding proteins: approached from Chou’s pseudo amino acid composition and other specific sequence features. Amino Acids 34(1):103–109
Holm S (1979) A simple sequentially rejective multiple test procedure. Scand J Stat 6:65–70
Jaakkola T, Diekhans M, Haussler D (1999) Using the fisher kernel method to detect remote protein homologies. Seventh International Conference on Intelligent Systems for Molecular Biology. AAAI Press, California, pp 149–158
Jeong JC, Lin X, Chen X.-W (2011) On position-specific scoring matrix for protein function prediction. IEEE/ACM Trans Comput Biol Bioinform 8: 2
Kawashima S, Kanehisa M (1999) AAindex: amino acid index database. Nucleic Acids Res 27(1):368–369
Landgrebe TCW, Duin RobertPW (2007) Approximating the multiclass ROC by pairwise analysis. Pattern Recogn Lett 28(2007):1747–1758
Lei Z, Dai Y (2005) An SVM-based system for predicting protein subnuclear localizations. BMC Bioinformatics 6:291
Leslie C, Eskin E, Noble WS (2002) The spectrum kernel: a string kernel for svm protein classification. Pacific Symposium on Biocomputing (PSB) 7:564–575
Leslie CS, Eskin E, Cohen A, Weston J, Noble WS (2004) Mismatch string kernels for discriminative protein classification. Bioinformatics 20:467–476
Li Yang, Yizhou Li, Rongquan Xiao, Yuhong Zeng, Jiamin Xiao, Fuyuan Tan, Menglong Li (2010) Using auto covariance method for functional discrimination of membrane proteins based on evolution information. Amino Acids 38:1497–1503
Lin WZ, Xiao X, Chou KC (2009) GPCR-GIA: a web-server for identifying G-protein coupled receptors and their families with grey incidence analysis. Protein Eng Des Sel 22(11):699–705
Lu L, Qian Z, Cai Y-D, Li Y (2007) ECS: an automatic enzyme classifier based on functional domain composition. Comput Biol Chem 31:226–232
Maddouri M, Elloumi M (2004) Encoding of primary structures of biological macromolecules within a data mining perspective. J Comput Sci Technol (JCST) 19(1):78–88
Nanni L (2005) Fusion of classifiers for predicting protein–protein interactions. Neurocomputing 68:289–296
Nanni L (2006) Comparison among feature extraction methods for HIV-1 protease cleavage site prediction. Pattern Recogn 39:711–713
Nanni L, Mazzara S, Pattini L, Lumini A (2009) Protein classification combining surface analysis and primary structure. Protein Eng Des Sel 22(4):267–272
Nanni L, Brahnam S, Lumini A (2010) High performance set of PseAAC and sequence based descriptors for protein classification. J Theor Biol 266(1):1–10
Ojala T, Pietikäinen M, Mäenpää T (2002) Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. IEEE Trans Pattern Anal Mach Intell 24(7):971–987
Ojansivu V, Heikkila J (2008) Blur insensitive texture classification using local phase quantization. In: Lecture Notes in Computer Science 5099: 236–243 (ICISP)
Pudil P, Novovicova J, Kittler J (1994) Floating search methods in feature selection. Pattern Recogn Lett 5:1119–1125
Qin ZC (2006) ROC analysis for predictions made by probabilistic classifiers. Fourth International Conference on Machine Learning and Cybernetics 5:3119–3312
Rodriguez JJ, Kuncheva LI, Alonso CJ (2006) Rotation forest: a new classifier ensemble method. IEEE Trans Pattern Anal Mach Intell 28:1619–1630
Saidi R, Maddouri M, Nguifo EM (2010) Protein sequences classification by means of feature extraction with substitution matrices. BMC Bioinformatics 11:175
Shen H-B, Chou K-C (2007a) Virus-PLoc: a fusion classifier for predicting the subcellular localization of viral proteins within host and virus-infected cells. Biopolymers 15:233–240
Shen H-B, Chou K-C (2007b) Gpos-PLoc: an ensemble classifier for predicting subcellular localization of Gram-positive bacterial proteins. Protein Eng Des Sel 20:39–46
Shen HB, Chou KC (2010) Gneg-mPLoc: a top-down strategy to enhance the quality of predicting subcellular localization of Gram-negative bacterial proteins. J Theor Biol 264:326–333
Tan X, Triggs B (2007) Enhanced local texture feature sets for face recognition under difficult lighting conditions. Analysis and Modelling of Faces and Gestures. LNCS 4778:168–182
Wang JTL, Marr TG, Shasha D, Shapiro BA, Chirn GW (1994) Discovering active motifs in sets of related protein sequences and using them for classification. Nucleic Acids Res 22(14):2769–2775
Wang J, Li Y, Wang Q, Zhang J, You X, Man J, Wang C, Gao X (2012) ProClusEnsem: predicting membrane protein types by fusing different models of pseudo amino acid composition. Comput Biol Med 42(5):564–574
Xiao X, Lin WZ (2009) Application of protein grey incidence degree measure to predict protein quaternary structural types. Amino Acids 37:741–749
Xiao X, Shao S, Ding Y, Huang Z, Huang Y, Chou KC (2005) Using complexity measure factor to predict protein subcellular location. Amino Acids 28:57–61
Xiao X, Shao SH, Huang ZD, Chou KC (2006a) Using pseudo amino acid composition to predict protein structural classes: approached with complexity measure factor. J Comput Chem 27(4):478–482
Xiao X, Shao SH, Ding YS, Huang ZD, Chou KC (2006b) Using cellular automata images and pseudo amino acid composition to predict protein subcellular location. Amino Acids 30:49–54
Xiao X, Wang P, Chou KC (2008a) Predicting protein structural classes with pseudo amino acid composition: an approach using geometric moments of cellular automaton image. J Theor Biol 254:691–696
Xiao X, Lin WZ, Chou KC (2008b) Using grey dynamic modeling and pseudo amino acid composition to predict protein structural classes. J Comput Chem 29:2018–2024
Xiao X, Wang P, Chou KC (2009a) GPCR-CA: a cellular automaton image approach for predicting G-protein-coupled receptor functional classes. J Comput Chem 30:1414–1423
Xiao X, Wang P, Chou KC (2009b) Predicting protein quaternary structural attribute by hybridizing functional domain composition and pseudo amino acid composition. J Appl Crystallogr 42:169–173
Xiao-Yong Pan, Ya-Nan Zhang, Hong-Bin Shen (2010) Large-scale prediction of human protein–protein interactions from amino acid sequence based on latent topic features. J Proteome Res 9:4992–5001
Yang ZR, Thomson R (2005) Bio-basis function neural network for prediction of protease cleavage sites in proteins. IEEE Trans Neural Netw 16:263–274
Yu X, Zheng X, Liu T, Dou Y, Wang J (2011) Predicting subcellular location of apoptosis proteins with pseudo amino acid composition: approach from amino acid substitution matrix and auto covariance transformation. Amino Acids 42(5):1619–1625
Zeng YH, Guo YZ, Xiao RQ, Yang L, Yu LZ, Li ML (2009) Using the augmented Chou’s pseudo amino acid composition for predicting protein submitochondria locations based on auto covariance approach. J Theor Biol 259:366–372
Conflict of interest
The authors declare that they have no conflict of interest.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Nanni, L., Lumini, A. & Brahnam, S. An empirical study on the matrix-based protein representations and their combination with sequence-based approaches. Amino Acids 44, 887–901 (2013). https://doi.org/10.1007/s00726-012-1416-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00726-012-1416-6
Keywords
- Proteins classification
- Machine learning
- Ensemble of classifiers
- Support vector machines