Moment Vector Encoding of Protein Sequences for Supervised Classification

  • Haneen AltartouriEmail author
  • Tobias Glasmachers
Conference paper
Part of the Advances in Intelligent Systems and Computing book series (AISC, volume 1005)


Automated prediction of biological attributes of protein sequences with machine learning methods depends on a well-suited protein representation. A central challenge is to represent variable-length sequences as fixed-length feature vectors. In this paper we introduce a new approach for representing the protein sequences as a fixed length vector based on statistical moments applied directly to the values of physicochemical properties of amino acids. The results show that this approach of encoding gives higher prediction accuracy on four benchmarks compared to the previous approaches that applied moments of complex descriptors extracted from the physicochemical properties, and even better than the PseAAC encoding method. The best results are achieved by removing highly correlated features with principal component analysis.


Moment vector Protein sequences Physicochemical properties 


  1. 1.
    Almen, M., Nordström, K., Fredriksson, R., Schioth, H.: Mapping the human membrane proteome: a majority of the human membrane proteins can be classified according to function and evolutionary origin. BMC Biol. (2009)Google Scholar
  2. 2.
    Alpaydın, E.: Introduction to Machine Learning. The Adaptive Computation and Machine Learning Series, 2nd edn. Massachusetts Institute of Technology (2010)Google Scholar
  3. 3.
    Ayyash, M., Tamimi, H., Ashhab, Y.: Developing a powerful in Silico tool for the discovery of novel caspase-3 substrates: a preliminary screening of the human proteome. BMC Bioinf. (2012)Google Scholar
  4. 4.
    Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)zbMATHCrossRefGoogle Scholar
  5. 5.
    Cangelosi, R., Goriely, A.: Component retention in principal component analysis with application to cDNA microarray data. Biol. Dir. 2(2) (2007)Google Scholar
  6. 6.
    Chou, C.: Prediction of protein cellular attributes using pseudo-amino-acid composition. In: PROTEINS: Structure, Function, and Genetic, pp. 246–255 (2001)CrossRefGoogle Scholar
  7. 7.
    Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995)zbMATHGoogle Scholar
  8. 8.
    Fawcett, T.: An introduction to ROC analysis. Pattern Recogn. Lett. 27, 861–874 (2006)CrossRefGoogle Scholar
  9. 9.
    Georgiev, A.: Interpretable numerical descriptors of amino acid space. J. Comput. Biol. 16(5) (2009)CrossRefGoogle Scholar
  10. 10.
    Jolliffe, I.: Principal Component Analysis, 2nd edn. Springer, New York (2002)zbMATHGoogle Scholar
  11. 11.
    Kumar, M., Gromiha, M.M., Raghava, G.P.S.: Identification of DNA-binding proteins using support vector machines and evolutionary profiles. BMC Bioinf. 8 (2007)CrossRefGoogle Scholar
  12. 12.
    Liu, B., Xu, J., Lan, X., Xu, R., Zhou, J., Wang, X., Chou, K.C.: iDNA-Prot—dis: identifying DNA-binding proteins by incorporating amino acid distance-pairs and reduced alphabet profile into the general pseudo amino acid composition. PLoS ONE 9 (2014)CrossRefGoogle Scholar
  13. 13.
    Matthews, B.W.: Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochimica et Biophysica Acta (BBA) - Protein Structure 405(2), 442–451 (1975)CrossRefGoogle Scholar
  14. 14.
    McKee, M., McKee, J.: Biochemistry: The Molecular Basis of Life, 5th edn. Oxford University Press, Oxford (2011)zbMATHGoogle Scholar
  15. 15.
    Park, K., Gromiha, M., Horton, P., Suwa, M.: Discrimination of outer membrane proteins using support vector machines. Bioinformatics 21, 223–229 (2005)Google Scholar
  16. 16.
    Qu, K., Han, K., Wu, S., Wang, G., Wei, L.: Identification of DNA-binding proteins using mixed feature representation methods. Molecules 10 (2017)Google Scholar
  17. 17.
    Rognvaldsson, T., You, L., Garwicz, D.: State of the art prediction of HIV-1 protease cleavage sites. Bioinformatics 31 (2015)CrossRefGoogle Scholar
  18. 18.
    Saidi, R., Maddouri, M., Nguifo, E.: Protein sequences classification by means of feature extraction with substitution matrices. BMC Bioinf. (2010)Google Scholar
  19. 19.
    Singh, O., Chia-Yu, E.: Prediction of HIV-1 protease cleavage site using a combination of sequence, structural, and physicochemical features. BMC Bioinf. 17 (2016)Google Scholar
  20. 20.
    Sun, D., Xu, C., Zhang, Y.: A novel method of 2D graphical representation for proteins and its application. Commun. Math. Comput. Chem. 75, 431–446 (2016)MathSciNetGoogle Scholar
  21. 21.
    Yau, S.S.T., Yu, C., He, R.: A protein map and its application. DNA Cell Biol. 27 (2008)CrossRefGoogle Scholar
  22. 22.
    Zhou, X., Li, X., Li, M., Lu, X.: Predicting protein functional class with the weighted segmented pseudo-amino acid composition moment vector. Commun. Math. Comput. Chem. 66, 445–462 (2011)MathSciNetGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.Institute for Neural ComputationRuhr-University BochumBochumGermany

Personalised recommendations