Abstract
Protein function prediction is a relevant but challenging task as protein structural data is a large and complex information. With the increase of biological data available there is a demand for computational methods to annotate and help us make sense of this data deluge. Here we propose a model and a data mining based strategy to perform protein structural classification. We are particularly interested in hierarchical classification schemes. To evaluate the proposed strategy, we conduct three experiments using as input protein structural data from biological databases (CATH, SCOPe and BRENDA). Each dataset is associated with a well known hierarchical classification scheme (CATH, SCOP, EC number). We show that our model accuracy ranges from 86% to 95% when predicting CATH, SCOP and EC Number levels respectively. To the best of our knowledge, ours is the first work to reach such high accuracy when dealing with very large data sets.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
We intend to make the code and details related to the models publicly available upon publication of this manuscript.
- 3.
Using the Python’s Sklearn implementation (http://scikit-learn.org).
References
Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)
Chandonia, J.M., et al.: SCOPe: classification of large macromolecular structures in the structural classification of proteins-extended database. Nucleic Acids Res. 47, D475–D481 (2018)
Chen, K.E., et al.: Prediction of protein structural class using novel evolutionary collocation based sequence representation. J. Comput. Chem. 29(10), 1596–1604 (2008)
Dalkiran, A., et al.: ECPred: a tool for the prediction of the enzymatic functions of protein sequences based on the EC nomenclature. BMC Bioinform. 19(1), 334 (2018)
Gu, J., et al.: Structural Bioinformatics, vol. 44. Wiley, London (2009)
Hearst, M.A., et al.: Support vector machines. IEEE Intell. Syst. Appl. 13(4), 18–28 (1998)
Kedarisetti, K.D., et al.: Classifier ensembles for protein structural class prediction with varying homology. Biochem. Biophys. Res. Commun. 348(3), 981–988 (2006)
McCallum, A., et al.: A comparison of event models for Naive Bayes text classification. In: AAAI/ICML-98 Workshop on Learning for Text Categorization, pp. 41–48 (1998)
Melo, R.C., et al.: A contact map matching approach to protein structure similarity analysis. Genet. Mol. Res. 5(2), 284–308 (2006)
Melo, R.C., et al.: Finding protein-protein interaction patterns by contact map matching. Genet. Mol. Res. 6(4), 946–963 (2007)
Mirceva, G., et al.: A novel approach for classifying protein structures based on fuzzy decision tree. In: 2018 2nd International Symposium on Multidisciplinary Studies and Innovative Technologies (ISMSIT), pp. 1–5. IEEE (2018)
Nelson, D.L., et al.: Lehninger Principles of Biochemistry, 6th edn. Macmillan Learning, New York (2013)
Pearl, F.M., et al.: The CATH database: an extended protein family resource for structural and functional genomics. Nucleic Acids Res. 31(1), 452–455 (2003)
Pires, D.E., et al.: Cutoff Scanning Matrix (CSM): structural classification and function prediction by protein inter-residue distance patterns. BMC Genomics 12(4), S12 (2011)
Rogen, P., et al.: Automatic classification of protein structure by using Gauss integrals. Proc. Nat. Acad. Sci. U.S.A. 100(1), 119–124 (2003)
Rogen, P., et al.: A new family of global protein shape descriptors. Math. Biosci. 182(2), 167–181 (2003)
Rose, P.W., et al.: The RCSB protein data bank: integrative view of protein, gene and 3D structural information. Nucleic Acids Res. 45, D271–D281 (2016)
Schomburg, I., et al.: BRENDA, enzyme data and metabolic information. Nucleic Acids Res. 30(1), 47–49 (2002)
Sillitoe, I., et al.: CATH: comprehensive structural and functional annotations for genome sequences. Nucleic Acids Res. 43(D1), D376–D381 (2015)
Silveira, S.A., et al.: ENZYMAP: exploiting protein annotation for modeling and predicting EC number changes in UniProt/Swiss-Prot. PloS One 9(2), e89162 (2014)
Sun, X.D., et al.: Prediction of protein structural classes using support vector machines. Amino Acids 30(4), 469–475 (2006)
Tyzack, J.D., et al.: Understanding enzyme function evolution from a computational perspective. Curr. Opin. Struct. Biol. 47, 131–139 (2017)
Wei, D., Xu, Q., Zhao, T., Dai, H. (eds.): Advance in Structural Bioinformatics. AEMB, vol. 827. Springer, Dordrecht (2015). https://doi.org/10.1007/978-94-017-9245-5
Weinberger, K., et al.: Distance metric learning for large margin nearest neighbor classification. Adv. Neural Inf. Process. Syst. 18, 1473 (2006)
Funding
This study was financed in part by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - Brasil (CAPES) - Finance Code 001, Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq) and Fundação de Amparo à Pesquisa do Estado de Minas Gerais (FAPEMIG).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Mendes, V.F., Monteiro, C.R., Comarela, G.V., Silveira, S.A. (2019). A Hierarchical and Scalable Strategy for Protein Structural Classification. In: Rojas, I., Valenzuela, O., Rojas, F., Ortuño, F. (eds) Bioinformatics and Biomedical Engineering. IWBBIO 2019. Lecture Notes in Computer Science(), vol 11465. Springer, Cham. https://doi.org/10.1007/978-3-030-17938-0_34
Download citation
DOI: https://doi.org/10.1007/978-3-030-17938-0_34
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-17937-3
Online ISBN: 978-3-030-17938-0
eBook Packages: Computer ScienceComputer Science (R0)