Automatic classification of protein structures using physicochemical parameters

Mohan, Abhilash; Divya Rao, M.; Sunderrajan, Shruthi; Pennathur, Gautam

doi:10.1007/s12539-013-0199-0

Automatic classification of protein structures using physicochemical parameters

Published: 11 September 2014

Volume 6, pages 176–186, (2014)
Cite this article

Interdisciplinary Sciences: Computational Life Sciences Aims and scope Submit manuscript

Abhilash Mohan¹,
M. Divya Rao¹,
Shruthi Sunderrajan¹ &
…
Gautam Pennathur¹

209 Accesses
8 Citations
Explore all metrics

Abstract

Protein classification is the first step to functional annotation; SCOP and Pfam databases are currently the most relevant protein classification schemes. However, the disproportion in the number of three dimensional (3D) protein structures generated versus their classification into relevant superfamilies/families emphasizes the need for automated classification schemes. Predicting function of novel proteins based on sequence information alone has proven to be a major challenge.

The present study focuses on the use of physicochemical parameters in conjunction with machine learning algorithms (Naive Bayes, Decision Trees, Random Forest and Support Vector Machines) to classify proteins into their respective SCOP superfamily/Pfam family, using sequence derived information. Spectrophores™, a 1D descriptor of the 3D molecular field surrounding a structure was used as a benchmark to compare the performance of the physicochemical parameters. The machine learning algorithms were modified to select features based on information gain for each SCOP superfamily/Pfam family. The effect of combining physicochemical parameters and spectrophores on classification accuracy (CA) was studied.

Machine learning algorithms trained with the physicochemical parameters consistently classified SCOP superfamilies and Pfam families with a classification accuracy above 90%, while spectrophores performed with a CA of around 85%. Feature selection improved classification accuracy for both physicochemical parameters and spectrophores based machine learning algorithms. Combining both attributes resulted in a marginal loss of performance. Physicochemical parameters were able to classify proteins from both schemes with classification accuracy ranging from 90–96%. These results suggest the usefulness of this method in classifying proteins from amino acid sequences.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Efficient Discriminative Models for Proteomics with Simple and Optimized Features

Ensemble of Artificial Bee Colony Optimization and Random Forest Technique for Feature Selection and Classification of Protein Function Family Prediction

Classification of Proteins Using Naïve Bayes Classifier and Surface-Invariant Coordinates

References

Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J. 1990. Basic local alignment search tool. J Mol Biol 215, 403–410.
Article PubMed CAS Google Scholar
Ankerst, M., Kastenmüller, G., Kriegel, H.P., Seidl, T., et al., 1999. Nearest neighbor classification in 3d protein databases. In Proceedings of the Seventh International Conference on Intelligent Systems for Molecular Biology, 34–43.
Arumugam, G., Nair, A.G., Hariharaputran, S., Ramanathan, S. 2013. Rebelling for a reason: Protein structural outliers. PloS one 8, e74416.
Article PubMed CAS PubMed Central Google Scholar
Ashby, C., Johnson, D., Walker, K., Kanj, I.A., Xia, G., Huang, X. 2013. New enumeration algorithm for protein structure comparison and classification. BMC Genomics 14, S1.
Article PubMed PubMed Central Google Scholar
Atsushi, I. 1980. Thermostability and aliphatic index of globular proteins. J Biochem 88, 1895–1898.
Google Scholar
Bhasin, M., Raghava, G. 2004. Eslpred: Svm-based method for subcellular localization of eukaryotic proteins using dipeptide composition and psi-blast. Nucleic Acids Res 32, W414–W419.
Article PubMed CAS PubMed Central Google Scholar
Blomberg, N., Nilges, M. 1997. Functional diversity of ph domains: an exhaustive modelling study. Fold Des 2, 343–355.
Article PubMed CAS Google Scholar
Bultinck, P., Langenaeker, W., Lahorte, P., De Proft, F., Geerlings, P., Waroquier, M., Tollenaere, J. 2002. The electronegativity equalization method I: Parametrization and validation for atomic charge calculations. J Phys Chem A 106, 7887–7894.
Article CAS Google Scholar
Casbon, J., Saqi, M. 2006. Functional diversity within proteins superfamilies. Journal of Integrative Bioinformatics 3.
Google Scholar
Chan, H.S., Dill, K.A. 1994. Transition states and folding dynamics of proteins and heteropolymers. J Chem Phys 100, 9238.
Article Google Scholar
Demšar, J., Zupan, B., Leban, G., Curk, T. 2004. Orange: From experimental machine learning to interactive data mining. Springer, Berlin, Heidelberg, pp 537–539.
Google Scholar
Dhir, C., Iqbal, N., Lee, S.Y. 2007. Efficient feature selection based on information gain criterion for face recognition. In Information Acquisition, 2007. ICIA’07. International Conference on. IEEE, 523–527.
Chapter Google Scholar
Dyda, F., Klein, D.C., Hickman, A.B. 2000. Gcn5-related n-acetyltransferases: a structural overview. Annu Rev Bioph Biom 29, 81–103.
Article CAS Google Scholar
Elofsson, A., Heijne, G.V. 2007. Membrane protein structure: prediction versus reality. Annu Rev Biochem 76, 125–140.
Article PubMed CAS Google Scholar
Erdmann, M.A. 2005. Protein similarity from knot theory: geometric convolution and line weavings. J Comput Biol 12, 609–637.
Article PubMed CAS Google Scholar
Esposito, F., Malerba, D., Semeraro, G., Kay, J. 1997. A comparative analysis of methods for pruning decision trees. IEEE T Pattern Anal 19, 476–491.
Article Google Scholar
Frank, E., Hall, M., Pfahringer, B. 2002. Locally weighted naive bayes. In Proceedings of the Nineteenth conference on Uncertainty in Artificial Intelligence. Morgan Kaufmann Publishers Inc., 249–256.
Google Scholar
Gonnet, G.H., Cohen, M.A., Benner, S.A. 1992. Exhaustive matching of the entire protein sequence database. Science 256, 1443–1445.
Article PubMed CAS Google Scholar
Hand, D.J., Yu, K. 2001. Idiot’s bayes not so stupid after all? Int Stat Rev 69, 385–398.
Google Scholar
Henikoff, S., Henikoff, J.G. 1992. Amino acid substitution matrices from protein blocks. P Natl Acad Sci USA 89, 10915–10919.
Article CAS Google Scholar
Hobohm, U., Sander, C. 1995. A sequence property approach to searching protein databases. J Mol Biol 251, 390–399.
Article PubMed CAS Google Scholar
Holm, L., Sander, C. 1996. The fssp database: fold classification based on structure-structure alignment of proteins. Nucleic Acids Res 24, 206–209.
Article PubMed CAS PubMed Central Google Scholar
Idicula-Thomas, S., Balaji, P.V. 2005. Understanding the relationship between the primary structure of proteins and its propensity to be soluble on overexpression in escherichia coli. Protein Sci 14, 582–592.
Article PubMed CAS PubMed Central Google Scholar
Jain, P., Hirst, J.D. 2010. Automatic structure classification of small proteins using random forest. BMC bioinformatics 11, 364.
Article PubMed PubMed Central Google Scholar
Kim, Y.J., Patel, J.M. 2006. A framework for protein structure classification and identification of novel protein structures. BMC bioinformatics 7, 456.
Article PubMed PubMed Central Google Scholar
Livingston, F. 2005. Implementation of breiman’s random forest machine learning algorithm. ECE591Q Machine Learning Journal Paper.
Google Scholar
Lu, Z., Szafron, D., Greiner, R., Lu, P., Wishart, D.S., Poulin, B., Anvik, J., Macdonell, C., Eisner, R. 2004. Predicting subcellular localization of proteins using machine-learned classifiers. Bioinformatics 20, 547–556.
Article PubMed CAS Google Scholar
Ma, B., Elkayam, T., Wolfson, H., Nussinov, R. 2003. Protein-protein interactions: Structurally conserved residues distinguish between binding sites and exposed protein surfaces. P Natl Acad Sci USA 100, 5772–5777.
Article CAS Google Scholar
Mohan, A., Anishetty, S., Gautam, P. 2010. Global metal-ion binding protein fingerprint: A method to identify motif-less metal-ion binding proteins. J Bioinform Comput Biol 8, 717–726.
Article CAS Google Scholar
Momany, F. 1978. Determination of partial atomic charges from ab initio molecular electrostatic potentials. Application to formamide, methanol, and formic acid. J Phys Chem 82, 592–601.
Article CAS Google Scholar
Murzin, A.G., Brenner, S.E., Hubbard, T., Chothia, C. 1995. Scop: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol 247, 536–540.
PubMed CAS Google Scholar
Ooms, F., Wouters, J., Collin, S., Durant, F., Jegham, S., George, P. 1998. Molecular lipophilicity potential by clip, a reliable tool for the description of the 3d distribution of lipophilicity: application to 3-phenyloxazolidin-2-one, a prototype series of reversible maoa inhibitors. Bioorg Med Chem Lett 8, 1425–1430.
Article PubMed CAS Google Scholar
Pearson, W.R. 1991. Searching protein sequence libraries: comparison of the sensitivity and selectivity of the smith-waterman and fasta algorithms. Genomics 11, 635–650.
Article PubMed CAS Google Scholar
Rasoul, S., David, L. 1991. A survey of decision tree classifier methodology. IEEE Trans Syst Man Cybern 21, 660–674.
Article Google Scholar
Rice, P., Longden, I., Bleasby, A. 2000. Emboss: the european molecular biology open software suite. Trends Genet 16, 276–277.
Article PubMed CAS Google Scholar
Røgen, P., Fain, B. 2003. Automatic classification of protein structure by using gauss integrals. P Natl Acad Sci USA 100, 119–124.
Article Google Scholar
Santini, G., Soldano, H., Pothier, J. 2012. Automatic classification of protein structures relying on similarities between alignments. BMC bioinformatics 13, 233.
Article PubMed CAS PubMed Central Google Scholar
Shen, J., Zhang, J., Luo, X., Zhu, W., Yu, K., Chen, K., Li, Y., Jiang, H. 2007. Predicting protein-protein interactions based only on sequences information. P Natl Acad Sci USA 104, 4337–4341.
Article CAS Google Scholar
Shirota, M., Ishida, T., Kinoshita, K. 2008. Effects of surface-to-volume ratio of proteins on hydrophilic residues: Decrease in occurrence and increase in buried fraction. Protein Sci 17, 1596–1602.
Article PubMed CAS PubMed Central Google Scholar
Söding, J., Biegert, A., Lupas, A.N. 2005. The hhpred interactive server for protein homology detection and structure prediction. Nucleic Acids Res 33, W244–W248.
Article PubMed PubMed Central Google Scholar
Sun, X.D., Huang, R.B. 2006. Prediction of protein structural classes using support vector machines. Amino Acids 30, 469–475.
Article PubMed CAS Google Scholar
Thijs, G., Langenaeker, W., De Winter, H. 2011. Application of spectrophores to map vendor chemical space using self-organising maps. J Cheminformatics 3, 1–1.
Article Google Scholar
Vasanthanathan, P., Taboureau, O., Oostenbrink, C., Vermeulen, N.P.E., Olsen, L., Jrgensen, F.S. 2009. Classification of cytochrome p450 1a2 inhibitors and noninhibitors by machine learning techniques. Drug Metab Dispos 37, 658–664.
Article PubMed CAS Google Scholar
Wang, G., Lochovsky, F.H. 2004. Feature selection with conditional mutual information maximin in text categorization. In Proceedings of the thirteenth ACM international conference on Information and knowledge management. ACM, 342–349.
Google Scholar
Wildman, S.A., Crippen, G.M. 1999. Prediction of physicochemical parameters by atomic contributions. J Chem Inf Comp Sci 39, 868–873.
Article CAS Google Scholar
Wu, C.H., Huang, H., Yeh, L.S.L., Barker, W.C. 2003. Protein family classification and functional annotation. Comput Biol Chem 27, 37–47.
Article PubMed CAS Google Scholar

Download references

Author information

Authors and Affiliations

The Center for Biotechnology, Anna University, Chennai, 600025, Tamilnadu, India
Abhilash Mohan, M. Divya Rao, Shruthi Sunderrajan & Gautam Pennathur

Authors

Abhilash Mohan
View author publications
You can also search for this author in PubMed Google Scholar
M. Divya Rao
View author publications
You can also search for this author in PubMed Google Scholar
Shruthi Sunderrajan
View author publications
You can also search for this author in PubMed Google Scholar
Gautam Pennathur
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Gautam Pennathur.

Additional information

Equal contribution

Rights and permissions

Reprints and permissions

About this article

Cite this article

Mohan, A., Divya Rao, M., Sunderrajan, S. et al. Automatic classification of protein structures using physicochemical parameters. Interdiscip Sci Comput Life Sci 6, 176–186 (2014). https://doi.org/10.1007/s12539-013-0199-0

Download citation

Received: 03 May 2013
Revised: 12 November 2013
Accepted: 05 December 2013
Published: 11 September 2014
Issue Date: September 2014
DOI: https://doi.org/10.1007/s12539-013-0199-0

Key words

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Automatic classification of protein structures using physicochemical parameters

Abstract

Access this article

Similar content being viewed by others

Efficient Discriminative Models for Proteomics with Simple and Optimized Features

Ensemble of Artificial Bee Colony Optimization and Random Forest Technique for Feature Selection and Classification of Protein Function Family Prediction

Classification of Proteins Using Naïve Bayes Classifier and Surface-Invariant Coordinates

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Key words

Navigation

Automatic classification of protein structures using physicochemical parameters

Abstract

Access this article

Similar content being viewed by others

Efficient Discriminative Models for Proteomics with Simple and Optimized Features

Ensemble of Artificial Bee Colony Optimization and Random Forest Technique for Feature Selection and Classification of Protein Function Family Prediction

Classification of Proteins Using Naïve Bayes Classifier and Surface-Invariant Coordinates

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Key words

Search

Navigation