From PDB files to protein features: a comparative analysis of PDB bind and STCRDAB datasets

Ali, Sarwan; Chourasia, Prakash; Patterson, Murray

doi:10.1007/s11517-024-03074-3

From PDB files to protein features: a comparative analysis of PDB bind and STCRDAB datasets

Original Article
Published: 16 April 2024

(2024)
Cite this article

Medical & Biological Engineering & Computing Aims and scope Submit manuscript

75 Accesses
Explore all metrics

Abstract

Understanding protein structures is crucial for various bioinformatics research, including drug discovery, disease diagnosis, and evolutionary studies. Protein structure classification is a critical aspect of structural biology, where supervised machine learning algorithms classify structures based on data from databases such as Protein Data Bank (PDB). However, the challenge lies in designing numerical embeddings for protein structures without losing essential information. Although some effort has been made in the literature, researchers have not effectively and rigorously combined the structural and sequence-based features for efficient protein classification to the best of our knowledge. To this end, we propose numerical embeddings that extract relevant features for protein sequences fetched from PDB structures from popular datasets such as PDB Bind and STCRDAB. The features are physicochemical properties such as aromaticity, instability index, flexibility, Grand Average of Hydropathy (GRAVY), isoelectric point, charge at pH, secondary structure fracture, molar extinction coefficient, and molecular weight. We also incorporate scaling features for the sliding windows (e.g., k-mers), which include Kyte and Doolittle (KD) hydropathy scale, Eisenberg hydrophobicity scale, Hydrophilicity scale, Flexibility of the amino acids, and Hydropathy scale. Multiple-feature selection aims to improve the accuracy of protein classification models. The results showed that the selected features significantly improved the predictive performance of existing embeddings.

Graphical abstract

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Artificial intelligence to deep learning: machine intelligence approach for drug discovery

Article 12 April 2021

Robocrystallographer: automated crystal structure text descriptions and analysis

Article 20 September 2019

BeStSel: From Secondary Structure Analysis to Protein Fold Prediction by Circular Dichroism Spectroscopy

Data availability

Available in the published version

Notes

Available in the published version.

References

Ali S, Bello B, Chourasia P et al (2022) Pwm2vec: An efficient embedding pproach for viral host specification from coronavirus spike sequences. MDPI Biology
Ali S, Patterson M (2021) Spike2vec: An efficient and scalable embedding approach for covid-19 spike sequences. In: IEEE International conference on big data, pp 1533–1540
Ali S, Sahoo B, Ullah N, Zelikovskiy A, Patterson M, Khan I (2021) A k-mer based approach for sars-cov-2 variant identification. In: International symposium on bioinformatics research and applications, pp 153–164
AlQuraishi M (2021) Machine learning in protein structure prediction. Curr Opin Chem Biol 65:1–8
Article CAS PubMed Google Scholar
Batool M, Ahmad B, Choi S (2019) A structure-based drug discovery paradigm. Int J Mol Sci 20(11):2783
Article CAS PubMed PubMed Central Google Scholar
Bepler T, Berger B (2019) Learning protein sequence embeddings using information from structure. In: International conference on learning representations
Bernstein FC, Koetzle TF et al (1977) The protein data bank: a computer-based archival file for macromolecular structures. J Mol Biol 112(3):535–542
Article CAS PubMed Google Scholar
Bjellqvist B, Hughes GJ et al (1993) The focusing positions of polypeptides in immobilized ph gradients can be predicted from their amino acid sequences. Electrophoresis 14(1):1023–1031
Article CAS PubMed Google Scholar
Buchan DW, Jones DT (2019) The psipred protein analysis workbench: 20 years on. Nucleic Acids Res 47(W1):W402–W407
Article CAS PubMed PubMed Central Google Scholar
Chourasia P, Ali S, Patterson M (2022) Informative initialization and kernel selection improves t-sne for biological sequences. In: 2022 IEEE international conference on big data (big data), pp. 101–106. IEEE
Chourasia P, Murad T, Ali S, Patterson M (2023) Enhancing t-sne performance for biological sequencing data through kernel selection. In: International symposium on bioinformatics research and applications, pp. 442–452. Springer
Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3):273–297
Article Google Scholar
Eisenberg D (1984) Three-dimensional structure of membrane and surface proteins. Ann Rev Biochem 53(1):595–623
Article CAS PubMed Google Scholar
Gill SC, Von Hippel PH (1989) Calculation of protein extinction coefficients from amino acid sequence data. Anal Biochem 182(2):319–326
Article CAS PubMed Google Scholar
Girotto S, Pizzi C, Comin M (2016) Metaprob: accurate metagenomic reads binning based on probabilistic sequence signatures. Bioinformatics 32(17):i567–i575
Article CAS PubMed Google Scholar
Gromiha MM (2010) Protein sequence analysis. Protein bioinformatics: from sequence to function. pp. 29–62
Guo G, Wang H, Bell D, Bi Y, Greer K (2003) Knn model-based approach in classification. In: On The Move to Meaningful Internet Systems 2003: CoopIS, DOA, and ODBASE: OTM Confederated International Conferences, CoopIS, DOA, and ODBASE 2003, Catania, Sicily, Italy, November 3-7, 2003. Proceedings, pp. 986–996. Springer
Guruprasad K, Reddy BB et al (1990) Correlation between stability of a protein and its dipeptide composition: a novel approach for predicting in vivo stability of a protein from its primary sequence. Protein Eng Des Sel 4(2):155–161
Article CAS Google Scholar
Haykin S (1998) Neural networks: a comprehensive foundation. Prentice Hall PTR
Hopp TP, Woods KR (1981) Prediction of protein antigenic determinants from amino acid sequences. Proc Natl Acad Sci 78(6):3824–3828
Article CAS PubMed PubMed Central Google Scholar
Ie E, Weston J, Noble WS, Leslie C (2005) Multi-class protein fold recognition using adaptive codes. In: International conference on machine learning, pp. 329–336
Kabsch W, Sander C (1983) Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers Orig Res Biomol 22(12):2577–2637
Article CAS Google Scholar
Karplus PA, Schulz GE (1987) Refined structure of glutathione reductase at 1.54 å resolution. J Mol Biol 195(3):701–729
Article CAS PubMed Google Scholar
Klein P, Delisi C (1986) Prediction of protein structural class from the amino acid sequence. Biopolymers Orig Res Biomol 25(9):1659–1672
Article CAS Google Scholar
Kubinyi H (1998) Structure-based design of enzyme inhibitors and receptor ligands. Curr Opin Drug Discov Devel 1(1):4–15
CAS PubMed Google Scholar
Kuksa P, Khan I, Pavlovic V (2012) Generalized similarity kernels for efficient sequence classification. In: SIAM international conference on data mining (SDM), pp. 873–882
Kurotani A, Tokmakov AA et al (2019) Localization-specific distributions of protein pi in human proteome are governed by local ph and membrane charge. BMC Mol Cel Biol 20
Kuzmin K, Adeniyi AE et al (2020) Machine learning methods accurately predict host specificity of coronaviruses based on spike sequences alone. Biochem Biophys Res Commun 533(3):553–558
Kyte J, Doolittle RF (1982) A simple method for displaying the hydropathic character of a protein. J Mol Biol 157(1):105–132
Article CAS PubMed Google Scholar
Leem J, de Oliveira SHP, Krawczyk K, Deane CM (2018) Stcrdab: the structural t-cell receptor database. Nucleic Acids Res 46(D1):D406–D412
Article CAS PubMed Google Scholar
Liu Z, Li Y, Han L, Li J et al (2015) Pdb-wide collection of binding data: current status of the pdbbind database. Bioinformatics 31(3):405–412
Article CAS PubMed Google Scholar
Lobry J, Gautier C (1994) Hydrophobicity, expressivity and aromaticity are the major trends of amino-acid usage in 999 escherichia coli chromosome-encoded genes. Nucleic Acids Res 22(15):3174–3180
Article CAS PubMed PubMed Central Google Scholar
MacCallum JL, Tieleman DP (2011) Hydrophobicity scales: a thermodynamic looking glass into lipid-protein interactions. Trends Biochem Sci 36(12):653–662
Article CAS PubMed Google Scholar
Ng A, Jordan M (2001) On discriminative vs. generative classifiers: A comparison of logistic regression and naive bayes. Adv Neural Inf Process Syst 14
de Oliveira S, Deane C (2017) Co-evolution techniques are reshaping the way we do structural bioinformatics. F1000Research 6
Pal M (2005) Random forest classifier for remote sensing classification. Int J Remote Sens 26(1):217–222
Article Google Scholar
Rish I et al (2001) An empirical study of the naive bayes classifier. In: IJCAI 2001 workshop on empirical methods in artificial intelligence, vol 3, pp. 41–46
Roberts M, Haynes W, Hunt B, Mount S, Yorke J (2004) Reducing storage requirements for biological sequence comparison. Bioinformatics 20:3363–9
Article CAS PubMed Google Scholar
Safavian SR, Landgrebe D (1991) A survey of decision tree classifier methodology. IEEE Trans Syst Man Cybern 21(3):660–674
Article Google Scholar
Shamim MTA, Anwaruddin M et al (2007) Support vector machine-based classification of protein folds using the structural properties of amino acid residues and amino acid residue pairs. Bioinformatics 23(24):3320–3327
Article CAS PubMed Google Scholar
Singh R, Sekhon A et al (2017) Gakco: a fast gapped k-mer string kernel using counting. In: Joint ECML and knowledge discovery in databases, pp. 356–373
Sofi MA, Wani MA (2022) Improving prediction of protein secondary structures using attention-enhanced deep neural networks. In: 2022 9th International conference on computing for sustainable global development, pp. 664–668. IEEE
Van L, Hinton G (2008) Visualizing data using t-sne. J Mach Learn Res (JMLR) 9(11)
Vapnik V (2013) The nature of statistical learning theory. Springer science & business media
Vinga S, Gouveia-Oliveira R, Almeida JS (2004) Comparative evaluation of word composition distances for the recognition of scop relationships. Bioinformatics 20(2):206–215
Article CAS PubMed Google Scholar

Download references

Author information

Authors and Affiliations

Georgia State University, Atlanta, GA, USA
Sarwan Ali, Prakash Chourasia & Murray Patterson

Authors

Sarwan Ali
View author publications
You can also search for this author in PubMed Google Scholar
Prakash Chourasia
View author publications
You can also search for this author in PubMed Google Scholar
Murray Patterson
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sarwan Ali.

Ethics declarations

Conflict of interest

The authors declare no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Ali, S., Chourasia, P. & Patterson, M. From PDB files to protein features: a comparative analysis of PDB bind and STCRDAB datasets. Med Biol Eng Comput (2024). https://doi.org/10.1007/s11517-024-03074-3

Download citation

Received: 20 October 2023
Accepted: 13 March 2024
Published: 16 April 2024
DOI: https://doi.org/10.1007/s11517-024-03074-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions