Skip to main content
Log in

From PDB files to protein features: a comparative analysis of PDB bind and STCRDAB datasets

  • Original Article
  • Published:
Medical & Biological Engineering & Computing Aims and scope Submit manuscript

Abstract

Understanding protein structures is crucial for various bioinformatics research, including drug discovery, disease diagnosis, and evolutionary studies. Protein structure classification is a critical aspect of structural biology, where supervised machine learning algorithms classify structures based on data from databases such as Protein Data Bank (PDB). However, the challenge lies in designing numerical embeddings for protein structures without losing essential information. Although some effort has been made in the literature, researchers have not effectively and rigorously combined the structural and sequence-based features for efficient protein classification to the best of our knowledge. To this end, we propose numerical embeddings that extract relevant features for protein sequences fetched from PDB structures from popular datasets such as PDB Bind and STCRDAB. The features are physicochemical properties such as aromaticity, instability index, flexibility, Grand Average of Hydropathy (GRAVY), isoelectric point, charge at pH, secondary structure fracture, molar extinction coefficient, and molecular weight. We also incorporate scaling features for the sliding windows (e.g., k-mers), which include Kyte and Doolittle (KD) hydropathy scale, Eisenberg hydrophobicity scale, Hydrophilicity scale, Flexibility of the amino acids, and Hydropathy scale. Multiple-feature selection aims to improve the accuracy of protein classification models. The results showed that the selected features significantly improved the predictive performance of existing embeddings.

Graphical abstract

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Data availability

Available in the published version

Notes

  1. Available in the published version.

References

  1. Ali S, Bello B, Chourasia P et al (2022) Pwm2vec: An efficient embedding pproach for viral host specification from coronavirus spike sequences. MDPI Biology

  2. Ali S, Patterson M (2021) Spike2vec: An efficient and scalable embedding approach for covid-19 spike sequences. In: IEEE International conference on big data, pp 1533–1540

  3. Ali S, Sahoo B, Ullah N, Zelikovskiy A, Patterson M, Khan I (2021) A k-mer based approach for sars-cov-2 variant identification. In: International symposium on bioinformatics research and applications, pp 153–164

  4. AlQuraishi M (2021) Machine learning in protein structure prediction. Curr Opin Chem Biol 65:1–8

    Article  CAS  PubMed  Google Scholar 

  5. Batool M, Ahmad B, Choi S (2019) A structure-based drug discovery paradigm. Int J Mol Sci 20(11):2783

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  6. Bepler T, Berger B (2019) Learning protein sequence embeddings using information from structure. In: International conference on learning representations

  7. Bernstein FC, Koetzle TF et al (1977) The protein data bank: a computer-based archival file for macromolecular structures. J Mol Biol 112(3):535–542

    Article  CAS  PubMed  Google Scholar 

  8. Bjellqvist B, Hughes GJ et al (1993) The focusing positions of polypeptides in immobilized ph gradients can be predicted from their amino acid sequences. Electrophoresis 14(1):1023–1031

    Article  CAS  PubMed  Google Scholar 

  9. Buchan DW, Jones DT (2019) The psipred protein analysis workbench: 20 years on. Nucleic Acids Res 47(W1):W402–W407

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  10. Chourasia P, Ali S, Patterson M (2022) Informative initialization and kernel selection improves t-sne for biological sequences. In: 2022 IEEE international conference on big data (big data), pp. 101–106. IEEE

  11. Chourasia P, Murad T, Ali S, Patterson M (2023) Enhancing t-sne performance for biological sequencing data through kernel selection. In: International symposium on bioinformatics research and applications, pp. 442–452. Springer

  12. Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3):273–297

    Article  Google Scholar 

  13. Eisenberg D (1984) Three-dimensional structure of membrane and surface proteins. Ann Rev Biochem 53(1):595–623

    Article  CAS  PubMed  Google Scholar 

  14. Gill SC, Von Hippel PH (1989) Calculation of protein extinction coefficients from amino acid sequence data. Anal Biochem 182(2):319–326

    Article  CAS  PubMed  Google Scholar 

  15. Girotto S, Pizzi C, Comin M (2016) Metaprob: accurate metagenomic reads binning based on probabilistic sequence signatures. Bioinformatics 32(17):i567–i575

    Article  CAS  PubMed  Google Scholar 

  16. Gromiha MM (2010) Protein sequence analysis. Protein bioinformatics: from sequence to function. pp. 29–62

  17. Guo G, Wang H, Bell D, Bi Y, Greer K (2003) Knn model-based approach in classification. In: On The Move to Meaningful Internet Systems 2003: CoopIS, DOA, and ODBASE: OTM Confederated International Conferences, CoopIS, DOA, and ODBASE 2003, Catania, Sicily, Italy, November 3-7, 2003. Proceedings, pp. 986–996. Springer

  18. Guruprasad K, Reddy BB et al (1990) Correlation between stability of a protein and its dipeptide composition: a novel approach for predicting in vivo stability of a protein from its primary sequence. Protein Eng Des Sel 4(2):155–161

    Article  CAS  Google Scholar 

  19. Haykin S (1998) Neural networks: a comprehensive foundation. Prentice Hall PTR

  20. Hopp TP, Woods KR (1981) Prediction of protein antigenic determinants from amino acid sequences. Proc Natl Acad Sci 78(6):3824–3828

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  21. Ie E, Weston J, Noble WS, Leslie C (2005) Multi-class protein fold recognition using adaptive codes. In: International conference on machine learning, pp. 329–336

  22. Kabsch W, Sander C (1983) Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers Orig Res Biomol 22(12):2577–2637

    Article  CAS  Google Scholar 

  23. Karplus PA, Schulz GE (1987) Refined structure of glutathione reductase at 1.54 å resolution. J Mol Biol 195(3):701–729

    Article  CAS  PubMed  Google Scholar 

  24. Klein P, Delisi C (1986) Prediction of protein structural class from the amino acid sequence. Biopolymers Orig Res Biomol 25(9):1659–1672

    Article  CAS  Google Scholar 

  25. Kubinyi H (1998) Structure-based design of enzyme inhibitors and receptor ligands. Curr Opin Drug Discov Devel 1(1):4–15

    CAS  PubMed  Google Scholar 

  26. Kuksa P, Khan I, Pavlovic V (2012) Generalized similarity kernels for efficient sequence classification. In: SIAM international conference on data mining (SDM), pp. 873–882

  27. Kurotani A, Tokmakov AA et al (2019) Localization-specific distributions of protein pi in human proteome are governed by local ph and membrane charge. BMC Mol Cel Biol 20

  28. Kuzmin K, Adeniyi AE et al (2020) Machine learning methods accurately predict host specificity of coronaviruses based on spike sequences alone. Biochem Biophys Res Commun 533(3):553–558

  29. Kyte J, Doolittle RF (1982) A simple method for displaying the hydropathic character of a protein. J Mol Biol 157(1):105–132

    Article  CAS  PubMed  Google Scholar 

  30. Leem J, de Oliveira SHP, Krawczyk K, Deane CM (2018) Stcrdab: the structural t-cell receptor database. Nucleic Acids Res 46(D1):D406–D412

    Article  CAS  PubMed  Google Scholar 

  31. Liu Z, Li Y, Han L, Li J et al (2015) Pdb-wide collection of binding data: current status of the pdbbind database. Bioinformatics 31(3):405–412

    Article  CAS  PubMed  Google Scholar 

  32. Lobry J, Gautier C (1994) Hydrophobicity, expressivity and aromaticity are the major trends of amino-acid usage in 999 escherichia coli chromosome-encoded genes. Nucleic Acids Res 22(15):3174–3180

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  33. MacCallum JL, Tieleman DP (2011) Hydrophobicity scales: a thermodynamic looking glass into lipid-protein interactions. Trends Biochem Sci 36(12):653–662

    Article  CAS  PubMed  Google Scholar 

  34. Ng A, Jordan M (2001) On discriminative vs. generative classifiers: A comparison of logistic regression and naive bayes. Adv Neural Inf Process Syst 14

  35. de Oliveira S, Deane C (2017) Co-evolution techniques are reshaping the way we do structural bioinformatics. F1000Research 6

  36. Pal M (2005) Random forest classifier for remote sensing classification. Int J Remote Sens 26(1):217–222

    Article  Google Scholar 

  37. Rish I et al (2001) An empirical study of the naive bayes classifier. In: IJCAI 2001 workshop on empirical methods in artificial intelligence, vol 3, pp. 41–46

  38. Roberts M, Haynes W, Hunt B, Mount S, Yorke J (2004) Reducing storage requirements for biological sequence comparison. Bioinformatics 20:3363–9

    Article  CAS  PubMed  Google Scholar 

  39. Safavian SR, Landgrebe D (1991) A survey of decision tree classifier methodology. IEEE Trans Syst Man Cybern 21(3):660–674

    Article  Google Scholar 

  40. Shamim MTA, Anwaruddin M et al (2007) Support vector machine-based classification of protein folds using the structural properties of amino acid residues and amino acid residue pairs. Bioinformatics 23(24):3320–3327

    Article  CAS  PubMed  Google Scholar 

  41. Singh R, Sekhon A et al (2017) Gakco: a fast gapped k-mer string kernel using counting. In: Joint ECML and knowledge discovery in databases, pp. 356–373

  42. Sofi MA, Wani MA (2022) Improving prediction of protein secondary structures using attention-enhanced deep neural networks. In: 2022 9th International conference on computing for sustainable global development, pp. 664–668. IEEE

  43. Van L, Hinton G (2008) Visualizing data using t-sne. J Mach Learn Res (JMLR) 9(11)

  44. Vapnik V (2013) The nature of statistical learning theory. Springer science & business media

  45. Vinga S, Gouveia-Oliveira R, Almeida JS (2004) Comparative evaluation of word composition distances for the recognition of scop relationships. Bioinformatics 20(2):206–215

    Article  CAS  PubMed  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sarwan Ali.

Ethics declarations

Conflict of interest

The authors declare no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ali, S., Chourasia, P. & Patterson, M. From PDB files to protein features: a comparative analysis of PDB bind and STCRDAB datasets. Med Biol Eng Comput (2024). https://doi.org/10.1007/s11517-024-03074-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11517-024-03074-3

Keywords

Navigation