Skip to main content

Criteria to Extract High-Quality Protein Data Bank Subsets for Structure Users

  • Protocol
  • First Online:
Data Mining Techniques for the Life Sciences

Part of the book series: Methods in Molecular Biology ((MIMB,volume 1415))

Abstract

It is often necessary to build subsets of the Protein Data Bank to extract structural trends and average values. For this purpose it is mandatory that the subsets are non-redundant and of high quality. The first problem can be solved relatively easily at the sequence level or at the structural level. The second, on the contrary, needs special attention. It is not sufficient, in fact, to consider the crystallographic resolution and other feature must be taken into account: the absence of strings of residues from the electron density maps and from the files deposited in the Protein Data Bank; the B-factor values; the appropriate validation of the structural models; the quality of the electron density maps, which is not uniform; and the temperature of the diffraction experiments. More stringent criteria produce smaller subsets, which can be enlarged with more tolerant selection criteria. The incessant growth of the Protein Data Bank and especially of the number of high-resolution structures is allowing the use of more stringent selection criteria, with a consequent improvement of the quality of the subsets of the Protein Data Bank.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Protocol
USD 49.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Bernstein FC, Koetzle TF, Williams GJB, Meyer EFJ, Brice MD, Rodgers JR, Kennard O, Shimanouchi T, Tasumi M (1977) The Protein Data Bank: a computer-based archival file for macromolecular structures. J Mol Biol 112:535–542

    Article  CAS  PubMed  Google Scholar 

  2. Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE (2000) The Protein Data Bank. Nucleic Acids Res 23:235–242

    Article  Google Scholar 

  3. Berman HM, Henrick K, Nakamura HA (2003) Announcing the worldwide Protein Data Bank. Nat Struct Biol 10:980

    Google Scholar 

  4. Sikic K, Tomic S, Carugo O (2010) Systematic comparison of crystal and NMR protein structures deposited in the Protein Data Bank. Open Biochem J 4:83–95

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  5. Ahram M, Litou ZI, Fang R, Al-Tawallbeh G (2006) Estimation of membrane proteins in the human proteome. In Silico Biol 6:379–386

    CAS  PubMed  Google Scholar 

  6. Almén MS, Nordström KJ, Fredriksson R, Schiöth HB (2009) Mapping the human membrane proteome a majority of the human membrane proteins can be classified according to function and evolutionary origin. BMC Biol 7:50

    Article  PubMed  PubMed Central  Google Scholar 

  7. Fagerberg L, Jonasson K, von Heijne G, Uhlén M, Berglund L (2010) Prediction of the human membrane proteome. Proteomics 10:1141–1149

    Article  CAS  PubMed  Google Scholar 

  8. Baase WA, Liu L, Tronrud DE, Matthews BW (2010) Lessons from the lysozyme of phage T4. Protein Sci 19:631–641

    Google Scholar 

  9. Mooers BH, Baase WA, Wray JW, Matthews BW (2009) Contributions of all 20 amino acids at site 96 to the stability and structure of T4 lysozyme. Protein Sci 18:871–880

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  10. Hobohm U, Sander C (1994) Enlarged representative set of protein structures. Protein Sci 3:522–524

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  11. Hobohm U, Scharf M, Schneider R, Sander C (1992) Selection of representative protein data sets. Protein Sci 1:409–417

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  12. Heringa J, Sommerfeldt H, Higgins D, Argos P (1992) OBSTRUCT: a program to obtain largest cliques from a protein sequence set according to structural resolution and sequence similarity. Comput Appl Biosci 8:599–600

    CAS  PubMed  Google Scholar 

  13. Griep S, Hobohm U (2010) PDBselect 1992-2009 and PDBfilter-select. Nucleic Acids Res 38:D318–D319

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  14. Wang G, Dunbrack RLJ (2003) PISCES: a protein sequence culling server. Bioinformatics 19:1589–1591

    Article  CAS  PubMed  Google Scholar 

  15. Rice P, Longden I, Bleasby A (2000) EMBOSS: the European Molecular Biology Open Software Suite. Trends Genet 16:276–277

    Article  CAS  PubMed  Google Scholar 

  16. Li W, Godzik A (2006) Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22:1658–1659

    Article  CAS  PubMed  Google Scholar 

  17. Sikic K, Carugo O (2010) Protein sequence redundancy reduction: comparison of various methods. Bioinformation 5:234–239

    Article  PubMed  PubMed Central  Google Scholar 

  18. Chin D, Means AR (2010) Calmodulin: a prototypical calcium sensor. Trends Cell Biol 10:322–328

    Article  Google Scholar 

  19. Sillitoe I, Lewis TE, Cuff AL, Das S, Ashford P, Dawson NL, Furnham N, Laskowski RA, Lee D, Lees J, Lehtinen S, Studer R, Thornton JM, Orengo CA (2015) CATH: comprehensive structural and functional annotations for genome sequences. Nucleic Acids Res 43:D376–D381

    Article  PubMed  PubMed Central  Google Scholar 

  20. Sirocco F, Tosatto SC (2008) TESE: generating specific protein structure test set ensembles. Bioinformatics 24:2632–2633

    Article  CAS  PubMed  Google Scholar 

  21. Carugo O, Djinovic-Carugo K (2012) How many packing contacts are observed in protein crystals? J Struct Biol 180:96–100

    Article  CAS  PubMed  Google Scholar 

  22. Carugo O (2011) Participation of protein sequence termini in crystal contacts. Protein Sci 20:2121–2124

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  23. Ringe D, Petsko GA (1986) Study of protein dynamics by X-ray diffraction. Methods Enzymol 131:389–433

    Article  CAS  PubMed  Google Scholar 

  24. Carugo O, Argos P (1998) Accessibility to internal cavities and ligand binding sites monitored by protein crystallographic thermal factors. Proteins 31:201–213

    Article  CAS  PubMed  Google Scholar 

  25. Lüdemann SK, Carugo O, Wade RC (1997) Substrate access to cytochrome P450cam: a comparison of a thermal motion pathway analysis with molecular dynamics simulation data. J Mol Model 3:369–374

    Article  Google Scholar 

  26. Carugo O, Argos P (1997) Correlation between side chain mobility and conformation in protein structures. Protein Eng 10:777–787

    Article  CAS  PubMed  Google Scholar 

  27. Yin H, Li YZ, Li ML (2011) On the relation between residue flexibility and residue interactions in proteins. Protein Pept Lett 18:450–456

    Article  CAS  PubMed  Google Scholar 

  28. Weiss MS (2007) On the interrelationship between atomic displacement parameters (ADPs) and coordinates in protein structures. Acta Crystallogr D63:1235–1242

    Google Scholar 

  29. Vihinen M, Torkkila E, Riikonen P (1994) Accuracy of protein flexibility predictions. Proteins 19:141–149

    Article  CAS  PubMed  Google Scholar 

  30. Parthasarathy S, Murthy MRN (1997) Analysis of temperature factor distribution in high-resolution protein structures. Protein Sci 6:2561–2567

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  31. Parthasarathy S, Murthy MRN (1999) On the correlation between the main-chain and side-chain atomic displacement parameters (B values) in high-resolution protein structures. Acta Crystallogr D55:173–180

    CAS  Google Scholar 

  32. Parthasarathy S, Murthy MR (2000) Protein thermal stability: insights from atomic displacement parameters (B values). Protein Eng 13:9–13

    Article  CAS  PubMed  Google Scholar 

  33. Carugo O, Argos P (1999) Reliability of atomic displacement parameters in protein crystal structures. Acta Crystallogr D55:473–478

    CAS  Google Scholar 

  34. Benkert P, Tosatto SC, Schomburg D (2008) QMEAN: a comprehensive scoring function for model quality assessment. Proteins 71:261–277

    Article  CAS  PubMed  Google Scholar 

  35. Kuzmanic A, Pannu NS, Zagrovic B (2014) X-ray refinement significantly underestimates the level of microscopic heterogeneity in biomolecular crystals. Nat Commun 5:3220

    Article  PubMed  PubMed Central  Google Scholar 

  36. Hope H (1988) Cryocrystallography of biological macromolecules: a generally applicable method. Acta Crystallogr B44:22–26

    Article  CAS  Google Scholar 

  37. Garman E, Owen RL (2007) Cryocrystallography of macromolecules: practice and optimization. Methods Mol Biol 364:1–18

    CAS  PubMed  Google Scholar 

  38. Garman EF, Owen RL (2006) Cryocooling and radiation damage in macromolecular crystallography. Acta Crystallogr D62:32–47

    CAS  Google Scholar 

  39. Carugo O, Carugo D (2005) When X-rays modify the protein structure: radiation damage at work. Trends Biochem Sci 30:213–219

    Article  CAS  PubMed  Google Scholar 

  40. Juers DH, Lovelace J, Bellamy HD, Snell EH, Matthews BW, Borgstahl GE (2007) Changes to crystals of Escherichia coli beta-galactosidase during room-temperature/low-temperature cycling and their relation to cryo-annealing. Acta Crystallogr D63:1139–1153

    Google Scholar 

  41. Miao Y, Yi Z, Glass DC, Hong L, Tyagi M, Baudry J, Jain N, Smith JC (2012) Temperature-dependent dynamical transitions of different classes of amino acid residue in a globular protein. J Am Chem Soc 134:19576–19579

    Article  CAS  PubMed  Google Scholar 

  42. Iben IE, Braunstein D, Doster W, Frauenfelder H, Hong MK, Johnson JB, Luck S, Ormos P, Schulte A, Steinbacj PJ, Xie AH, Young RD (1989) Glassy behavior of a protein. Phys Rev Lett 62:1916–1919

    Article  CAS  PubMed  Google Scholar 

  43. Fraser JS, van den Bedemb HE, Samelson AJ, Lang PT, Holton JM, Echols N, Alber T (2011) Accessing protein conformational ensembles using room-temperature X-ray crystallography. Proc Natl Acad Sci U S A 108:16247–16252

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  44. Dauter Z, Lamzin VS, Wilson KS (1997) The benefits of atomic resolution. Curr Opin Struct Biol 7:681–688

    Article  CAS  PubMed  Google Scholar 

  45. Longhi S, Czjzek M, Cambillau C (1998) Messages from ultrahigh resolution crystal structures. Curr Opin Struct Biol 8:730–737

    Article  CAS  PubMed  Google Scholar 

  46. Lamb AL, Kappock TJ, Silvaggi NR (2015) You are lost without a map: navigating the sea of protein structures. Biochim Biophys Acta 1854:258–268

    Article  CAS  PubMed  Google Scholar 

  47. Brunger AT (1992) Free R value: a novel statistical quantity for assessing the accuracy of crystal structures. Nature 355:472–475

    Article  CAS  PubMed  Google Scholar 

  48. Karplus PA, Diederichs K (2012) Linking crystallographic model and data quality. Science 336:1030–1033

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  49. Urzhumtsev A, Afonine PV, Adams PD (2009) On the use of logarithmic scales for analysis of diffraction data. Acta Crystallogr D65:1283–1291

    Google Scholar 

  50. Brown EN, Ramaswamy S (2007) Quality of protein crystal structures. Acta Crystallogr D63:941–950

    Google Scholar 

  51. Wang J (2015) Estimation of the quality of refined protein crystal structures. Protein Sci 24:661–669

    Article  CAS  PubMed  Google Scholar 

  52. Read RJ, Adams PD, Arendall WBR, Brunger AT, Emsley P, Joosten RP, Kleywegt GJ, Krissinel EB, Lütteke T, Otwinowski Z, Perrakis A, Richardson JS, Sheffler WH, Smith JL, Tickle IJ, Vriend G, Zwart PH (2011) A new generation of crystallographic validation tools for the protein data bank. Structure 19:1395–1412

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  53. Branden C-I, Jones TA (1990) Between objectivity and subjectivity. Nature 343:687–689

    Article  Google Scholar 

  54. Laskowski RA, MacArthur MW, Moss DS, Thornton JM (1993) PROCHECK: a program to check the stereochemical quality of protein structures. J Appl Crystallogr 26:283–291

    Article  CAS  Google Scholar 

  55. Hooft RWW, Vriend G, Sander C, Abola EE (1996) Errors in protein structures. Nature 381:272

    Article  CAS  PubMed  Google Scholar 

  56. Davis JW, Murray LW, Richardson JS, Richardson DC (2004) MolProbity: structure validation and all-atom contact analysis for nucleic acids and their complexes. Nucleic Acids Res 32:W615–W619

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  57. Lovell SC, Davis IW, Arendall WBR, de Bakker PIW, Word JM, Prisant MG, Richardson JS, Richardson DC (2003) Structure validation by Calpha geometry: ϕ, ψ and Cbeta deviation. Proteins 50:437–450

    Article  CAS  PubMed  Google Scholar 

  58. Ramachandran GN, Ramakrishnan C, Sasisekharan V (1963) Stereochemistry of polypeptide chain configurations. J Mol Biol 7:95–99

    Article  CAS  PubMed  Google Scholar 

  59. Carugo O, Djinovic-Carugo K (2013) Half a century of Ramachandran plots. Acta Crystallogr D69:1333–1341

    Google Scholar 

  60. Ponder JW, Richards FM (1987) Tertiary templates for proteins. Use of packing criteria in the enumeration of allowed sequences for different structural classes. J Mol Biol 193:775–791

    Article  CAS  PubMed  Google Scholar 

  61. Dunbrack RLJ, Cohen FE (1997) Bayesian statistical analysis of protein side-chain rotamer preferences. Protein Sci 6:1661–1681

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  62. Schrauber H, Eisenhaber F, Argos P (1993) Rotamers: to be or not to be? An analysis of amino acid side-chain conformations in globular proteins. J Mol Biol 230:592–612

    Article  CAS  PubMed  Google Scholar 

  63. Hooft RWW, Sander C, Vriend G (1996) Positioning hydrogen atoms by optimizing hydrogen-bond networks in protein structures. Proteins 26:363–376

    Article  CAS  PubMed  Google Scholar 

  64. Chen VB, Arendall WBR, Headd JJ, Keedy DA, Immormino RM, Kapral GJ, Murray LW, Richardson JS, Richardson DC (2010) MolProbity: all-atom structure validation for macromolecular crystallography. Acta Crystallogr D66:12–21

    Google Scholar 

  65. Word JM, Lovell SC, Richardson JS, Richardson DC (1999) Asparagine and glutamine: using hydrogen atom contacts in the choice of side-chain amide orientation. J Mol Biol 285:1735–1747

    Article  CAS  PubMed  Google Scholar 

  66. Word JM, Lovell SC, LaBean TH, Taylor HC, Zalis ME, Presley BK, Richardson JS, Richardson DC (1999) Visualizing and quantifying molecular goodness-of-fit: small-probe contact dots with explicit hydrogen atoms. J Mol Biol 285:1711–1733

    Article  CAS  PubMed  Google Scholar 

  67. Wiederstein M, Sippl MJ (2007) ProSA-web: interactive web service for the recognition of errors in three-dimensional structures of proteins. Nucleic Acids Res 35:W407–W410

    Article  PubMed  PubMed Central  Google Scholar 

  68. Vaguine AA, Richelle J, Wodak SJ (1999) SFCHECK: a unified set of procedures for evaluating the quality of macromolecular structure-factor data and their agreement with the atomic model. Acta Crystallogr D55:191–205

    CAS  Google Scholar 

  69. Joosten RP, Long F, Murshudov GN, Perrakis A (2014) The PDB_REDO server for macromolecular structure model optimization. IUCrJ 1:213–220

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  70. Joosten RP, Salzemann J, Bloch V, Stockinger H, Berglund A-C, Blanchet C, Bongcam-Rudloff E, Combet C, Da Costa AL, Deleage G, Diarena M, Fabbretti R, Fettahi G, Flegel V, Gisel A, Kasam V, Kervinen T, Korpelainen E, Mattila K, Pagni M, Reichstadt M, Breton V, Tickle IJ, Vriend G (2009) PDB_REDO: automated re-refinement of X-ray structure models in the PDB. J Appl Crystallogr 42:376–384

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  71. Touw WG, Vriend G (2014) BDB: databank of PDB files with consistent B-factors. Protein Eng 27:457–462

    Article  CAS  Google Scholar 

  72. Luzzati V (1952) Traitement statistique des erreurs dans la determination des structures cristallines. Acta Crystallogr 5:802–810

    Article  Google Scholar 

  73. Janin J (1990) Errors in three dimensions. Biochimie 72:705–709

    Article  CAS  PubMed  Google Scholar 

  74. Cruickshank DWJ (1996) Refinement of macromolecular structures. Proceedings of CCP4 Study weekend 1996. pp 11–22

    Google Scholar 

  75. Thaimattam R, Jaskolski M (2004) Synchrotron radiation in atomic-resolution studies of protein structure. J Alloys Compounds 362:12–20

    Article  CAS  Google Scholar 

  76. Tickle IJ, Laskowski RA, Moss DS (1998) Error estimates of protein structure coordinates and deviations from standard geometry by full-matrix refinement of γB- and βB2-crystallin. Acta Crystallogr D54:243–252

    CAS  Google Scholar 

  77. Carugo O (1995) Use of the estimated errors of the data in structure-correlation studies. Acta Crystallogr B51:314–328

    Article  CAS  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Oliviero Carugo .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer Science+Business Media New York

About this protocol

Cite this protocol

Carugo, O., Djinović-Carugo, K. (2016). Criteria to Extract High-Quality Protein Data Bank Subsets for Structure Users. In: Carugo, O., Eisenhaber, F. (eds) Data Mining Techniques for the Life Sciences. Methods in Molecular Biology, vol 1415. Humana Press, New York, NY. https://doi.org/10.1007/978-1-4939-3572-7_7

Download citation

  • DOI: https://doi.org/10.1007/978-1-4939-3572-7_7

  • Published:

  • Publisher Name: Humana Press, New York, NY

  • Print ISBN: 978-1-4939-3570-3

  • Online ISBN: 978-1-4939-3572-7

  • eBook Packages: Springer Protocols

Publish with us

Policies and ethics