Three-dimensional Structure Databases of Biological Macromolecules

Waman, Vaishali P.; Orengo, Christine; Kleywegt, Gerard J.; Lesk, Arthur M.

doi:10.1007/978-1-0716-2095-3_3

Vaishali P. Waman⁴,
Christine Orengo⁴,
Gerard J. Kleywegt⁵ &
…
Arthur M. Lesk⁶

Part of the book series: Methods in Molecular Biology ((MIMB,volume 2449))

1064 Accesses
2 Citations
1 Altmetric

Abstract

Databases of three-dimensional structures of proteins (and their associated molecules) provide:

(a)
Curated repositories of coordinates of experimentally determined structures, including extensive metadata; for instance information about provenance, details about data collection and interpretation, and validation of results.
(b)
Information-retrieval tools to allow searching to identify entries of interest and provide access to them.
(c)
Links among databases, especially to databases of amino-acid and genetic sequences, and of protein function; and links to software for analysis of amino-acid sequence and protein structure, and for structure prediction.
(d)
Collections of predicted three-dimensional structures of proteins. These will become more and more important after the breakthrough in structure prediction achieved by AlphaFold2.

The single global archive of experimentally determined biomacromolecular structures is the Protein Data Bank (PDB). It is managed by wwPDB, a consortium of five partner institutions: the Protein Data Bank in Europe (PDBe), the Research Collaboratory for Structural Bioinformatics (RCSB), the Protein Data Bank Japan (PDBj), the BioMagResBank (BMRB), and the Electron Microscopy Data Bank (EMDB). In addition to jointly managing the PDB repository, the individual wwPDB partners offer many tools for analysis of protein and nucleic acid structures and their complexes, including providing computer-graphic representations. Their collective and individual websites serve as hubs of the community of structural biologists, offering newsletters, reports from Task Forces, training courses, and “helpdesks,” as well as links to external software.

Many specialized projects are based on the information contained in the PDB. Especially important are SCOP, CATH, and ECOD, which present classifications of protein domains.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Protocol: USD 49.95; Price excludes VAT (USA)

eBook: USD 109.00; Price excludes VAT (USA)

Softcover Book: USD 139.99; Price excludes VAT (USA)

Hardcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
A (very rough) back-of-the-envelope calculation suggests that to store ∼50 PDB entries on punched cards in mmCIF format rather than old PDB format would cost one 8-inch-diameter tree.
2.
In this example, the subgraph shown could be regarded as a tree, starting with the family node as the root. (Indeed, having the root at the bottom would more accurately reflect the botanical metaphor.) However, in many cases the ancestry subgraph is a directed acyclic graph but not a tree.
3.
Sydney Brenner used to chaff Aaron Klug, saying: “Why don’t you crystallize E. coli?”
4.
Many readers will recall that Paul Dirac famously made a similar-sounding claim about chemistry, in 1929, but this has not happened.

References

(1971) Crystallography: Protein data bank. Nature New Biol 233:223
Google Scholar
(2021) A celebration of structural biology. Nat Methods 18:427
Google Scholar
Bordin N, Sillitoe I, Lees JG, Orengo C (2021) Tracing evolution through protein structures: Nature captured in a few thousand folds. Front Mol Biosci 8:668184
Google Scholar
Dayhoff MO, Eck RV et al (1965) Atlas of protein sequence and structure. National Biomedical Research Foundation, Silver Spring, MD
Google Scholar
Lipscomb WN, Reeke GN Jr, Hartsuck JA, Quiocho FA, Bethge PH (1970) The structure of carboxypeptidase A. 8. Atomic interpretation at 0.2 nm resolution, a new study of the complex of glycyl-L-tyrosine with CPA, and mechanistic deductions. Philos Trans R Soc Lond B257:177–214
Google Scholar
Berman H, Henrick K, Nakamura H (2003) Announcing the worldwide Protein Data Bank Nature Struct. Biol 10:980
Google Scholar
wwPDB consortium (2019) Protein Data Bank: the single global archive for 3D macromolecular structure data. Nucl Acids Res 47:D520–D528
Google Scholar
Lesk AM (2016) Introduction to Protein Science, 3rd edn. Oxford University Press, Oxford
Google Scholar
Seoane B, Carbone A (2021) The complexity of protein interactions unravelled from structural disorder. PLoS Comput Biol 17:e1008546
Article CAS PubMed PubMed Central Google Scholar
Borrell B (2009) Fraud rocks protein community. Nature 462:970
Article CAS PubMed Google Scholar
Young JY et al (2017) OneDep: Unified wwPDB system for deposition, biocuration, and validation of macromolecular structures in the PDB archive. Structure 25:536–545
Article CAS PubMed PubMed Central Google Scholar
Baker EN, Saenger W (1999) Deposition and release of macromolecular structural data. Acta Cryst D55:2–3
Google Scholar
Joosten RP, Vriend G (2007) PDB improvement starts with data deposition. Science 317:195–196
Article CAS PubMed Google Scholar
Commission on Biological Macromolecules (2000) Guidelines for the deposition and release of macromolecular coordinate and experimental data. Acta Cryst D56:2
Google Scholar
Gore S, Sanz-Garcia E, Hendrickx PMS, Gutmanas A, Westbook JD et al (2017) Validation of structures in the protein data bank. Structure 25:1916–1927
Article CAS PubMed PubMed Central Google Scholar
Berjanskii M, Zhou J, Liang Y, Lin G, Wishart DS (2012) Resolution-by-proxy: a simple measure for assessing and comparing the overall quality of NMR protein structures. J Biomol NMR 53:167–180
Google Scholar
Lawson CL, Chiu W (2018) Comparing Cryo-EM structures. J Struct Biol 204:523–526
Article PubMed PubMed Central Google Scholar
Lawson CL, Berman HM, Chiu W (2020) Evolving data standards for cryo-EM structures. Struct Dyn 7:014701
Article CAS PubMed PubMed Central Google Scholar
Lange J, Baakman C, Pistorius A, Krieger E, Hooft R, Joosten RP, Vriend G (2020) Facilities that make the PDB data collection more powerful. Protein Sci 29:330–344
Article CAS PubMed Google Scholar
Joosten RP, Womack T, Vriend G, Bricogne G (2009) Re-refinement from deposited X-ray data can deliver improved models for most PDB entries. Acta Cryst D65:176–185
Google Scholar
Joosten RP et al (2009) PDB_REDO: automated re-refinement of X-ray structure models in the PDB. J Appl Cryst 42:376–384
Article CAS Google Scholar
Joosten RP, Joosten K, Cohen SX, Vriend G, Perrakis A (2011) Automatic rebuilding and optimization of crystallographic structures in the Protein Data Bank. Bioinformatics 27:3392–3398
Article CAS PubMed PubMed Central Google Scholar
Wilkinson MD, Dumontier M, Aalbersberg IJ J, Appelton G, Axton M et al (2016) The FAIR Guiding Principles for scientific datamanagement and stewardship. Sci Data 3:160018
Google Scholar
Armstrong DR, Berrisford JM, Conroy MJ, Gutmanas A, Anyango S et al (2020) PDBe: improved findability of macromolecular structure data in the PDB. Nucl Acids Res 48:D335–D343
CAS PubMed Google Scholar
Mitsopoulos C et al (2021) canSAR: update to the cancer translational research and drug discovery knowledgebase. Nucl Acids Res 49:D1074–1082
Google Scholar
Orengo C et al (2020) A community proposal to integrate structural bioinformatics activities in ELIXIR (3D-Bioinfo Community) F1000Res 9:ELIXIR-278
Google Scholar
de Chadarevian, S (2018) John Kendrew and myoglobin: Protein structure determination in the 1950s. Prot Sci 27:1136–1143
Article CAS Google Scholar
Phillips SE (2018) Structure and refinement of oxymyoglobin at 1.6 Å resolution. J Mol Biol 142:531–554
Article Google Scholar
Altschul SF, Gish W, Miller W, Meyers EW, Lipman DJ (1990) Basic local alignment search tool. J Mol Biol 215:403–410
Article CAS PubMed Google Scholar
Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucl Acids Res 25:3389–3402
Article CAS PubMed PubMed Central Google Scholar
Krogh A, Brown B, Mian IS, Sjölander K, Haussler D (1994) Hidden Markov models in computational biology. Applications to protein modeling. J Mol Biol 235:1501–1531
Google Scholar
Eddy SR (1996) Hidden Markov models. Curr Opin Struct Biol 6:361–365
Article CAS PubMed Google Scholar
Eddy SR (1998) Profile Hidden Markov Models. Bioinformatics 14:755–763
Article CAS PubMed Google Scholar
Mirdita M, Steinegger M, Söding J (2019) MMseqs2 desktop and local web server app for fast, interactive sequence searches. Bioinformatics 35:2856–2858
Article CAS PubMed PubMed Central Google Scholar
Krissinel E, Henrick K (2004) Secondary-structure matching (SSM), a new tool for fast protein structure alignment in three dimensions. Acta Cryst D60:2256–2268
CAS Google Scholar
Burley SK, Bhikadiya C, Bi C, Bittrich S, Chen L et al (2021) RCSB Protein Data Bank: powerful new tools for exploring 3D structures of biological macromolecules for basic and applied research and education in fundamental biology, biomedicine, biotechnology, bioengineering and energy sciences. Nucl Acids Res 49:D437–D451
Article CAS PubMed Google Scholar
Sumanaweera D, Allison L, Konagurthu AS (2019) Statistical compression of protein sequences and inference of marginal probability landscapes over competing alignments using finite state models and Dirichlet priors. Bioinformatics 35:i360–i369
Article CAS PubMed PubMed Central Google Scholar
Collier JH, Allison L, Lesk AM, Stuckey PJ, Gardia de la Banda M, Konagurthu AS (2017) Statistical inference of protein structural alignments using information and compression. Bioinformatics 33:1005–1013
Google Scholar
Konagurthu AS, Whisstock JC, Stuckey PJ, Lesk AM (2006) MUSTANG: a multiple structural alignment algorithm. Proteins Struct Funct Bioinf 64:559–574
Article CAS Google Scholar
Collier JH, Lesk AM, Garcia de la Banda M, Konagurthu AS (2012) Super: a web server to rapidly screen superposable oligopeptide fragments from the protein data bank. Nucl Acids Res 40:W334–W339
Article CAS PubMed PubMed Central Google Scholar
Konagurthu AS, Lesk AM, Allison L (2012) Minimum Message Length inference of secondary structure from protein coordinate data. Bioinformatics 28:i97–i105
Article CAS PubMed PubMed Central Google Scholar
Konagurthu AS, Subramanian R, Allison L, Abramson D, Stuckey PJ, Gardia de la Banda M, Lesk AM (2021) Universal architectural concepts underlying protein folding patterns. Front Mol Biosci 7:612920
Google Scholar
Bourne PE, Berman HM, McMahon B, Watenpaugh KD, Westbrook J, Fitzgerald PMD (1977) The Macromolecular crystallographic information file (mmCIF). Methods Enzymol 277:571–590
Google Scholar
Wetlaufer DB (1973) Nucleation, rapid folding, and globular intrachain regions in proteins. Proc Natl Acad Sci U S A 70:697–701
Article CAS PubMed PubMed Central Google Scholar
Murzin AG, Brenner SE, Hubbard T, Chothia C (1995) SCOP: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol 247:536–540
CAS PubMed Google Scholar
Orengo C, Jones D, Thornton JM (1994) Protein superfamilies and domain superfolds. Nature 372:631–634
Google Scholar
Orengo CA, Michie AD, Jones S, Jones DT, Swindells MB, Thornton JM (1997) CATH – a hierarchic classification of protein domain structures. Structure 5:1093–1108
Article CAS PubMed Google Scholar
Chandonia JM, Brenner SE (2006) The impact of structural genomics: expectations and outcomes. Science 311:347–351
Article CAS PubMed Google Scholar
Fox NK, Brenner SE, Chandonia J-M (2015) The value of protein structure classification information—Surveying the scientific literature. Proteins 83:2025–2038
Google Scholar
Fox NK, Brenner SE, Chandonia J-M (2014) SCOPe: Structural Classification of Proteins—extended, integrating SCOP and ASTRAL data and classification of new structures. Nucleic Acids Res 42:D304–309
Google Scholar
Lesk AM, Chothia C (1986) The response of protein structures to amino acid sequence changes. Philos Trans R Soc Lond A317:345–356
Google Scholar
Greene LH, Lewis TE, Addou S, Cuff A, Dallman T et al (2007) The CATH domain structure database: new protocols and classification levels give a more comprehensive resource for exploring evolution. Nucleic Acids Res 35:D291–D297
Article CAS PubMed Google Scholar
Taylor WR, Orengo CA (1989) Protein structure alignment. J Mol Biol 208:1–22
Article CAS PubMed Google Scholar
Orengo CA, Taylor WR (1996) SSAP: Sequential structure alignment program for protein structure comparison. Methods Enzymol 266:617–635
Article CAS PubMed Google Scholar
Cheng H, Liao Y, Schaeffer RD, Grishin NV (2015) Manual classification strategies in the ECOD database. Proteins 83:1238–1251
Article CAS PubMed PubMed Central Google Scholar
Cheng H, Schaeffer RD, Liao Y et al (2014) ECOD: an evolutionary classification of protein domains. PLoS Comput Biol 10:e1003926
Article PubMed PubMed Central Google Scholar
Sillitoe I, Lewis TE, Cuff A, Das S, Ashford P, Dawson NL, Furnham N, Laskowski RA, Lee D, Lees JG, Lehtinen S, Studer RA, Thornton J, Orengo CA (2015) CATH: comprehensive structural and functional annotations for genome sequences. Nucleic Acids Res. 43(Database issue):D376–81. https://doi.org/10.1093/nar/gku947
Orengo CA, Pearl FM, Bray JE, Todd AE, Martin AC, Lo Conte L, Thornton JM (1999) The CATH Database provides insights into protein structure/function relationships. Nucl Acids Res. 27:275–279
Google Scholar
Sillitoe I, Dawson N, Thornton J, Orengo C (2015) The history of the CATH structural classification of protein domains. Biochemie 119:209–217
Article CAS Google Scholar
Sillitoe I, Dawson N, Lewis TE, Das S, Lees JG et al (2019) CATH: expanding the horizons of structure-based functional annotations for genome sequences. Nucleic Acids Res 47:D280–D284
Article CAS PubMed Google Scholar
Dawson NL, Lewis TE, Das S, Lees JG, Lee D, Ashford P, Orengo CA, Sillitoe I (2017) CATH: an expanded resource to predict protein function through structure and sequence. Nucl Acids Res. 45:D289–D295
Google Scholar
Das S, Sillitoe I, Lee D, Lees JG, Dawson NL, Ward J, Orengo CA (2015) CATH FunFHMMer web server: protein functional annotations using functional family assignments. Nucleic Acids Res 43:W148–153
Article CAS PubMed PubMed Central Google Scholar
Levitt M, Chothia C (1976) Structural patterns in globular proteins. Nature 261:552–558
Article CAS PubMed Google Scholar
Michie AD, Orengo CA, Thornton JM (1996) Analysis of domain structural class using an automated class assignment protocol. J Mol Biol 262:168–185
Article CAS PubMed Google Scholar
Sillitoe I, Bordin N, Dawson N, Waman VP, Ashford P et al (2021) CATH: increased structural coverage of functional space. Nucl Acids Res 49:D226–273
Article CAS Google Scholar
Presnell SR, Cohen FE (1989) Topological distribution of four-α-helix bundles. Proc Natl Acad Sci U S A 86:6592–6596
Article CAS PubMed PubMed Central Google Scholar
Furnham N, Sillitoe I, Holliday GL et al (2012) FunTree: a resource for exploring the functional evolution of structurally defined enzyme superfamilies. Nucleic Acids Res. 40:D776–D782
Article CAS PubMed Google Scholar
The Gene Ontology Consortium: Ashburner M, Ball CA, Blake JA, Botstein D, Butler H et al (2000) Gene ontology: tool for the unification of biology. Nat Genet 25:25–29
Article CAS Google Scholar
Zhou N, Jiang Y, Bergquist TR et al (2019) The CAFA challenge reports improved protein function prediction and new functional annotations for hundreds of genes through experimental screens. Genome Biol 20:244
Article CAS PubMed PubMed Central Google Scholar
Valdar WS, Thornton JM (2001) Protein-protein interfaces: analysis of amino acid conservation in homodimers. Proteins 42:108–124
Article CAS PubMed Google Scholar
Chandonia J-M, Fox NK, Brenner SE (2017) SCOPe: Manual curation and artifact removal in the Structural Classification of Proteins – extended database. J Mol Biol 429:348–355
Google Scholar
Andreeva A, Howorth D, Chothia C, Kulesha E, Murzin AG (2014) SCOP2 prototype: a new approach to protein structure mining. Nucleic Acids Res 42:D310–D314
Article CAS PubMed Google Scholar
Andreeva A, Howorth D, Chothia C, Kulesha E, Murzin AG (2018) Investigating protein structure and evolution with SCOP2. Curr Protocols Bioinf 49:1.26.1–1.26.21
Google Scholar
Andreeva A, Kulesha E, Gough J, Murzin AG (2020) The SCOP database in 2020: expanded classification of representative family and superfamily domains of known protein structures. Nucleic Acids Res 48:D376–D382
Article CAS PubMed Google Scholar
Lesk AM (2021) Protein science. Oxford University Press, Oxford
Google Scholar
Das S, Dawson NL, Orengo CA (2015) Diversity in protein domain superfamilies. Curr Opin Genet Dev 35:40–49
Article CAS PubMed PubMed Central Google Scholar
Grishin NV (2001) Fold change in evolution of protein structures. J Struct Biol 134:167–185
Article CAS PubMed Google Scholar
Akiva E, Brown S, Almonacid DE, Barber AE, II, Custer AF et al (2014) The structure-function linkage database. Nucl Acids Res 42:D521–530
Article CAS PubMed Google Scholar
Jumper J, Evans R, Pritzel A et al. (2021) Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589. https://doi.org/10.1038/s41586-021-03819-2
Lupas AN, Pereira J, Alva V, Merino F, Coles M, Hartmann MD (2021) The breakthrough in protein structure prediction. Biochem J 478:1885–1890
Article CAS PubMed Google Scholar
Hopf TA et al (2014) Sequence co-evolution gives 3D contacts and structures of protein complexes. eLife 3:e03430
Article PubMed Central Google Scholar
Wodak S, Velankar S, Sternberg MJE (2020) Modeling protein interactions and complexes in CAPRI: Seventh CAPRI evaluation meeting, April 3–5 EMBL-EBI, Hinxton, UK. Proteins Struct Funct Bioinf 88:913–915. (And other articles in that issue.)
Google Scholar
Humphreys IR, Pei J, Baek M, Krishnakumar A, Anishchenko, A et al (2021) Computed structures of core eukaryotic protein complexes. Science 374, 1340. https://doi.org/10.1126/science.abd9776
Gao YQ, Yang W, Karplus M (2005) A structure-based model for the synthesis and hydrolysis of ATP by F1-ATPase. Cell 123:195–205
Article CAS PubMed Google Scholar
Pu J, Karplus M (2008) How subunit coupling produces the γ-subunit rotary motion in F1-ATPase. Proc Nat’l Acad Sci U S A 105:1192–1197
Google Scholar
Arnold FH (2019) Innovation by evolution: Bringing new chemistry to life. Angew Chem Int Ed 58:14420–14426
Article CAS Google Scholar
Siegel JB, Zanghellini A, Lovick HM et al (2010) Computational design of an enzyme catalyst for a stereoselective bimolecular Diels-Alder reaction. Science 329:309–313
Google Scholar
Privett HK, Kiss G, Lee TM et al (2012) Iterative approach to computational enzyme design. Proc Nat’l Acad Sci U S A 109:3790–3795
Google Scholar

Download references

Acknowledgements

We thank A.G. Murzin and A. Andreeva for helpful advice.

Author information

Authors and Affiliations

Institute of Structural and Molecular Biology, University College London, London, UK
Vaishali P. Waman & Christine Orengo
European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Cambridge, UK
Gerard J. Kleywegt
Department of Biochemistry and Molecular Biology and Center for Computational Biology and Bioinformatics, The Pennsylvania State University, University Park, PA, USA
Arthur M. Lesk

Authors

Vaishali P. Waman
View author publications
You can also search for this author in PubMed Google Scholar
Christine Orengo
View author publications
You can also search for this author in PubMed Google Scholar
Gerard J. Kleywegt
View author publications
You can also search for this author in PubMed Google Scholar
Arthur M. Lesk
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Arthur M. Lesk .

Editor information

Editors and Affiliations

University of Pavia, Pavia, Italy
Oliviero Carugo
Genome Institute and Bioinformatics Institute, Singapore, Singapore
Frank Eisenhaber

Rights and permissions

Reprints and permissions

Copyright information

About this protocol

Cite this protocol

Waman, V.P., Orengo, C., Kleywegt, G.J., Lesk, A.M. (2022). Three-dimensional Structure Databases of Biological Macromolecules. In: Carugo, O., Eisenhaber, F. (eds) Data Mining Techniques for the Life Sciences. Methods in Molecular Biology, vol 2449. Humana, New York, NY. https://doi.org/10.1007/978-1-0716-2095-3_3

Download citation

DOI: https://doi.org/10.1007/978-1-0716-2095-3_3
Published: 29 October 2021
Publisher Name: Humana, New York, NY
Print ISBN: 978-1-0716-2094-6
Online ISBN: 978-1-0716-2095-3
eBook Packages: Springer Protocols

Publish with us

Policies and ethics