Data Management in the Modern Structural Biology and Biomedical Research Environment

Zimmerman, Matthew D.; Grabowski, Marek; Domagalski, Marcin J.; MacLean, Elizabeth M.; Chruszcz, Maksymilian; Minor, Wladek

doi:10.1007/978-1-4939-0354-2_1

Data Management in the Modern Structural Biology and Biomedical Research Environment

Matthew D. Zimmerman^3,4,5,6,7,
Marek Grabowski^3,4,5,6,7,
Marcin J. Domagalski^3,4,5,6,7,
Elizabeth M. MacLean^3,4,5,6,7,
Maksymilian Chruszcz⁸ &
…
Wladek Minor^3,4,5,6,7

Protocol
First Online: 01 January 2014

3195 Accesses
22 Citations

Part of the book series: Methods in Molecular Biology ((MIMB,volume 1140))

Abstract

Modern high-throughput structural biology laboratories produce vast amounts of raw experimental data. The traditional method of data reduction is very simple—results are summarized in peer-reviewed publications, which are hopefully published in high-impact journals. By their nature, publications include only the most important results derived from experiments that may have been performed over the course of many years. The main content of the published paper is a concise compilation of these data, an interpretation of the experimental results, and a comparison of these results with those obtained by other scientists.

Due to an avalanche of structural biology manuscripts submitted to scientific journals, in many recent cases descriptions of experimental methodology (and sometimes even experimental results) are pushed to supplementary materials that are only published online and sometimes may not be reviewed as thoroughly as the main body of a manuscript. Trouble may arise when experimental results are contradicting the results obtained by other scientists, which requires (in the best case) the reexamination of the original raw data or independent repetition of the experiment according to the published description of the experiment. There are reports that a significant fraction of experiments obtained in academic laboratories cannot be repeated in an industrial environment (Begley CG & Ellis LM, Nature 483(7391):531–3, 2012). This is not an indication of scientific fraud but rather reflects the inadequate description of experiments performed on different equipment and on biological samples that were produced with disparate methods. For that reason the goal of a modern data management system is not only the simple replacement of the laboratory notebook by an electronic one but also the creation of a sophisticated, internally consistent, scalable data management system that will combine data obtained by a variety of experiments performed by various individuals on diverse equipment. All data should be stored in a core database that can be used by custom applications to prepare internal reports, statistics, and perform other functions that are specific to the research that is pursued in a particular laboratory.

This chapter presents a general overview of the methods of data management and analysis used by structural genomics (SG) programs. In addition to a review of the existing literature on the subject, also presented is experience in the development of two SG data management systems, UniTrack and LabDB. The description is targeted to a general audience, as some technical details have been (or will be) published elsewhere. The focus is on “data management,” meaning the process of gathering, organizing, and storing data, but also briefly discussed is “data mining,” the process of analysis ideally leading to an understanding of the data. In other words, data mining is the conversion of data into information. Clearly, effective data management is a precondition for any useful data mining. If done properly, gathering details on millions of experiments on thousands of proteins and making them publicly available for analysis—even after the projects themselves have ended—may turn out to be one of the most important benefits of SG programs.

Matthew D. Zimmerman and Marek Grabowski have contributed equally to this work.

This is a preview of subscription content, log in via an institution.

Protocol: USD 49.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Hardcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Springer Nature is developing a new tool to find and evaluate Protocols. Learn more

References

Begley CG, Ellis LM (2012) Drug development: Raise standards for preclinical cancer research. Nature 483(7391):531–533
Google Scholar
Minor W et al (2006) HKL-3000: the integration of data reduction and structure solution—from diffraction images to an initial model in minutes. Acta Crystallogr D Biol Crystallogr 62(Pt 8):859–866
Article PubMed Google Scholar
Berman HM et al (2000) The Protein Data Bank. Nucleic Acids Res 28(1):235–242
Article CAS PubMed PubMed Central Google Scholar
Berman H, Henrick K, Nakamura H (2003) Announcing the worldwide Protein Data Bank. Nat Struct Biol 10(12):980
Article CAS PubMed Google Scholar
Peat TS, Christopher JA, Newman J (2005) Tapping the Protein Data Bank for crystallization information. Acta Crystallogr D Biol Crystallogr 61(Pt 12):1662–1669
Article PubMed Google Scholar
Wlodawer A et al (2008) Protein crystallography for non-crystallographers, or how to get the best (but not more) from published macromolecular structures. FEBS J 275(1):1–21
Article CAS PubMed Google Scholar
Hooft RW et al (1996) Errors in protein structures. Nature 381(6580):272
Article CAS PubMed Google Scholar
Koclega KD et al (2009) ‘Hot’ macromolecular crystals. Cryst Growth Des 10(2):580
Article PubMed PubMed Central Google Scholar
SBKB P-N PSI impact: ex-cited use of PSI structures
Google Scholar
Gabanyi MJ et al (2011) The Structural Biology Knowledgebase: a portal to protein structures, sequences, functions, and methods. J Struct Funct Genomics 12(2):45–54
Article CAS PubMed PubMed Central Google Scholar
Chen L et al (2004) TargetDB: a target registration database for structural genomics projects. Bioinformatics 20(16):2860–2862
Article CAS PubMed Google Scholar
Edwards A (2008) Open-source science to enable drug discovery. Drug Discov Today 13(17–18):731–733
Article PubMed Google Scholar
O’Toole N et al (2004) The structural genomics experimental pipeline: insights from global target lists. Proteins 56(2):201–210
Article PubMed Google Scholar
Goh CS et al (2004) Mining the structural genomics pipeline: identification of protein properties that affect high-throughput experimental analysis. J Mol Biol 336(1):115–130
Article CAS PubMed Google Scholar
Kouranov A et al (2006) The RCSB PDB information portal for structural genomics. Nucleic Acids Res 34(Database issue):D302–D305
Article CAS PubMed Google Scholar
Berman HM et al (2009) The protein structure initiative structural genomics knowledgebase. Nucleic Acids Res 37(Database issue):D365–D368
Article CAS PubMed Google Scholar
Westbrook J et al (2003) The Protein Data Bank and structural genomics. Nucleic Acids Res 31(1):489–491
Article CAS PubMed PubMed Central Google Scholar
Pajon A et al (2005) Design of a data model for developing laboratory information management and analysis systems for protein production. Proteins 58(2):278–284
Article CAS PubMed Google Scholar
Prilusky J et al (2005) HalX: an open-source LIMS (Laboratory Information Management System) for small- to large-scale laboratories. Acta Crystallogr D Biol Crystallogr 61(Pt 6):671–678
Article PubMed Google Scholar
Morris C et al (2011) The Protein Information Management System (PiMS): a generic tool for any structural biology research laboratory. Acta Crystallogr D Biol Crystallogr 67(Pt 4):249–260
Article CAS PubMed PubMed Central Google Scholar
Goh CS et al (2003) SPINE 2: a system for collaborative structural proteomics within a federated database framework. Nucleic Acids Res 31(11):2833–2838
Article CAS PubMed PubMed Central Google Scholar
Zolnai Z et al (2003) Project management system for structural and functional proteomics: sesame. J Struct Funct Genomics 4(1):11–23
Article CAS PubMed Google Scholar
Raymond S, O’Toole N, Cygler M (2004) A data management system for structural genomics. Proteome Sci 2(1):4
Article PubMed PubMed Central Google Scholar
JCSG web portal. http://www.jcsg.org/. Accessed 4 Mar 2013
Benson DA et al (2013) GenBank. Nucleic Acids Res 41(Database issue):D36–D42
CAS PubMed Google Scholar
Apweiler R, Bairoch A, Wu CH (2004) Protein sequence databases. Curr Opin Chem Biol 8(1):76–80
Article CAS PubMed Google Scholar
Cymborowski M et al (2010) To automate or not to automate: this is the question. J Struct Funct Genomics 11(3):211–221
Article CAS PubMed PubMed Central Google Scholar
Nair R et al (2009) Structural genomics is the largest contributor of novel structural leverage. J Struct Funct Genomics 10(2):181–191
Article CAS PubMed PubMed Central Google Scholar
Liu J, Montelione GT, Rost B (2007) Novel leverage of structural genomics. Nat Biotechnol 25(8):849–851
Article CAS PubMed Google Scholar
Bucher MH, Evdokimov AG, Waugh DS (2002) Differential effects of short affinity tags on the crystallization of Pyrococcus furiosus maltodextrin-binding protein. Acta Crystallogr D Biol Crystallogr 58(Pt 3):392–397
Article PubMed Google Scholar
Koth CM et al (2003) Use of limited proteolysis to identify protein domains suitable for structural analysis. Methods Enzymol 368:77–84
Article CAS PubMed Google Scholar
Kim Y et al (2008) Large-scale evaluation of protein reductive methylation for improving protein crystallization. Nat Methods 5(10):853–854
Article CAS PubMed PubMed Central Google Scholar
Cormier CY et al (2011) PSI:Biology-materials repository: a biologist’s resource for protein expression plasmids. J Struct Funct Genomics 12(2):55–62
Article CAS PubMed PubMed Central Google Scholar
Cormier CY et al (2010) Protein structure initiative material repository: an open shared public resource of structural genomics plasmids for the biological community. Nucleic Acids Res 38(Database issue):D743–D749
Article CAS PubMed Google Scholar
Baker R, Peacock S (2008) BEI Resources: supporting antiviral research. Antiviral Res 80(2):102–106
Article CAS PubMed PubMed Central Google Scholar
Chruszcz M, Wlodawer A, Minor W (2008) Determination of protein structures—a series of fortunate events. Biophys J 95(1):1–9
Article CAS PubMed PubMed Central Google Scholar
Page R et al (2003) Shotgun crystallization strategy for structural genomics: an optimized two-tiered crystallization screen against the Thermotoga maritima proteome. Acta Crystallogr D Biol Crystallogr 59(Pt 6):1028–1037
Article PubMed Google Scholar
Babnigg G, Joachimiak A (2010) Predicting protein crystallization propensity from protein sequence. J Struct Funct Genomics 11(1):71–80
Article CAS PubMed PubMed Central Google Scholar
Kimber MS et al (2003) Data mining crystallization databases: knowledge-based approaches to optimize protein crystal screens. Proteins 51(4):562–568
Article CAS PubMed Google Scholar
Newman J et al (2005) Towards rationalization of crystallization screening for small- to medium-sized academic laboratories: the PACT/JCSG+ strategy. Acta Crystallogr D Biol Crystallogr 61(Pt 10):1426–1431
Article PubMed Google Scholar
Zheng H et al (2008) Data mining of metal ion environments present in protein structures. J Inorg Biochem 102(9):1765–1776
Article CAS PubMed PubMed Central Google Scholar
Weekes D et al (2010) TOPSAN: a collaborative annotation environment for structural genomics. BMC Bioinforma 11:426
Article Google Scholar
Hodis E et al (2008) Proteopedia—a scientific ‘wiki’ bridging the rift between three-dimensional structure and function of biomacromolecules. Genome Biol 9(8):R121
Article PubMed PubMed Central Google Scholar
Lee WH et al (2009) SGC—structural biology and human health: a new approach to publishing structural biology results. PLoS One 4(10):e7675
Article PubMed PubMed Central Google Scholar
Raush E et al (2009) A new method for publishing three-dimensional content. PLoS One 4(10):e7394
Article PubMed PubMed Central Google Scholar
Hubert R (2001) Convergent architecture: building model-driven J2EE systems with UML. Wiley, New York
Google Scholar
Howe D et al (2008) Big data: the future of biocuration. Nature 455(7209):47–50
Article CAS PubMed PubMed Central Google Scholar
Bateman A (2010) Curators of the world unite: the International Society of Biocuration. Bioinformatics 26(8):991
Article CAS PubMed Google Scholar
Chayen NE, Saridakis E (2008) Protein crystallization: from purified protein to diffraction-quality crystal. Nat Methods 5(2):147–153
Article CAS PubMed Google Scholar

Download references

Acknowledgments

The authors would like to thank Alex Wlodawer, Tom Terwilliger, Heidi Imker, Steve Almo, Wayne Anderson, Andrzej Joachimiak, Rachel Vigour, and Zbyszek Dauter for valuable comments on the manuscript. This work was supported by PSI:Biology grants U54 GM094585 and U54 GM094662 as well as grants R01 GM053163 and U54 GM093342. This work was also supported with federal funds from the NIAID, NIH, Department of Health and Human Services, under Contract Nos. HHSN272200700058C and HHSN272201200026C.

Author information

Authors and Affiliations

Department of Molecular Physiology and Biological Physics, University of Virginia, Charlottesville, VA, USA
Matthew D. Zimmerman, Marek Grabowski, Marcin J. Domagalski, Elizabeth M. MacLean & Wladek Minor
Center for Structural Genomics of Infectious Diseases (CSGID), University of Virginia, Charlottesville, VA, USA
Matthew D. Zimmerman, Marek Grabowski, Marcin J. Domagalski, Elizabeth M. MacLean & Wladek Minor
Midwest Center for Structural Genomics (MCSG), University of Virginia, Charlottesville, VA, USA
Matthew D. Zimmerman, Marek Grabowski, Marcin J. Domagalski, Elizabeth M. MacLean & Wladek Minor
New York Structural Genomics Research Consortium (NYSGRC), University of Virginia, Charlottesville, VA, USA
Matthew D. Zimmerman, Marek Grabowski, Marcin J. Domagalski, Elizabeth M. MacLean & Wladek Minor
Enzyme Function Initiative (EFI), University of Virginia, Charlottesville, VA, USA
Matthew D. Zimmerman, Marek Grabowski, Marcin J. Domagalski, Elizabeth M. MacLean & Wladek Minor
Department of Chemistry and Biochemistry, University of South Carolina, Columbia, SC, USA
Maksymilian Chruszcz

Authors

Matthew D. Zimmerman
View author publications
You can also search for this author in PubMed Google Scholar
Marek Grabowski
View author publications
You can also search for this author in PubMed Google Scholar
Marcin J. Domagalski
View author publications
You can also search for this author in PubMed Google Scholar
Elizabeth M. MacLean
View author publications
You can also search for this author in PubMed Google Scholar
Maksymilian Chruszcz
View author publications
You can also search for this author in PubMed Google Scholar
Wladek Minor
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Center for Structural Genomics of Infectious Diseases Midwest Center for Structural Genomics, Northwestern University Feinberg School of Medicine, Chicago, IL, USA
Wayne F. Anderson

Rights and permissions

Reprints and permissions

Copyright information

About this protocol

Cite this protocol

Zimmerman, M.D., Grabowski, M., Domagalski, M.J., MacLean, E.M., Chruszcz, M., Minor, W. (2014). Data Management in the Modern Structural Biology and Biomedical Research Environment. In: Anderson, W.F. (eds) Structural Genomics and Drug Discovery. Methods in Molecular Biology, vol 1140. Humana, New York, NY. https://doi.org/10.1007/978-1-4939-0354-2_1

Download citation

DOI: https://doi.org/10.1007/978-1-4939-0354-2_1
Published: 08 February 2014
Publisher Name: Humana, New York, NY
Print ISBN: 978-1-4939-0353-5
Online ISBN: 978-1-4939-0354-2
eBook Packages: Springer Protocols

Publish with us

Policies and ethics