Abstract
The development of secure and reliable systems to collect, store, utilise, and share data on study participants plays a critical role in large population health studies. Contemporary prospective biobank studies typically involve hundreds of thousands of participants, and collect a wide range of data through questionnaires, physical measurements, sample assays, and linkages with external data sources for an extended period. Careful planning and management of a central data repository are required to ensure the privacy, security, accessibility, flexibility, consistency, and accuracy of the data collected and generated in the study. This chapter outlines some of the key concepts and principles underlying the design and development of data storage infrastructures, database architecture, and management systems in large biobank studies. It also describes practical considerations for each step from initial data collection from study participants to delivery of research-ready datasets; from data import, cleaning, and integration; through quality checks, standardisation, and validation; and finally to preparing datasets for bone fide researchers. The general principles and approaches described should be applicable to a wide variety of population health studies in different settings.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Abbreviations
- API:
-
Application programming interface
- CKB:
-
China Kadoorie Biobank
- DAG:
-
Data access governance
- DBMS:
-
Database management system
- ICD:
-
International classification of diseases
- ID:
-
Identifier
- IT:
-
Information technology
- RDBMS:
-
Relational database management system
- SQL:
-
Structured query language
- SOP:
-
Standard operating procedures
- WHO:
-
World Health Organisation
References
Arbuckle L, El Emam K. Anonymizing health data – case studies and methods to get you started. Newton: O’Reilly Media; 2013.
Foster EC, Godbole S. Database systems - a pragmatic approach. New York: Apress; 2016.
Goldberg D. What every computer scientist should know about floating-point arithmetic. ACM Comput Surv. 1991;23(1). https://dl.acm.org/doi/pdf/10.1145/103162.103163
Harron K, Goldstein H, Dibben C. Methodological developments in data linkage. London: Wiley; 2016.
Kirkwood BR, Sterne JAC. Essential medical statistics. 2nd ed. Hoboken: Wiley-Blackwell; 2003.
Molinaro A. SQL cookbook – query solutions and techniques for database developers. Newton: O’Reilly Media; 2009.
UK Biobank Limited. UK Biobank: Access procedures November 2011. 2011. Available from http://www.ukbiobank.ac.uk/wp-content/uploads/2012/09/Access-Procedures-2011.pdf
World Health Organisation International Statistical Classification of Diseases and Related Health Problems 10th Revision. 2016. Available from: https://icd.who.int/browse10/2016/en
Ziemann M, Eren Y, El-Osta A. Gene name errors are widespread in the scientific literature. Genome Biol. 2016;17:177.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Singapore Pte Ltd.
About this chapter
Cite this chapter
Sansome, G., Hacker, A. (2020). Management and Curation of Multi-Dimensional Data in Biobank Studies. In: Chen, Z. (eds) Population Biobank Studies: A Practical Guide. Springer, Singapore. https://doi.org/10.1007/978-981-15-7666-9_8
Download citation
DOI: https://doi.org/10.1007/978-981-15-7666-9_8
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-15-7665-2
Online ISBN: 978-981-15-7666-9
eBook Packages: Biomedical and Life SciencesBiomedical and Life Sciences (R0)