Abstract
Biobanks store and catalog human biological material that is increasingly being digitized using next-generation sequencing (NGS). There is, however, a computational bottleneck, as existing software systems are not scalable and secure enough to store and process the incoming wave of genomic data from NGS machines. In the BiobankCloud project, we are building a Hadoop-based platform for the secure storage, sharing, and parallel processing of genomic data. We extended Hadoop to include support for multi-tenant studies, reduced storage requirements with erasure coding, and added support for extensible and consistent metadata. On top of Hadoop, we built a scalable scientific workflow engine featuring a proper workflow definition language focusing on simple integration and chaining of existing tools, adaptive scheduling on Apache Yarn, and support for iterative dataflows. Our platform also supports the secure sharing of data across different, distributed Hadoop clusters. The software is easily installed and comes with a user-friendly web interface for running, managing, and accessing data sets behind a secure 2-factor authentication. Initial tests have shown that the engine scales well to dozens of nodes. The entire system is open-source and includes pre-defined workflows for popular tasks in biomedical data analysis, such as variant identification, differential transcriptome analysis using RNA-Seq, and analysis of miRNA-Seq and ChIP-Seq data.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
- 2.
- 3.
Yubikey Manual, http://www.yubico.com/
- 4.
The concrete roles should be seen as implementations of the European Data Protection Directive.
References
Janitz, M. (ed.): Next-generation genome sequencing: towards personalized medicine. Wiley, Chichester (2011)
Weissleder, R., Pittet, M.Y.: Imaging in the era of molecular oncology. Nature 452(7187), 580–589 (2008)
Costa, F.F.: Big data in biomedicine. Drug Discov. Today 19(4), 433–440 (2014)
Swan, M.: The quantified self: fundamental disruption in big data science and biological discovery. Big Data 1(2), 85–99 (2013)
Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: Cluster computing with working sets. HotCloud (2010)
Dudoladov, S., Xu, C., Schelter, S., Katsifodimos, A., Ewen, S., Tzoumas, K., Markl, V.: Optimistic recovery for iterative dataflows in action. SIGMOD, Melbourne, Australia (2015)
Bux, M., Leser, U.: Parallelization in Scientific Workflow Management Systems. CoRR/abs:1303.7195 U (2013)
Langmead, B., Schatz, M.C., Lin, J., Pop, M., Salzberg, S.L.: Searching for SNPs with cloud computing. Genome Biol. 10(11), R134 (2009)
McKenna, A., Hanna, M., Banks, E., Sivachenko, A., Cibulskis, K., Kernytsky, A., Garimella, K., Altshuler, D., Gabriel, S., Daly, M., et al.: The genome analysis toolkit: a mapreduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20(9), 1297–1303 (2010)
Nothaft, F.A., Massie, M., Danford, T., Zhang, Z., Laserson, U., Yeksigian, C., Kottalam, J., Ahuja, A., Hammerbacher, J., Linderman, M., Franklin, M.J., Joseph, A.D., Patterson, D.A.: Rethinking data-intensive science using scalable analytics systems. SIGMOD, Melbourne, Australia (2015)
Decap, D., Reumers, J., Herzeel, C., Costanza, P., Fostier, J.: Halvade: scalable sequence analysis with MapReduce. Bioinformatics, btv179+ (2015)
Pireddu, L., Leo, S., Zanetti, G.: SEAL: a distributed short read mapping and duplicate removal tool. Bioinformatics 27(15), 2159–2160 (2011)
Schumacher, A., Pireddu, L., Niemenmaa, M., Kallio, A., Korpelainen, E., Zanetti, G., Heljanko, K.: SeqPig: simple and scalable scripting for large sequencing data sets in Hadoop. Bioinformatics 30(1), 119–120 (2014)
Gholami, A., Dowling, J., Laure, E.: A security framework for population-scale genomics analysis. The International Conference on High Performance Computing and Simulation (2015)
Gholami, A., Lind, A.-S., Reichel, J., Litton, J.-E., Edlund, A., Laure, E.: Privacy threat modeling for emerging BiobankClouds. Procedia Comput. Sci. 37, 489–496 (2014). EUSPN-2014/ICTH
Shvachko, K., Kuang, H., Radia, S., Chansler, R.: The hadoop distributed file system. In: IEEE Symposium on Mass Storage Systems and Technologies (2010)
Ronström, M., Oreland, J.: Recovery principles of MySQL Cluster 5.1. PVLDB (2005)
Hakimzadeh, K., Sajjad, H.P., Dowling, J.: Scaling HDFS with a strongly consistent relational model for metadata. In: Magoutis, K., Pietzuch, P. (eds.) DAIS 2014. LNCS, vol. 8460, pp. 38–51. Springer, Heidelberg (2014)
Niazi, S., Ismail, M., Berthou, G., Dowling, J.: Leader election using NewSQL database systems. In: Bessani, A., Bouchenak, S. (eds.) DAIS. LNCS, vol. 9038, pp. 158–172. Springer, Heidelberg (2015)
Pabinger, S., Dander, A., Fischer, M., Snajder, R., Sperk, M., Efremova, M., Krabichler, B., Speicher, M.R., Zschocke, J., Trajanoski, Z.: A survey of tools for variant analysis of next-generation genome sequencing data. Briefings Bioinform. 15, 256–278 (2014)
Bux, M., Brandt, J., Lipka, C., Hakimzadeh, K., Dowling, J., Leser, U.: SAASFEE: scalable scientific workflow execution engine. PVLDB (2015)
Brandt, J., Bux, M., Leser, U.: Cuneiform: A functional language for large scale scientific data analysis. In: Workshops of the EDBT/ICDT, Brussels, Belgium (2015)
Langmead, B., Trapnell, C., Pop, M., Salzberg, S.L., et al.: Ultrafast and memory-efficient alignment of short dna sequences to the human genome. Genome Biol. 10(3), R25 (2009)
Goff, L.A., Trapnell, C., Kelley, D.: Cummerbund: visualization and exploration of cufflinks high-throughput sequencing data. R Package Version 2.2 (2012)
Deelman, E., Vahi, K., Juve, G., Rynge, M., Callaghan, S., Maechling, P.J., Mayani, R., Chen, W., da Silva, R.F., Livny, M., Wenger, K.: Pegasus: A workflow management system for science automation. Future Gener. Comput. Syst. 46, 17–35 (2015)
Goecks, J., Nekrutenko, A., Taylor, J.: Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biol. 11, R86 (2010)
Shendure, J., Ji, H.: Next-generation dna sequencing. Nature Biotechnol. 26(10), 1135–1145 (2008)
Thalheim, L.: Point mutation analysis of four human colorectal cancer exomes. Master thesis, Humboldt Universität zu Berlin, Germany (2013)
Trapnell, C., Roberts, A., Goff, L., Pertea, G., Kim, D., Kelley, D.R., Pimentel, H., Salzberg, S.L., Rinn, J.L., Pachter, L.: Differential gene and transcript expression analysis of rna-seq experiments with tophat and cufflinks. Nature Protoc. 7(3), 562–578 (2012)
Trapnell, C., Hendrickson, D.G., Sauvageau, M., Goff, L., Rinn, J.L., Pachter, L.: Differential analysis of gene regulation at transcript resolution with rna-seq. Nature Biotechnol. 31(1), 46–53 (2013)
Dimitrova, L., Seitz, V., Hecht, J., Lenze, D., Hansen, P., Szczepanowski, M., Ma, L., Oker, E., Sommerfeld, A., Jundt, F., et al.: Pax5 overexpression is not enough to reestablish the mature b-cell phenotype in classical hodgkin lymphoma. Leukemia 28(1), 213 (2014)
Kozubek, J., Ma, Z., Fleming, E., Duggan, T., Wu, R., Shin, D.-G.: In-depth characterization of microrna transcriptome in melanoma. PloS One 8(9), e72699 (2013)
Verissimo, P.E., Bessani, A.: E-biobanking: What have you done to my cell samples? IEEE Secur. Priv. 11(6), 62–65 (2013)
Bessani, A., Correia, M., Quaresma, B., Andre, F., Sousa, P.: DepSky: Dependable and secure storage in cloud-of-clouds. ACM Trans. Storage 9(4), 382–401 (2013)
Nelson-Smith, S.: Test-Driven Infrastructure with Chef: Bring Behavior-Driven Development to Infrastructure as Code. O’Reilly Media Inc (2013)
Acknowledgements
This work is funded by the EU FP7 project “Scalable, Secure Storage and Analysis of Biobank Data” under Grant Agreement no. 317871.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Bessani, A. et al. (2016). BiobankCloud: A Platform for the Secure Storage, Sharing, and Processing of Large Biomedical Data Sets. In: Wang, F., Luo, G., Weng, C., Khan, A., Mitra, P., Yu, C. (eds) Biomedical Data Management and Graph Online Querying. Big-O(Q) DMAH 2015 2015. Lecture Notes in Computer Science(), vol 9579. Springer, Cham. https://doi.org/10.1007/978-3-319-41576-5_7
Download citation
DOI: https://doi.org/10.1007/978-3-319-41576-5_7
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-41575-8
Online ISBN: 978-3-319-41576-5
eBook Packages: Computer ScienceComputer Science (R0)