Skip to main content

BiobankCloud: A Platform for the Secure Storage, Sharing, and Processing of Large Biomedical Data Sets

  • Conference paper
  • First Online:
Biomedical Data Management and Graph Online Querying (Big-O(Q) 2015, DMAH 2015)

Abstract

Biobanks store and catalog human biological material that is increasingly being digitized using next-generation sequencing (NGS). There is, however, a computational bottleneck, as existing software systems are not scalable and secure enough to store and process the incoming wave of genomic data from NGS machines. In the BiobankCloud project, we are building a Hadoop-based platform for the secure storage, sharing, and parallel processing of genomic data. We extended Hadoop to include support for multi-tenant studies, reduced storage requirements with erasure coding, and added support for extensible and consistent metadata. On top of Hadoop, we built a scalable scientific workflow engine featuring a proper workflow definition language focusing on simple integration and chaining of existing tools, adaptive scheduling on Apache Yarn, and support for iterative dataflows. Our platform also supports the secure sharing of data across different, distributed Hadoop clusters. The software is easily installed and comes with a user-friendly web interface for running, managing, and accessing data sets behind a secure 2-factor authentication. Initial tests have shown that the engine scales well to dozens of nodes. The entire system is open-source and includes pre-defined workflows for popular tasks in biomedical data analysis, such as variant identification, differential transcriptome analysis using RNA-Seq, and analysis of miRNA-Seq and ChIP-Seq data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    See http://www.genomicsengland.co.uk/

  2. 2.

    http://www.karamel.io/

  3. 3.

    Yubikey Manual, http://www.yubico.com/

  4. 4.

    The concrete roles should be seen as implementations of the European Data Protection Directive.

References

  1. Janitz, M. (ed.): Next-generation genome sequencing: towards personalized medicine. Wiley, Chichester (2011)

    Google Scholar 

  2. Weissleder, R., Pittet, M.Y.: Imaging in the era of molecular oncology. Nature 452(7187), 580–589 (2008)

    Article  Google Scholar 

  3. Costa, F.F.: Big data in biomedicine. Drug Discov. Today 19(4), 433–440 (2014)

    Article  Google Scholar 

  4. Swan, M.: The quantified self: fundamental disruption in big data science and biological discovery. Big Data 1(2), 85–99 (2013)

    Article  Google Scholar 

  5. Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: Cluster computing with working sets. HotCloud (2010)

    Google Scholar 

  6. Dudoladov, S., Xu, C., Schelter, S., Katsifodimos, A., Ewen, S., Tzoumas, K., Markl, V.: Optimistic recovery for iterative dataflows in action. SIGMOD, Melbourne, Australia (2015)

    Google Scholar 

  7. Bux, M., Leser, U.: Parallelization in Scientific Workflow Management Systems. CoRR/abs:1303.7195 U (2013)

    Google Scholar 

  8. Langmead, B., Schatz, M.C., Lin, J., Pop, M., Salzberg, S.L.: Searching for SNPs with cloud computing. Genome Biol. 10(11), R134 (2009)

    Article  Google Scholar 

  9. McKenna, A., Hanna, M., Banks, E., Sivachenko, A., Cibulskis, K., Kernytsky, A., Garimella, K., Altshuler, D., Gabriel, S., Daly, M., et al.: The genome analysis toolkit: a mapreduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20(9), 1297–1303 (2010)

    Article  Google Scholar 

  10. Nothaft, F.A., Massie, M., Danford, T., Zhang, Z., Laserson, U., Yeksigian, C., Kottalam, J., Ahuja, A., Hammerbacher, J., Linderman, M., Franklin, M.J., Joseph, A.D., Patterson, D.A.: Rethinking data-intensive science using scalable analytics systems. SIGMOD, Melbourne, Australia (2015)

    Google Scholar 

  11. Decap, D., Reumers, J., Herzeel, C., Costanza, P., Fostier, J.: Halvade: scalable sequence analysis with MapReduce. Bioinformatics, btv179+ (2015)

    Google Scholar 

  12. Pireddu, L., Leo, S., Zanetti, G.: SEAL: a distributed short read mapping and duplicate removal tool. Bioinformatics 27(15), 2159–2160 (2011)

    Article  Google Scholar 

  13. Schumacher, A., Pireddu, L., Niemenmaa, M., Kallio, A., Korpelainen, E., Zanetti, G., Heljanko, K.: SeqPig: simple and scalable scripting for large sequencing data sets in Hadoop. Bioinformatics 30(1), 119–120 (2014)

    Article  Google Scholar 

  14. Gholami, A., Dowling, J., Laure, E.: A security framework for population-scale genomics analysis. The International Conference on High Performance Computing and Simulation (2015)

    Google Scholar 

  15. Gholami, A., Lind, A.-S., Reichel, J., Litton, J.-E., Edlund, A., Laure, E.: Privacy threat modeling for emerging BiobankClouds. Procedia Comput. Sci. 37, 489–496 (2014). EUSPN-2014/ICTH

    Article  Google Scholar 

  16. Shvachko, K., Kuang, H., Radia, S., Chansler, R.: The hadoop distributed file system. In: IEEE Symposium on Mass Storage Systems and Technologies (2010)

    Google Scholar 

  17. Ronström, M., Oreland, J.: Recovery principles of MySQL Cluster 5.1. PVLDB (2005)

    Google Scholar 

  18. Hakimzadeh, K., Sajjad, H.P., Dowling, J.: Scaling HDFS with a strongly consistent relational model for metadata. In: Magoutis, K., Pietzuch, P. (eds.) DAIS 2014. LNCS, vol. 8460, pp. 38–51. Springer, Heidelberg (2014)

    Chapter  Google Scholar 

  19. Niazi, S., Ismail, M., Berthou, G., Dowling, J.: Leader election using NewSQL database systems. In: Bessani, A., Bouchenak, S. (eds.) DAIS. LNCS, vol. 9038, pp. 158–172. Springer, Heidelberg (2015)

    Google Scholar 

  20. Pabinger, S., Dander, A., Fischer, M., Snajder, R., Sperk, M., Efremova, M., Krabichler, B., Speicher, M.R., Zschocke, J., Trajanoski, Z.: A survey of tools for variant analysis of next-generation genome sequencing data. Briefings Bioinform. 15, 256–278 (2014)

    Article  Google Scholar 

  21. Bux, M., Brandt, J., Lipka, C., Hakimzadeh, K., Dowling, J., Leser, U.: SAASFEE: scalable scientific workflow execution engine. PVLDB (2015)

    Google Scholar 

  22. Brandt, J., Bux, M., Leser, U.: Cuneiform: A functional language for large scale scientific data analysis. In: Workshops of the EDBT/ICDT, Brussels, Belgium (2015)

    Google Scholar 

  23. Langmead, B., Trapnell, C., Pop, M., Salzberg, S.L., et al.: Ultrafast and memory-efficient alignment of short dna sequences to the human genome. Genome Biol. 10(3), R25 (2009)

    Article  Google Scholar 

  24. Goff, L.A., Trapnell, C., Kelley, D.: Cummerbund: visualization and exploration of cufflinks high-throughput sequencing data. R Package Version 2.2 (2012)

    Google Scholar 

  25. Deelman, E., Vahi, K., Juve, G., Rynge, M., Callaghan, S., Maechling, P.J., Mayani, R., Chen, W., da Silva, R.F., Livny, M., Wenger, K.: Pegasus: A workflow management system for science automation. Future Gener. Comput. Syst. 46, 17–35 (2015)

    Article  Google Scholar 

  26. Goecks, J., Nekrutenko, A., Taylor, J.: Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biol. 11, R86 (2010)

    Article  Google Scholar 

  27. Shendure, J., Ji, H.: Next-generation dna sequencing. Nature Biotechnol. 26(10), 1135–1145 (2008)

    Article  Google Scholar 

  28. Thalheim, L.: Point mutation analysis of four human colorectal cancer exomes. Master thesis, Humboldt Universität zu Berlin, Germany (2013)

    Google Scholar 

  29. Trapnell, C., Roberts, A., Goff, L., Pertea, G., Kim, D., Kelley, D.R., Pimentel, H., Salzberg, S.L., Rinn, J.L., Pachter, L.: Differential gene and transcript expression analysis of rna-seq experiments with tophat and cufflinks. Nature Protoc. 7(3), 562–578 (2012)

    Article  Google Scholar 

  30. Trapnell, C., Hendrickson, D.G., Sauvageau, M., Goff, L., Rinn, J.L., Pachter, L.: Differential analysis of gene regulation at transcript resolution with rna-seq. Nature Biotechnol. 31(1), 46–53 (2013)

    Article  Google Scholar 

  31. Dimitrova, L., Seitz, V., Hecht, J., Lenze, D., Hansen, P., Szczepanowski, M., Ma, L., Oker, E., Sommerfeld, A., Jundt, F., et al.: Pax5 overexpression is not enough to reestablish the mature b-cell phenotype in classical hodgkin lymphoma. Leukemia 28(1), 213 (2014)

    Article  Google Scholar 

  32. Kozubek, J., Ma, Z., Fleming, E., Duggan, T., Wu, R., Shin, D.-G.: In-depth characterization of microrna transcriptome in melanoma. PloS One 8(9), e72699 (2013)

    Article  Google Scholar 

  33. Verissimo, P.E., Bessani, A.: E-biobanking: What have you done to my cell samples? IEEE Secur. Priv. 11(6), 62–65 (2013)

    Article  Google Scholar 

  34. Bessani, A., Correia, M., Quaresma, B., Andre, F., Sousa, P.: DepSky: Dependable and secure storage in cloud-of-clouds. ACM Trans. Storage 9(4), 382–401 (2013)

    Article  Google Scholar 

  35. Nelson-Smith, S.: Test-Driven Infrastructure with Chef: Bring Behavior-Driven Development to Infrastructure as Code. O’Reilly Media Inc (2013)

    Google Scholar 

Download references

Acknowledgements

This work is funded by the EU FP7 project “Scalable, Secure Storage and Analysis of Biobank Data” under Grant Agreement no. 317871.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ulf Leser .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

Bessani, A. et al. (2016). BiobankCloud: A Platform for the Secure Storage, Sharing, and Processing of Large Biomedical Data Sets. In: Wang, F., Luo, G., Weng, C., Khan, A., Mitra, P., Yu, C. (eds) Biomedical Data Management and Graph Online Querying. Big-O(Q) DMAH 2015 2015. Lecture Notes in Computer Science(), vol 9579. Springer, Cham. https://doi.org/10.1007/978-3-319-41576-5_7

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-41576-5_7

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-41575-8

  • Online ISBN: 978-3-319-41576-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics