BiobankCloud: A Platform for the Secure Storage, Sharing, and Processing of Large Biomedical Data Sets

  • Alysson Bessani
  • Jörgen Brandt
  • Marc Bux
  • Vinicius Cogo
  • Lora Dimitrova
  • Jim Dowling
  • Ali Gholami
  • Kamal Hakimzadeh
  • Micheal Hummel
  • Mahmoud Ismail
  • Erwin Laure
  • Ulf Leser
  • Jan-Eric Litton
  • Roxanna Martinez
  • Salman Niazi
  • Jane Reichel
  • Karin Zimmermann
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9579)

Abstract

Biobanks store and catalog human biological material that is increasingly being digitized using next-generation sequencing (NGS). There is, however, a computational bottleneck, as existing software systems are not scalable and secure enough to store and process the incoming wave of genomic data from NGS machines. In the BiobankCloud project, we are building a Hadoop-based platform for the secure storage, sharing, and parallel processing of genomic data. We extended Hadoop to include support for multi-tenant studies, reduced storage requirements with erasure coding, and added support for extensible and consistent metadata. On top of Hadoop, we built a scalable scientific workflow engine featuring a proper workflow definition language focusing on simple integration and chaining of existing tools, adaptive scheduling on Apache Yarn, and support for iterative dataflows. Our platform also supports the secure sharing of data across different, distributed Hadoop clusters. The software is easily installed and comes with a user-friendly web interface for running, managing, and accessing data sets behind a secure 2-factor authentication. Initial tests have shown that the engine scales well to dozens of nodes. The entire system is open-source and includes pre-defined workflows for popular tasks in biomedical data analysis, such as variant identification, differential transcriptome analysis using RNA-Seq, and analysis of miRNA-Seq and ChIP-Seq data.

References

  1. 1.
    Janitz, M. (ed.): Next-generation genome sequencing: towards personalized medicine. Wiley, Chichester (2011)Google Scholar
  2. 2.
    Weissleder, R., Pittet, M.Y.: Imaging in the era of molecular oncology. Nature 452(7187), 580–589 (2008)CrossRefGoogle Scholar
  3. 3.
    Costa, F.F.: Big data in biomedicine. Drug Discov. Today 19(4), 433–440 (2014)CrossRefGoogle Scholar
  4. 4.
    Swan, M.: The quantified self: fundamental disruption in big data science and biological discovery. Big Data 1(2), 85–99 (2013)CrossRefGoogle Scholar
  5. 5.
    Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: Cluster computing with working sets. HotCloud (2010)Google Scholar
  6. 6.
    Dudoladov, S., Xu, C., Schelter, S., Katsifodimos, A., Ewen, S., Tzoumas, K., Markl, V.: Optimistic recovery for iterative dataflows in action. SIGMOD, Melbourne, Australia (2015)Google Scholar
  7. 7.
    Bux, M., Leser, U.: Parallelization in Scientific Workflow Management Systems. CoRR/abs:1303.7195 U (2013)Google Scholar
  8. 8.
    Langmead, B., Schatz, M.C., Lin, J., Pop, M., Salzberg, S.L.: Searching for SNPs with cloud computing. Genome Biol. 10(11), R134 (2009)CrossRefGoogle Scholar
  9. 9.
    McKenna, A., Hanna, M., Banks, E., Sivachenko, A., Cibulskis, K., Kernytsky, A., Garimella, K., Altshuler, D., Gabriel, S., Daly, M., et al.: The genome analysis toolkit: a mapreduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20(9), 1297–1303 (2010)CrossRefGoogle Scholar
  10. 10.
    Nothaft, F.A., Massie, M., Danford, T., Zhang, Z., Laserson, U., Yeksigian, C., Kottalam, J., Ahuja, A., Hammerbacher, J., Linderman, M., Franklin, M.J., Joseph, A.D., Patterson, D.A.: Rethinking data-intensive science using scalable analytics systems. SIGMOD, Melbourne, Australia (2015)Google Scholar
  11. 11.
    Decap, D., Reumers, J., Herzeel, C., Costanza, P., Fostier, J.: Halvade: scalable sequence analysis with MapReduce. Bioinformatics, btv179+ (2015)Google Scholar
  12. 12.
    Pireddu, L., Leo, S., Zanetti, G.: SEAL: a distributed short read mapping and duplicate removal tool. Bioinformatics 27(15), 2159–2160 (2011)CrossRefGoogle Scholar
  13. 13.
    Schumacher, A., Pireddu, L., Niemenmaa, M., Kallio, A., Korpelainen, E., Zanetti, G., Heljanko, K.: SeqPig: simple and scalable scripting for large sequencing data sets in Hadoop. Bioinformatics 30(1), 119–120 (2014)CrossRefGoogle Scholar
  14. 14.
    Gholami, A., Dowling, J., Laure, E.: A security framework for population-scale genomics analysis. The International Conference on High Performance Computing and Simulation (2015)Google Scholar
  15. 15.
    Gholami, A., Lind, A.-S., Reichel, J., Litton, J.-E., Edlund, A., Laure, E.: Privacy threat modeling for emerging BiobankClouds. Procedia Comput. Sci. 37, 489–496 (2014). EUSPN-2014/ICTHCrossRefGoogle Scholar
  16. 16.
    Shvachko, K., Kuang, H., Radia, S., Chansler, R.: The hadoop distributed file system. In: IEEE Symposium on Mass Storage Systems and Technologies (2010)Google Scholar
  17. 17.
    Ronström, M., Oreland, J.: Recovery principles of MySQL Cluster 5.1. PVLDB (2005)Google Scholar
  18. 18.
    Hakimzadeh, K., Sajjad, H.P., Dowling, J.: Scaling HDFS with a strongly consistent relational model for metadata. In: Magoutis, K., Pietzuch, P. (eds.) DAIS 2014. LNCS, vol. 8460, pp. 38–51. Springer, Heidelberg (2014)CrossRefGoogle Scholar
  19. 19.
    Niazi, S., Ismail, M., Berthou, G., Dowling, J.: Leader election using NewSQL database systems. In: Bessani, A., Bouchenak, S. (eds.) DAIS. LNCS, vol. 9038, pp. 158–172. Springer, Heidelberg (2015)Google Scholar
  20. 20.
    Pabinger, S., Dander, A., Fischer, M., Snajder, R., Sperk, M., Efremova, M., Krabichler, B., Speicher, M.R., Zschocke, J., Trajanoski, Z.: A survey of tools for variant analysis of next-generation genome sequencing data. Briefings Bioinform. 15, 256–278 (2014)CrossRefGoogle Scholar
  21. 21.
    Bux, M., Brandt, J., Lipka, C., Hakimzadeh, K., Dowling, J., Leser, U.: SAASFEE: scalable scientific workflow execution engine. PVLDB (2015)Google Scholar
  22. 22.
    Brandt, J., Bux, M., Leser, U.: Cuneiform: A functional language for large scale scientific data analysis. In: Workshops of the EDBT/ICDT, Brussels, Belgium (2015)Google Scholar
  23. 23.
    Langmead, B., Trapnell, C., Pop, M., Salzberg, S.L., et al.: Ultrafast and memory-efficient alignment of short dna sequences to the human genome. Genome Biol. 10(3), R25 (2009)CrossRefGoogle Scholar
  24. 24.
    Goff, L.A., Trapnell, C., Kelley, D.: Cummerbund: visualization and exploration of cufflinks high-throughput sequencing data. R Package Version 2.2 (2012)Google Scholar
  25. 25.
    Deelman, E., Vahi, K., Juve, G., Rynge, M., Callaghan, S., Maechling, P.J., Mayani, R., Chen, W., da Silva, R.F., Livny, M., Wenger, K.: Pegasus: A workflow management system for science automation. Future Gener. Comput. Syst. 46, 17–35 (2015)CrossRefGoogle Scholar
  26. 26.
    Goecks, J., Nekrutenko, A., Taylor, J.: Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biol. 11, R86 (2010)CrossRefGoogle Scholar
  27. 27.
    Shendure, J., Ji, H.: Next-generation dna sequencing. Nature Biotechnol. 26(10), 1135–1145 (2008)CrossRefGoogle Scholar
  28. 28.
    Thalheim, L.: Point mutation analysis of four human colorectal cancer exomes. Master thesis, Humboldt Universität zu Berlin, Germany (2013)Google Scholar
  29. 29.
    Trapnell, C., Roberts, A., Goff, L., Pertea, G., Kim, D., Kelley, D.R., Pimentel, H., Salzberg, S.L., Rinn, J.L., Pachter, L.: Differential gene and transcript expression analysis of rna-seq experiments with tophat and cufflinks. Nature Protoc. 7(3), 562–578 (2012)CrossRefGoogle Scholar
  30. 30.
    Trapnell, C., Hendrickson, D.G., Sauvageau, M., Goff, L., Rinn, J.L., Pachter, L.: Differential analysis of gene regulation at transcript resolution with rna-seq. Nature Biotechnol. 31(1), 46–53 (2013)CrossRefGoogle Scholar
  31. 31.
    Dimitrova, L., Seitz, V., Hecht, J., Lenze, D., Hansen, P., Szczepanowski, M., Ma, L., Oker, E., Sommerfeld, A., Jundt, F., et al.: Pax5 overexpression is not enough to reestablish the mature b-cell phenotype in classical hodgkin lymphoma. Leukemia 28(1), 213 (2014)CrossRefGoogle Scholar
  32. 32.
    Kozubek, J., Ma, Z., Fleming, E., Duggan, T., Wu, R., Shin, D.-G.: In-depth characterization of microrna transcriptome in melanoma. PloS One 8(9), e72699 (2013)CrossRefGoogle Scholar
  33. 33.
    Verissimo, P.E., Bessani, A.: E-biobanking: What have you done to my cell samples? IEEE Secur. Priv. 11(6), 62–65 (2013)CrossRefGoogle Scholar
  34. 34.
    Bessani, A., Correia, M., Quaresma, B., Andre, F., Sousa, P.: DepSky: Dependable and secure storage in cloud-of-clouds. ACM Trans. Storage 9(4), 382–401 (2013)CrossRefGoogle Scholar
  35. 35.
    Nelson-Smith, S.: Test-Driven Infrastructure with Chef: Bring Behavior-Driven Development to Infrastructure as Code. O’Reilly Media Inc (2013)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  • Alysson Bessani
    • 5
  • Jörgen Brandt
    • 2
  • Marc Bux
    • 2
  • Vinicius Cogo
    • 5
  • Lora Dimitrova
    • 4
  • Jim Dowling
    • 1
  • Ali Gholami
    • 1
  • Kamal Hakimzadeh
    • 1
  • Micheal Hummel
    • 4
  • Mahmoud Ismail
    • 1
  • Erwin Laure
    • 1
  • Ulf Leser
    • 2
  • Jan-Eric Litton
    • 3
  • Roxanna Martinez
    • 3
  • Salman Niazi
    • 1
  • Jane Reichel
    • 6
  • Karin Zimmermann
    • 4
  1. 1.KTH - Royal Institute of TechnologyStockholmSweden
  2. 2.Humboldt-Universität zu BerlinBerlinGermany
  3. 3.Karolinska InstituteSolnaSweden
  4. 4.ChariteBerlinGermany
  5. 5.LaSIGE, Faculdade de CiênciasUniversidade de LisboaLisbonPortugal
  6. 6.Uppsala UniversityUppsalaSweden

Personalised recommendations