Skip to main content

Cloud Storage-Management Techniques for NGS Data

  • Chapter
  • First Online:
  • 1786 Accesses

Abstract

Current scientific advancements in both computer and biological sciences are bringing new opportunities to intra-disciplinary research topics. On one hand, computers and big-data analytics cloud software tools are being developed rapidly, increasing the capability of processing from terabyte data sets to petabytes and beyond. On the other hand, the advancement in molecular biological experiments is producing huge amounts of data related to genome and RNA sequences, protein and metabolite abundance, protein–protein interactions, gene expression, and so on. In most cases, biological data are forming big, versatile, complex networks.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD   159.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. 1000 genomes project (2013). http://www.1000genomes.org/

  2. Amazon S3 multipart upload. http://aws.amazon.com/blogs/aws/amazon-s3-multipart-upload/

  3. Apache Hadoop. http://hadoop.apache.org/

  4. Apache Hive. https://hive.apache.org/

  5. Apache Flink. http://flink.incubator.apache.org

  6. Apache Pig. http://pig.apache.org/

  7. Apache Samza. http://samza.incubator.apache.org/

  8. Apache Spark. https://spark.apache.org/

  9. Apache Tez. http://tez.apache.org/

  10. Bongcam-Rudloff, E., et al.: The next NGS challenge conference: data processing and integration. EMBnet. J. 19(A), p-3 (2013)

    Google Scholar 

  11. Bowtie. http://bowtie-bio.sourceforge.net/index.shtml

  12. Burrows-Wheeler Aligner. http://bio-bwa.sourceforge.net/

  13. Chang, Y.J., Chen, C.C., Chen, C.L., Ho, J.M.: A de novo next generation genomic sequence assembler based on string graph and MapReduce cloud computing framework. BMC Genomics 13, 1–17 (2012)

    Article  Google Scholar 

  14. Chen, C.C., Chang, Y.J., Chung, W.C., Lee, D.T., Ho, J.M.: CloudRS: an error correction algorithm of high-throughput sequencing data based on scalable framework. In: BigData Conference, pp. 717–722. IEEE (2013)

    Google Scholar 

  15. Chung, W.-C., et al.: CloudDOE: a user-friendly tool for deploying Hadoop clouds and analyzing high-throughput sequencing data with MapReduce. PLoS One 9(6), e98146 (2014). doi:10.1371/journal.pone.0098146

    Article  Google Scholar 

  16. CloudGENE A graphical MapReduce platform for cloud computing. http://cloudgene.uibk.ac.at/index.html

  17. COST Action BM1006: next generation sequencing data analysis network. http://www.seqahead.eu/

  18. Crossbow. http://bowtie-bio.sourceforge.net/crossbow/index.shtml

  19. Daugelaite, J., O’ Driscoll, A., Sleator, R.D.: An overview of multiple sequence alignments and cloud computing in bioinformatics. ISRN Biomath. 2013, 14 pp. (2013). doi:10.1155/2013/615630. Article ID 615630

    Google Scholar 

  20. Genome 10K Community of Scientists: Genome 10K: a proposal to obtain whole-genome sequence for 10,000 vertebrate species. J. Hered. 100, 659–674 (2009)

    Google Scholar 

  21. Gomez-Cabrero, D., Abugessaisa, I., Maier, D., Teschendorff, A., Merkenschlager, M., Gisel, A., Ballestar, E., Bongcam-Rudloff, E., Conesa A., Tegnér, J.: Data integration in the era of omics: current and future challenges. BMC Syst. Biol. 8(Suppl. 2), I1 (2014)

    Article  Google Scholar 

  22. Google BigQuery. https://developers.google.com/bigquery/

  23. Google BigQuery. https://cloud.google.com/developers/articles/getting-started-with-google-bigquery

  24. Hadoop Yarn. http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html

  25. Human genome project information (2013). http://web.ornl.gov/sci/techresources/HumanGenome/

  26. Illumina. https://www.illumina.com/

  27. Lin, Y.-C., Yu, C.-S., Lin, Y.-J.: Enabling large-scale biomedical analysis in the cloud. BioMed. Res. Int. 2013, 6 pp. (2013). doi:10.1155/2013/185679. Article ID 185679

    Google Scholar 

  28. Melnik, S., Gubarev, A., Long, J.J., Romer, G., Shivakumar, S., Tolton, M., Vassilakis, T.: Dremel: interactive analysis of web-scale datasets. Proc. VLDB Endow. 3, 330–339 (2010)

    Article  Google Scholar 

  29. Miner, D., Shook, A.: Mapreduce Design Patterns: Building Effective Algorithms and Analytics for Hadoop and Other Systems, 1st edn. O’Reilly Media, Inc., Sebastopol (2012)

    Google Scholar 

  30. Niemenmaa, M., et al.: Hadoop-BAM: directly manipulating next generation sequencing data in the cloud. Bioinformatics 28(6), 876–877 (2012)

    Article  Google Scholar 

  31. Nordberg, H., Bhatia, K., Wang, K., Wang, Z.: BioPig: a Hadoop-based analytic toolkit for large-scale sequence data. Bioinformatics 29(23), 3014–3019 (2013)

    Article  Google Scholar 

  32. O’Driscoll, A., Daugelaite, J., Sleator, R.D.: Big data’, Hadoop and cloud computing in genomics. J. Biomed. Inform. 46(5), 774–781 (2013)

    Google Scholar 

  33. Pandey, R.V., Schlötterer, C.: DistMap: a toolkit for distributed short read mapping on a Hadoop cluster. PLoS One 8(8), e72614 (2013)

    Article  Google Scholar 

  34. Pasupuleti, P.: Pig Design Patterns. Packt Publishing, Birmingham (2014)

    Google Scholar 

  35. Picard Tools. http://picard.sourceforge.net/

  36. Pig Latin. http://pig.apache.org/docs/r0.13.0/basic.html

  37. Pireddu, L., Leo, S., Zanetti, G.: SEAL: a distributed short read mapping and duplicate removal tool. Bioinformatics 27(15), 2159–2160 (2011). doi:10.1093/bioinformatics/btr325. http://biodoop-seal.sourceforge.net/

    Article  Google Scholar 

  38. Regierer, B., et al.: ICT needs and challenges for big data in the life sciences. A workshop report-SeqAhead/ISBE Workshop in Pula, Sardinia, 6 June 2013. EMBnet. J. 19(1), pp-31 (2013)

    Google Scholar 

  39. Roche/454 http://www.454.com/

  40. SAMtools http://www.htslib.org/

  41. Schatz, M.C.: CloudBurst: highly sensitive read mapping with MapReduce. Bioinformatics 25, 1363–1369 (2009)

    Article  Google Scholar 

  42. Schumacher, A., et al.: SeqPig: simple and scalable scripting for large sequencing data sets in Hadoop. Bioinformatics 30(1), 119–120 (2014)

    Article  Google Scholar 

  43. SeqWare https://seqware.github.io/

  44. SoapsSNP http://bowtie-bio.sourceforge.net/index.shtml

  45. Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Anthony, S., Liu, H., Wyckoff, P., Murthy, R.: 2009. Hive: a warehousing solution over a map-reduce framework. Proc. VLDB Endow. 2(2), 1626–1629 (2009)

    Google Scholar 

  46. Trapnell, C., Pachter, L., Salzberg, S.L.: TopHat: discovering splice junctions with RNA-Seq. Bioinformatics (2009). doi:10.1093/bioinformatics/btp120

    Google Scholar 

  47. Venner, J.: Pro Hadoop, 1st edn. Apress, Berkely, CA (2009)

    Book  Google Scholar 

  48. White, T.: Hadoop: The Definitive Guide, 1st edn. O’Reilly Media, Inc., Sebastopol (2009)

    Google Scholar 

  49. Wiewiórka, M.S., et al.: SparkSeq: fast, scalable, cloud-ready tool for the interactive genomic data analysis with nucleotide precision. Bioinformatics (2014) doi:10.1093/bioinformatics/btu343. First published online: May 19 (2014)

    Google Scholar 

  50. Wu, T.D., Nacu, S.: Fast and SNP-tolerant detection of complex variants and splicing in short reads. Bioinformatics 26, 873–881 (2010)

    Article  Google Scholar 

  51. Zaharia, M., et al.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation (NSDI’12) (2012)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Evangelos Theodoridis .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this chapter

Cite this chapter

Theodoridis, E. (2017). Cloud Storage-Management Techniques for NGS Data. In: Elloumi, M. (eds) Algorithms for Next-Generation Sequencing Data. Springer, Cham. https://doi.org/10.1007/978-3-319-59826-0_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-59826-0_5

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-59824-6

  • Online ISBN: 978-3-319-59826-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics