Abstract
Current scientific advancements in both computer and biological sciences are bringing new opportunities to intra-disciplinary research topics. On one hand, computers and big-data analytics cloud software tools are being developed rapidly, increasing the capability of processing from terabyte data sets to petabytes and beyond. On the other hand, the advancement in molecular biological experiments is producing huge amounts of data related to genome and RNA sequences, protein and metabolite abundance, protein–protein interactions, gene expression, and so on. In most cases, biological data are forming big, versatile, complex networks.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
1000 genomes project (2013). http://www.1000genomes.org/
Amazon S3 multipart upload. http://aws.amazon.com/blogs/aws/amazon-s3-multipart-upload/
Apache Hadoop. http://hadoop.apache.org/
Apache Hive. https://hive.apache.org/
Apache Flink. http://flink.incubator.apache.org
Apache Pig. http://pig.apache.org/
Apache Samza. http://samza.incubator.apache.org/
Apache Spark. https://spark.apache.org/
Apache Tez. http://tez.apache.org/
Bongcam-Rudloff, E., et al.: The next NGS challenge conference: data processing and integration. EMBnet. J. 19(A), p-3 (2013)
Burrows-Wheeler Aligner. http://bio-bwa.sourceforge.net/
Chang, Y.J., Chen, C.C., Chen, C.L., Ho, J.M.: A de novo next generation genomic sequence assembler based on string graph and MapReduce cloud computing framework. BMC Genomics 13, 1–17 (2012)
Chen, C.C., Chang, Y.J., Chung, W.C., Lee, D.T., Ho, J.M.: CloudRS: an error correction algorithm of high-throughput sequencing data based on scalable framework. In: BigData Conference, pp. 717–722. IEEE (2013)
Chung, W.-C., et al.: CloudDOE: a user-friendly tool for deploying Hadoop clouds and analyzing high-throughput sequencing data with MapReduce. PLoS One 9(6), e98146 (2014). doi:10.1371/journal.pone.0098146
CloudGENE A graphical MapReduce platform for cloud computing. http://cloudgene.uibk.ac.at/index.html
COST Action BM1006: next generation sequencing data analysis network. http://www.seqahead.eu/
Crossbow. http://bowtie-bio.sourceforge.net/crossbow/index.shtml
Daugelaite, J., O’ Driscoll, A., Sleator, R.D.: An overview of multiple sequence alignments and cloud computing in bioinformatics. ISRN Biomath. 2013, 14 pp. (2013). doi:10.1155/2013/615630. Article ID 615630
Genome 10K Community of Scientists: Genome 10K: a proposal to obtain whole-genome sequence for 10,000 vertebrate species. J. Hered. 100, 659–674 (2009)
Gomez-Cabrero, D., Abugessaisa, I., Maier, D., Teschendorff, A., Merkenschlager, M., Gisel, A., Ballestar, E., Bongcam-Rudloff, E., Conesa A., Tegnér, J.: Data integration in the era of omics: current and future challenges. BMC Syst. Biol. 8(Suppl. 2), I1 (2014)
Google BigQuery. https://developers.google.com/bigquery/
Google BigQuery. https://cloud.google.com/developers/articles/getting-started-with-google-bigquery
Hadoop Yarn. http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html
Human genome project information (2013). http://web.ornl.gov/sci/techresources/HumanGenome/
Illumina. https://www.illumina.com/
Lin, Y.-C., Yu, C.-S., Lin, Y.-J.: Enabling large-scale biomedical analysis in the cloud. BioMed. Res. Int. 2013, 6 pp. (2013). doi:10.1155/2013/185679. Article ID 185679
Melnik, S., Gubarev, A., Long, J.J., Romer, G., Shivakumar, S., Tolton, M., Vassilakis, T.: Dremel: interactive analysis of web-scale datasets. Proc. VLDB Endow. 3, 330–339 (2010)
Miner, D., Shook, A.: Mapreduce Design Patterns: Building Effective Algorithms and Analytics for Hadoop and Other Systems, 1st edn. O’Reilly Media, Inc., Sebastopol (2012)
Niemenmaa, M., et al.: Hadoop-BAM: directly manipulating next generation sequencing data in the cloud. Bioinformatics 28(6), 876–877 (2012)
Nordberg, H., Bhatia, K., Wang, K., Wang, Z.: BioPig: a Hadoop-based analytic toolkit for large-scale sequence data. Bioinformatics 29(23), 3014–3019 (2013)
O’Driscoll, A., Daugelaite, J., Sleator, R.D.: Big data’, Hadoop and cloud computing in genomics. J. Biomed. Inform. 46(5), 774–781 (2013)
Pandey, R.V., Schlötterer, C.: DistMap: a toolkit for distributed short read mapping on a Hadoop cluster. PLoS One 8(8), e72614 (2013)
Pasupuleti, P.: Pig Design Patterns. Packt Publishing, Birmingham (2014)
Picard Tools. http://picard.sourceforge.net/
Pireddu, L., Leo, S., Zanetti, G.: SEAL: a distributed short read mapping and duplicate removal tool. Bioinformatics 27(15), 2159–2160 (2011). doi:10.1093/bioinformatics/btr325. http://biodoop-seal.sourceforge.net/
Regierer, B., et al.: ICT needs and challenges for big data in the life sciences. A workshop report-SeqAhead/ISBE Workshop in Pula, Sardinia, 6 June 2013. EMBnet. J. 19(1), pp-31 (2013)
Roche/454 http://www.454.com/
SAMtools http://www.htslib.org/
Schatz, M.C.: CloudBurst: highly sensitive read mapping with MapReduce. Bioinformatics 25, 1363–1369 (2009)
Schumacher, A., et al.: SeqPig: simple and scalable scripting for large sequencing data sets in Hadoop. Bioinformatics 30(1), 119–120 (2014)
SeqWare https://seqware.github.io/
Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Anthony, S., Liu, H., Wyckoff, P., Murthy, R.: 2009. Hive: a warehousing solution over a map-reduce framework. Proc. VLDB Endow. 2(2), 1626–1629 (2009)
Trapnell, C., Pachter, L., Salzberg, S.L.: TopHat: discovering splice junctions with RNA-Seq. Bioinformatics (2009). doi:10.1093/bioinformatics/btp120
Venner, J.: Pro Hadoop, 1st edn. Apress, Berkely, CA (2009)
White, T.: Hadoop: The Definitive Guide, 1st edn. O’Reilly Media, Inc., Sebastopol (2009)
Wiewiórka, M.S., et al.: SparkSeq: fast, scalable, cloud-ready tool for the interactive genomic data analysis with nucleotide precision. Bioinformatics (2014) doi:10.1093/bioinformatics/btu343. First published online: May 19 (2014)
Wu, T.D., Nacu, S.: Fast and SNP-tolerant detection of complex variants and splicing in short reads. Bioinformatics 26, 873–881 (2010)
Zaharia, M., et al.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation (NSDI’12) (2012)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this chapter
Cite this chapter
Theodoridis, E. (2017). Cloud Storage-Management Techniques for NGS Data. In: Elloumi, M. (eds) Algorithms for Next-Generation Sequencing Data. Springer, Cham. https://doi.org/10.1007/978-3-319-59826-0_5
Download citation
DOI: https://doi.org/10.1007/978-3-319-59826-0_5
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-59824-6
Online ISBN: 978-3-319-59826-0
eBook Packages: Computer ScienceComputer Science (R0)