Cloud Storage-Management Techniques for NGS Data

Theodoridis, Evangelos

doi:10.1007/978-3-319-59826-0_5

Cloud Storage-Management Techniques for NGS Data

Evangelos Theodoridis²

Chapter
First Online: 19 September 2017

1786 Accesses

Abstract

Current scientific advancements in both computer and biological sciences are bringing new opportunities to intra-disciplinary research topics. On one hand, computers and big-data analytics cloud software tools are being developed rapidly, increasing the capability of processing from terabyte data sets to petabytes and beyond. On the other hand, the advancement in molecular biological experiments is producing huge amounts of data related to genome and RNA sequences, protein and metabolite abundance, protein–protein interactions, gene expression, and so on. In most cases, biological data are forming big, versatile, complex networks.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 159.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

1000 genomes project (2013). http://www.1000genomes.org/
Amazon S3 multipart upload. http://aws.amazon.com/blogs/aws/amazon-s3-multipart-upload/
Apache Hadoop. http://hadoop.apache.org/
Apache Hive. https://hive.apache.org/
Apache Flink. http://flink.incubator.apache.org
Apache Pig. http://pig.apache.org/
Apache Samza. http://samza.incubator.apache.org/
Apache Spark. https://spark.apache.org/
Apache Tez. http://tez.apache.org/
Bongcam-Rudloff, E., et al.: The next NGS challenge conference: data processing and integration. EMBnet. J. 19(A), p-3 (2013)
Google Scholar
Bowtie. http://bowtie-bio.sourceforge.net/index.shtml
Burrows-Wheeler Aligner. http://bio-bwa.sourceforge.net/
Chang, Y.J., Chen, C.C., Chen, C.L., Ho, J.M.: A de novo next generation genomic sequence assembler based on string graph and MapReduce cloud computing framework. BMC Genomics 13, 1–17 (2012)
Article Google Scholar
Chen, C.C., Chang, Y.J., Chung, W.C., Lee, D.T., Ho, J.M.: CloudRS: an error correction algorithm of high-throughput sequencing data based on scalable framework. In: BigData Conference, pp. 717–722. IEEE (2013)
Google Scholar
Chung, W.-C., et al.: CloudDOE: a user-friendly tool for deploying Hadoop clouds and analyzing high-throughput sequencing data with MapReduce. PLoS One 9(6), e98146 (2014). doi:10.1371/journal.pone.0098146
Article Google Scholar
CloudGENE A graphical MapReduce platform for cloud computing. http://cloudgene.uibk.ac.at/index.html
COST Action BM1006: next generation sequencing data analysis network. http://www.seqahead.eu/
Crossbow. http://bowtie-bio.sourceforge.net/crossbow/index.shtml
Daugelaite, J., O’ Driscoll, A., Sleator, R.D.: An overview of multiple sequence alignments and cloud computing in bioinformatics. ISRN Biomath. 2013, 14 pp. (2013). doi:10.1155/2013/615630. Article ID 615630
Google Scholar
Genome 10K Community of Scientists: Genome 10K: a proposal to obtain whole-genome sequence for 10,000 vertebrate species. J. Hered. 100, 659–674 (2009)
Google Scholar
Gomez-Cabrero, D., Abugessaisa, I., Maier, D., Teschendorff, A., Merkenschlager, M., Gisel, A., Ballestar, E., Bongcam-Rudloff, E., Conesa A., Tegnér, J.: Data integration in the era of omics: current and future challenges. BMC Syst. Biol. 8(Suppl. 2), I1 (2014)
Article Google Scholar
Google BigQuery. https://developers.google.com/bigquery/
Google BigQuery. https://cloud.google.com/developers/articles/getting-started-with-google-bigquery
Hadoop Yarn. http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html
Human genome project information (2013). http://web.ornl.gov/sci/techresources/HumanGenome/
Illumina. https://www.illumina.com/
Lin, Y.-C., Yu, C.-S., Lin, Y.-J.: Enabling large-scale biomedical analysis in the cloud. BioMed. Res. Int. 2013, 6 pp. (2013). doi:10.1155/2013/185679. Article ID 185679
Google Scholar
Melnik, S., Gubarev, A., Long, J.J., Romer, G., Shivakumar, S., Tolton, M., Vassilakis, T.: Dremel: interactive analysis of web-scale datasets. Proc. VLDB Endow. 3, 330–339 (2010)
Article Google Scholar
Miner, D., Shook, A.: Mapreduce Design Patterns: Building Effective Algorithms and Analytics for Hadoop and Other Systems, 1st edn. O’Reilly Media, Inc., Sebastopol (2012)
Google Scholar
Niemenmaa, M., et al.: Hadoop-BAM: directly manipulating next generation sequencing data in the cloud. Bioinformatics 28(6), 876–877 (2012)
Article Google Scholar
Nordberg, H., Bhatia, K., Wang, K., Wang, Z.: BioPig: a Hadoop-based analytic toolkit for large-scale sequence data. Bioinformatics 29(23), 3014–3019 (2013)
Article Google Scholar
O’Driscoll, A., Daugelaite, J., Sleator, R.D.: Big data’, Hadoop and cloud computing in genomics. J. Biomed. Inform. 46(5), 774–781 (2013)
Google Scholar
Pandey, R.V., Schlötterer, C.: DistMap: a toolkit for distributed short read mapping on a Hadoop cluster. PLoS One 8(8), e72614 (2013)
Article Google Scholar
Pasupuleti, P.: Pig Design Patterns. Packt Publishing, Birmingham (2014)
Google Scholar
Picard Tools. http://picard.sourceforge.net/
Pig Latin. http://pig.apache.org/docs/r0.13.0/basic.html
Pireddu, L., Leo, S., Zanetti, G.: SEAL: a distributed short read mapping and duplicate removal tool. Bioinformatics 27(15), 2159–2160 (2011). doi:10.1093/bioinformatics/btr325. http://biodoop-seal.sourceforge.net/
Article Google Scholar
Regierer, B., et al.: ICT needs and challenges for big data in the life sciences. A workshop report-SeqAhead/ISBE Workshop in Pula, Sardinia, 6 June 2013. EMBnet. J. 19(1), pp-31 (2013)
Google Scholar
Roche/454 http://www.454.com/
SAMtools http://www.htslib.org/
Schatz, M.C.: CloudBurst: highly sensitive read mapping with MapReduce. Bioinformatics 25, 1363–1369 (2009)
Article Google Scholar
Schumacher, A., et al.: SeqPig: simple and scalable scripting for large sequencing data sets in Hadoop. Bioinformatics 30(1), 119–120 (2014)
Article Google Scholar
SeqWare https://seqware.github.io/
SoapsSNP http://bowtie-bio.sourceforge.net/index.shtml
Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Anthony, S., Liu, H., Wyckoff, P., Murthy, R.: 2009. Hive: a warehousing solution over a map-reduce framework. Proc. VLDB Endow. 2(2), 1626–1629 (2009)
Google Scholar
Trapnell, C., Pachter, L., Salzberg, S.L.: TopHat: discovering splice junctions with RNA-Seq. Bioinformatics (2009). doi:10.1093/bioinformatics/btp120
Google Scholar
Venner, J.: Pro Hadoop, 1st edn. Apress, Berkely, CA (2009)
Book Google Scholar
White, T.: Hadoop: The Definitive Guide, 1st edn. O’Reilly Media, Inc., Sebastopol (2009)
Google Scholar
Wiewiórka, M.S., et al.: SparkSeq: fast, scalable, cloud-ready tool for the interactive genomic data analysis with nucleotide precision. Bioinformatics (2014) doi:10.1093/bioinformatics/btu343. First published online: May 19 (2014)
Google Scholar
Wu, T.D., Nacu, S.: Fast and SNP-tolerant detection of complex variants and splicing in short reads. Bioinformatics 26, 873–881 (2010)
Article Google Scholar
Zaharia, M., et al.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation (NSDI’12) (2012)
Google Scholar

Download references

Author information

Authors and Affiliations

Computer Technology Institute, Patras, Greece
Evangelos Theodoridis

Authors

Evangelos Theodoridis
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Evangelos Theodoridis .

Editor information

Editors and Affiliations

LaTICE, Tunis, Tunisia
Mourad Elloumi

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Theodoridis, E. (2017). Cloud Storage-Management Techniques for NGS Data. In: Elloumi, M. (eds) Algorithms for Next-Generation Sequencing Data. Springer, Cham. https://doi.org/10.1007/978-3-319-59826-0_5

Download citation

DOI: https://doi.org/10.1007/978-3-319-59826-0_5
Published: 19 September 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-59824-6
Online ISBN: 978-3-319-59826-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics