Next-generation sequencing data analysis on cloud computing

Kwon, Taesoo; Yoo, Won Gi; Lee, Won-Ja; Kim, Won; Kim, Dae-Won

doi:10.1007/s13258-015-0280-7

Next-generation sequencing data analysis on cloud computing

Review
Published: 04 April 2015

Volume 37, pages 489–501, (2015)
Cite this article

Genes & Genomics Aims and scope Submit manuscript

Taesoo Kwon^1,2,
Won Gi Yoo²,
Won-Ja Lee³,
Won Kim¹ &
…
Dae-Won Kim²

1090 Accesses
16 Citations
2 Altmetric
Explore all metrics

Abstract

With the advent of next-generation sequencing (NGS), including whole genome sequencing (WGS), RNA sequencing (RNA-seq), and chromatin immunoprecipitation followed by sequencing (ChIP-seq), many biologists and computer scientists are highlighting the urgent need for computing power, storage, and various bioinformatics software for analyzing large quantities of sequence data. Currently, building the computational infrastructure required for massive data processing and providing maintenance services are among the most important tasks. However, technology platforms for handling large amounts of information pose multiple challenges for data access and processing. To overcome these challenges, cloud computing technologies are emerging as a possible infrastructure for tackling the intensive use of computing power and communication resources in NGS data analysis. Thus, in this review, we explain the concepts and key technologies of cloud computing, such as Map and Reduce, and discuss the problem of data transfer. To reveal the performance and usefulness of these technologies, we analyzed NGS data using cloud platforms and compared them with a local cluster. From the benchmark results, we concluded that cloud computing is still more expensive than local cluster, but provides reasonable performance for NGS data analysis with acceptable prices and could be a good alternative to local cluster systems.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Closha: bioinformatics workflow system for the analysis of massive sequencing data

Article Open access 19 February 2018

Scalable Cloud-Based Data Analysis Software Systems for Big Data from Next Generation Sequencing

The Ultrafast and Accurate Mapping Algorithm FANSe3: Mapping a Human Whole-Genome Sequencing Dataset Within 30 Minutes

Article Open access 22 February 2021

References

Abecasis GR, Altshuler D, Auton A, Brooks LD, Durbin RM, Gibbs RA, Hurles ME, McVean GA (2010) A map of human genome variation from population-scale sequencing. Nature 467:1061–1073
Article PubMed Google Scholar
Afgan E, Baker D, Coraor N, Chapman B, Nekrutenko A, Taylor J (2010) Galaxy CloudMan: delivering cloud compute clusters. BMC Bioinform 11(Suppl 12):S4
Angiuoli SV, Matalka M, Gussman A, Galens K, Vangala M, Riley DR, Arze C, White JR, White O, Fricke WF (2011) CloVR: a virtual machine for automated and portable sequence analysis from the desktop using cloud computing. BMC Bioinform 12:356
Asmann YW, Middha S, Hossain A, Baheti S, Li Y, Chai HS, Sun Z, Duffy PH, Hadad AA, Nair A et al (2012) TREAT: a bioinformatics tool for variant annotations and visualizations in targeted and exome sequencing data. Bioinformatics 28:277–278
Baker M (2010) Next-generation sequencing: adjusting to data overload. Nat Methods 7:495–499
Article CAS Google Scholar
Blow N (2009) Transcriptomics: the digital generation. Nature 458:239–242
Article CAS PubMed Google Scholar
Dai L, Gao X, Guo Y, Xiao J, Zhang Z (2012) Bioinformatics clouds for big data manipulation Biol Direct 7:43; discussion 43
Dean J, Ghemawat S (2004) MapReduce: simplified data processing on large clusters. In: Paper presented at the proceedings of the 6th conference on symposium on operating systems design & implementation vol 6, San Francisco
Dudley JT, Pouliot Y, Chen R, Morgan AA, Butte AJ (2010) Translational bioinformatics in the cloud: an affordable alternative. Genome Med 2:51
Article PubMed Central PubMed Google Scholar
Feng X, Grossman R, Stein L (2011) PeakRanger: a cloud-enabled peak caller for ChIP-seq data. BMC Bioinform 12:139
Fischer M, Snajder R, Pabinger S, Dander A, Schossig A, Zschocke J, Trajanoski Z, Stocker G (2012) SIMPLEX: cloud-enabled pipeline for the comprehensive analysis of exome sequencing data. PLoS One 7:e41948
Frommer M, McDonald LE, Millar DS, Collis CM, Watt F, Grigg GW, Molloy PL, Paul CL (1992) A genomic sequencing protocol that yields a positive display of 5-methylcytosine residues in individual DNA strands. Proc Natl Acad Sci USA 89:1827–1831
Article PubMed Central CAS PubMed Google Scholar
Fusaro VA, Patil P, Gafni E, Wall DP, Tonellato PJ (2011) Biomedical cloud computing with Amazon Web Services. PLoS Comput Biol 7:e1002147
Article PubMed Central CAS PubMed Google Scholar
Habegger L, Balasubramanian S, Chen DZ, Khurana E, Sboner A, Harmanci A, Rozowsky J, Clarke D, Snyder M, Gerstein M (2012) VAT: a computational framework to functionally annotate variants in personal genomes within a cloud-computing environment. Bioinformatics 28:2267–2269
Herman JG, Graff JR, Myohanen S, Nelkin BD, Baylin SB (1996) Methylation-specific PCR: a novel PCR assay for methylation status of CpG islands. Proc Natl Acad Sci USA 93:9821–9826
Article PubMed Central CAS PubMed Google Scholar
Hong D, Rhie A, Park SS, Lee J, Ju YS, Kim S, Yu SB, Bleazard T, Park HS, Rhee H et al (2012) FX: an RNA-Seq analysis tool on the cloud. Bioinformatics 28:721–723
Horner DS, Pavesi G, Castrignano T, De Meo PD, Liuni S, Sammeth M, Picardi E, Pesole G (2010) Bioinformatics approaches for genomics and post genomics applications of next-generation sequencing. Brief Bioinform 11:181–197
Article CAS PubMed Google Scholar
Hu F, Qiu M, Li J, Grant T, Tylor D, McCaleb S, Butler L, Hamner R (2011) A review on cloud computing: design challenges in architecture and security. CIT 19(1):25–55
Article CAS Google Scholar
Johnson DS, Mortazavi A, Myers RM, Wold B (2007) Genome-wide mapping of in vivo protein-DNA interactions. Science 316:1497–1502
Article CAS PubMed Google Scholar
Jourdren L, Bernard M, Dillies MA, Le Crom S (2012) Eoulsan: a cloud computing-based framework facilitating high throughput sequencing analyses. Bioinformatics 28:1542–1543
Lam HY, Pan C, Clark MJ, Lacroute P, Chen R, Haraksingh R, O'Huallachain M, Gerstein MB, Kidd JM, Bustamante CD et al (2012) Detecting and annotating genetic variations using the HugeSeq pipeline. Nat Biotechnol 30:226–229
Langmead B, Hansen KD, Leek JT (2010) Cloud-scale RNA-sequencing differential expression analysis with Myrna. Genome Biol 11:R83
Langmead B, Schatz MC, Lin J, Pop M, Salzberg SL (2009) Searching for SNPs with cloud computing. Genome Biol 10:R134
Li H, Durbin R (2009) Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25:1754–1760
Maher CA, Kumar-Sinha C, Cao X, Kalyana-Sundaram S, Han B, Jing X, Sam L, Barrette T, Palanisamy N, Chinnaiyan AM (2009) Transcriptome sequencing to detect gene fusions in cancer. Nature 458:97–101
Article PubMed Central CAS PubMed Google Scholar
Marozzo F, Talia D, Trunfio P (2012) P2P-MapReduce: parallel data processing in dynamic Cloud environments. J Comput Syst Sci 78:1382–1402
Article Google Scholar
Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B (2008) Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods 5:621–628
Article CAS PubMed Google Scholar
Ng SB, Turner EH, Robertson PD, Flygare SD, Bigham AW, Lee C, Shaffer T, Wong M, Bhattacharjee A, Eichler EE et al (2009) Targeted capture and massively parallel sequencing of 12 human exomes. Nature 461:272–276
Article PubMed Central CAS PubMed Google Scholar
Ng SB, Bigham AW, Buckingham KJ, Hannibal MC, McMillin MJ, Gildersleeve HI, Beck AE, Tabor HK, Cooper GM, Mefford HC et al (2010) Exome sequencing identifies MLL2 mutations as a cause of Kabuki syndrome. Nat Genet 42:790–793
Article PubMed Central CAS PubMed Google Scholar
Nguyen T, Shi W, Ruden D (2011) CloudAligner: A fast and full-featured MapReduce based tool for sequence mapping. BMC Res Notes 4:171
Pabinger S, Dander A, Fischer M, Snajder R, Sperk M, Efremova M, Krabichler B, Speicher, Zschocke J, Trajanoski Z (2014) A survey of tools for variant analysis of next-generation genome sequencing data. Brief Bioinform 15(2):256–278
Peter M, Timothy G (2011) The NIST Definition of Cloud Computing. National Institute of Standards and Technology, Gaithersburg
Schatz MC (2009) CloudBurst: highly sensitive read mapping with MapReduce. Bioinformatics 25:1363–1369
Schatz MC, Langmead B, Salzberg SL (2010) Cloud computing and the DNA data race. Nat Biotechnol 28:691–693
Article PubMed Central CAS PubMed Google Scholar
Schmid CD, Bucher P (2007) ChIP-Seq data reveal nucleosome architecture of human promoters. Cell 131:831–832 author reply 832–833
Article CAS PubMed Google Scholar
Wang Z, Gerstein M, Snyder M (2009) RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet 10:57–63
Article PubMed Central CAS PubMed Google Scholar

Download references

Acknowledgments

This work was supported by the Research of Korea Centers for Disease Control and Prevention [4847-300-210-13]. The funders had no role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Conflict of interest

The authors declare that they have no conflict of interest.

Author information

Authors and Affiliations

School of Biological Sciences, Seoul National University, Seoul, 151-742, Republic of Korea
Taesoo Kwon & Won Kim
Division of Biosafety Evaluation and Control, Korea Centers for Disease Control and Prevention, Korea National Institute of Health, Chungbuk, 363-951, Republic of Korea
Taesoo Kwon, Won Gi Yoo & Dae-Won Kim
Division of Arboviruses, Korea Centers for Disease Control and Prevention, Korea National Institute of Health, Chungbuk, 363-951, Republic of Korea
Won-Ja Lee

Authors

Taesoo Kwon
View author publications
You can also search for this author in PubMed Google Scholar
Won Gi Yoo
View author publications
You can also search for this author in PubMed Google Scholar
Won-Ja Lee
View author publications
You can also search for this author in PubMed Google Scholar
Won Kim
View author publications
You can also search for this author in PubMed Google Scholar
Dae-Won Kim
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Won Kim or Dae-Won Kim.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kwon, T., Yoo, W.G., Lee, WJ. et al. Next-generation sequencing data analysis on cloud computing. Genes Genom 37, 489–501 (2015). https://doi.org/10.1007/s13258-015-0280-7

Download citation

Received: 06 November 2014
Accepted: 24 March 2015
Published: 04 April 2015
Issue Date: June 2015
DOI: https://doi.org/10.1007/s13258-015-0280-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Next-generation sequencing data analysis on cloud computing

Abstract

Access this article

Similar content being viewed by others

Closha: bioinformatics workflow system for the analysis of massive sequencing data

Scalable Cloud-Based Data Analysis Software Systems for Big Data from Next Generation Sequencing

The Ultrafast and Accurate Mapping Algorithm FANSe3: Mapping a Human Whole-Genome Sequencing Dataset Within 30 Minutes

References

Acknowledgments

Conflict of interest

Author information

Authors and Affiliations

Corresponding authors

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Next-generation sequencing data analysis on cloud computing

Abstract

Access this article

Similar content being viewed by others

Closha: bioinformatics workflow system for the analysis of massive sequencing data

Scalable Cloud-Based Data Analysis Software Systems for Big Data from Next Generation Sequencing

The Ultrafast and Accurate Mapping Algorithm FANSe3: Mapping a Human Whole-Genome Sequencing Dataset Within 30 Minutes

References

Acknowledgments

Conflict of interest

Author information

Authors and Affiliations

Corresponding authors

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation