Skip to main content

Advertisement

SpringerLink
Log in
Menu
Find a journal Publish with us
Search
Cart
Book cover

European Conference on Parallel Processing

Euro-Par 2011: Euro-Par 2011: Parallel Processing Workshops pp 467–476Cite as

  1. Home
  2. Euro-Par 2011: Parallel Processing Workshops
  3. Conference paper
Large-Scale DNA Sequence Analysis in the Cloud: A Stream-Based Approach

Large-Scale DNA Sequence Analysis in the Cloud: A Stream-Based Approach

  • Romeo Kienzler30,
  • Rémy Bruggmann31,
  • Anand Ranganathan32 &
  • …
  • Nesime Tatbul30 
  • Conference paper
  • 1187 Accesses

  • 4 Citations

Part of the Lecture Notes in Computer Science book series (LNTCS,volume 7156)

Abstract

Cloud computing technologies have made it possible to analyze big data sets in scalable and cost-effective ways. DNA sequence analysis, where very large data sets are now generated at reduced cost using the Next-Generation Sequencing (NGS) methods, is an area which can greatly benefit from cloud-based infrastructures. Although existing solutions show nearly linear scalability, they pose significant limitations in terms of data transfer latencies and cloud storage costs. In this paper, we propose to tackle the performance problems that arise from having to transfer large amounts of data between clients and the cloud based on a streaming data management architecture. Our approach provides an incremental data processing model which can hide data transfer latencies while maintaining linear scalability. We present an initial implementation and evaluation of this approach for SHRiMP, a well-known software package for NGS read alignment, based on the IBM InfoSphere Streams computing platform deployed on Amazon EC2.

Keywords

  • DNA sequence analysis
  • Next-Generation Sequencing (NGS)
  • NGS read alignment
  • cloud computing
  • data stream processing
  • incremental data processing

Download conference paper PDF

References

  1. Amazon Elastic Compute Cloud, http://aws.amazon.com/ec2/

  2. Apache Hadoop, http://hadoop.apache.org/

  3. Functional Genomics Center Zurich, http://www.fgcz.ch/

  4. Google MapReduce, http://labs.google.com/papers/mapreduce.html

  5. IBM InfoSphere Streams, http://www.ibm.com/software/data/infosphere/streams

  6. The SAM Format Specification, samtools.sourceforge.net/SAM1.pdf

  7. Abadi, D., Ahmad, Y., Balazinska, M., Çetintemel, U., Cherniack, M., Hwang, J., Lindner, W., Maskey, A., Rasin, A., Ryvkina, E., Tatbul, N., Xing, Y., Zdonik, S.: The Design of the Borealis Stream Processing Engine. In: Conference on Innovative Data Systems Research (CIDR 2005), Asilomar, CA (January 2005)

    Google Scholar 

  8. Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J.: Basic Local Alignment Search Tool. Journal of Molecular Biology 215(3) (October 1990)

    Google Scholar 

  9. Collins, F.S., Guyer, M., Chakravarti, A.: Variations on a Theme: Cataloging Human DNA Sequence Variation. Science 278(5343) (November 1997)

    Google Scholar 

  10. Deelman, E., Mehta, G., Singh, G., Su, M., Vahi, K.: Pegasus: mapping large-scale workflows to distributed resources. In: Workflows for e-Science, pp. 376–394 (2007)

    Google Scholar 

  11. Dudley, J.T., Butte, A.J.: In Silico Research in the Era of Cloud Computing. Nature Biotechnology 28(11) (2010)

    Google Scholar 

  12. Fernald, G.H., Capriotti, E., Daneshjou, R., Karczewski, K.J., Altman, R.B.: Bioinformatics Challenges for Personalized Medicine. Bioinformatics 27(13) (July 2011)

    Google Scholar 

  13. Gedik, B., Andrade, H., Wu, K.L., Yu, P.S., Doo, M.: SPADE: The System S Declarative Stream Processing Engine. In: ACM SIGMOD Conference, Vancouver, BC, Canada (June 2008)

    Google Scholar 

  14. Goecks, J., Nekrutenko, A., Taylor, J., Team, G.: Galaxy: A Comprehensive Approach for Supporting Accessible, Reproducible, and Transparent Computational Research in the Life Sciences. Genome Biology 11(8) (2010)

    Google Scholar 

  15. Graefe, G.: Query Evaluation Techniques for Large Databases. ACM Computing Surveys 25(2) (June 1993)

    Google Scholar 

  16. Keich, U., Ming, L., Ma, B., Tromp, J.: On Spaced Seeds for Similarity Search. Discrete Applied Mathematics 138(3) (April 2004)

    Google Scholar 

  17. Langmead, B., Schatz, M.C., Lin, J., Pop, M., Salzberg, S.L.: Searching for SNPs with Cloud Computing. Genome Biology 10(11) (2009)

    Google Scholar 

  18. Langmead, B., Trapnell, C., Pop, M., Salzberg, S.L.: Ultrafast and Memory-efficient Alignment of Short DNA Sequences to the Human Genome. Genome Biology 10(3) (2009)

    Google Scholar 

  19. Li, H., Homer, N.: A Survey of Sequence Alignment Algorithms for Next-Generation Sequencing. Briefings in Bioinformatics 11(5) (September 2010)

    Google Scholar 

  20. Li, R., Li, Y., Fang, X., Yang, H., Wang, J., Kristiansen, K., Wang, J.: SNP Detection for Massively Parallel Whole-Genome Resequencing. Genome Research 19(6) (June 2009)

    Google Scholar 

  21. Rumble, S.M., Lacroute, P., Dalca, A.V., Fiume, M., Sidow, A., Brudno, M.: SHRiMP: Accurate Mapping of Short Color-space Reads. PLOS Computational Biology 5(5) (May 2009)

    Google Scholar 

  22. Sanger, F., Coulson, A.R.: A Rapid Method for Determining Sequences in DNA by Primed Synthesis with DNA Polymerase. Journal of Mol. Biol. 94(3) (May 1975)

    Google Scholar 

  23. Schatz, M., Delcher, A., Salzberg, S.: Assembly of large genomes using second-generation sequencing. Genome Research 20(9), 1165 (2010)

    CrossRef  Google Scholar 

  24. Schatz, M.C.: CloudBurst: Highly Sensitive Read Mapping with MapReduce. Bioinformatics 25(11) (June 2009)

    Google Scholar 

  25. Stein, L.D.: The Case for Cloud Computing in Genome Informatics. Genome Biology 11(5) (2010)

    Google Scholar 

  26. Viedma, G., Olias, A., Parsons, P.: Genomics Processing in the Cloud. International Science Grid This Week (February 2011), http://www.isgtw.org/feature/genomics-processing-cloud

  27. Voelkerding, K.V., Dames, S.A., Durtschi, J.D.: Next-Generation Sequencing: From Basic Research to Diagnostics. Clinical Chemistry 55(4) (February 2009)

    Google Scholar 

Download references

Author information

Authors and Affiliations

  1. Department of Computer Science, ETH Zurich, Switzerland

    Romeo Kienzler & Nesime Tatbul

  2. Bioinformatics, Department of Biology, University of Berne, Switzerland

    Rémy Bruggmann

  3. IBM T.J. Watson Research Center, NY, USA

    Anand Ranganathan

Authors
  1. Romeo Kienzler
    View author publications

    You can also search for this author in PubMed Google Scholar

  2. Rémy Bruggmann
    View author publications

    You can also search for this author in PubMed Google Scholar

  3. Anand Ranganathan
    View author publications

    You can also search for this author in PubMed Google Scholar

  4. Nesime Tatbul
    View author publications

    You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

  1. Scilytics, Koellnerhofgasse 3/15A, 1010, Vienna, Austria

    Michael Alexander

  2. ICAR-CNR, Via P. Castellino, 111, 80131, Napoli, Italy

    Pasqua D’Ambra

  3. University of Amsterdam, 1090, Amsterdam, Netherlands

    Adam Belloum

  4. Innovative Computing Laboratory, The University of Tennessee, US

    George Bosilca

  5. Department of Experimental Medicine and Clinic, University Magna Græcia, 88100, Catanzaro, Italy

    Mario Cannataro

  6. Computer Science Department, University of Pisa, Italy

    Marco Danelutto

  7. Second University of Naples, Italy

    Beniamino Di Martino

  8. TUMünchen,, Boltzmannstr. 3, ,, 85748, Garching, Germany

    Michael Gerndt

  9. Equipe Runtime, INRIA Bordeaux Sud-Ouest, 33405, Talence Cedex, France

    Emmanuel Jeannot & Raymond Namyst & 

  10. Equipe HIEPACS, INRIA Bordeaux Sud-Ouest, 33405, Talence Cedex, France

    Jean Roman

  11. Computer Science and Mathematics Division, Oak Ridge National Laboratory, 37831-6164, Oak Ridge, TN, USA

    Stephen L. Scott

  12. Department of Scientific Computing, University of Vienna, Nordbergstr. 15/3C, 1090, Vienna, Austria

    Jesper Larsson Traff

  13. Computer Science and Mathematics Division, Oak Ridge National Laboratory, 37831, Oak Ridge, TN, USA

    Geoffroy Vallée

  14. Technische Universität München, Germany

    Josef Weidendorfer

Rights and permissions

Reprints and Permissions

Copyright information

© 2012 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Kienzler, R., Bruggmann, R., Ranganathan, A., Tatbul, N. (2012). Large-Scale DNA Sequence Analysis in the Cloud: A Stream-Based Approach. In: Alexander, M., et al. Euro-Par 2011: Parallel Processing Workshops. Euro-Par 2011. Lecture Notes in Computer Science, vol 7156. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-29740-3_52

Download citation

  • .RIS
  • .ENW
  • .BIB
  • DOI: https://doi.org/10.1007/978-3-642-29740-3_52

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-29739-7

  • Online ISBN: 978-3-642-29740-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Share this paper

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Search

Navigation

  • Find a journal
  • Publish with us

Discover content

  • Journals A-Z
  • Books A-Z

Publish with us

  • Publish your research
  • Open access publishing

Products and services

  • Our products
  • Librarians
  • Societies
  • Partners and advertisers

Our imprints

  • Springer
  • Nature Portfolio
  • BMC
  • Palgrave Macmillan
  • Apress
  • Your US state privacy rights
  • Accessibility statement
  • Terms and conditions
  • Privacy policy
  • Help and support

167.114.118.210

Not affiliated

Springer Nature

© 2023 Springer Nature