Abstract
Cloud computing technologies have made it possible to analyze big data sets in scalable and cost-effective ways. DNA sequence analysis, where very large data sets are now generated at reduced cost using the Next-Generation Sequencing (NGS) methods, is an area which can greatly benefit from cloud-based infrastructures. Although existing solutions show nearly linear scalability, they pose significant limitations in terms of data transfer latencies and cloud storage costs. In this paper, we propose to tackle the performance problems that arise from having to transfer large amounts of data between clients and the cloud based on a streaming data management architecture. Our approach provides an incremental data processing model which can hide data transfer latencies while maintaining linear scalability. We present an initial implementation and evaluation of this approach for SHRiMP, a well-known software package for NGS read alignment, based on the IBM InfoSphere Streams computing platform deployed on Amazon EC2.
Keywords
- DNA sequence analysis
- Next-Generation Sequencing (NGS)
- NGS read alignment
- cloud computing
- data stream processing
- incremental data processing
Download conference paper PDF
References
Amazon Elastic Compute Cloud, http://aws.amazon.com/ec2/
Apache Hadoop, http://hadoop.apache.org/
Functional Genomics Center Zurich, http://www.fgcz.ch/
Google MapReduce, http://labs.google.com/papers/mapreduce.html
IBM InfoSphere Streams, http://www.ibm.com/software/data/infosphere/streams
The SAM Format Specification, samtools.sourceforge.net/SAM1.pdf
Abadi, D., Ahmad, Y., Balazinska, M., Çetintemel, U., Cherniack, M., Hwang, J., Lindner, W., Maskey, A., Rasin, A., Ryvkina, E., Tatbul, N., Xing, Y., Zdonik, S.: The Design of the Borealis Stream Processing Engine. In: Conference on Innovative Data Systems Research (CIDR 2005), Asilomar, CA (January 2005)
Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J.: Basic Local Alignment Search Tool. Journal of Molecular Biology 215(3) (October 1990)
Collins, F.S., Guyer, M., Chakravarti, A.: Variations on a Theme: Cataloging Human DNA Sequence Variation. Science 278(5343) (November 1997)
Deelman, E., Mehta, G., Singh, G., Su, M., Vahi, K.: Pegasus: mapping large-scale workflows to distributed resources. In: Workflows for e-Science, pp. 376–394 (2007)
Dudley, J.T., Butte, A.J.: In Silico Research in the Era of Cloud Computing. Nature Biotechnology 28(11) (2010)
Fernald, G.H., Capriotti, E., Daneshjou, R., Karczewski, K.J., Altman, R.B.: Bioinformatics Challenges for Personalized Medicine. Bioinformatics 27(13) (July 2011)
Gedik, B., Andrade, H., Wu, K.L., Yu, P.S., Doo, M.: SPADE: The System S Declarative Stream Processing Engine. In: ACM SIGMOD Conference, Vancouver, BC, Canada (June 2008)
Goecks, J., Nekrutenko, A., Taylor, J., Team, G.: Galaxy: A Comprehensive Approach for Supporting Accessible, Reproducible, and Transparent Computational Research in the Life Sciences. Genome Biology 11(8) (2010)
Graefe, G.: Query Evaluation Techniques for Large Databases. ACM Computing Surveys 25(2) (June 1993)
Keich, U., Ming, L., Ma, B., Tromp, J.: On Spaced Seeds for Similarity Search. Discrete Applied Mathematics 138(3) (April 2004)
Langmead, B., Schatz, M.C., Lin, J., Pop, M., Salzberg, S.L.: Searching for SNPs with Cloud Computing. Genome Biology 10(11) (2009)
Langmead, B., Trapnell, C., Pop, M., Salzberg, S.L.: Ultrafast and Memory-efficient Alignment of Short DNA Sequences to the Human Genome. Genome Biology 10(3) (2009)
Li, H., Homer, N.: A Survey of Sequence Alignment Algorithms for Next-Generation Sequencing. Briefings in Bioinformatics 11(5) (September 2010)
Li, R., Li, Y., Fang, X., Yang, H., Wang, J., Kristiansen, K., Wang, J.: SNP Detection for Massively Parallel Whole-Genome Resequencing. Genome Research 19(6) (June 2009)
Rumble, S.M., Lacroute, P., Dalca, A.V., Fiume, M., Sidow, A., Brudno, M.: SHRiMP: Accurate Mapping of Short Color-space Reads. PLOS Computational Biology 5(5) (May 2009)
Sanger, F., Coulson, A.R.: A Rapid Method for Determining Sequences in DNA by Primed Synthesis with DNA Polymerase. Journal of Mol. Biol. 94(3) (May 1975)
Schatz, M., Delcher, A., Salzberg, S.: Assembly of large genomes using second-generation sequencing. Genome Research 20(9), 1165 (2010)
Schatz, M.C.: CloudBurst: Highly Sensitive Read Mapping with MapReduce. Bioinformatics 25(11) (June 2009)
Stein, L.D.: The Case for Cloud Computing in Genome Informatics. Genome Biology 11(5) (2010)
Viedma, G., Olias, A., Parsons, P.: Genomics Processing in the Cloud. International Science Grid This Week (February 2011), http://www.isgtw.org/feature/genomics-processing-cloud
Voelkerding, K.V., Dames, S.A., Durtschi, J.D.: Next-Generation Sequencing: From Basic Research to Diagnostics. Clinical Chemistry 55(4) (February 2009)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Kienzler, R., Bruggmann, R., Ranganathan, A., Tatbul, N. (2012). Large-Scale DNA Sequence Analysis in the Cloud: A Stream-Based Approach. In: Alexander, M., et al. Euro-Par 2011: Parallel Processing Workshops. Euro-Par 2011. Lecture Notes in Computer Science, vol 7156. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-29740-3_52
Download citation
DOI: https://doi.org/10.1007/978-3-642-29740-3_52
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-29739-7
Online ISBN: 978-3-642-29740-3
eBook Packages: Computer ScienceComputer Science (R0)
