K-mulus: Strategies for BLAST in the Cloud

Hill, Christopher M.; Albach, Carl H.; Angel, Sebastian G.; Pop, Mihai

doi:10.1007/978-3-642-55195-6_22

Christopher M. Hill¹⁹,
Carl H. Albach¹⁹,
Sebastian G. Angel²⁰ &
…
Mihai Pop¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 8385))

Included in the following conference series:

International Conference on Parallel Processing and Applied Mathematics

1393 Accesses
1 Citations

Abstract

With the increased availability of next-generation sequencing technologies, researchers are gathering more data than they are able to process and analyze. One of the most widely performed analysis is identifying regions of similarity between DNA or protein sequences using the Basic Local Alignment Search Tool, or BLAST. Due to the large amount of sequencing data produced, parallel implementations of BLAST are needed to process the data in a timely manner. While these implementations have been designed for those researchers with access to computing grids, recent web-based services, such as Amazon’s Elastic Compute Cloud, now offer scalable, pay-as-you-go computing. In this paper, we present K-mulus, an application that performs distributed BLAST queries via Hadoop MapReduce using a collection of established parallelization strategies. In addition, we provide a method to speedup BLAST by clustering the sequence database to reduce the search space for a given query. Our results show that users must take into account the size of the BLAST database and memory of the underlying hardware to efficiently carry out the BLAST queries in parallel. Finally, we show that while our database clustering and indexing approach offers a significant theoretical speedup, in practice the distribution of protein sequences prevents this potential from being realized.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J.: Basic local alignment search tool. J. Mol. Biol. 215(3), 403–410 (1990)
Article Google Scholar
Darling, A., Carey, L., Feng, W.c.: The design, implementation, and evaluation of mpiBLAST. In: Proceedings of ClusterWorld 2003 (2003)
Google Scholar
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)
Article Google Scholar
Dongarra, J.J., Hempel, R., Hey, A.J., Walker, D.W.: A proposal for a user-level, message passing interface in a distributed memory environment. Technical report, Oak Ridge National Lab., TN (United States) (1993)
Google Scholar
Eddy, S.R., et al.: A new generation of homology search tools based on probabilistic inference. Genome Inf. 23, 205–211 (2009). (World Scientific)
Article Google Scholar
Hartigan, J.A., Wong, M.A.: Algorithm as 136: a k-means clustering algorithm. J. Roy. Stat. Soc.: Ser. C (Appl. Stat.) 28(1), 100–108 (1979)
MATH Google Scholar
Inc, A.: Amazon Elastic Compute Cloud (Amazon EC2). Amazon Inc. (2008). http://aws.amazon.com/ec2/pricing
Kent, W.J.: BLAT-the BLAST-like alignment tool. Genome Res. 12(4), 656–664 (2002)
Article MathSciNet Google Scholar
Van der Laan, M., Pollard, K., Bryan, J.: A new partitioning around medoids algorithm. J. Stat. Comput. Simul. 73(8), 575–584 (2003)
Article MATH MathSciNet Google Scholar
Langmead, B., Salzberg, S.L.: Fast gapped-read alignment with Bowtie 2. Nat. Methods 9(4), 357–359 (2012)
Article Google Scholar
Li, Y., Luo, H.M., Sun, C., Song, J.Y., Sun, Y.Z., Wu, Q., Wang, N., Yao, H., Steinmetz, A., Chen, S.L.: EST analysis reveals putative genes involved in glycyrrhizin biosynthesis. BMC Genomics 11(1), 268 (2010)
Article Google Scholar
Litzkow, M., Livny, M., Mutka, M.: Condor - a hunter of idle workstations. In: Proceedings of the 8th International Conference of Distributed, Computing Systems, June 1988 (1988)
Google Scholar
Matsunaga, A., Tsugawa, M., Fortes, J.: CloudBLAST: combining MapReduce and virtualization on distributed resources for bioinformatics applications. In: IEEE Fourth International Conference on eScience, 2008. eScience’08, pp. 222–229. IEEE (2008)
Google Scholar
Morgulis, A., Coulouris, G., Raytselis, Y., Madden, T.L., Agarwala, R., Schäffer, A.A.: Database indexing for production MegaBLAST searches. Bioinformatics 24(16), 1757–1764 (2008)
Article Google Scholar
Murray, J., Larsen, J., Michaels, T., Schaafsma, A., Vallejos, C., Pauls, K.: Identification of putative genes in bean (phaseolus vulgaris) genomic (Bng) RFLP clones and their conversion to STSs. Genome 45(6), 1013–1024 (2002)
Article Google Scholar
Peterson, J., Garges, S., Giovanni, M., McInnes, P., Wang, L., Schloss, J.A., Bonazzi, V., McEwen, J.E., Wetterstrand, K.A., Deal, C., et al.: The NIH human microbiome project. Genome Res. 19(12), 2317–2323 (2009)
Article Google Scholar
Wootton, J.C., Federhen, S.: Statistics of local complexity in amino acid sequences and sequence databases. Comput. Chem. 17(2), 149–163 (1993)
Article MATH Google Scholar

Download references

Acknowledgments

We would like to thank Mohammadreza Ghodsi for advice on clustering, Daniel Sommer for advice on Hadoop, Lee Mendelowitz for manuscript feedback, Katherine Fenstermacher for the name K-mulus, and the other members of the Pop lab for valuable discussions on all aspects of our work.

This work is supported in part by grants from the National Science Foundation, grant IIS-0844494 to MP.

Author information

Authors and Affiliations

Center for Bioinformatics and Computational Biology, University of Maryland, College Park, MD, USA
Christopher M. Hill, Carl H. Albach & Mihai Pop
University of Texas, Austin, TX, USA
Sebastian G. Angel

Authors

Christopher M. Hill
View author publications
You can also search for this author in PubMed Google Scholar
Carl H. Albach
View author publications
You can also search for this author in PubMed Google Scholar
Sebastian G. Angel
View author publications
You can also search for this author in PubMed Google Scholar
Mihai Pop
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Christopher M. Hill .

Editor information

Editors and Affiliations

Institute of Computer and Information Science, Czestochowa University of Technology, Czestochowa, Poland
Roman Wyrzykowski
University of Tennessee, Department of Computer Science, Knoxville, Tennessee, USA
Jack Dongarra
Institute of Computer and Information Science, Czestochowa University of Technology, Czestochowa, Poland
Konrad Karczewski
Technical University of Denmark Informatics and Mathematical Modelling, Kongens Lyngby, Denmark
Jerzy Waśniewski

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hill, C.M., Albach, C.H., Angel, S.G., Pop, M. (2014). K-mulus: Strategies for BLAST in the Cloud. In: Wyrzykowski, R., Dongarra, J., Karczewski, K., Waśniewski, J. (eds) Parallel Processing and Applied Mathematics. PPAM 2013. Lecture Notes in Computer Science(), vol 8385. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-55195-6_22

Download citation

DOI: https://doi.org/10.1007/978-3-642-55195-6_22
Published: 08 May 2014
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-55194-9
Online ISBN: 978-3-642-55195-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics