K-mulus: Strategies for BLAST in the Cloud

  • Christopher M. Hill
  • Carl H. Albach
  • Sebastian G. Angel
  • Mihai Pop
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8385)

Abstract

With the increased availability of next-generation sequencing technologies, researchers are gathering more data than they are able to process and analyze. One of the most widely performed analysis is identifying regions of similarity between DNA or protein sequences using the Basic Local Alignment Search Tool, or BLAST. Due to the large amount of sequencing data produced, parallel implementations of BLAST are needed to process the data in a timely manner. While these implementations have been designed for those researchers with access to computing grids, recent web-based services, such as Amazon’s Elastic Compute Cloud, now offer scalable, pay-as-you-go computing. In this paper, we present K-mulus, an application that performs distributed BLAST queries via Hadoop MapReduce using a collection of established parallelization strategies. In addition, we provide a method to speedup BLAST by clustering the sequence database to reduce the search space for a given query. Our results show that users must take into account the size of the BLAST database and memory of the underlying hardware to efficiently carry out the BLAST queries in parallel. Finally, we show that while our database clustering and indexing approach offers a significant theoretical speedup, in practice the distribution of protein sequences prevents this potential from being realized.

Keywords

Bioinformatics Cloud computing Sequence alignment Hadoop 

References

  1. 1.
    Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J.: Basic local alignment search tool. J. Mol. Biol. 215(3), 403–410 (1990)CrossRefGoogle Scholar
  2. 2.
    Darling, A., Carey, L., Feng, W.c.: The design, implementation, and evaluation of mpiBLAST. In: Proceedings of ClusterWorld 2003 (2003)Google Scholar
  3. 3.
    Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)CrossRefGoogle Scholar
  4. 4.
    Dongarra, J.J., Hempel, R., Hey, A.J., Walker, D.W.: A proposal for a user-level, message passing interface in a distributed memory environment. Technical report, Oak Ridge National Lab., TN (United States) (1993)Google Scholar
  5. 5.
    Eddy, S.R., et al.: A new generation of homology search tools based on probabilistic inference. Genome Inf. 23, 205–211 (2009). (World Scientific)CrossRefGoogle Scholar
  6. 6.
    Hartigan, J.A., Wong, M.A.: Algorithm as 136: a k-means clustering algorithm. J. Roy. Stat. Soc.: Ser. C (Appl. Stat.) 28(1), 100–108 (1979)MATHGoogle Scholar
  7. 7.
    Inc, A.: Amazon Elastic Compute Cloud (Amazon EC2). Amazon Inc. (2008). http://aws.amazon.com/ec2/pricing
  8. 8.
    Kent, W.J.: BLAT-the BLAST-like alignment tool. Genome Res. 12(4), 656–664 (2002)CrossRefMathSciNetGoogle Scholar
  9. 9.
    Van der Laan, M., Pollard, K., Bryan, J.: A new partitioning around medoids algorithm. J. Stat. Comput. Simul. 73(8), 575–584 (2003)CrossRefMATHMathSciNetGoogle Scholar
  10. 10.
    Langmead, B., Salzberg, S.L.: Fast gapped-read alignment with Bowtie 2. Nat. Methods 9(4), 357–359 (2012)CrossRefGoogle Scholar
  11. 11.
    Li, Y., Luo, H.M., Sun, C., Song, J.Y., Sun, Y.Z., Wu, Q., Wang, N., Yao, H., Steinmetz, A., Chen, S.L.: EST analysis reveals putative genes involved in glycyrrhizin biosynthesis. BMC Genomics 11(1), 268 (2010)CrossRefGoogle Scholar
  12. 12.
    Litzkow, M., Livny, M., Mutka, M.: Condor - a hunter of idle workstations. In: Proceedings of the 8th International Conference of Distributed, Computing Systems, June 1988 (1988)Google Scholar
  13. 13.
    Matsunaga, A., Tsugawa, M., Fortes, J.: CloudBLAST: combining MapReduce and virtualization on distributed resources for bioinformatics applications. In: IEEE Fourth International Conference on eScience, 2008. eScience’08, pp. 222–229. IEEE (2008)Google Scholar
  14. 14.
    Morgulis, A., Coulouris, G., Raytselis, Y., Madden, T.L., Agarwala, R., Schäffer, A.A.: Database indexing for production MegaBLAST searches. Bioinformatics 24(16), 1757–1764 (2008)CrossRefGoogle Scholar
  15. 15.
    Murray, J., Larsen, J., Michaels, T., Schaafsma, A., Vallejos, C., Pauls, K.: Identification of putative genes in bean (phaseolus vulgaris) genomic (Bng) RFLP clones and their conversion to STSs. Genome 45(6), 1013–1024 (2002)CrossRefGoogle Scholar
  16. 16.
    Peterson, J., Garges, S., Giovanni, M., McInnes, P., Wang, L., Schloss, J.A., Bonazzi, V., McEwen, J.E., Wetterstrand, K.A., Deal, C., et al.: The NIH human microbiome project. Genome Res. 19(12), 2317–2323 (2009)CrossRefGoogle Scholar
  17. 17.
    Wootton, J.C., Federhen, S.: Statistics of local complexity in amino acid sequences and sequence databases. Comput. Chem. 17(2), 149–163 (1993)CrossRefMATHGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2014

Authors and Affiliations

  • Christopher M. Hill
    • 1
  • Carl H. Albach
    • 1
  • Sebastian G. Angel
    • 2
  • Mihai Pop
    • 1
  1. 1.Center for Bioinformatics and Computational BiologyUniversity of MarylandCollege ParkUSA
  2. 2.University of TexasAustinUSA

Personalised recommendations