Chapter

Parallel Processing and Applied Mathematics

Volume 8385 of the series Lecture Notes in Computer Science pp 237-246

Date:

K-mulus: Strategies for BLAST in the Cloud

  • Christopher M. HillAffiliated withDept. Computer Sciences, Lancaster UniversityCenter for Bioinformatics and Computational Biology, University of Maryland Email author 
  • , Carl H. AlbachAffiliated withDept. Computer Sciences, Lancaster UniversityCenter for Bioinformatics and Computational Biology, University of Maryland
  • , Sebastian G. AngelAffiliated withRobotics Institute, Carnegie Mellon UniversityUniversity of Texas
  • , Mihai PopAffiliated withDept. Computer Sciences, Lancaster UniversityCenter for Bioinformatics and Computational Biology, University of Maryland

* Final gross prices may vary according to local VAT.

Get Access

Abstract

With the increased availability of next-generation sequencing technologies, researchers are gathering more data than they are able to process and analyze. One of the most widely performed analysis is identifying regions of similarity between DNA or protein sequences using the Basic Local Alignment Search Tool, or BLAST. Due to the large amount of sequencing data produced, parallel implementations of BLAST are needed to process the data in a timely manner. While these implementations have been designed for those researchers with access to computing grids, recent web-based services, such as Amazon’s Elastic Compute Cloud, now offer scalable, pay-as-you-go computing. In this paper, we present K-mulus, an application that performs distributed BLAST queries via Hadoop MapReduce using a collection of established parallelization strategies. In addition, we provide a method to speedup BLAST by clustering the sequence database to reduce the search space for a given query. Our results show that users must take into account the size of the BLAST database and memory of the underlying hardware to efficiently carry out the BLAST queries in parallel. Finally, we show that while our database clustering and indexing approach offers a significant theoretical speedup, in practice the distribution of protein sequences prevents this potential from being realized.

Keywords

Bioinformatics Cloud computing Sequence alignment Hadoop