Abstract
It is obvious that the next generation sequencing (NGS) technologies, are poised to be the next big revolution in personalized healthcare, and caused the amount of available sequencing data growing exponentially. While NGS data processing has become a major challenge for individual genomic research, commodity computers as a cost-effective platform for distributed and parallel processing in laboratories can help processing such huge volume of data. To deploy sequence-processing methods on these platforms, in this paper we present a parallel computational model for BLAST on commodity clusters that works in a data parallel manner. The suggested model has a master-worker paradigm. The master stores temporarily incoming requests and splits the database to chunks according to the number of available workers. Each worker pulls, formats, and searches queries against a unique chunk of the database. To show that our model works well, we used queries with different lengths to search against a small database (i.e. UniProtKB/SWISS-PROT) and a large database (i.e. UniProtKB/TrEMBL). The results were equal with the output of the golden method (i.e. NCBI BLAST) and the performance of our model outperformed the most popular distributed form of BLAST (i.e. mpiBLAST) with 25% higher performance.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
GenomeTools. Available at: http://genometools.org
References
Fu, L., Niu, B., Zhu, Z., Wu, S., Li, W.: CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics 28(23), 3150–3152 (2012)
Wilkinson, B., Allen, M.: Parallel Programming: Techniques and Applications Using Networked Workstations and Parallel Computers. Prentice-Hall Inc., Upper Saddle River (2004)
Petsko, G., Ringe, D.: From sequence to function: case studies in structural and functional genomics. In: Protein Structure and Function. New Science Press (2004)
Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J.: Basic local alignment search tool. J. Mol. Biol. 215(3), 403–410 (1990)
Mathog, D.: Parallel BLAST on split databases. Bioinformatics 19(14), 1865–1866 (2003)
Bjornson, R., Sherman, A., Weston, S., Willard, N., Wing, J.: TurboBLAST: a parallel implementation of BLAST built on the TurboHub. In: Proceedings of the 16th International Parallel and Distributed Processing Symposium, Washington, DC, USA, p. 325 (2002)
Matsunaga, A., Tsugawa, M., Fortes, J.: CloudBLAST: combining MapReduce and virtualization on distributed resources for bioinformatics applications. In: Proceedings of the 2008 Fourth IEEE International Conference on eScience, Indianapolis, IN, USA, pp. 222–229 (2008)
Castro, M., Tostes, C., Dávila, A., Senger, H., Silva, F.: SparkBLAST: scalable BLAST processing using in-memory operations. BMC Bioinformatics 18(1), 318 (2017)
Ye, W., Chen, Y., Zhang, Y., Xu, Y.: H-BLAST: a fast protein sequence alignment toolkit on heterogeneous computers with GPUs. Bioinformatics 33(8), 1130–1138 (2017)
Darling, A., Carey, L., Feng, W.: The design, implementation, and evaluation of mpiBLAST. In: 4th International Conference on Linux Clusters, San Jose, CA, USA, p. 14p (2003)
Zhang, L., Tang, B.: Parka: a parallel implementation of BLAST with MapReduce. In: Xhafa, F., Patnaik, S., Zomaya, A.Y. (eds.) IISA 2017. AISC, vol. 686, pp. 185–191. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-69096-4_26
Dong, G., Fu, X., Li, H., Li, J.: An accurate algorithm for multiple sequence alignment in MapReduce. J. Comput. Methods Sci. Eng. 18(1), 283–295 (2018)
Guo, R., Zhao, Y., Zou, Q., Fang, X., Peng, S.: Bioinformatics applications on Apache Spark. GigaScience 7(8), giy098 (2018)
Mondal, S., Khatua, S.: Accelerating pairwise sequence alignment algorithm by MapReduce technique for Next-Generation Sequencing (NGS) data analysis. In: Abraham, A., Dutta, P., Mandal, J., Bhattacharya, A., Dutta, S. (eds.) Emerging Technologies in Data Mining and Information Security. Advances in Intelligent Systems and Computing, vol. 813, pp. 213–220. Springer, Singapore (2018). https://doi.org/10.1007/978-981-13-1498-8_19
Oehmen, C.S., Baxter, D.J.: ScalaBLAST 2.0: rapid and robust BLAST calculations on multiprocessor systems. Bioinformatics 29(6), 797–798 (2013)
Kim, D.-W., et al.: G-BLAST: BLAST manager in an heterogeneous distributed environment. In: 2012 Sixth International Symposium on Theoretical Aspects of Software Engineering, Tianjin, China, pp. 315–316 (2009)
Braun, R.C., Pedretti, K.T., Casavant, T.L., Scheetz, T.E., Birkett, C.L., Roberts, C.A.: Parallelization of local BLAST service on workstation clusters. Future Gener. Comput. Syst. 17, 745–754 (2001)
Xiao, S., Lin, H., Feng, W.-C.: Accelerating protein sequence search in a heterogeneous computing system. In: Proceedings of the 2011 IEEE International Parallel Distributed Processing Symposium (IPDPS), Washington, DC, USA, pp. 1212–1222 (2011)
Kim, H.-S., Kim, H.-J., Han, D.-S.: Hyper-BLAST: a parallelized BLAST on cluster system. In: Sloot, P.M.A., Abramson, D., Bogdanov, A.V., Gorbachev, Y.E., Dongarra, J.J., Zomaya, A.Y. (eds.) ICCS 2003. LNCS, vol. 2659, pp. 213–222. Springer, Heidelberg (2003). https://doi.org/10.1007/3-540-44863-2_22
Pinthong, W., Muangruen, P., Suriyaphol, P., Mairiang, D.: A simple grid implementation with Berkeley Open Infrastructure for Network Computing using BLAST as a model. PeerJ 4, e1388 (2016)
Tao, T., Madden, T., Christiam, C., Szilagyi, L.: BLAST® Help. https://www.ncbi.nlm.nih.gov/books/NBK62345/
Li, L., Malony, A.D.: Model-based performance diagnosis of master-worker parallel computations. In: Nagel, W.E., Walter, W.V., Lehner, W. (eds.) Euro-Par 2006. LNCS, vol. 4128, pp. 35–46. Springer, Heidelberg (2006). https://doi.org/10.1007/11823285_5
Agarwal, A.: Parallel Computational Models, Handout, Lecture02, Multicore Systems Laboratory. MIT (2010)
Hamilton, S.: An Introduction to Parallel Programming. CreateSpace Independent Publishing Platform, Scotts Valley (2014)
Muresano, R., Rexachs, D., Luque, E.: Learning parallel programming: a challenge for university students. Procedia Comput. Sci. 1(1), 875–883 (2010)
Massingill, B., Mattson, T., Sanders, B.: Patterns for parallel application programs. In: 6th Pattern Languages of Programs Workshop (1999)
Hughey, R.: Parallel hardware for sequence comparison and alignment. CABIOS 12(6), 473–479 (1996)
Lin, H., Ma, X., Chandramohan, P., Geist, A., Samatova, N.: Efficient data access for parallel BLAST. In: Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium, Denver, Colorado, US, p. 72b (2005)
Korf, I., Yandell, M., Bedell, J.: BLAST - An Essential Guide to the Basic Local Alignment Search Tool. O’Reilly & Associates, Sebastopol (2003)
Vidyarthi, D., Sarker, B., Tripathi, A., Yang, L.: Scheduling in Distributed Computing Systems. Springer, New York (2009). https://doi.org/10.1007/978-0-387-74483-4
Yap, T., Frieder, O., Martino, R.: Parallel computation in biological sequence analysis. IEEE Trans. Parallel Distrib. Syst. 9(3), 283–294 (1998)
Amdahl, G.: Validity of the single processor approach to achieving large scale computing capabilities. In: Proceedings of the April 18–20, 1967, Spring Joint Computer Conference, New York, NY, USA (1967)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Hajibaba, M., Sharifi, M., Gorgin, S. (2019). Data-Parallel Computational Model for Next Generation Sequencing on Commodity Clusters. In: Malyshkin, V. (eds) Parallel Computing Technologies. PaCT 2019. Lecture Notes in Computer Science(), vol 11657. Springer, Cham. https://doi.org/10.1007/978-3-030-25636-4_22
Download citation
DOI: https://doi.org/10.1007/978-3-030-25636-4_22
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-25635-7
Online ISBN: 978-3-030-25636-4
eBook Packages: Computer ScienceComputer Science (R0)