Advertisement

Data-Parallel Computational Model for Next Generation Sequencing on Commodity Clusters

  • Majid HajibabaEmail author
  • Mohsen Sharifi
  • Saeid Gorgin
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11657)

Abstract

It is obvious that the next generation sequencing (NGS) technologies, are poised to be the next big revolution in personalized healthcare, and caused the amount of available sequencing data growing exponentially. While NGS data processing has become a major challenge for individual genomic research, commodity computers as a cost-effective platform for distributed and parallel processing in laboratories can help processing such huge volume of data. To deploy sequence-processing methods on these platforms, in this paper we present a parallel computational model for BLAST on commodity clusters that works in a data parallel manner. The suggested model has a master-worker paradigm. The master stores temporarily incoming requests and splits the database to chunks according to the number of available workers. Each worker pulls, formats, and searches queries against a unique chunk of the database. To show that our model works well, we used queries with different lengths to search against a small database (i.e. UniProtKB/SWISS-PROT) and a large database (i.e. UniProtKB/TrEMBL). The results were equal with the output of the golden method (i.e. NCBI BLAST) and the performance of our model outperformed the most popular distributed form of BLAST (i.e. mpiBLAST) with 25% higher performance.

Keywords

Distributed systems Next generation sequencing Parallel computational models Parallel programming paradigm Commodity clusters 

References

  1. 1.
    Fu, L., Niu, B., Zhu, Z., Wu, S., Li, W.: CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics 28(23), 3150–3152 (2012)CrossRefGoogle Scholar
  2. 2.
    Wilkinson, B., Allen, M.: Parallel Programming: Techniques and Applications Using Networked Workstations and Parallel Computers. Prentice-Hall Inc., Upper Saddle River (2004)Google Scholar
  3. 3.
    Petsko, G., Ringe, D.: From sequence to function: case studies in structural and functional genomics. In: Protein Structure and Function. New Science Press (2004)Google Scholar
  4. 4.
    Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J.: Basic local alignment search tool. J. Mol. Biol. 215(3), 403–410 (1990)CrossRefGoogle Scholar
  5. 5.
    Mathog, D.: Parallel BLAST on split databases. Bioinformatics 19(14), 1865–1866 (2003)CrossRefGoogle Scholar
  6. 6.
    Bjornson, R., Sherman, A., Weston, S., Willard, N., Wing, J.: TurboBLAST: a parallel implementation of BLAST built on the TurboHub. In: Proceedings of the 16th International Parallel and Distributed Processing Symposium, Washington, DC, USA, p. 325 (2002)Google Scholar
  7. 7.
    Matsunaga, A., Tsugawa, M., Fortes, J.: CloudBLAST: combining MapReduce and virtualization on distributed resources for bioinformatics applications. In: Proceedings of the 2008 Fourth IEEE International Conference on eScience, Indianapolis, IN, USA, pp. 222–229 (2008)Google Scholar
  8. 8.
    Castro, M., Tostes, C., Dávila, A., Senger, H., Silva, F.: SparkBLAST: scalable BLAST processing using in-memory operations. BMC Bioinformatics 18(1), 318 (2017)CrossRefGoogle Scholar
  9. 9.
    Ye, W., Chen, Y., Zhang, Y., Xu, Y.: H-BLAST: a fast protein sequence alignment toolkit on heterogeneous computers with GPUs. Bioinformatics 33(8), 1130–1138 (2017)Google Scholar
  10. 10.
    Darling, A., Carey, L., Feng, W.: The design, implementation, and evaluation of mpiBLAST. In: 4th International Conference on Linux Clusters, San Jose, CA, USA, p. 14p (2003)Google Scholar
  11. 11.
    Zhang, L., Tang, B.: Parka: a parallel implementation of BLAST with MapReduce. In: Xhafa, F., Patnaik, S., Zomaya, A.Y. (eds.) IISA 2017. AISC, vol. 686, pp. 185–191. Springer, Cham (2018).  https://doi.org/10.1007/978-3-319-69096-4_26CrossRefGoogle Scholar
  12. 12.
    Dong, G., Fu, X., Li, H., Li, J.: An accurate algorithm for multiple sequence alignment in MapReduce. J. Comput. Methods Sci. Eng. 18(1), 283–295 (2018)Google Scholar
  13. 13.
    Guo, R., Zhao, Y., Zou, Q., Fang, X., Peng, S.: Bioinformatics applications on Apache Spark. GigaScience 7(8), giy098 (2018)Google Scholar
  14. 14.
    Mondal, S., Khatua, S.: Accelerating pairwise sequence alignment algorithm by MapReduce technique for Next-Generation Sequencing (NGS) data analysis. In: Abraham, A., Dutta, P., Mandal, J., Bhattacharya, A., Dutta, S. (eds.) Emerging Technologies in Data Mining and Information Security. Advances in Intelligent Systems and Computing, vol. 813, pp. 213–220. Springer, Singapore (2018).  https://doi.org/10.1007/978-981-13-1498-8_19CrossRefGoogle Scholar
  15. 15.
    Oehmen, C.S., Baxter, D.J.: ScalaBLAST 2.0: rapid and robust BLAST calculations on multiprocessor systems. Bioinformatics 29(6), 797–798 (2013)CrossRefGoogle Scholar
  16. 16.
    Kim, D.-W., et al.: G-BLAST: BLAST manager in an heterogeneous distributed environment. In: 2012 Sixth International Symposium on Theoretical Aspects of Software Engineering, Tianjin, China, pp. 315–316 (2009)Google Scholar
  17. 17.
    Braun, R.C., Pedretti, K.T., Casavant, T.L., Scheetz, T.E., Birkett, C.L., Roberts, C.A.: Parallelization of local BLAST service on workstation clusters. Future Gener. Comput. Syst. 17, 745–754 (2001)CrossRefGoogle Scholar
  18. 18.
    Xiao, S., Lin, H., Feng, W.-C.: Accelerating protein sequence search in a heterogeneous computing system. In: Proceedings of the 2011 IEEE International Parallel Distributed Processing Symposium (IPDPS), Washington, DC, USA, pp. 1212–1222 (2011)Google Scholar
  19. 19.
    Kim, H.-S., Kim, H.-J., Han, D.-S.: Hyper-BLAST: a parallelized BLAST on cluster system. In: Sloot, P.M.A., Abramson, D., Bogdanov, A.V., Gorbachev, Y.E., Dongarra, J.J., Zomaya, A.Y. (eds.) ICCS 2003. LNCS, vol. 2659, pp. 213–222. Springer, Heidelberg (2003).  https://doi.org/10.1007/3-540-44863-2_22CrossRefGoogle Scholar
  20. 20.
    Pinthong, W., Muangruen, P., Suriyaphol, P., Mairiang, D.: A simple grid implementation with Berkeley Open Infrastructure for Network Computing using BLAST as a model. PeerJ 4, e1388 (2016)CrossRefGoogle Scholar
  21. 21.
    Tao, T., Madden, T., Christiam, C., Szilagyi, L.: BLAST® Help. https://www.ncbi.nlm.nih.gov/books/NBK62345/
  22. 22.
    Li, L., Malony, A.D.: Model-based performance diagnosis of master-worker parallel computations. In: Nagel, W.E., Walter, W.V., Lehner, W. (eds.) Euro-Par 2006. LNCS, vol. 4128, pp. 35–46. Springer, Heidelberg (2006).  https://doi.org/10.1007/11823285_5CrossRefGoogle Scholar
  23. 23.
    Agarwal, A.: Parallel Computational Models, Handout, Lecture02, Multicore Systems Laboratory. MIT (2010) Google Scholar
  24. 24.
    Hamilton, S.: An Introduction to Parallel Programming. CreateSpace Independent Publishing Platform, Scotts Valley (2014)Google Scholar
  25. 25.
    Muresano, R., Rexachs, D., Luque, E.: Learning parallel programming: a challenge for university students. Procedia Comput. Sci. 1(1), 875–883 (2010)CrossRefGoogle Scholar
  26. 26.
    Massingill, B., Mattson, T., Sanders, B.: Patterns for parallel application programs. In: 6th Pattern Languages of Programs Workshop (1999)Google Scholar
  27. 27.
    Hughey, R.: Parallel hardware for sequence comparison and alignment. CABIOS 12(6), 473–479 (1996)Google Scholar
  28. 28.
    Lin, H., Ma, X., Chandramohan, P., Geist, A., Samatova, N.: Efficient data access for parallel BLAST. In: Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium, Denver, Colorado, US, p. 72b (2005)Google Scholar
  29. 29.
    Korf, I., Yandell, M., Bedell, J.: BLAST - An Essential Guide to the Basic Local Alignment Search Tool. O’Reilly & Associates, Sebastopol (2003)Google Scholar
  30. 30.
    Vidyarthi, D., Sarker, B., Tripathi, A., Yang, L.: Scheduling in Distributed Computing Systems. Springer, New York (2009).  https://doi.org/10.1007/978-0-387-74483-4CrossRefzbMATHGoogle Scholar
  31. 31.
    Yap, T., Frieder, O., Martino, R.: Parallel computation in biological sequence analysis. IEEE Trans. Parallel Distrib. Syst. 9(3), 283–294 (1998)CrossRefGoogle Scholar
  32. 32.
    Amdahl, G.: Validity of the single processor approach to achieving large scale computing capabilities. In: Proceedings of the April 18–20, 1967, Spring Joint Computer Conference, New York, NY, USA (1967)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.Department of Electrical and Information TechnologyIranian Research Organization for Science and TechnologyTehranIran
  2. 2.Department of Computer EngineeringIran University of Science and TechnologyTehranIran

Personalised recommendations