International Journal of Parallel Programming

, Volume 46, Issue 4, pp 762–775 | Cite as

Combining Hadoop with MPI to Solve Metagenomics Problems that are both Data- and Compute-intensive

  • Han LinEmail author
  • Zhichao Su
  • Xiandong Meng
  • Xu Jin
  • Zhong Wang
  • Wenting Han
  • Hong An
  • Mengxian Chi
  • Zheng Wu
Part of the following topical collections:
  1. Special issue on Network and Parallel Computing for New Architectures and Applications


Metagenomics, the study of all microbial species cohabitants in an environment, often produces large amount of sequence data varying from several GBs to a few TBs. Analyzing metagenomics data includes both data-intensive and compute-intensive steps, making the entire process hard to scale. Here we aim to optimize a metagenomics application that partitions the shortgun metagenomics sequences based on their species of origin. Our solution combines MapReduce-based BioPig analytic toolkit with MPI to provide scalability in respective to both data and compute. We also made some improvements to the existing BioPig toolkit by using simplified data types and compressed k-mer storage. These optimizations leads up to 193\(\times \) speedup for the computing-intensive step and 9.6\(\times \) speedup over the entire pipeline. Our optimized application is also capable of processing datasets that are 16 times larger on the same hardware platform. These results suggest integrating heterogeneous technologies such as Hadoop and MPI is quite efficient to solve large genomics problems that are both data-intensive and compute-intensive.


Metagenomics Hadoop MPI Optimization Pig Latin BioPig Big data Data-intensive Compute-intensive 



The work was supported by the National Key Research and Development Program of China (Grant No. 2016YFB1000403). Xiandong Meng and Zhong Wang’s work was supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231.


  1. 1.
    Anderson, M., Smith, S., Sundaram, N., Capotă, M., Zhao, Z., Dulloor, S., Satish, N., Willke, T.L.: Bridging the gap between hpc and big data frameworks. Proc. VLDB Endow. 10(8), 901–912 (2017)CrossRefGoogle Scholar
  2. 2.
    Dagum, L., Menon, R.: Openmp: an industry standard api for shared-memory programming. IEEE Comput. Sci. Eng. 5(1), 46–55 (1998)CrossRefGoogle Scholar
  3. 3.
    Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)CrossRefGoogle Scholar
  4. 4.
    Fox, G.C., Qiu, J., Kamburugamuve, S., Jha, S., Luckow, A.: Hpc-abds high performance computing enhanced apache big data stack. In: 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid), pp. 1057–1066. IEEE (2015)Google Scholar
  5. 5.
    Gittens, A., Devarakonda, A., Racah, E., Ringenburg, M., Gerhardt, L., Kottalam, J., Liu, J., Maschhoff, K., Canon, S., Chhugani, J., et al.: Matrix factorization at scale: a comparison of scientific data analytics in spark and c+ mpi using three case studies (2016). arXiv preprint arXiv:1607.01335
  6. 6.
    Gropp, W., Lusk, E., Doss, N., Skjellum, A.: A high-performance, portable implementation of the mpi message passing interface standard. Parallel Comput. 22(6), 789–828 (1996)CrossRefzbMATHGoogle Scholar
  7. 7.
    Guo, X., Yu, N., Ding, X., Wang, J., Pan, Y.: Dime: a novel framework for de novo metagenomic sequence assembly. J. Comput. Biol. 22(2), 159–177 (2015)CrossRefGoogle Scholar
  8. 8.
    Heger, D.: Hadoop performance tuning-a pragmatic & iterative approach. CMG J. 4, 97–113 (2013)Google Scholar
  9. 9.
    Hess, M., Sczyrba, A., Egan, R., Kim, T.W., Chokhawala, H., Schroth, G., Luo, S., Clark, D.S., Chen, F., Zhang, T., et al.: Metagenomic discovery of biomass-degrading genes and genomes from cow rumen. Science 331(6016), 463–467 (2011)CrossRefGoogle Scholar
  10. 10.
    Joshi, S.B.: Apache hadoop performance-tuning methodologies and best practices. In: Proceedings of the 3rd ACM/SPEC International Conference on Performance Engineering, pp. 241–242. ACM (2012)Google Scholar
  11. 11.
    Kiveris, R., Lattanzi, S., Mirrokni, V., Rastogi, V., Vassilvitskii, S.: Connected components in mapreduce and beyond. In: Proceedings of the ACM Symposium on Cloud Computing, pp. 1–13. ACM (2014)Google Scholar
  12. 12.
    Li, M., Zeng, L., Meng, S., Tan, J., Zhang, L., Butt, A.R., Fuller, N.: Mronline: Mapreduce online performance tuning. In: Proceedings of the 23rd International Symposium on High-Performance Parallel and Distributed Computing, pp. 165–176. ACM (2014)Google Scholar
  13. 13.
    Lu, X., Liang, F., Wang, B., Zha, L., Xu, Z.: Datampi: extending mpi to hadoop-like big data computing. In: 2014 IEEE 28th International Symposium on Parallel and Distributed Processing, pp. 829–838. IEEE (2014)Google Scholar
  14. 14.
    Metzker, M.L.: Sequencing technologies—the next generation. Nat. Rev. Genet. 11(1), 31–46 (2010)CrossRefGoogle Scholar
  15. 15.
    Nordberg, H., Bhatia, K., Wang, K., Wang, Z.: Biopig: a hadoop-based analytic toolkit for large-scale sequence data. Bioinformatics 29(23), 3014–3019 (2013)Google Scholar
  16. 16.
    Nvidia, C.: Compute Unified Device Architecture Programming Guide (2007).
  17. 17.
    Olston, C., Reed, B., Srivastava, U., Kumar, R., Tomkins, A.: Pig latin: a not-so-foreign language for data processing. In: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, pp. 1099–1110. ACM (2008)Google Scholar
  18. 18.
    Qiu, J., Jha, S., Luckow, A., Fox, G.C.: Towards hpc-abds: an initial high-performance big data stack. Build. Robust Big Data Ecosyst. ISO/IEC JTC 1, 18–21 (2014)Google Scholar
  19. 19.
    Rasheed, Z., Rangwala, H.: A map-reduce framework for clustering metagenomes. In: Parallel and Distributed Processing Symposium Workshops and Ph.D. Forum (IPDPSW), 2013 IEEE 27th International, pp. 549–558. IEEE (2013)Google Scholar
  20. 20.
    Reyes-Ortiz, J.L., Oneto, L., Anguita, D.: Big data analytics in the cloud: spark on hadoop vs mpi/openmp on beowulf. Proc. Comput. Sci. 53, 121–130 (2015)CrossRefGoogle Scholar
  21. 21.
    Schmidt, B., Hildebrandt, A.: Next-generation sequencing: big data meets high performance computing. Drug Discovery Today 4(4), 712–717 (2017)Google Scholar
  22. 22.
    Shi, L., Wang, Z., Yu, W., Meng, X.: Performance evaluation and tuning of biopig for genomic analysis. In: Proceedings of the 2015 International Workshop on Data-Intensive Scalable Computing Systems, p. 9. ACM (2015)Google Scholar
  23. 23.
    Tarjan, R.E.: Efficiency of a good but not linear set union algorithm. J. ACM (JACM) 22(2), 215–225 (1975)MathSciNetCrossRefzbMATHGoogle Scholar
  24. 24.
    Vavilapalli, V.K., Murthy, A.C., Douglas, C., Agarwal, S., Konar, M., Evans, R., Graves, T., Lowe, J., Shah, H., Seth, S., et al.: Apache hadoop yarn: yet another resource negotiator. In: Proceedings of the 4th Annual Symposium on Cloud Computing, p. 5. ACM (2013)Google Scholar
  25. 25.
    Website: Apache hadoop.
  26. 26.
    Website: Apache pig.
  27. 27.
    Website: Apache tez.
  28. 28.
    Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. HotCloud 10(10–10), 95 (2010)Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2017

Authors and Affiliations

  1. 1.University of Science and Technology of ChinaHefeiChina
  2. 2.DOE Joint Genome Institute and Lawrence Berkeley National LaboratoryBerkeleyUSA

Personalised recommendations