Advertisement

Performance Characterization of De Novo Genome Assembly on Leading Parallel Systems

  • Marquita EllisEmail author
  • Evangelos Georganas
  • Rob Egan
  • Steven Hofmeyr
  • Aydın Buluç
  • Brandon Cook
  • Leonid Oliker
  • Katherine Yelick
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10417)

Abstract

De novo genome assembly is one of the most important and challenging computational problems in modern genomics; further, it shares algorithms and communication patterns important to other graph analytic and irregular applications. Unlike simulations, it has no floating point arithmetic and is dominated by small memory transactions within and between computing nodes. In this work, we focus on the highly scalable HipMer assembler and identify the dominant algorithms and communication patterns, also using microbenchmarks to capture the workload. We evaluate HipMer on a variety of platforms from the latest HPC systems to ethernet clusters. HipMer performs well on all single node systems, including the Xeon Phi manycore architecture. Given large enough problems, it also demonstrates excellent scaling across nodes in an HPC system, but requires a high speed network with low overhead and high injection rates. Our results shed light on the architectural features that are most important for achieving good parallel efficiency on this and related problems.

Notes

Acknowledgments

All authors at Lawrence Berkeley National Laboratory (LBNL) were supported by Department of Energy (DOE) Offices of Advanced Scientific Computing Research (ASCR) and Biological and Environmental Research (BER), both under contract number DE-AC02-05CH11231. This includes funding to BER’s Joint Genome Institute, the ASCR-funded Exascale Computing Project, and the ASCR Mathematics and Computer Science Research Programs. This word used resources of ASCR’s National Energy Research Scientific Computing Center (NERSC) under the same LBNL contract and ASCR’s Oak Ridge Leadership Facility (OLCF) under Contract No. DE-AC05-00OR22725.

References

  1. 1.
    Abu-Doleh, A., Catalyurek, U.V.: Spaler: Spark and GraphX based de novo genome assembler. In: 2015 IEEE International Conference on Big Data (Big Data), October 2015Google Scholar
  2. 2.
    Boisvert, S., Laviolette, F., Corbeil, J.: Ray: simultaneous assembly of reads from a mix of high-throughput sequencing technologies. J. Comput. Biol. 17(11), 1519–1533 (2010)MathSciNetCrossRefGoogle Scholar
  3. 3.
    Chapman, J.A., Ho, I., Sunkara, S., Luo, S., Schroth, G.P., Rokhsar, D.S.: Meraculous: de novo genome assembly with short paired-end reads. PLoS ONE 6(8), e23501 (2011)CrossRefGoogle Scholar
  4. 4.
    Chapman, J.A., Mascher, M., Buluç, A., Barry, K., Georganas, E., Session, A., Strnadova, V., Jenkins, J., Sehgal, S., Oliker, L., Schmutz, J., Yelick, K.A., Scholz, U., Waugh, R., Poland, J.A., Muehlbauer, G.J., Stein, N., Rokhsar, D.S.: A whole-genome shotgun approach for assembling and anchoring the hexaploid bread wheat genome. Genome Biol. 16, 26 (2015)CrossRefGoogle Scholar
  5. 5.
    Deonier, R.C., Tavaré, S., Waterman, M.: Computational Genome Analysis: An Introduction. Springer Science & Business Media, New York (2005). doi: 10.1007/0-387-28807-4 zbMATHGoogle Scholar
  6. 6.
    Earl, D., Bradnam, K., St John, J., Darling, A., et al.: Assemblathon 1: a competitive assessment of de novo short read assembly methods. Genome Res. 21(12), 2224–2241 (2011)CrossRefGoogle Scholar
  7. 7.
    Georganas, E.: Scalable parallel algorithms for genome analysis. Ph.D. thesis, EECS Department, University of California, Berkeley (2016)Google Scholar
  8. 8.
    Georganas, E., Buluç, A., Chapman, J., Hofmeyr, S., Aluru, C., Egan, R., Oliker, L., Rokhsar, D., Yelick, K.: HipMer: an extreme-scale de novo genome assembler. In: International Conference for High Performance Computing, Networking, Storage and Analysis (SC 2015) (2015)Google Scholar
  9. 9.
    Georganas, E., Buluç, A., Chapman, J., Oliker, L., Rokhsar, D., Yelick, K.: merAligner: a fully parallel sequence aligner. In: Proceedings of the IPDPS (2015)Google Scholar
  10. 10.
    Georganas, E., Buluç, A., Chapman, J., Oliker, L., Rokhsar, D., Yelick, K.: Parallel de Bruijn graph construction and traversal for de novo genome assembly. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC 2014) (2014)Google Scholar
  11. 11.
    Husbands, P., Iancu, C., Yelick, K.: A performance analysis of the Berkeley UPC compiler. In: Proceedings of International Conference on Supercomputing, ICS 2003, pp. 63–73. ACM, New York (2003)Google Scholar
  12. 12.
    Liu, Y., Schmidt, B., Maskell, D.L.: Parallelized short read assembly of large genomes using de Bruijn graphs. BMC Bioinform. 12(1), 354 (2011)CrossRefGoogle Scholar
  13. 13.
    Meng, J., Seo, S., Balaji, P., Wei, Y., Wang, B., Feng, S.: Swap-assembler 2: optimization of de novo genome assembler at extreme scale. In: 45th International Conference on Parallel Processing (ICPP), pp. 195–204. IEEE (2016)Google Scholar
  14. 14.
    Miller, J.R., Koren, S., Sutton, G.: Assembly algorithms for next-generation sequencing data. Genomics 95(6), 315–327 (2010)CrossRefGoogle Scholar
  15. 15.
    Salzberg, S.L., Phillippy, A.M., Zimin, A., Puiu, D., et al.: GAGE: a critical evaluation of genome assemblies and assembly algorithms. Genome Res. 22(3), 557–567 (2012)CrossRefGoogle Scholar
  16. 16.
    Simpson, J.T., Wong, K., et al.: ABySS: a parallel assembler for short read sequence data. Genome Res. 19(6), 1117–1123 (2009)CrossRefGoogle Scholar
  17. 17.
    Smith, T.F., Waterman, M.S.: Identification of common molecular subsequences. J. Mol. Biol. 147(1), 195–197 (1981)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  • Marquita Ellis
    • 1
    • 2
    Email author
  • Evangelos Georganas
    • 1
    • 2
    • 5
  • Rob Egan
    • 3
  • Steven Hofmeyr
    • 2
  • Aydın Buluç
    • 1
    • 2
  • Brandon Cook
    • 4
  • Leonid Oliker
    • 2
  • Katherine Yelick
    • 1
    • 2
  1. 1.EECS DepartmentUniversity of CaliforniaBerkeleyUSA
  2. 2.Computational Research DivisionLawrence Berkeley National LaboratoryBerkeleyUSA
  3. 3.Joint Genome InstituteLawrence Berkeley National LaboratoryBerkeleyUSA
  4. 4.National Energy Research Scientific Computing CenterBerkeleyUSA
  5. 5.Parallel Computing Lab, Intel Corp.Santa ClaraUSA

Personalised recommendations