Efficient Construction of a Complete Index for Pan-Genomics Read Alignment

  • Alan KuhnleEmail author
  • Taher Mun
  • Christina Boucher
  • Travis Gagie
  • Ben Langmead
  • Giovanni Manzini
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11467)


While short read aligners, which predominantly use the FM-index, are able to easily index one or a few human genomes, they do not scale well to indexing databases containing thousands of genomes. To understand why, it helps to examine the main components of the FM-index in more detail, which is a rank data structure over the Burrows-Wheeler Transform (\({\mathsf{BWT}}\)) of the string that will allow us to find the interval in the string’s suffix array (\({\mathsf{SA}}\)) containing pointers to starting positions of occurrences of a given pattern; second, a sample of the \({\mathsf{SA}}\) that—when used with the rank data structure—allows us access to the \({\mathsf{SA}}\). The rank data structure can be kept small even for large genomic databases, by run-length compressing the \({\mathsf{BWT}}\), but until recently there was no means known to keep the \({\mathsf{SA}}\) sample small without greatly slowing down access to the \({\mathsf{SA}}\). Now that Gagie et al. (SODA 2018) have defined an \({\mathsf{SA}}\) sample that takes about the same space as the run-length compressed \({\mathsf{BWT}}\)—we have the design for efficient FM-indexes of genomic databases but are faced with the problem of building them. In 2018 we showed how to build the \({\mathsf{BWT}}\) of large genomic databases efficiently (WABI 2018) but the problem of building Gagie et al.’s \({\mathsf{SA}}\) sample efficiently was left open. We compare our approach to state-of-the-art methods for constructing the \({\mathsf{SA}}\) sample, and demonstrate that it is the fastest and most space-efficient method on highly repetitive genomic databases. Lastly, we apply our method for indexing partial and whole human genomes and show that it improves over Bowtie with respect to both memory and time.

Availability: The implementations of our methods can be found at (BWT and SA sample construction) and at (indexing).



AK and CB were supported by the National Institute of Allergy and Infectious Diseases of the National Institutes of Health (1R01AI141810-01) and NSF-IIS (1618814). TM and BL were supported by the National Institutes of Health (R01GM118568) and NSF-IIS (1349906). TG was supported by FONDECYT grant 1171058 Compression-aware algorithmics. GM was partially supported by PRIN grant 201534HNXC and by INdAM-GNCS Project 2019 Innovative methods for the solution of medical and biological big data.


  1. 1.
    Bannai, H., Gagie, T., I, T.: Online LZ77 parsing and matching statistics with RLBWTs. In: Proceedings of the 29th Annual Symposium on Combinatorial Pattern Matching, (CPM), vol. 105, pp. 7:1–7:12 (2018)Google Scholar
  2. 2.
    Boucher, C., Gagie, T., Kuhnle, A., Manzini, G.: Prefix-free parsing for building big BWTs. In: Proceedings of 18th International Workshop on Algorithms in Bioinformatics, WABI, vol. 113, pp. 2:1–2:16 (2018)Google Scholar
  3. 3.
    Burrows, M., Wheeler, D.J.: A block sorting lossless data compression algorithm. Technical report 124. Digital Equipment Corporation (1994)Google Scholar
  4. 4.
    The 1000 Genomes Project Consortium: A global reference for human genetic variation. Nature 526(7571), 68–74 (2015)Google Scholar
  5. 5.
    Danek, A., Deorowicz, S., Grabowski, S.: Indexes of large genome collections on a PC. PLoS ONE 9(10), e109384 (2014)Google Scholar
  6. 6.
    Deorowicz, S., Kokot, M., Grabowski, S., Debudaj-Grabysz, A.: KMC 2: fast and resource-frugal k-mer counting. Bioinformatics 31(10), 1569–1576 (2015)CrossRefGoogle Scholar
  7. 7.
    Garrison, E., et al.: Variation graph toolkit improves read mapping by representing genetic variation in the reference. Nat. Biotechnol. 36(9), 875–879 (2018)CrossRefGoogle Scholar
  8. 8.
    Ferrada, H., Gagie, T., Hirvola, T., Puglisi, S.J.: Hybrid indexes for repetitive datasets. Philos. Trans. Roy. Soc. A: Math. Phys. Eng. Sci. 372(2016), 1–9 (2014)MathSciNetCrossRefGoogle Scholar
  9. 9.
    Ferrada, H., Kempa, D., Puglisi, S.J.: Hybrid indexing revisited. In: Proceedings of the 21st Algorithm Engineering and Experiments, ALENEX, pp. 1–8 (2018)Google Scholar
  10. 10.
    Ferragina, P., Manzini, G.: Opportunistic data structures with applications. In: Proceedings of the 41st Annual Symposium on Foundations of Computer Science, FOCS, pp. 390–398 (2000)Google Scholar
  11. 11.
    Gagie, T., Navarro, G., Prezza, N.: Optimal-time text indexing in BWT-runs bounded space. In: Proceedings of the 29th Annual Symposium on Discrete Algorithms, SODA, pp. 1459–1477 (2018)Google Scholar
  12. 12.
    Gagie, T., Puglisi, S.J.: Searching and indexing genomic databases via kernelization. Front. Bioeng. Biotechnol. 3, 10–13 (2015)CrossRefGoogle Scholar
  13. 13.
    Li, H., Durbin, R.: Fast and accurate short read alignment with Burrows-Wheeler Transform. Bioinformatics 25, 1754–1760 (2009)CrossRefGoogle Scholar
  14. 14.
    Huang, L., Popic, V., Batzoglou, S.: Short read alignment with populations of genomes. Bioinformatics 29(13), i361–i370 (2013)CrossRefGoogle Scholar
  15. 15.
    Jain, M., et al.: Nanopore sequencing and assembly of a human genome with ultra-long reads. Nat. Biotechnol. 36(4), 338–345 (2018)MathSciNetCrossRefGoogle Scholar
  16. 16.
    Jeong-Sun, S., et al.: De novo assembly and phasing of a Korean human genome. Nature 538(7624), 243–247 (2016)CrossRefGoogle Scholar
  17. 17.
    Kärkkäinen, J., Kempa, D., Puglisi, S.J.: Parallel external memory suffix sorting. In: Cicalese, F., Porat, E., Vaccaro, U. (eds.) CPM 2015. LNCS, vol. 9133, pp. 329–342. Springer, Cham (2015). Scholar
  18. 18.
    Langmead, B., Salzberg, S.L.: Fast gapped-read alignment with Bowtie 2. Nat. Methods 9(4), 357 (2012)CrossRefGoogle Scholar
  19. 19.
    Langmead, B., Trapnell, C., Pop, M., Salzberg, S.L.: Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 10, R25 (2008)Google Scholar
  20. 20.
    Levy, S., et al.: The diploid genome sequence of an individual human. PLoS Biol. 5(10), e254 (2007)CrossRefGoogle Scholar
  21. 21.
    Li, R., et al.: SOAP2: an improved tool for short read alignment. Bioinformatics 25(15), 1966–1967 (2009)CrossRefGoogle Scholar
  22. 22.
    Li, H.: Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv:1303.3997 [q-bio], March 2013
  23. 23.
    Maciuca, S., del Ojo Elias, C., McVean, G., Iqbal, Z.: A natural encoding of genetic variation in a Burrows-Wheeler Transform to enable mapping and genome inference. In: Frith, M., Storm Pedersen, C.N. (eds.) WABI 2016. LNCS, vol. 9838, pp. 222–233. Springer, Cham (2016). Scholar
  24. 24.
    Mäkinen, V., Navarro, G., Sirén, J., Välimäki, N.: Storage and retrieval of highly repetitive sequence collections. J. Comput. Biol. 17(3), 281–308 (2010)MathSciNetCrossRefGoogle Scholar
  25. 25.
    Peng, Y., Leung, H.C.M., Yiu, S.M., Chin, F.Y.L.: IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth. Bioinformatics 28(11), 1420–1428 (2012)CrossRefGoogle Scholar
  26. 26.
    Policriti, A., Prezza, N.: LZ77 computation based on the run-length encoded BWT. Algorithmica 80(7), 1986–2011 (2018)MathSciNetCrossRefGoogle Scholar
  27. 27.
    Schneeberger, K., et al.: Simultaneous alignment of short reads against multiple genomes. Genome Biol. 10(9), R98 (2009)Google Scholar
  28. 28.
    Shi, L., et al.: Long-read sequencing and de novo assembly of a Chinese genome. Nat. Commun. 7, 12065 (2016)CrossRefGoogle Scholar
  29. 29.
    Sirén, J., Välimäki, N., Mäkinen, V.: Indexing graphs for path queries with applications in genome research. IEEE/ACM Trans. Comput. Biol. Bioinform. 11(2), 375–388 (2014)CrossRefGoogle Scholar
  30. 30.
    Steinberg, K.M., et al.: Single haplotype assembly of the human genome from a hydatidiform mole. Genome Res. 24, 2066–2076 (2014). p. gr.180893.114CrossRefGoogle Scholar
  31. 31.
    Stevens, E.L., et al.: The public health impact of a publically available, environmental database of microbial genomes. Front. Microbiol. 8, 808 (2017)CrossRefGoogle Scholar
  32. 32.
    Valenzuela, D., Norri, T., Välimäki, N., Pitkänen, E., Mäkinen, V.: Towards pan-genome read alignment to improve variation calling. BMC Genomics 19(2), 87 (2018)CrossRefGoogle Scholar
  33. 33.
    Valenzuela, D., Mäkinen, V.: CHIC: a short read aligner for pan-genomic references. Technical report, (2017)
  34. 34.
    Wandelt, S., Starlinger, J., Bux, M., Leser, U.: RCSI: scalable similarity search in thousand(s) of genomes. Proc. VLDB Endow. 6(13), 1534–1545 (2013)CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Alan Kuhnle
    • 1
    Email author
  • Taher Mun
    • 2
  • Christina Boucher
    • 1
  • Travis Gagie
    • 3
  • Ben Langmead
    • 2
  • Giovanni Manzini
    • 4
  1. 1.Department of Computer and Information Science and EngineeringUniversity of FloridaGainesvilleUSA
  2. 2.Department of Computer ScienceJohn Hopkins UniversityBaltimoreUSA
  3. 3.School of Computer Science and TelecommunicationsUniversidad Diego Portales and CeBiBSantiagoChile
  4. 4.Department of Science and Technological InnovationUniversity of Eastern PiedmontAlessandriaItaly

Personalised recommendations