Advertisement

Algorithmica

, Volume 81, Issue 4, pp 1370–1391 | Cite as

Fixed Block Compression Boosting in FM-Indexes: Theory and Practice

  • Simon Gog
  • Juha Kärkkäinen
  • Dominik KempaEmail author
  • Matthias Petri
  • Simon J. Puglisi
Article
  • 88 Downloads

Abstract

The FM index (Ferragina and Manzini in J ACM 52(4):552–581, 2005) is a widely-used compressed data structure that stores a string T in a compressed form and also supports fast pattern matching queries. In this paper, we describe new FM-index variants that combine nice theoretical properties, simple implementation and improved practical performance. Our main theoretical result is a new technique called fixed block compression boosting, which is a simpler and faster alternative to optimal compression boosting and implicit compression boosting used in previous FM-indexes. We also describe several new techniques for implementing fixed-block boosting efficiently, including a new, fast, and space-efficient implementation of wavelet trees. Our extensive experiments show the new indexes to be consistently fast and small relative to the state-of-the-art, and thus they make a good “off-the-shelf” choice for many applications.

Keywords

Text indexing Wavelet tree FM-index Compression boosting Suffix array Pattern matching 

Notes

Acknowledgements

Preliminary versions of this article appeared as J. Kärkkäinen and S. J. Puglisi, “Fixed block compression boosting in FM-indexes”, Proc. SPIRE 2011 [19], and S. Gog, J. Kärkkäinen, D. Kempa, M. Petri, and S. J. Puglisi, “Faster, minuter”, Proc. DCC 2016 [13].

References

  1. 1.
    Belazzougui, D., Gagie, T., Mäkinen, V., Previtali, M., Puglisi, S.J.: Bidirectional variable-order de Bruijn graphs. In: 12th Latin American Theoretical Informatics Symposium (LATIN 2016), Ensenada, 11–15 Apr 2016. LNCS, vol. 9644, pp. 164–178. Springer (2016)Google Scholar
  2. 2.
    Boucher, C., Bowe, A., Gagie, T., Puglisi, S.J., Sadakane, K.: Variable-order de Bruijn graphs. In: 2015 Data Compression Conference (DCC 2015), Snowbird, 7–9 Apr 2015, pp. 383–392. IEEE (2015)Google Scholar
  3. 3.
    Bowe, A., Onodera, T., Sadakane, K., Shibuya, T.: Succinct de Bruijn graphs. In: 12th Workshop on Algorithms in Bioinformatics (WABI 2012), Ljubljana, 10–12 Sept 2012. LNCS, vol. 7534, pp. 225–235. Springer (2012)Google Scholar
  4. 4.
    Burrows, M., Wheeler, D.J.: A block sorting lossless data compression algorithm. Tech. Rep. 124, Digital Equipment Corporation, Palo Alto, California (1994)Google Scholar
  5. 5.
    Claude, F., Navarro, G., Pereira, A.O.: The wavelet matrix: an efficient wavelet tree for large alphabets. Inf. Syst. 47, 15–32 (2015)CrossRefGoogle Scholar
  6. 6.
    Ferragina, P., Giancarlo, R., Manzini, G., Sciortino, M.: Boosting textual compression in optimal linear time. J. ACM 52(4), 688–713 (2005)MathSciNetCrossRefzbMATHGoogle Scholar
  7. 7.
    Ferragina, P., González, R., Navarro, G., Venturini, R.: Compressed text indexes: from theory to practice. ACM J. Exp. Algorithmics 13, 1.12–1.31 (2009)MathSciNetzbMATHGoogle Scholar
  8. 8.
    Ferragina, P., Manzini, G.: Indexing compressed text. J. ACM 52(4), 552–581 (2005)MathSciNetCrossRefzbMATHGoogle Scholar
  9. 9.
    Ferragina, P., Manzini, G., Mäkinen, V., Navarro, G.: An alphabet-friendly FM-index. In: 11th International Symposium on String Processing and Information Retrieval (SPIRE 2004), Padova, 5–8 Oct 2004. LNCS, vol. 3246, pp. 150–160. Springer (2004)Google Scholar
  10. 10.
    Ferragina, P., Manzini, G., Mäkinen, V., Navarro, G.: Compressed representations of sequences and full-text indexes. ACM Trans. Algorithms 3(2), Article 20 (2007)Google Scholar
  11. 11.
    Ferragina, P., Nitto, I., Venturini, R.: On optimally partitioning a text to improve its compression. Algorithmica 61(1), 51–74 (2011)MathSciNetCrossRefzbMATHGoogle Scholar
  12. 12.
    Flicek, P., Birney, E.: Sense from sequence reads: methods for alignment and assembly. Nat. Methods 6(11 Suppl), S6–S12 (2009)CrossRefGoogle Scholar
  13. 13.
    Gog, S., Kärkkäinen, J., Kempa, D., Petri, M., Puglisi, S.J.: Faster, minuter. In: 2016 Data Compression Conference (DCC 2016), Snowbird, 30 Mar–1 Apr, 2016, pp. 53–62. IEEE (2016)Google Scholar
  14. 14.
    Gog, S., Petri, M.: Optimized succinct data structures for massive data. Softw. Pract. Exp. 44(11), 1287–1314 (2014)CrossRefGoogle Scholar
  15. 15.
    Grossi, R., Gupta, A., Vitter, J.S.: High-order entropy-compressed text indexes. In: 14th Annual Symposium on Discrete Algorithms (SODA 2003), Baltimore, 12–14 Jan 2003, pp. 841–850. SIAM (2003)Google Scholar
  16. 16.
    Grossi, R., Gupta, A., Vitter, J.S.: When indexing equals compression: experiments with compressing suffix arrays and applications. In: 15th Annual Symposium on Discrete Algorithms (SODA 2004), New Orleans, 11–14 Jan 2004, pp. 636–645. SIAM (2004)Google Scholar
  17. 17.
    Kärkkäinen, J., Kempa, D., Puglisi, S.J.: Hybrid compression of bitvectors for the FM-index. In: 2014 Data Compression Conference (DCC 2014), Snowbird, 26–28 Mar 2014, pp. 302–311. IEEE (2014)Google Scholar
  18. 18.
    Kärkkäinen, J., Kempa, D., Puglisi, S.J.: Lempel–Ziv parsing in external memory. In: 2014 Data Compression Conference (DCC 2014), Snowbird, 26–28 Mar 2014, pp. 153–162. IEEE (2014)Google Scholar
  19. 19.
    Kärkkäinen, J., Puglisi, S.J.: Fixed block compression boosting in FM-indexes. In: 18th International Symposium on String Processing and Information Retrieval (SPIRE 2011), Pisa, 17–21 Oct 2011. LNCS, vol. 7024, pp. 174–184. Springer (2011)Google Scholar
  20. 20.
    Kempa, D., Puglisi, S.J.: Lempel-Ziv factorization: simple, fast, practical. In: 15th Meeting on Algorithm Engineering and Experiments (ALENEX 2013), New Orleans, 7 Jan 2013, pp. 103–112. SIAM (2013)Google Scholar
  21. 21.
    Li, R., Yu, C., Li, Y., Lam, T.W., Yiu, S.M., Kristiansen, K., Wang, J.: SOAP2: an improved ultrafast tool for short read alignment. Bioinformatics 25(15), 1966–1967 (2009)CrossRefGoogle Scholar
  22. 22.
    Mäkinen, V., Belazzougui, D., Cunial, F., Tomescu, A.I.: Genome-Scale Algorithm Design: Biological Sequence Analysis in the Era of High-Throughput Sequencing. Cambridge University Press, Cambridge (2015)CrossRefGoogle Scholar
  23. 23.
    Mäkinen, V., Navarro, G.: Succinct suffix arrays based on run-length encoding. Nord. J. Comput. 12(1), 40–66 (2005)MathSciNetzbMATHGoogle Scholar
  24. 24.
    Mäkinen, V., Navarro, G.: Implicit compression boosting with applications to self-indexing. In: 14th International Symposium on String Processing and Information Retrieval (SPIRE 2007), Santiago, 29–31 Oct 2007. LNCS, vol. 4726, pp. 229–241. Springer (2007)Google Scholar
  25. 25.
    Mäkinen, V., Navarro, G., Sirén, J., Välimäki, N.: Storage and retrieval of highly repetitive sequence collections. J. Comput. Biol. 17(3), 281–308 (2010)MathSciNetCrossRefGoogle Scholar
  26. 26.
    Manzini, G.: An analysis of the Burrows–Wheeler transform. J. ACM 48(3), 407–430 (2001)MathSciNetCrossRefzbMATHGoogle Scholar
  27. 27.
    Navarro, G., Providel, E.: Fast, small, simple rank/select on bitmaps. In: 11th International Symposium on Experimental Algorithms (SEA 2012), Bordeaux, 7–9 June 2012. LNCS, vol. 7276, pp. 295–306. Springer (2012)Google Scholar
  28. 28.
    Okanohara, D., Sadakane, K.: Practical entropy-compressed rank/select dictionary. In: 9th Meeting on Algorithm Engineering and Experiments (ALENEX 2007), New Orleans, 6 Jan 2007, pp. 60–70. SIAM (2007)Google Scholar
  29. 29.
    Raman, R., Raman, V., Rao, S.S.: Succinct indexable dictionaries with applications to encoding k-ary trees, prefix sums and multisets. ACM Trans. Algorithms 3(4), 43 (2007)MathSciNetCrossRefzbMATHGoogle Scholar
  30. 30.
    Shareghi, E., Petri, M., Haffari, G., Cohn, T.: Compact, efficient and unlimited capacity: language modeling with compressed suffix trees. In: 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP 2015), Lisbon, 17–21 Sept 2015, pp. 2409–2418. The Association for Computational Linguistics (2015)Google Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  • Simon Gog
    • 1
  • Juha Kärkkäinen
    • 2
  • Dominik Kempa
    • 3
    Email author
  • Matthias Petri
    • 4
  • Simon J. Puglisi
    • 3
  1. 1.Institute for Theoretical InformaticsKarlsruhe Institute of TechnologyKarlsruheGermany
  2. 2.Department of Computer ScienceUniversity of HelsinkiHelsinkiFinland
  3. 3.Department of Computer Science, Helsinki Institute for Information TechnologyUniversity of HelsinkiHelsinkiFinland
  4. 4.School of Computing and Information SystemsUniversity of MelbourneMelbourneAustralia

Personalised recommendations