Advertisement

LZ77-Based Self-indexing with Faster Pattern Matching

  • Travis Gagie
  • Paweł Gawrychowski
  • Juha Kärkkäinen
  • Yakov Nekrich
  • Simon J. Puglisi
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8392)

Abstract

To store and search genomic databases efficiently, researchers have recently started building self-indexes based on LZ77. As the name suggests, a self-index for a string supports both random access and pattern matching queries. In this paper we show how, given a string S [1..n] whose LZ77 parse consists of z phrases, we can store a self-index for S in \(\mathcal{O}({z \log (n / z)})\) space such that later, first, given a position i and a length ℓ, we can extract S [i..i + ℓ − 1] in \(\mathcal{O}({\ell + \log n})\) time; second, given a pattern P [1..m], we can list the occ occurrences of P in S in \(\mathcal{O}({m \log m + occ \log \log n})\) time.

Keywords

Pattern Match Parse Tree Primary Substrings Phrase Boundary Range Reporting 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Alstrup, S., Brodal, G., Rauhe, T.: Optimal static range reporting in one dimension. In: Proc. STOC, pp. 476–482 (2001)Google Scholar
  2. 2.
    Alstrup, S., Brodal, G.S., Rauhe, T.: Pattern matching in dynamic texts. In: Proc. SODA, pp. 819–828 (2000)Google Scholar
  3. 3.
    Arroyuelo, D., Navarro, G., Sadakane, K.: Stronger Lempel-Ziv based compressed text indexing. Algorithmica 62(1-2), 54–101 (2012)CrossRefzbMATHMathSciNetGoogle Scholar
  4. 4.
    Belazzougui, D., Boldi, P., Pagh, R., Vigna, S.: Monotone minimal perfect hashing: searching a sorted table with \(\mathcal{O}({1})\) accesses. In: Proc. SODA, pp. 785–794 (2009)Google Scholar
  5. 5.
    Bille, P., Gørtz, I.L.: Substring range reporting. In: Giancarlo, R., Manzini, G. (eds.) CPM 2011. LNCS, vol. 6661, pp. 299–308. Springer, Heidelberg (2011)CrossRefGoogle Scholar
  6. 6.
    Chan, T.M., Larsen, K.G., Pǎtraşcu, M.: Orthogonal range searching on the RAM, revisited. In: Proc. SoCG, pp. 1–10 (2011)Google Scholar
  7. 7.
    Claude, F., Navarro, G.: Self-indexed grammar-based compression. Fund. Inf. 111(3), 313–337 (2011)zbMATHMathSciNetGoogle Scholar
  8. 8.
    Do, H.H., Jansson, J., Sadakane, K., Sung, W.-K.: Fast relative Lempel-Ziv self-index for similar sequences. Theor. Comp. Sci. (to appear)Google Scholar
  9. 9.
    Charikar, M., et al.: The smallest grammar problem. IEEE Trans. Inf. Theory 51(7), 2554–2576 (2005)CrossRefzbMATHMathSciNetGoogle Scholar
  10. 10.
    Bille, P., et al.: Random access to grammar-compressed strings. In: Proc. SODA, pp. 373–389 (2011)Google Scholar
  11. 11.
    Bille, P., Cording, P.H., Gørtz, I.L., Sach, B., Vildhøj, H.W., Vind, S.: Fingerprints in compressed strings. In: Dehne, F., Solis-Oba, R., Sack, J.-R. (eds.) WADS 2013. LNCS, vol. 8037, pp. 146–157. Springer, Heidelberg (2013)CrossRefGoogle Scholar
  12. 12.
    Farach, M., Thorup, M.: String matching in Lempel-Ziv compressed strings. In: Proc. STOC, pp. 703–712 (1995)Google Scholar
  13. 13.
    Gagie, T., Gawrychowski, P., Kärkkäinen, J., Nekrich, Y., Puglisi, S.J.: A faster grammar-based self-index. In: Dediu, A.-H., Martín-Vide, C. (eds.) LATA 2012. LNCS, vol. 7183, pp. 240–251. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  14. 14.
    Gagie, T., Gawrychowski, P., Kärkkäinen, J., Nekrich, Y., Puglisi, S.J.: A faster grammar-based self-index. Technical Report 1109.3954v6, arxiv.org (2012)Google Scholar
  15. 15.
    Huang, S., Lam, T.W., Sung, W.K., Tam, S.L., Yiu, S.M.: Indexing similar DNA sequences. In: Chen, B. (ed.) AAIM 2010. LNCS, vol. 6124, pp. 180–190. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  16. 16.
    Kärkkäinen, J., Ukkonen, E.: Lempel-Ziv parsing and sublinear-size index structures for string matching. In: Proc. WSP, pp. 141–155 (1996)Google Scholar
  17. 17.
    Karp, R.M., Miller, R.E., Rosenberg, A.L.: Rapid identification of repeated patters in strings, trees and arrays. In: Proc. STOC, pp. 125–136 (1972)Google Scholar
  18. 18.
    Karp, R.M., Rabin, M.O.: Efficient randomized pattern-matching algorithms. IBM J. Res. Dev. 31(2), 249–260 (1987)CrossRefzbMATHMathSciNetGoogle Scholar
  19. 19.
    Kreft, S., Navarro, G.: On compressing and indexing repetitive sequences. Theor. Comp. Sci. 483, 115–133 (2013)CrossRefzbMATHMathSciNetGoogle Scholar
  20. 20.
    Mäkinen, V., Navarro, G., Sirén, J., Välimäki, N.: Storage and retrieval of highly repetitive sequence collections. J. Comp. Bio. 17(3), 281–308 (2010)CrossRefGoogle Scholar
  21. 21.
    Maruyama, S., Nakahara, M., Kishiue, N., Sakamoto, H.: ESP-index: A compressed index based on edit-sensitive parsing. J. Dis. Alg. 18, 100–112 (2013)zbMATHMathSciNetGoogle Scholar
  22. 22.
    Morrison, D.R.: PATRICIA - Practical algorithm to retrieve information coded in alphanumeric. J. ACM 15(4), 514–534 (1968)CrossRefGoogle Scholar
  23. 23.
    Mortensen, C.W., Pagh, R., Pǎtraşcu, M.: On dynamic range reporting in one dimension. In: Proc. STOC, pp. 104–111 (2005)Google Scholar
  24. 24.
    Navarro, G., Mäkinen, V.: Compressed full-text indexes. ACM Comp. Surv. 39(1) (2007)Google Scholar
  25. 25.
    Russo, L.M.S., Oliveira, A.L.: A compressed self-index using a Ziv-Lempel dictionary. Inf. Retr. 11(4), 359–388 (2008)CrossRefGoogle Scholar
  26. 26.
    Rytter, W.: Application of Lempel-Ziv factorization to the approximation of grammar-based compression. Theor. Comp. Sci. 302(1-3), 211–222 (2003)CrossRefzbMATHMathSciNetGoogle Scholar
  27. 27.
    Verbin, E., Yu, W.: Data structure lower bounds on random access to grammar-compressed strings. In: Fischer, J., Sanders, P. (eds.) CPM 2013. LNCS, vol. 7922, pp. 247–258. Springer, Heidelberg (2013)CrossRefGoogle Scholar
  28. 28.
    Wandelt, S., Leser, U.: QGramProjector: Q-gram projection for indexing highly-similar strings. In: Catania, B., Guerrini, G., Pokorný, J. (eds.) ADBIS 2013. LNCS, vol. 8133, pp. 260–273. Springer, Heidelberg (2013)CrossRefGoogle Scholar
  29. 29.
    Yang, X., Wang, B., Li, C., Wang, J., Xie, X.: Efficient direct search on genomic data. In: Proc. ICDE, pp. 961–972 (2013)Google Scholar
  30. 30.
    Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory 23(3), 337–343 (1977)CrossRefzbMATHMathSciNetGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2014

Authors and Affiliations

  • Travis Gagie
    • 1
  • Paweł Gawrychowski
    • 2
  • Juha Kärkkäinen
    • 1
  • Yakov Nekrich
    • 3
  • Simon J. Puglisi
    • 1
  1. 1.University of HelsinkiFinland
  2. 2.Max Planck InstituteGermany
  3. 3.University of KansasUnited States

Personalised recommendations