Advertisement

Improved Grammar-Based Compressed Indexes

  • Francisco Claude
  • Gonzalo Navarro
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7608)

Abstract

We introduce the first grammar-compressed representation of a sequence that supports searches in time that depends only logarithmically on the size of the grammar. Given a text T[1..u] that is represented by a (context-free) grammar of n (terminal and nonterminal) symbols and size N (measured as the sum of the lengths of the right hands of the rules), a basic grammar-based representation of T takes \(N\lg n\) bits of space. Our representation requires \(2N\lg n + N\lg u + \epsilon\, n\lg n + o(N\lg n)\) bits of space, for any 0 < ε ≤ 1. It can find the positions of the occ occurrences of a pattern of length m in T in \(O\left((m^2/\epsilon)\lg \left(\frac{\lg u}{\lg n}\right) + (m+occ)\lg n\right)\) time, and extract any substring of length ℓ of T in time \(O(\ell+h\lg(N/h))\), where h is the height of the grammar tree.

Keywords

Binary Relation Binary Search Parse Tree Context Free Grammar Terminal Symbol 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Arroyuelo, D., Navarro, G., Sadakane, K.: Reducing the Space Requirement of LZ-Index. In: Lewenstein, M., Valiente, G. (eds.) CPM 2006. LNCS, vol. 4009, pp. 318–329. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  2. 2.
    Barbay, J., Gagie, T., Navarro, G., Nekrich, Y.: Alphabet Partitioning for Compressed Rank/Select and Applications. In: Cheong, O., Chwa, K.-Y., Park, K. (eds.) ISAAC 2010, Part II. LNCS, vol. 6507, pp. 315–326. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  3. 3.
    Benoit, D., Demaine, E., Munro, I., Raman, R., Raman, V., Rao, S.S.: Representing trees of higher degree. Algorithmica 43(4), 275–292 (2005)MathSciNetzbMATHCrossRefGoogle Scholar
  4. 4.
    Bille, P., Landau, G.M., Raman, R., Sadakane, K., Satti, S.R., Weimann, O.: Random access to grammar-compressed strings. In: Proc. 22nd SODA, pp. 373–389 (2011)Google Scholar
  5. 5.
    Chan, T., Larsen, K., Patrascu, M.: Orthogonal range searching on the RAM, revisited. In: Proc. 27th SoCG, pp. 1–10 (2011)Google Scholar
  6. 6.
    Charikar, M., Lehman, E., Liu, D., Panigrahy, R., Prabhakaran, M., Sahai, A., Shelat, A.: The smallest grammar problem. IEEE Trans. Inf. Theo. 51(7), 2554–2576 (2005)MathSciNetCrossRefGoogle Scholar
  7. 7.
    Charikar, M., Lehman, E., Liu, D., Panigrahy, R., Prabhakaran, M., Rasala, A., Sahai, A., Shelat, A.: Approximating the smallest grammar: Kolmogorov complexity in natural models. In: STOC, pp. 792–801 (2002)Google Scholar
  8. 8.
    Claude, F., Fariña, A., Martínez-Prieto, M., Navarro, G.: Compressed q-gram indexing for highly repetitive biological sequences. In: Proc. 10th BIBE (2010)Google Scholar
  9. 9.
    Claude, F., Fariña, A., Martínez-Prieto, M., Navarro, G.: Indexes for highly repetitive document collections. In: Proc. 20th CIKM, pp. 463–468 (2011)Google Scholar
  10. 10.
    Claude, F., Navarro, G.: Self-indexed grammar-based compression. Fund. Inf. 111(3), 313–337 (2010)MathSciNetGoogle Scholar
  11. 11.
    Do, H.H., Jansson, J., Sadakane, K., Sung, W.-K.: Fast Relative Lempel-Ziv Self-index for Similar Sequences. In: Snoeyink, J., Lu, P., Su, K., Wang, L. (eds.) AAIM 2012 and FAW 2012. LNCS, vol. 7285, pp. 291–302. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  12. 12.
    Ferragina, P., Manzini, G.: Indexing compressed texts. J. ACM 52(4), 552–581 (2005)MathSciNetCrossRefGoogle Scholar
  13. 13.
    Gagie, T., Gawrychowski, P., Kärkkäinen, J., Nekrich, Y., Puglisi, S.J.: A Faster Grammar-Based Self-index. In: Dediu, A.-H., Martín-Vide, C. (eds.) LATA 2012. LNCS, vol. 7183, pp. 240–251. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  14. 14.
    Gasieniec, L., Kolpakov, R., Potapov, I., Sant, P.: Real-time traversal in grammar-based compressed files. In: Proc. 15th DCC, pp. 458–458 (2005)Google Scholar
  15. 15.
    Golynski, A., Raman, R., Rao, S.S.: On the Redundancy of Succinct Data Structures. In: Gudmundsson, J. (ed.) SWAT 2008. LNCS, vol. 5124, pp. 148–159. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  16. 16.
    Grossi, R., Gupta, A., Vitter, J.: High-order entropy-compressed text indexes. In: Proc. 14th SODA, pp. 841–850 (2003)Google Scholar
  17. 17.
    Huang, S., Lam, T.W., Sung, W.K., Tam, S.L., Yiu, S.M.: Indexing Similar DNA Sequences. In: Chen, B. (ed.) AAIM 2010. LNCS, vol. 6124, pp. 180–190. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  18. 18.
    Kärkkäinen, J.: Repetition-Based Text Indexing. Ph.D. thesis, Department of Computer Science, University of Helsinki, Finland (1999)Google Scholar
  19. 19.
    Kida, T., Matsumoto, T., Shibata, Y., Takeda, M., Shinohara, A., Arikawa, S.: Collage system: a unifying framework for compressed pattern matching. Theor. Comp. Sci. 298(1), 253–272 (2003)MathSciNetzbMATHCrossRefGoogle Scholar
  20. 20.
    Kieffer, J., Yang, E.H.: Grammar-based codes: A new class of universal lossless source codes. IEEE Trans. Inf. Theo. 46(3), 737–754 (2000)MathSciNetzbMATHCrossRefGoogle Scholar
  21. 21.
    Kreft, S., Navarro, G.: Self-indexing Based on LZ77. In: Giancarlo, R., Manzini, G. (eds.) CPM 2011. LNCS, vol. 6661, pp. 41–54. Springer, Heidelberg (2011)CrossRefGoogle Scholar
  22. 22.
    Kuruppu, S., Beresford-Smith, B., Conway, T., Zobel, J.: Repetition-based compression of large DNA datasets. In: Proc. 13th RECOMB (2009) (poster)Google Scholar
  23. 23.
    Larsson, J., Moffat, A.: Off-line dictionary-based compression. Proc. of the IEEE 88(11), 1722–1732 (2000)CrossRefGoogle Scholar
  24. 24.
    Mäkinen, V., Navarro, G., Sirén, J., Välimäki, N.: Storage and Retrieval of Individual Genomes. In: Batzoglou, S. (ed.) RECOMB 2009. LNCS, vol. 5541, pp. 121–137. Springer, Heidelberg (2009)CrossRefGoogle Scholar
  25. 25.
    Maruyama, S., Nakahara, M., Kishiue, N., Sakamoto, H.: ESP-Index: A Compressed Index Based on Edit-Sensitive Parsing. In: Grossi, R., Sebastiani, F., Silvestri, F. (eds.) SPIRE 2011. LNCS, vol. 7024, pp. 398–409. Springer, Heidelberg (2011)CrossRefGoogle Scholar
  26. 26.
    Morrison, D.: PATRICIA – practical algorithm to retrieve information coded in alphanumeric. J. ACM 15(4), 514–534 (1968)CrossRefGoogle Scholar
  27. 27.
    Munro, J., Raman, R., Raman, V., Rao, S.S.: Succinct Representations of Permutations. In: Baeten, J.C.M., Lenstra, J.K., Parrow, J., Woeginger, G.J. (eds.) ICALP 2003. LNCS, vol. 2719, pp. 345–356. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  28. 28.
    Navarro, G., Mäkinen, V.: Compressed full-text indexes. ACM Comp. Surv. 39(1), article 2 (2007)Google Scholar
  29. 29.
    Nevill-Manning, C., Witten, I., Maulsby, D.: Compression by induction of hierarchical grammars. In: Proc. 4th DCC, pp. 244–253 (1994)Google Scholar
  30. 30.
    Raman, R., Raman, V., Rao, S.: Succinct indexable dictionaries with applications to encoding k-ary trees and multisets. In: Proc. 13th SODA, pp. 233–242 (2002)Google Scholar
  31. 31.
    Russo, L., Oliveira, A.: A compressed self-index using a Ziv-Lempel dictionary. Inf. Ret. 11(4), 359–388 (2008)CrossRefGoogle Scholar
  32. 32.
    Rytter, W.: Application of Lempel-Ziv factorization to the approximation of grammar-based compression. Theo. Comp. Sci. 302(1-3), 211–222 (2003)MathSciNetzbMATHCrossRefGoogle Scholar
  33. 33.
    Sakamoto, H.: A fully linear-time approximation algorithm for grammar-based compression. J. Discr. Alg. 3, 416–430 (2005)MathSciNetzbMATHCrossRefGoogle Scholar
  34. 34.
    Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Trans. Inf. Theo. 23(3), 337–343 (1977)MathSciNetzbMATHCrossRefGoogle Scholar
  35. 35.
    Ziv, J., Lempel, A.: Compression of individual sequences via variable length coding. IEEE Trans. Inf. Theo. 24(5), 530–536 (1978)MathSciNetzbMATHCrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Francisco Claude
    • 1
  • Gonzalo Navarro
    • 2
  1. 1.David R. Cheriton School of Computer ScienceUniversity of WaterlooCanada
  2. 2.Department of Computer ScienceUniversity of ChileChile

Personalised recommendations