Skip to main content

Speeding Up q-Gram Mining on Grammar-Based Compressed Texts

  • Conference paper
Combinatorial Pattern Matching (CPM 2012)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 7354))

Included in the following conference series:

Abstract

We present an efficient algorithm for calculating q-gram frequencies on strings represented in compressed form, namely, as a straight line program (SLP). Given an SLP \(\mathcal{T}\) of size n that represents string T, the algorithm computes the occurrence frequencies of all q-grams in T, by reducing the problem to the weighted q-gram frequencies problem on a trie-like structure of size \(m = |T|-\mathit{dup}(q,\mathcal{T})\), where \(\mathit{dup}(q,\mathcal{T})\) is a quantity that represents the amount of redundancy that the SLP captures with respect to q-grams. The reduced problem can be solved in linear time. Since m = O(qn), the running time of our algorithm is \(O(\min\{|T|-\mathit{dup}(q,\mathcal{T}),qn\})\), improving our previous O(qn) algorithm when q = Ω(|T|/n).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Bender, M.A., Farach-Colton, M.: The level ancestor problem simplified. Theor. Comput. Sci. 321(1), 5–12 (2004)

    Article  MathSciNet  MATH  Google Scholar 

  2. Berkman, O., Vishkin, U.: Finding level-ancestors in trees. J. Comput. System Sci. 48(2), 214–230 (1994)

    Article  MathSciNet  MATH  Google Scholar 

  3. Claude, F., Navarro, G.: Self-indexed grammar-based compression. Fundamenta Informaticae 111(3), 313–337 (2011)

    MATH  Google Scholar 

  4. Crochemore, M., Landau, G.M., Ziv-Ukelson, M.: A subquadratic sequence alignment algorithm for unrestricted scoring matrices. SIAM J. Comput. 32(6), 1654–1673 (2003)

    Article  MathSciNet  MATH  Google Scholar 

  5. Dietz, P.: Finding Level-Ancestors in Dynamic Trees. In: Dehne, F., Sack, J.-R., Santoro, N. (eds.) WADS 1991. LNCS, vol. 519, pp. 32–40. Springer, Heidelberg (1991)

    Chapter  Google Scholar 

  6. Ferragina, P., Luccio, F., Manzini, G., Muthukrishnan, S.: Compressing and indexing labeled trees, with applications. J. ACM 57(1) (2009)

    Google Scholar 

  7. Gawrychowski, P.: Pattern Matching in Lempel-Ziv Compressed Strings: Fast, Simple, and Deterministic. In: Demetrescu, C., Halldórsson, M.M. (eds.) ESA 2011. LNCS, vol. 6942, pp. 421–432. Springer, Heidelberg (2011)

    Chapter  Google Scholar 

  8. Gąsieniec, L., Kolpakov, R., Potapov, I., Sant, P.: Real-time traversal in grammar-based compressed files. In: Proc. DCC 2005, p. 458 (2005)

    Google Scholar 

  9. Goto, K., Bannai, H., Inenaga, S., Takeda, M.: Fast q-gram Mining on SLP Compressed Strings. In: Grossi, R., Sebastiani, F., Silvestri, F. (eds.) SPIRE 2011. LNCS, vol. 7024, pp. 278–289. Springer, Heidelberg (2011)

    Chapter  Google Scholar 

  10. Hermelin, D., Landau, G.M., Landau, S., Weimann, O.: A unified algorithm for accelerating edit-distance computation via text-compression. In: Proc. STACS 2009, pp. 529–540 (2009)

    Google Scholar 

  11. Inenaga, S., Bannai, H.: Finding characteristic substrings from compressed texts. International Journal of Foundations of Computer Science 23(2), 261–280 (2012); a preliminary version appeared in PSC 2009

    Article  Google Scholar 

  12. Kärkkäinen, J., Sanders, P., Burkhardt, S.: Linear work suffix array construction. Journal of the ACM 53(6), 918–936 (2006)

    Article  MathSciNet  Google Scholar 

  13. Karpinski, M., Rytter, W., Shinohara, A.: An efficient pattern-matching algorithm for strings with short descriptions. Nordic Journal of Computing 4, 172–186 (1997)

    MathSciNet  MATH  Google Scholar 

  14. Kasai, T., Lee, G., Arimura, H., Arikawa, S., Park, K.: Linear-Time Longest-Common-Prefix Computation in Suffix Arrays and Its Applications. In: Amir, A., Landau, G.M. (eds.) CPM 2001. LNCS, vol. 2089, pp. 181–192. Springer, Heidelberg (2001)

    Chapter  Google Scholar 

  15. Kimura, D., Kashima, H.: A linear time subpath kernel for trees. IEICE Technical Report, IBISML2011-85, pp. 291–298 (2011)

    Google Scholar 

  16. Larsson, N.J., Moffat, A.: Offline dictionary-based compression. In: Proc. DCC 1999, pp. 296–305. IEEE Computer Society (1999)

    Google Scholar 

  17. Nevill-Manning, C.G., Witten, I.H., Maulsby, D.L.: Compression by induction of hierarchical grammars. In: Proc. DCC 1994, pp. 244–253 (1994)

    Google Scholar 

  18. Rytter, W.: Application of Lempel-Ziv factorization to the approximation of grammar-based compression. Theor. Comput. Sci. 302(1-3), 211–222 (2003)

    Article  MathSciNet  MATH  Google Scholar 

  19. Shibata, Y., Kida, T., Fukamachi, S., Takeda, M., Shinohara, A., Shinohara, T., Arikawa, S.: Speeding Up Pattern Matching by Text Compression. In: Bongiovanni, G., Petreschi, R., Gambosi, G. (eds.) CIAC 2000. LNCS, vol. 1767, pp. 306–315. Springer, Heidelberg (2000)

    Chapter  Google Scholar 

  20. Shibuya, T.: Constructing the suffix tree of a tree with a large alphabet. IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences E86-A(5), 1061–1066 (2003)

    Google Scholar 

  21. Storer, J., Szymanski, T.: Data compression via textual substitution. Journal of the ACM 29(4), 928–951 (1982)

    Article  MathSciNet  MATH  Google Scholar 

  22. Welch, T.A.: A technique for high performance data compression. IEEE Computer 17, 8–19 (1984)

    Article  Google Scholar 

  23. Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Transactions on Information Theory IT-23(3), 337–349 (1977)

    Article  MathSciNet  Google Scholar 

  24. Ziv, J., Lempel, A.: Compression of individual sequences via variable-length coding. IEEE Transactions on Information Theory 24(5), 530–536 (1978)

    Article  MathSciNet  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Goto, K., Bannai, H., Inenaga, S., Takeda, M. (2012). Speeding Up q-Gram Mining on Grammar-Based Compressed Texts. In: Kärkkäinen, J., Stoye, J. (eds) Combinatorial Pattern Matching. CPM 2012. Lecture Notes in Computer Science, vol 7354. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-31265-6_18

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-31265-6_18

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-31264-9

  • Online ISBN: 978-3-642-31265-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics