Abstract
We present an efficient algorithm for calculating q-gram frequencies on strings represented in compressed form, namely, as a straight line program (SLP). Given an SLP \(\mathcal{T}\) of size n that represents string T, the algorithm computes the occurrence frequencies of all q-grams in T, by reducing the problem to the weighted q-gram frequencies problem on a trie-like structure of size \(m = |T|-\mathit{dup}(q,\mathcal{T})\), where \(\mathit{dup}(q,\mathcal{T})\) is a quantity that represents the amount of redundancy that the SLP captures with respect to q-grams. The reduced problem can be solved in linear time. Since m = O(qn), the running time of our algorithm is \(O(\min\{|T|-\mathit{dup}(q,\mathcal{T}),qn\})\), improving our previous O(qn) algorithm when q = Ω(|T|/n).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Bender, M.A., Farach-Colton, M.: The level ancestor problem simplified. Theor. Comput. Sci. 321(1), 5–12 (2004)
Berkman, O., Vishkin, U.: Finding level-ancestors in trees. J. Comput. System Sci. 48(2), 214–230 (1994)
Claude, F., Navarro, G.: Self-indexed grammar-based compression. Fundamenta Informaticae 111(3), 313–337 (2011)
Crochemore, M., Landau, G.M., Ziv-Ukelson, M.: A subquadratic sequence alignment algorithm for unrestricted scoring matrices. SIAM J. Comput. 32(6), 1654–1673 (2003)
Dietz, P.: Finding Level-Ancestors in Dynamic Trees. In: Dehne, F., Sack, J.-R., Santoro, N. (eds.) WADS 1991. LNCS, vol. 519, pp. 32–40. Springer, Heidelberg (1991)
Ferragina, P., Luccio, F., Manzini, G., Muthukrishnan, S.: Compressing and indexing labeled trees, with applications. J. ACM 57(1) (2009)
Gawrychowski, P.: Pattern Matching in Lempel-Ziv Compressed Strings: Fast, Simple, and Deterministic. In: Demetrescu, C., Halldórsson, M.M. (eds.) ESA 2011. LNCS, vol. 6942, pp. 421–432. Springer, Heidelberg (2011)
Gąsieniec, L., Kolpakov, R., Potapov, I., Sant, P.: Real-time traversal in grammar-based compressed files. In: Proc. DCC 2005, p. 458 (2005)
Goto, K., Bannai, H., Inenaga, S., Takeda, M.: Fast q-gram Mining on SLP Compressed Strings. In: Grossi, R., Sebastiani, F., Silvestri, F. (eds.) SPIRE 2011. LNCS, vol. 7024, pp. 278–289. Springer, Heidelberg (2011)
Hermelin, D., Landau, G.M., Landau, S., Weimann, O.: A unified algorithm for accelerating edit-distance computation via text-compression. In: Proc. STACS 2009, pp. 529–540 (2009)
Inenaga, S., Bannai, H.: Finding characteristic substrings from compressed texts. International Journal of Foundations of Computer Science 23(2), 261–280 (2012); a preliminary version appeared in PSC 2009
Kärkkäinen, J., Sanders, P., Burkhardt, S.: Linear work suffix array construction. Journal of the ACM 53(6), 918–936 (2006)
Karpinski, M., Rytter, W., Shinohara, A.: An efficient pattern-matching algorithm for strings with short descriptions. Nordic Journal of Computing 4, 172–186 (1997)
Kasai, T., Lee, G., Arimura, H., Arikawa, S., Park, K.: Linear-Time Longest-Common-Prefix Computation in Suffix Arrays and Its Applications. In: Amir, A., Landau, G.M. (eds.) CPM 2001. LNCS, vol. 2089, pp. 181–192. Springer, Heidelberg (2001)
Kimura, D., Kashima, H.: A linear time subpath kernel for trees. IEICE Technical Report, IBISML2011-85, pp. 291–298 (2011)
Larsson, N.J., Moffat, A.: Offline dictionary-based compression. In: Proc. DCC 1999, pp. 296–305. IEEE Computer Society (1999)
Nevill-Manning, C.G., Witten, I.H., Maulsby, D.L.: Compression by induction of hierarchical grammars. In: Proc. DCC 1994, pp. 244–253 (1994)
Rytter, W.: Application of Lempel-Ziv factorization to the approximation of grammar-based compression. Theor. Comput. Sci. 302(1-3), 211–222 (2003)
Shibata, Y., Kida, T., Fukamachi, S., Takeda, M., Shinohara, A., Shinohara, T., Arikawa, S.: Speeding Up Pattern Matching by Text Compression. In: Bongiovanni, G., Petreschi, R., Gambosi, G. (eds.) CIAC 2000. LNCS, vol. 1767, pp. 306–315. Springer, Heidelberg (2000)
Shibuya, T.: Constructing the suffix tree of a tree with a large alphabet. IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences E86-A(5), 1061–1066 (2003)
Storer, J., Szymanski, T.: Data compression via textual substitution. Journal of the ACM 29(4), 928–951 (1982)
Welch, T.A.: A technique for high performance data compression. IEEE Computer 17, 8–19 (1984)
Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Transactions on Information Theory IT-23(3), 337–349 (1977)
Ziv, J., Lempel, A.: Compression of individual sequences via variable-length coding. IEEE Transactions on Information Theory 24(5), 530–536 (1978)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Goto, K., Bannai, H., Inenaga, S., Takeda, M. (2012). Speeding Up q-Gram Mining on Grammar-Based Compressed Texts. In: Kärkkäinen, J., Stoye, J. (eds) Combinatorial Pattern Matching. CPM 2012. Lecture Notes in Computer Science, vol 7354. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-31265-6_18
Download citation
DOI: https://doi.org/10.1007/978-3-642-31265-6_18
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-31264-9
Online ISBN: 978-3-642-31265-6
eBook Packages: Computer ScienceComputer Science (R0)