Speeding Up q-Gram Mining on Grammar-Based Compressed Texts

Goto, Keisuke; Bannai, Hideo; Inenaga, Shunsuke; Takeda, Masayuki

doi:10.1007/978-3-642-31265-6_18

Keisuke Goto¹⁸,
Hideo Bannai¹⁸,
Shunsuke Inenaga¹⁸ &
…
Masayuki Takeda¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 7354))

Included in the following conference series:

Annual Symposium on Combinatorial Pattern Matching

928 Accesses
5 Citations
3 Altmetric

Abstract

We present an efficient algorithm for calculating q-gram frequencies on strings represented in compressed form, namely, as a straight line program (SLP). Given an SLP \(\mathcal{T}\) of size n that represents string T, the algorithm computes the occurrence frequencies of all q-grams in T, by reducing the problem to the weighted q-gram frequencies problem on a trie-like structure of size \(m = |T|-\mathit{dup}(q,\mathcal{T})\), where \(\mathit{dup}(q,\mathcal{T})\) is a quantity that represents the amount of redundancy that the SLP captures with respect to q-grams. The reduced problem can be solved in linear time. Since m = O(qn), the running time of our algorithm is \(O(\min\{|T|-\mathit{dup}(q,\mathcal{T}),qn\})\), improving our previous O(qn) algorithm when q = Ω(|T|/n).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Bender, M.A., Farach-Colton, M.: The level ancestor problem simplified. Theor. Comput. Sci. 321(1), 5–12 (2004)
Article MathSciNet MATH Google Scholar
Berkman, O., Vishkin, U.: Finding level-ancestors in trees. J. Comput. System Sci. 48(2), 214–230 (1994)
Article MathSciNet MATH Google Scholar
Claude, F., Navarro, G.: Self-indexed grammar-based compression. Fundamenta Informaticae 111(3), 313–337 (2011)
MATH Google Scholar
Crochemore, M., Landau, G.M., Ziv-Ukelson, M.: A subquadratic sequence alignment algorithm for unrestricted scoring matrices. SIAM J. Comput. 32(6), 1654–1673 (2003)
Article MathSciNet MATH Google Scholar
Dietz, P.: Finding Level-Ancestors in Dynamic Trees. In: Dehne, F., Sack, J.-R., Santoro, N. (eds.) WADS 1991. LNCS, vol. 519, pp. 32–40. Springer, Heidelberg (1991)
Chapter Google Scholar
Ferragina, P., Luccio, F., Manzini, G., Muthukrishnan, S.: Compressing and indexing labeled trees, with applications. J. ACM 57(1) (2009)
Google Scholar
Gawrychowski, P.: Pattern Matching in Lempel-Ziv Compressed Strings: Fast, Simple, and Deterministic. In: Demetrescu, C., Halldórsson, M.M. (eds.) ESA 2011. LNCS, vol. 6942, pp. 421–432. Springer, Heidelberg (2011)
Chapter Google Scholar
Gąsieniec, L., Kolpakov, R., Potapov, I., Sant, P.: Real-time traversal in grammar-based compressed files. In: Proc. DCC 2005, p. 458 (2005)
Google Scholar
Goto, K., Bannai, H., Inenaga, S., Takeda, M.: Fast q-gram Mining on SLP Compressed Strings. In: Grossi, R., Sebastiani, F., Silvestri, F. (eds.) SPIRE 2011. LNCS, vol. 7024, pp. 278–289. Springer, Heidelberg (2011)
Chapter Google Scholar
Hermelin, D., Landau, G.M., Landau, S., Weimann, O.: A unified algorithm for accelerating edit-distance computation via text-compression. In: Proc. STACS 2009, pp. 529–540 (2009)
Google Scholar
Inenaga, S., Bannai, H.: Finding characteristic substrings from compressed texts. International Journal of Foundations of Computer Science 23(2), 261–280 (2012); a preliminary version appeared in PSC 2009
Article Google Scholar
Kärkkäinen, J., Sanders, P., Burkhardt, S.: Linear work suffix array construction. Journal of the ACM 53(6), 918–936 (2006)
Article MathSciNet Google Scholar
Karpinski, M., Rytter, W., Shinohara, A.: An efficient pattern-matching algorithm for strings with short descriptions. Nordic Journal of Computing 4, 172–186 (1997)
MathSciNet MATH Google Scholar
Kasai, T., Lee, G., Arimura, H., Arikawa, S., Park, K.: Linear-Time Longest-Common-Prefix Computation in Suffix Arrays and Its Applications. In: Amir, A., Landau, G.M. (eds.) CPM 2001. LNCS, vol. 2089, pp. 181–192. Springer, Heidelberg (2001)
Chapter Google Scholar
Kimura, D., Kashima, H.: A linear time subpath kernel for trees. IEICE Technical Report, IBISML2011-85, pp. 291–298 (2011)
Google Scholar
Larsson, N.J., Moffat, A.: Offline dictionary-based compression. In: Proc. DCC 1999, pp. 296–305. IEEE Computer Society (1999)
Google Scholar
Nevill-Manning, C.G., Witten, I.H., Maulsby, D.L.: Compression by induction of hierarchical grammars. In: Proc. DCC 1994, pp. 244–253 (1994)
Google Scholar
Rytter, W.: Application of Lempel-Ziv factorization to the approximation of grammar-based compression. Theor. Comput. Sci. 302(1-3), 211–222 (2003)
Article MathSciNet MATH Google Scholar
Shibata, Y., Kida, T., Fukamachi, S., Takeda, M., Shinohara, A., Shinohara, T., Arikawa, S.: Speeding Up Pattern Matching by Text Compression. In: Bongiovanni, G., Petreschi, R., Gambosi, G. (eds.) CIAC 2000. LNCS, vol. 1767, pp. 306–315. Springer, Heidelberg (2000)
Chapter Google Scholar
Shibuya, T.: Constructing the suffix tree of a tree with a large alphabet. IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences E86-A(5), 1061–1066 (2003)
Google Scholar
Storer, J., Szymanski, T.: Data compression via textual substitution. Journal of the ACM 29(4), 928–951 (1982)
Article MathSciNet MATH Google Scholar
Welch, T.A.: A technique for high performance data compression. IEEE Computer 17, 8–19 (1984)
Article Google Scholar
Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Transactions on Information Theory IT-23(3), 337–349 (1977)
Article MathSciNet Google Scholar
Ziv, J., Lempel, A.: Compression of individual sequences via variable-length coding. IEEE Transactions on Information Theory 24(5), 530–536 (1978)
Article MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

Department of Informatics, Kyushu University, Japan
Keisuke Goto, Hideo Bannai, Shunsuke Inenaga & Masayuki Takeda

Authors

Keisuke Goto
View author publications
You can also search for this author in PubMed Google Scholar
Hideo Bannai
View author publications
You can also search for this author in PubMed Google Scholar
Shunsuke Inenaga
View author publications
You can also search for this author in PubMed Google Scholar
Masayuki Takeda
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science, University of Helsinki, Gustaf Hällström Katu 2b, P.O. Box 68, 00014, Helsinki, Finland
Juha Kärkkäinen
Faculty of Technology, University of Bielefeld, Universitätsstraße 25, 33615, Bielefeld, Germany
Jens Stoye

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Goto, K., Bannai, H., Inenaga, S., Takeda, M. (2012). Speeding Up q-Gram Mining on Grammar-Based Compressed Texts. In: Kärkkäinen, J., Stoye, J. (eds) Combinatorial Pattern Matching. CPM 2012. Lecture Notes in Computer Science, vol 7354. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-31265-6_18

Download citation

DOI: https://doi.org/10.1007/978-3-642-31265-6_18
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-31264-9
Online ISBN: 978-3-642-31265-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics