Algorithmica

, Volume 72, Issue 2, pp 515–538 | Cite as

Compressing Dictionary Matching Index via Sparsification Technique

  • Wing-Kai Hon
  • Tsung-Han Ku
  • Tak-Wah Lam
  • Rahul Shah
  • Siu-Lung Tam
  • Sharma V. Thankachan
  • Jeffrey Scott Vitter
Article

Abstract

Given a set \(\mathcal{D}\) of patterns of total length n, the dictionary matching problem is to index \(\mathcal{D}\) such that for any query text T, we can locate the occurrences of any pattern within T efficiently. This problem can be solved in optimal O(|T|+occ) time by the classical AC automaton (Aho and Corasick in Commun. ACM 18(6):333–340, 1975), where occ denotes the number of occurrences. The space requirement is O(n) words which is still far from optimal. In this paper, we show that in many cases, sparsification technique can be applied to improve the space requirements of the indexes for the dictionary matching and its related problems. First, we give a compressed index for dictionary matching, and show that such an index can be generalized to handle dynamic updates of \(\mathcal{D}\). Also, we give a compressed index for approximate dictionary matching with one error. In each case, the query time is only slowed down by a polylogarithmic factor when compared with that achieved by the best O(n)-word counterparts.

Keywords

Data compression Dictionary matching Text indexing Sparsification technique 

References

  1. 1.
    Aho, A., Corasick, M.: Efficient string matching: an aid to bibliographic search. Commun. ACM 18(6), 333–340 (1975) CrossRefMATHMathSciNetGoogle Scholar
  2. 2.
    Alstrup, S., Husfeldt, T., Rauhe, T.: Marked ancestor problems. In: Proceedings of IEEE Symposium on Foundations of Computer Science (FOCS’98), pp. 534–544 (1998) Google Scholar
  3. 3.
    Amir, A., Farach, M.: Adaptive dictionary matching. In: Proceedings of IEEE Symposium on Foundations of Computer Science (FOCS’91), pp. 760–766 (1991) Google Scholar
  4. 4.
    Amir, A., Farach, M., Galil, Z., Giancarlo, R., Park, K.: Dynamic dictionary matching. J. Comput. Syst. Sci. 49(2), 208–222 (1994) CrossRefMATHMathSciNetGoogle Scholar
  5. 5.
    Amir, A., Farach, M., Idury, R., Poutre, A.L., Schaffer, A.: Improved dynamic dictionary matching. Inf. Comput. 119(2), 258–282 (1995) CrossRefMATHGoogle Scholar
  6. 6.
    Amir, A., Keselman, D., Landau, G.M., Lewenstein, M., Lewenstein, N., Rodeh, M.: Text indexing and dictionary matching with one error. J. Algorithms 37(2), 309–325 (2000) CrossRefMATHMathSciNetGoogle Scholar
  7. 7.
    Arge, L., Vitter, J.S.: Optimal external memory interval management. SIAM J. Comput. 32(6), 1488–1508 (2003) CrossRefMATHMathSciNetGoogle Scholar
  8. 8.
    Belazzougui, D.: Succinct dictionary matching with no slowdown. In: Proceedings of Symposium on Combinatorial Pattern Matching (CPM’10), pp. 88–100 (2010) CrossRefGoogle Scholar
  9. 9.
    Bender, M.A., Cole, R., Demaine, E.D., Farach-Colton, M., Zito, J.: Two simplified algorithms for maintaining order in a list. In: Proceedings of European Symposium on Algorithms (ESA’02), pp. 152–164 (2002) Google Scholar
  10. 10.
    Bender, M.A., Farach-Colton, M., Pemmasani, G., Skiena, S., Sumazin, P.: Lowest common ancestors in trees and directed acyclic graphs. J. Algorithms 57(2), 75–94 (2005) CrossRefMATHMathSciNetGoogle Scholar
  11. 11.
    Chan, H.L., Hon, W.K., Lam, T.W., Sadakane, K.: Compressed indexes for dynamic text collections. ACM Trans. Algorithms 3, 2 (2007) CrossRefMathSciNetGoogle Scholar
  12. 12.
    Chien, Y.F., Hon, W.K., Shah, R., Vitter, J.S.: Geometric Burrows-Wheeler transform: linking range searching and text indexing. In: Proceedings of IEEE Data Compression Conference (DCC’08), pp. 252–261 (2008) Google Scholar
  13. 13.
    Cole, R., Gottlieb, L.-A., Lewenstein, M.: Dictionary matching and indexing with errors and don’t cares. In: Proceedings of ACM Symposium on Theory of Computing (STOC’04), pp. 91–100 (2004) Google Scholar
  14. 14.
    Dietz, P.F., Sleator, D.D.: Two algorithms for maintaining order in a list. In: Proceedings of ACM Symposium on Theory of Computing (STOC’87), pp. 365–372 (1987) Google Scholar
  15. 15.
    Ferragina, P., Grossi, R.: The string B-tree: a new data structure for string search in external memory and its applications. J. ACM 46(2), 236–280 (1999) CrossRefMATHMathSciNetGoogle Scholar
  16. 16.
    Ferragina, P., Manzini, G.: Indexing compressed text. J. ACM 52(4), 552–581 (2005) CrossRefMathSciNetGoogle Scholar
  17. 17.
    Ferragina, P., Venturini, R.: A simple storage scheme for strings achieving entropy bounds. Theor. Comput. Sci. 372(1), 115–121 (2007) CrossRefMATHMathSciNetGoogle Scholar
  18. 18.
    Ferragina, P., Muthukrishnan, S., de Berg, M.: Multi-method dispatching: a geometric approach with applications to string matching problems. In: Proceedings of ACM Symposium on Theory of Computing (STOC’99), pp. 483–491 (1999) Google Scholar
  19. 19.
    Fischer, J., Heun, V.: Space-efficient preprocessing schemes for range minimum queries on static arrays. SIAM J. Comput. 40(2), 465–492 (2011) CrossRefMATHMathSciNetGoogle Scholar
  20. 20.
    Grossi, R., Vitter, J.S.: Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SIAM J. Comput. 35(2), 378–407 (2005) CrossRefMATHMathSciNetGoogle Scholar
  21. 21.
    Hagerup, T., Miltersen, P.B., Pagh, R.: Deterministic dictionaries. J. Algorithms 41(1), 69–85 (2001) CrossRefMATHMathSciNetGoogle Scholar
  22. 22.
    Hon, W.K., Lam, T.W., Shah, R., Tam, S.L., Vitter, J.S.: Compressed index for dictionary matching. In: Proceedings of IEEE Data Compression Conference (DCC’08), pp. 23–32 (2008) Google Scholar
  23. 23.
    Hon, W.K., Shah, R., Thankachan, S.V., Vitter, J.S.: On entropy-compressed text indexing in external memory. In: Proceedings of International Symposium on String Processing and Information Retrieval (SPIRE’09), pp. 75–89 (2009) CrossRefGoogle Scholar
  24. 24.
    Hon, W.K., Ku, T.H., Shah, R., Thankachan, S.V., Vitter, J.S.: Faster compressed dictionary matching. In: Proceedings of International Symposium on String Processing and Information Retrieval (SPIRE’10), pp. 191–200 (2010) CrossRefGoogle Scholar
  25. 25.
    Kärkkäinen, J., Ukkonen, E.: Sparse suffix trees. In: Proceedings of International Conference on Computing and Combinatorics (COCOON’96), pp. 219–230 (1996) Google Scholar
  26. 26.
    Manber, U., Myers, G.: Suffix arrays: a new method for on-line string searches. SIAM J. Comput. 22(5), 935–948 (1993) CrossRefMATHMathSciNetGoogle Scholar
  27. 27.
    McCreight, E.M.: A space-economical suffix tree construction algorithm. J. ACM 23(2), 262–272 (1976) CrossRefMATHMathSciNetGoogle Scholar
  28. 28.
    McCreight, E.M.: Priority search trees. SIAM J. Comput. 14(2), 257–276 (1985) CrossRefMATHMathSciNetGoogle Scholar
  29. 29.
    Overmars, M.H.: Efficient data structures for range searching on a grid. J. Algorithms 9(2), 254–275 (1988) CrossRefMATHMathSciNetGoogle Scholar
  30. 30.
    Sadakane, K.: Compressed suffix trees with full functionality. Theory Comput. Syst. 41(4), 589–607 (2007) CrossRefMATHMathSciNetGoogle Scholar
  31. 31.
    Weiner, P.: Linear pattern matching algorithms. In: Proceedings of Symposium on Switching and Automata Theory, pp. 1–11 (1973) CrossRefGoogle Scholar
  32. 32.
    Willard, D.E.: Log-logarithmic worst-case range queries are possible in space Θ(N). Inf. Process. Lett. 17(2), 81–84 (1983) CrossRefMATHMathSciNetGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2014

Authors and Affiliations

  • Wing-Kai Hon
    • 1
  • Tsung-Han Ku
    • 1
  • Tak-Wah Lam
    • 2
  • Rahul Shah
    • 3
  • Siu-Lung Tam
    • 2
  • Sharma V. Thankachan
    • 3
  • Jeffrey Scott Vitter
    • 4
  1. 1.National Tsing Hua UniversityHsinchu CityTaiwan, Republic of China
  2. 2.The University of Hong KongHong KongHong Kong SAR
  3. 3.Louisiana State UniversityBaton RougeUSA
  4. 4.The University of KansasLawrenceUSA

Personalised recommendations