Strategic Pattern Search in Factor-Compressed Text

  • Simon Gog
  • Alistair Moffat
  • Matthias Petri
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8799)

Abstract

We consider the problem of pattern-search in compressed text in a context in which: (a) the text is stored as a sequence of factors against a static phrase-book; (b) decoding of factors is from right-to-left; and (c) extraction of each symbol in each factor requires Θ(logσ) time, where σ is the size of the original alphabet. To determine possible alignments given information about decoded characters we introduce two Boyer-Moore-like searching mechanisms, including one that makes use of a suffix array constructed over the pattern. The new mechanisms decode fewer than half the symbols that are required by a sequential left-to-right search such as the Knuth-Morris-Pratt approach, a saving that translates directly into improved execution time. Experiments with a two-level suffix array index structure for 4 GB of English text demonstrate the usefulness of the new techniques.

Keywords

string search pattern matching suffix array Burrows-Wheeler transform succinct data structure disk-based algorithm experimental evaluation 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Boyer, R.S., Moore, J.S.: A fast string searching algorithm. C. ACM 20, 1075–1091 (1977)CrossRefGoogle Scholar
  2. 2.
    Colussi, L.: Fastest pattern matching in strings. J. Alg. 16, 163–189 (1994)MathSciNetCrossRefMATHGoogle Scholar
  3. 3.
    Faro, S., Lecroq, T.: The exact online string matching problem: A review of the most recent results. ACM Comput. Surv. 45(2), 13:1–13:42 (2013)Google Scholar
  4. 4.
    Ferragina, P., Grossi, R.: The string B-tree: A new data structure for search in external memory and its applications. J. ACM 46(2), 236–280 (1999)MathSciNetCrossRefMATHGoogle Scholar
  5. 5.
    Ferragina, P., Manzini, G.: Indexing compressed text. J. ACM 52(4), 552–581 (2005)MathSciNetCrossRefGoogle Scholar
  6. 6.
    Gog, S., Beller, T., Moffat, A., Petri, M.: From theory to practice: Plug and play with succinct data structures. In: Proc. Symp. Experimental Algorithms, pp. 326–337 (2014)Google Scholar
  7. 7.
    Gog, S., Moffat, A.: Adding compression and blended search to a compact two-level suffix array. In: Proc. Symp. String Processing and Inf. Retrieval, pp. 141–152 (2013)Google Scholar
  8. 8.
    Gog, S., Moffat, A., Culpepper, J.S., Turpin, A., Wirth, A.: Large-scale pattern search using reduced-space on-disk suffix arrays. IEEE Trans. Knowledge and Data Engineering 26(8), 1 (2014)CrossRefGoogle Scholar
  9. 9.
    Horspool, R.N.: Practical fast searching in strings. Soft. Prac. & Exp. 10(6), 501–506 (1980)CrossRefGoogle Scholar
  10. 10.
    Knuth, D.E., Morris, J.H., Pratt, V.R.: Fast pattern matching in strings. SIAM J. Comp. 6(1), 323–350 (1977)MathSciNetCrossRefMATHGoogle Scholar
  11. 11.
    Navarro, G., Raffinot, M.: Flexible Pattern Matching in Strings: Practical On-Line Search Algorithms for Texts and Biological Sequences. Cambridge University Press (2002)Google Scholar
  12. 12.
    Raita, T.: Tuning the Boyer-Moore-Horspool string searching algorithms. Soft. Prac. & Exp. 22(10), 879–884 (1992)CrossRefGoogle Scholar
  13. 13.
    Raman, R., Raman, V., Rao, S.S.: Succinct indexable dictionaries with applications to encoding k-ary trees and multisets. In: Proc. ACM-SIAM Symp. Discrete Algorithms, pp. 233–242 (2002)Google Scholar
  14. 14.
    Sinha, R., Puglisi, S.J., Moffat, A., Turpin, A.: Improving suffix array locality for fast pattern matching on disk. In: Proc. ACM SIGMOD Int. Conf. Management of Data, pp. 661–672 (2008)Google Scholar
  15. 15.
    Smith, P.D.: Experiments with a very fast substring search algorithm. Soft. Prac. & Exp. 21(10), 1065–1074 (1991)CrossRefGoogle Scholar
  16. 16.
    Sunday, D.M.: A very fast substring search algorithm. C. ACM 33(8), 132–142 (1990)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  • Simon Gog
    • 1
    • 2
  • Alistair Moffat
    • 1
  • Matthias Petri
    • 1
  1. 1.Department of Computing and Information SystemsThe University of MelbourneAustralia
  2. 2.Institute of Theoretical InformaticsKarlsruhe Institute of TechnologyGermany

Personalised recommendations