Advertisement

A bit-parallel approach to suffix automata: Fast extended string matching

  • Gonzalo Navarro
  • Mathieu Raffinot
Session I
Part of the Lecture Notes in Computer Science book series (LNCS, volume 1448)

Abstract

We present a new algorithm for string matching. The algorithm, called BNDM, is the bit-parallel simulation of a known (but recent) algorithm called BDM. BDM skips characters using a “suffix automaton” which is made deterministic in the preprocessing. BNDM, instead, simulates the nondeterministic version using bit-parallelism. This algorithm is 20%–25% faster than BDM, 2–3 times faster than other bit-parallel algorithms, and 10%–40% faster than all the Boyer-Moore family. This makes it the fastest algorithm in all cases except for very short or very long patterns (e.g. on English text it is the fastest between 5 and 110 characters). Moreover, the algorithm is very simple, allowing to easily implement other variants of BDM which are extremely complex in their original formulation. We show that, as other bit-parallel algorithms, BNDM can be extended to handle classes of characters in the pattern and in the text, multiple patterns and to allow errors in the pattern or in the text, combining simplicity, efficiency and flexibility. We also generalize the suffix automaton definition to handle classes of characters. To the best of our knowledge, this extension has not been studied before.

Keywords

String Match Suffix Tree Random Text Extended Pattern Deterministic Automaton 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    R. Baeza-Yates. Text retrieval: Theory and practice. In 12th IFIP World Computer Congress, volume I, pages 465–476. Elsevier Science, September 1992.Google Scholar
  2. 2.
    R. Baeza-Yates and G. Gonnet.A new approach to text searching.CALM, 35(10):74–82, October 1992.Google Scholar
  3. 3.
    R. Baeza-Yates and G. Navarro. A faster algorithm for approximate string matching. In Proc. of CPM'96, pages 1–23, 1996.Google Scholar
  4. 4.
    R. Baeza-Yates and C. Perleberg. Fast and practical approximate pattern matching. In Proc. CPM'92, pages 185–192. Springer-Verlag, 1992. LNCS 644.Google Scholar
  5. 5.
    A. Blumer, A. Ehrenfeucht, and D. Haussler. Average sizes of suffix trees and dawgs. Discrete Applied Mathematics, 24(1):37–45, 1989.Google Scholar
  6. 6.
    R. S. Boyer and J. S. Moore. A fast string searching algorithm. Communications of the ACM, 20(10):762–772, 1977.Google Scholar
  7. 7.
    W. Chang and J. Lampe. Theoretical and empirical comparisons of approximate string matching algorithms. In Proc. of CPM'92, pages 172–181, 1992. LNCS 644.Google Scholar
  8. 8.
    M. Crochemore. Transducers and repetitions. Theor. Comput. Sci., 45(1):63–86, 1986.Google Scholar
  9. 9.
    M. Crochemore, A. Czumaj, L. Gasieniec, S. Jarominek, T. Lecroq, W. Plandowski, and W. Rytter. Fast practical multi-pattern matching. Rapport 93-3, Institut Gaspard Monge, Université de Marne la Vallée, 1993.Google Scholar
  10. 10.
    M. Crochemore,A. Czumaj, L. Gasieniec, S. Jarominek, T. Lecroq, W. Plandowski, and W. Rytter. Speeding up two string-matching algorithms. Algorithmica, (12):247–267, 1994.Google Scholar
  11. 11.
    M. Crochemore and W. Rytter. Text algorithms. Oxford University Press, 1994.Google Scholar
  12. 12.
    R. N. Horspool. Practical fast searching in strings. Softw. Pratt. Exp., 10:501–506, 1980.Google Scholar
  13. 13.
    P. Jokinen, J. Tarhio, and E. Ukkonen. A comparison of approximate string matching algorithms. Software Practice and Experience, 26(12):1439–1458, 1996.Google Scholar
  14. 14.
    D. E. Knuth, J. H. Morris, Jr, and V. R. Pratt. Fast pattern matching in strings. SIAM Journal on Computing, 6(1):323–350, 1977.Google Scholar
  15. 15.
    T. Lecroq. Recherches de mot. Thèse de doctorat, Université d'Orléans, France, 1992.Google Scholar
  16. 16.
    G. Navarro. A partial deterministic automaton for approximate string matching. In Proc. of WSP'97, pages 112–124. Carleton University Press, 1997.Google Scholar
  17. 17.
    G. Navarro and M. Raffinot. A bit-parallel approach to suffix automata: Fast extended string matching. Technical Report TR/DCC-98-1, Dept. of Computer Science, Univ. of Chile, Jan 1998. ftp://ftp.dcc.uchile.cl/pub/users/gnavarro/bndm.ps.gz.Google Scholar
  18. 18.
    M. Raffinot. Asymptotic estimation of the average number of terminal states in dawgs. In R. Baeza-Yates, editor, Proc. of WSP'97, pages 140–148, Valparaiso, Chile, November 12–13, 1997. Carleton University Press.Google Scholar
  19. 19.
    M. Raffinot. On the multi backward dawg matching algorithm (MultiBDM). In R. Baeza-Yates, editor, Proceedings of the 4rd South American Workshop on String Processing, pages 149–165, Valparaiso, Chile, November 12–13, 1997. Carleton University Press.Google Scholar
  20. 20.
    D. Sunday. A very fast substring search algorithm. CACM, 33(8):132–142, August 1990.Google Scholar
  21. 21.
    S. Wu and U. Manber. Agrep — a fast approximate pattern-matching tool. In Proc. of USENIX Technical Conference, pages 153–162, 1992.Google Scholar
  22. 22.
    S. Wu and U. Manber. Fast text searching allowing errors. CALM, 35(10):83–91, October 1992.Google Scholar
  23. 23.
    S. Wu, U. Manber, and E. Myers. A sub-quadratic algorithm for approximate limited expression matching. Algorithmica, 15(1):50–67, 1996.Google Scholar
  24. 24.
    A. C. Yao. The complexity of pattern matching for a random string. SIAM Journal on Computing, 8(3):368–387, 1979.Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 1998

Authors and Affiliations

  • Gonzalo Navarro
    • 1
  • Mathieu Raffinot
    • 2
  1. 1.Dept. of Computer ScienceUniversity of ChileSantiagoChile
  2. 2.Institut Gaspard Monge, Cité Descartes, Champs-sur-MarneMarne-la-Vallée Cedex 2France

Personalised recommendations