Optimal Exact String Matching Based on Suffix Arrays
Using the suffix tree of a string S, decision queries of the type “Is P a substring of S?” can be answered in O(|P|) time and enumeration queries of the type “Where are all z occurrences of P in S?” can be answered in O(|P|+z) time, totally independent of the size of S. However, in large scale applications as genome analysis, the space requirements of the suffix tree are a severe drawback. The suffix array is a more space economical index structure. Using it and an additional table, Manber and Myers (1993) showed that decision queries and enumeration queries can be answered in O(|P|+log|S|) and O(|P|+log|S|+z) time, respectively, but no optimal time algorithms are known. In this paper, we show how to achieve the optimal O(|P|) and O(|P| + z) time bounds for the suffix array. Our approach is not confined to exact pattern matching. In fact, it can be used to efficiently solve all problems that are usually solved by a top-down traversal of the suffix tree. Experiments show that our method is not only of theoretical interest but also of practical relevance.
KeywordsSpace Requirement Suffix Tree Input String Alphabet Size Additional Table
Unable to display preview. Download preview PDF.
- M. I. Abouelhoda, S. Kurtz, and E. Ohlebusch. The Enhanced Suffix Array and its Applications to Genome Analysis. In Proceedings of the Second Workshop on Algorithms in Bioinformatics. Springer Verlag, Lecture Notes in Computer Science, accepted for publication, 2002.Google Scholar
- A. Apostolico. The Myriad Virtues of Subword Trees. In Combinatorial Algorithms on Words, Springer Verlag, pages 85–96, 1985.Google Scholar
- J. Bentley and R. Sedgewick. Fast Algorithms for Sorting and Searching Strings. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms, pages 360–369, 1997.Google Scholar
- P. Ferragina and G. Manzini. Opportunistic data structures with applications. In IEEE Symposium on Foundations of Computer Science, pages 390–398, 2000.Google Scholar
- P. Ferragina and G. Manzini. An experimental study of an opportunistic index. In Symposium on Discrete Algorithms, pages 269–278, 2001.Google Scholar
- G. Gonnet, R. Baeza-Yates, and T. Snider. New Indices for Text: PAT trees and PAT arrays. In W. Frakes and R.A. Baeza-Yates, editors, Information Retrieval: Algorithms and Data Structures, pages 66–82. Prentice-Hall, Englewood Cliffs, NJ, 1992.Google Scholar
- R. Grossi and J. S. Vitter. Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching. In ACM Symposium on the Theory of Computing (STOC 2000), pages 397–406. ACM Press, 2000.Google Scholar
- D. Gusfield. Algorithms on Strings, Trees, and Sequences. Cambridge University Press, 1997.Google Scholar
- T. Kasai, G. Lee, H. Arimura, S. Arikawa, and K. Park. Linear-Time Longest-Common-Prefix Computation in Suffix Arrays and its Applications. In Proceedings of the 12th Annual Symposium on Combinatorial Pattern Matching, July 2001, Lecture Notes in Computer Science 2089, Springer Verlag, pages 181–192, 2001.Google Scholar
- N. J. Larsson and K. Sadakane. Faster Suffix Sorting. Technical Report LU-CSTR: 99-214, Dept. of Computer Science, Lund University, 1999.Google Scholar
- P. Weiner. Linear Pattern Matching Algorithms. In Proceedings of the 14th IEEE Annual Symposium on Switching and Automata Theory, pages 1–11, The University of Iowa, 1973.Google Scholar