Text Indexing with Errors

  • Moritz G. Maaß
  • Johannes Nowak
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3537)

Abstract

In this paper we address the problem of constructing an index for a text document or a collection of documents to answer various questions about the occurrences of a pattern when allowing a constant number of errors. In particular, our index can be built to report all occurrences, all positions, or all documents where a pattern occurs in time linear in the size of the query string and the number of results. This improves over previous work where the lookup time is not linear or depends upon the size of the document corpus. Our data structure has size \(O\left(n\log^k n\right)\) on average and with high probability for input size n and queries with up to k errors. Additionally, we present a trade-off between query time and index complexity that achieves worst-case bounded index size and preprocessing time with linear lookup time on average.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Abouelhoda, M.I., Ohlebusch, E., Kurtz, S.: Optimal exact string matching based on suffix arrays. In: Laender, A.H.F., Oliveira, A.L. (eds.) SPIRE 2002. LNCS, vol. 2476, pp. 31–43. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  2. 2.
    Amir, A., Keselman, D., Landau, G.M., Lewenstein, M., Lewenstein, N., Rodeh, M.: Indexing and dictionary matching with one error. J. Algorithms 37, 309–325 (2000)MATHCrossRefMathSciNetGoogle Scholar
  3. 3.
    Barkol, O., Rabani, Y.: Tighter bounds for nearest neighbor search and related problems in the cell probe model. In: Proc. 32nd ACM Symp. on Theory of Computing (STOC), pp. 388–396. ACM Press, New York (2000)Google Scholar
  4. 4.
    Borodin, A., Ostrovsky, R., Rabani, Y.: Lower bounds for high dimensional nearest neighbor search and related problems. In: Proc. 31st ACM Symp. on Theory of Computing (STOC), pp. 312–321. ACM Press, New York (1999)Google Scholar
  5. 5.
    Brodal, G.S., Ga̧sieniec, L.: Approximate dictionary queries. In: Hirschberg, D.S., Meyers, G. (eds.) CPM 1996. LNCS, vol. 1075, pp. 65–74. Springer, Heidelberg (1996)Google Scholar
  6. 6.
    Buchsbaum, A.L., Goodrich, M.T., Westbrook, J.: Range searching over tree cross products. In: Paterson, M. (ed.) ESA 2000. LNCS, vol. 1879, pp. 120–131. Springer, Heidelberg (2000)CrossRefGoogle Scholar
  7. 7.
    Chang, W.I., Lawler, E.L.: Sublinear approximate string matching and biological applications. Algorithmica 12, 327–344 (1994)MATHCrossRefMathSciNetGoogle Scholar
  8. 8.
    Chávez, E., Navarro, G.: A metric index for approximate string matching. In: Rajsbaum, S. (ed.) LATIN 2002. LNCS, vol. 2286, pp. 181–195. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  9. 9.
    Cobbs, A.L.: Fast approximate matching using suffix trees. In: Galil, Z., Ukkonen, E. (eds.) CPM 1995. LNCS, vol. 937, pp. 41–54. Springer, Heidelberg (1995)Google Scholar
  10. 10.
    Cole, R., Gottlieb, L.-A., Lewenstein, M.: Dictionary matching and indexing with errors and don’t cares. In: Proc. 36th ACM Symp. on Theory of Computing (STOC), pp. 91–100 (2004)Google Scholar
  11. 11.
    Demaine, E.D., López-Ortiz, A.: A linear lower bound on index size for text retrieval. In: Proc. 12th ACM-SIAM Symp. on Discrete Algorithms (SODA), pp. 289–294. ACM, New York (2001)Google Scholar
  12. 12.
    Gabow, H.N., Bentely, J.L., Tarjan, R.E.: Scaling and related techniques for geometry problems. In: Proc. 16th ACM Symp. on Theory of Computing (STOC), pp. 135–143. ACM, New York (1984)Google Scholar
  13. 13.
    Gabriele, A., Mignosi, F., Restivo, A., Sciortino, M.: Indexing structures for approximate string matching. In: Petreschi, R., Persiano, G., Silvestri, R. (eds.) CIAC 2003. LNCS, vol. 2653, pp. 140–151. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  14. 14.
    Gusfield, D.: Algorithms on Strings, Trees, and Sequences: Comp. Science and Computational Biology. Cambridge University Press, Cambridge (1997)CrossRefGoogle Scholar
  15. 15.
    Indyk, P.: Nearest neighbors in high-dimensional spaces. In: Goodman, J.E., O’Rourke, J. (eds.) Handbook of Discrete and Computational Geometry, ch. 39, 2nd edn. CRC Press LLC, Boca Raton (2004)Google Scholar
  16. 16.
    Maaß, M.G.: Average-case analysis of approximate trie search. In: Sahinalp, S.C., Muthukrishnan, S.M., Dogrusoz, U. (eds.) CPM 2004. LNCS, vol. 3109, pp. 472–483. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  17. 17.
    Maaß, M.G., Nowak, J.: Text indexing with errors. Technical Report TUMI0503, Fakultät für Informatik, TU München (Mar 2005)Google Scholar
  18. 18.
    Maaß, M.G., Nowak, J.: A new method for approximate indexing and dictionary lookup with one error. Information Processing Letters (IPL) (to be published)Google Scholar
  19. 19.
    McCreight, E.M.: A space-economical suffix tree construction algorithm. J. ACM 23(2), 262–272 (1976)MATHCrossRefMathSciNetGoogle Scholar
  20. 20.
    Muthukrishnan, S.: Efficient algorithms for document retrieval problems. In: Proc. 13th ACM-SIAM Symp. on Discrete Algorithms (SODA), ACM/SIAM (2002)Google Scholar
  21. 21.
    Myers, E.W.: A sublinear algorithm for approximate keyword searching. Algorithmica 12, 345–374 (1994)MATHCrossRefMathSciNetGoogle Scholar
  22. 22.
    Navarro, G.: A guided tour to approximate string matching. ACM Computing Surveys 33(1), 31–88 (2001)CrossRefGoogle Scholar
  23. 23.
    Navarro, G., Baeza-Yates, R.: A hybrid indexing method for approximate string matching. J. Discrete Algorithms 1(1), 205–209 (2000); Special issue on Matching PatternsMathSciNetGoogle Scholar
  24. 24.
    Navarro, G., Baeza-Yates, R.A., Sutinen, E., Tarhio, J.: Indexing methods for approximate string matching. IEEE Data Eng. Bull. 24(4), 19–27 (2001)Google Scholar
  25. 25.
    Nowak, J.: A new indexing method for approximate pattern matching with one mismatch. Master’s thesis, Fakultät für Informatik, Technische Universität München, Boltzmannstr. 3, D-85748 Garching (February 2004)Google Scholar
  26. 26.
    Pittel, B.: Asymptotical growth of a class of random trees. Annals of Probability 13(2), 414–427 (1985)MATHCrossRefMathSciNetGoogle Scholar
  27. 27.
    Szpankowski, W.: Asymptotic properties of data compression and suffix trees. IEEE Transact. on Information Theory 39(5), 1647–1659 (1993)MATHCrossRefMathSciNetGoogle Scholar
  28. 28.
    Szpankowski, W.: Average Case Analysis of Algorithms on Sequences, 1st edn. Wiley-Interscience, Chichester (2000)Google Scholar
  29. 29.
    Ukkonen, E.: Algorithms for approximate string matching. Information and Control 64, 100–118 (1985)MATHCrossRefMathSciNetGoogle Scholar
  30. 30.
    Ukkonen, E.: Approximate string-matching over suffix trees. In: Apostolico, A., Crochemore, M., Galil, Z., Manber, U. (eds.) CPM 1993. LNCS, vol. 684, pp. 228–242. Springer, Heidelberg (1993)CrossRefGoogle Scholar
  31. 31.
    Ukkonen, E.: On-line construction of suffix trees. Algorithmica 14, 249–260 (1995)MATHCrossRefMathSciNetGoogle Scholar
  32. 32.
    Weiner, P.: Linear pattern matching. In: Proc. 14th IEEE Symp. on Switching and Automata Theory, pp. 1–11. IEEE, Los Alamitos (1973)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2005

Authors and Affiliations

  • Moritz G. Maaß
    • 1
  • Johannes Nowak
    • 1
  1. 1.Fakultät für InformatikTechnische Universität MünchenGarchingGermany

Personalised recommendations