Advertisement

Forty Years of Text Indexing

  • Alberto Apostolico
  • Maxime Crochemore
  • Martin Farach-Colton
  • Zvi Galil
  • S. Muthukrishnan
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7922)

Abstract

This paper reviews the first 40 years in the life of textual inverted indexes, their many incarnations, and their applications. The paper is non-technical and assumes some familiarity with the structures and constructions discussed. It is not meant to be exhaustive. It is meant to be a tribute to a ubiquitous tool of string matching — the suffix tree and its variants — and one of the most persistent subjects of study in the theory of algorithms.

Keywords

pattern matching string searching bi-tree suffix tree dawg suffix automaton factor automaton suffix array FM-index wavelet tree 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Amir, A., Benson, G., Farach, M.: Let sleeping files lie: Pattern matching in Z-compressed files. In: Proceedings of the 5th ACM-SIAM Annual Symposium on Discrete Algorithms, Arlington, VA, pp. 705–714 (1994)Google Scholar
  2. 2.
    Amir, A., Benson, G., Farach, M.: Let sleeping files lie: Pattern matching in Z-compressed files. J. Comput. Syst. Sci. 52(2), 299–307 (1996)MathSciNetzbMATHCrossRefGoogle Scholar
  3. 3.
    Apostolico, A.: The myriad virtues of suffix trees. In: Apostolico, A., Galil, Z. (eds.) Combinatorial Algorithms on Words. NATO Advanced Science Institutes, Series F, vol. 12, pp. 85–96. Springer, Berlin (1985)CrossRefGoogle Scholar
  4. 4.
    Apostolico, A., Bock, M.E., Lonardi, S.: Monotony of surprise and large-scale quest for unusual words. Journal of Computational Biology 10(3/4), 283–311 (2003)CrossRefGoogle Scholar
  5. 5.
    Apostolico, A., Denas, O., Dress, A.: Efficient tools for comparative substring analysis. Journal of Biotechnology 149(3), 120–126 (2010)CrossRefGoogle Scholar
  6. 6.
    Apostolico, A., Iliopoulos, C., Landau, G.M., Schieber, B., Vishkin, U.: Parallel construction of a suffix tree with applications. Algorithmica 3, 347–365 (1988)MathSciNetzbMATHCrossRefGoogle Scholar
  7. 7.
    Apostolico, A., Preparata, F.P.: Optimal off-line detection of repetitions in a string. Theor. Comput. Sci. 22(3), 297–315 (1983)MathSciNetzbMATHCrossRefGoogle Scholar
  8. 8.
    Apostolico, A., Preparata, F.P.: Data structures and algorithms for the strings statistics problem. Algorithmica 15(5), 481–494 (1996)MathSciNetzbMATHCrossRefGoogle Scholar
  9. 9.
    Baker, B.S.: Parameterized duplication in strings: Algorithms and an application to software maintenance. SIAM J. Comput. 26(5), 1343–1362 (1997)MathSciNetzbMATHCrossRefGoogle Scholar
  10. 10.
    Béal, M.-P., Mignosi, F., Restivo, A.: Minimal forbidden words and symbolic dynamics. In: Puech, C., Reischuk, R. (eds.) STACS 1996. LNCS, vol. 1046, pp. 555–566. Springer, Heidelberg (1996)CrossRefGoogle Scholar
  11. 11.
    Bender, M.A., Farach-Colton, M.: The ICA problem revisited. In: Gonnet, G.H., Panario, D., Viola, A. (eds.) LATIN 2000. LNCS, vol. 1776, pp. 88–94. Springer, Heidelberg (2000)CrossRefGoogle Scholar
  12. 12.
    Blumer, A., Blumer, J., Ehrenfeucht, A., Haussler, D., Chen, M.T., Seiferas, J.: The smallest automaton recognizing the subwords of a text. Theor. Comput. Sci. 40(1), 31–55 (1985)MathSciNetzbMATHCrossRefGoogle Scholar
  13. 13.
    Blumer, A., Blumer, J., Ehrenfeucht, A., Haussler, D., McConnell, R.: Building a complete inverted file for a set of text files in linear time. In: Proceedings of the 16th ACM Symposium on the Theory of Computing, pp. 349–351. ACM Press, Washington, D.C. (1984)Google Scholar
  14. 14.
    Blumer, A., Blumer, J., Ehrenfeucht, A., Haussler, D., McConnell, R.: Complete inverted files for efficient text retrieval and analysis. J. Assoc. Comput. Mach. 34(3), 578–595 (1987)MathSciNetCrossRefGoogle Scholar
  15. 15.
    Brodal, G.S., Lyngsø, R.B., Östlin, A., Pedersen, C.N.S.: Solving the string statistics problem in time \(\mathcal{O}(n\log n)\). In: Widmayer, P., Triguero, F., Morales, R., Hennessy, M., Eidenbenz, S., Conejo, R. (eds.) ICALP 2002. LNCS, vol. 2380, pp. 728–739. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  16. 16.
    Burrows, M., Wheeler, D.J.: A block-sorting lossless data compression algorithm. Technical Report 124, Digital Equipments Corporation (May 1994)Google Scholar
  17. 17.
    Chairungsee, S., Crochemore, M.: Using minimal absent words to build phylogeny. Theoretical Computer Science 450(1), 109–116 (2012)MathSciNetzbMATHCrossRefGoogle Scholar
  18. 18.
    Clark, D.R., Munro, J.I.: Efficient suffix trees on secondary storage. In: Proceedings of the 7th ACM-SIAM Annual Symposium on Discrete Algorithms, Atlanta, Georgia, pp. 383–391 (1996)Google Scholar
  19. 19.
    Crochemore, M.: Transducers and repetitions. Theor. Comput. Sci. 45(1), 63–86 (1986)MathSciNetzbMATHCrossRefGoogle Scholar
  20. 20.
    Crochemore, M.: Longest common factor of two words. In: Ehrig, H., Kowalski, R., Levi, G., Montanari, U. (eds.) CAAP 1987 and TAPSOFT 1987. LNCS, vol. 249, pp. 26–36. Springer, Heidelberg (1987)Google Scholar
  21. 21.
    Crochemore, M., Hancart, C., Lecroq, T.: Algorithms on Strings. Cambridge University Press, Cambridge (2007)zbMATHCrossRefGoogle Scholar
  22. 22.
    Crochemore, M., Mignosi, F., Restivo, A.: Automata and forbidden words. Information Processing Letters 67(3), 111–117 (1998)MathSciNetCrossRefGoogle Scholar
  23. 23.
    Crochemore, M., Mignosi, F., Restivo, A., Salemi, S.: Data compression using antidictonaries. Proceedings of the I.E.E.E. 88(11), 1756–1768 (2000); Special issue Lossless data compression, Storer, J. (ed.)Google Scholar
  24. 24.
    Crochemore, M., Rytter, W.: Usefulness of the Karp-Miller-Rosenberg algorithm in parallel computations on strings and arrays. Theor. Comput. Sci. 88(1), 59–82 (1991)MathSciNetzbMATHCrossRefGoogle Scholar
  25. 25.
    Crochemore, M., Rytter, W.: Text algorithms. Oxford University Press (1994)Google Scholar
  26. 26.
    Farach, M.: Optimal suffix tree construction with large alphabets. In: Proceedings of the 38th IEEE Annual Symposium on Foundations of Computer Science, Miami Beach, FL, pp. 137–143 (1997)Google Scholar
  27. 27.
    Ferragina, P., Giancarlo, R., Manzini, G.: The myriad virtues of wavelet trees. Inf. Comput. 207(8), 849–866 (2009)MathSciNetzbMATHCrossRefGoogle Scholar
  28. 28.
    Ferragina, P., Luccio, F., Manzini, G., Muthukrishnan, S.: Compressing and indexing labeled trees, with applications. J. ACM 57(1) (2009)Google Scholar
  29. 29.
    Ferragina, P., Manzini, G.: Opportunistic data structures with applications. In: FOCS, pp. 390–398 (2000)Google Scholar
  30. 30.
    Ferragina, P., Manzini, G.: Indexing compressed text. J. ACM 52(4), 552–581 (2005)MathSciNetCrossRefGoogle Scholar
  31. 31.
    Galil, Z.: Optimal parallel algorithms for string matching. In: Proceedings of the 16th ACM Symposium on the Theory of Computing, pp. 240–248. ACM Press, Washington, D.C. (1984)Google Scholar
  32. 32.
    Galil, Z.: Optimal parallel algorithms for string matching. Inf. Control 67(1-3), 144–157 (1985)MathSciNetzbMATHCrossRefGoogle Scholar
  33. 33.
    Grossi, R., Gupta, A., Vitter, J.S.: High-order entropy-compressed text indexes. In: SODA, pp. 841–850 (2003)Google Scholar
  34. 34.
    Grossi, R., Vitter, J.S.: Compressed suffix arrays and suffix trees with applications to text indexing and string matching. In: Proceedings of the ACM Symposium on the Theory of Computing, Portland, Oregon, pp. 397–406. ACM Press (2000)Google Scholar
  35. 35.
    Gusfield, D.: Algorithms on strings, trees and sequences: computer science and computational biology. Cambridge University Press, Cambridge (1997)zbMATHCrossRefGoogle Scholar
  36. 36.
    Gusfield, D., Stoye, J.: Linear time algorithms for finding and representing all the tandem repeats in a string. J. Comput. Syst. Sci. 69(4), 525–546 (2004)MathSciNetzbMATHCrossRefGoogle Scholar
  37. 37.
    Harel, D., Tarjan, R.E.: Fast algorithms for finding nearest common ancestors. SIAM J. Comput. 13(2), 338–355 (1984)MathSciNetzbMATHCrossRefGoogle Scholar
  38. 38.
    Hui, L.C.K.: Color set size problem with applications to string matching. In: Apostolico, A., Crochemore, M., Galil, Z., Manber, U. (eds.) CPM 1992. LNCS, vol. 644, pp. 230–243. Springer, Heidelberg (1992)CrossRefGoogle Scholar
  39. 39.
    Kärkkäinen, J., Sanders, P.: Simple linear work suffix array construction. In: Baeten, J.C.M., Lenstra, J.K., Parrow, J., Woeginger, G.J. (eds.) ICALP 2003. LNCS, vol. 2719, pp. 943–955. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  40. 40.
    Karp, R.M., Miller, R.E., Rosenberg, A.L.: Rapid identification of repeated patterns in strings, trees and arrays. In: Proceedings of the 4th ACM Symposium on the Theory of Computing, pp. 125–136. ACM Press, Denver, CO (1972)Google Scholar
  41. 41.
    Kasai, T., Lee, G., Arimura, H., Arikawa, S., Park, K.: Linear-time longest-common-prefix computation in suffix arrays and its applications. In: Amir, A., Landau, G.M. (eds.) CPM 2001. LNCS, vol. 2089, pp. 181–192. Springer, Heidelberg (2001)CrossRefGoogle Scholar
  42. 42.
    Kempf, M., Bayer, R., Güntzer, U.: Time optimal left to right construction of position trees. Acta. Inform. 24(4), 461–474 (1987)MathSciNetzbMATHCrossRefGoogle Scholar
  43. 43.
    Kim, D.K., Sim, J.S., Park, H., Park, K.: Constructing suffix arrays in linear time. J. Discrete Algorithms 3(2-4), 126–142 (2005)MathSciNetzbMATHCrossRefGoogle Scholar
  44. 44.
    Ko, P., Aluru, S.: Space efficient linear time construction of suffix arrays. J. Discrete Algorithms 3(2-4), 143–156 (2005)MathSciNetzbMATHCrossRefGoogle Scholar
  45. 45.
    Kolmogorov, A.N.: Three approaches to the quantitative definition of information. Problems of Information Transmission 1(1), 1–7 (1965)MathSciNetGoogle Scholar
  46. 46.
    Kurtz, S.: Reducing the space requirements of suffix trees. Softw. Pract. Exp. 29(13), 1149–1171 (1999)CrossRefGoogle Scholar
  47. 47.
    Landau, G.M.: String matching in erroneus input. Ph. D. Thesis, Department of Computer Science, Tel-Aviv University (1986)Google Scholar
  48. 48.
    Lempel, A., Ziv, J.: On the complexity of finite sequences. IEEE Trans. Inf. Theory 22, 75–81 (1976)MathSciNetzbMATHCrossRefGoogle Scholar
  49. 49.
    Majster, M.E., Ryser, A.: Efficient on-line construction and correction of position trees. SIAM J. Comput. 9(4), 785–807 (1980)MathSciNetzbMATHCrossRefGoogle Scholar
  50. 50.
    Manber, U., Myers, G.: Suffix arrays: a new method for on-line string searches. In: Proceedings of the 1st ACM-SIAM Annual Symposium on Discrete Algorithms, San Francisco, CA, pp. 319–327 (1990)Google Scholar
  51. 51.
    Manber, U., Myers, G.: Suffix arrays: a new method for on-line string searches. SIAM J. Comput. 22(5), 935–948 (1993)MathSciNetzbMATHCrossRefGoogle Scholar
  52. 52.
    McCreight, E.M.: A space-economical suffix tree construction algorithm. J. Algorithms 23(2), 262–272 (1976)MathSciNetzbMATHGoogle Scholar
  53. 53.
    Muthukrishnan, S.: Efficient algorithms for document listing problems. In: Proceedings of the 13th ACM-SIAM Annual Symposium on Discrete Algorithms, pp. 657–666 (2002)Google Scholar
  54. 54.
    Na, J.C., Ferragina, P., Giancarlo, R., Park, K.: Two-dimensional pattern indexing. In: Encyclopedia of Algorithms (2008)Google Scholar
  55. 55.
    Na, J.C., Giancarlo, R., Park, K.: On-line construction of two-dimensional suffix trees in o(n2 log n) time. Algorithmica 48(2), 173–186 (2007)MathSciNetzbMATHCrossRefGoogle Scholar
  56. 56.
    Poe, E.A.: The Gold-Bug and Other Tales. Dover Thrift Editions Series. Dover (1991)Google Scholar
  57. 57.
    Pratt, V.: Improvements and applications for the Weiner repetition finder, Manuscript (1975)Google Scholar
  58. 58.
    Rodeh, M., Pratt, V., Even, S.: Linear algorithm for data compression via string matching. J. Assoc. Comput. Mach. 28(1), 16–24 (1991)MathSciNetCrossRefGoogle Scholar
  59. 59.
    Slisenko, A.O.: Determination in real time of all the periodicities in a word. Sov. Math. Dokl. 21, 392–395 (1980)zbMATHGoogle Scholar
  60. 60.
    Slisenko, A.O.: Detection of periodicities and string matching in real time. J. Sov. Math. 22, 1316–1386 (1983)zbMATHCrossRefGoogle Scholar
  61. 61.
    Storer, J.A.: NP-completeness results concerning data compression. Report 234, Princeton University (1977)Google Scholar
  62. 62.
    Storer, J.A., Szymanski, T.G.: The macro model for data compression. In: Proceedings of the 10th ACM Symposium on the Theory of Computing, San Diego, CA, pp. 30–39. ACM Press (1978)Google Scholar
  63. 63.
    Storer, J.A., Szymanski, T.G.: Data compression via textual substitution. J. Assoc. Comput. Mach. 29(4), 928–951 (1982)MathSciNetzbMATHCrossRefGoogle Scholar
  64. 64.
    Thue, A.: Über die gegenseitige lage gleicher teile gewisser zeichenreichen. Nor. Vidensk. Selsk. Skr. Mat. Nat. Kl. 1, 1–67 (1912)Google Scholar
  65. 65.
    Ukkonen, E.: On-line construction of suffix trees. Algorithmica 14(3), 249–260 (1995)MathSciNetzbMATHCrossRefGoogle Scholar
  66. 66.
    Ulitsky, I., Burstein, D., Tuller, T., Chor, B.: The average common substring approach to phylogenomic reconstruction. Journal of Computational Biology 13(2), 336–350 (2006)MathSciNetCrossRefGoogle Scholar
  67. 67.
    van Helden, J., André, B., Collado-Vides, J.: Extracting regulatory sites from the upstream region of the yeast genes by computational analysis of oligonucleotides. J. Mol. Biol. 281, 827–842 (1998)CrossRefGoogle Scholar
  68. 68.
    Weiner, P.: Linear pattern matching algorithm. In: Proceedings of the 14th Annual IEEE Symposium on Switching and Automata Theory, Washington, DC, pp. 1–11 (1973)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Alberto Apostolico
    • 1
  • Maxime Crochemore
    • 2
    • 3
  • Martin Farach-Colton
    • 4
  • Zvi Galil
    • 1
  • S. Muthukrishnan
    • 4
  1. 1.College of ComputingGeorgia Institute of TechnologyAtlantaUSA
  2. 2.King’s College LondonLondonUK
  3. 3.Institut Gaspard-MongeUniversité Paris-EstMarne-la-Vallée Cedex 2France
  4. 4.Department of Computer ScienceRutgers UniversityPiscatawayUSA

Personalised recommendations