Advertisement

Suffix Arrays on Words

  • Paolo Ferragina
  • Johannes Fischer
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4580)

Abstract

Surprisingly enough, it is not yet known how to build directly a suffix array that indexes just the k positions at word-boundaries of a text T[1,n], taking O(n) time and O(k) space in addition to T. We propose a class-note solution to this problem that achieves such optimal time and space bounds. Word-based versions of indexes achieving the same time/space bounds were already known for suffix trees [1,2] and (compact) DAWGs [3,4]. Our solution inherits the simplicity and efficiency of suffix arrays, with respect to such other word-indexes, and thus it foresees applications in word-based approaches to data compression [5] and computational linguistics [6]. To support this, we have run a large set of experiments showing that word-based suffix arrays may be constructed twice as fast as their full-text counterparts, and with a working space as low as 20%. The space reduction of the final word-based suffix array impacts also in their query time (i.e. less random access binary-search steps!), being faster by a factor of up to 3.

Keywords

Query Time Construction Time Word Boundary Pattern Length Suffix Array 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Andersson, A., Larsson, N.J., Swanson, K.: Suffix Trees on Words. Algorithmica 23(3), 246–260 (1999)zbMATHCrossRefMathSciNetGoogle Scholar
  2. 2.
    Inenaga, S., Takeda, M.: On-Line Linear-Time Construction of Word Suffix Trees. In: Lewenstein, M., Valiente, G. (eds.) CPM 2006. LNCS, vol. 4009, pp. 60–71. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  3. 3.
    Inenaga, S., Takeda, M.: Sparse Directed Acyclic Word Graphs. In: Crestani, F., Ferragina, P., Sanderson, M. (eds.) SPIRE 2006. LNCS, vol. 4209, pp. 61–73. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  4. 4.
    Inenaga, S., Takeda, M.: Sparse compact directed acyclic word graphs. In: Stringology, pp. 197–211 (2006)Google Scholar
  5. 5.
    Yugo, R., Isal, K., Moffat, A.: Word-based block-sorting text compression. In: Australasian Conference on Computer Science, pp. 92–99. IEEE Press, New York (2001)Google Scholar
  6. 6.
    Yamamoto, M., Church, K.W: Using suffix arrays to compute term frequency and document frequency for all substrings in a corpus. Computational Linguistics 27(1), 1–30 (2001)CrossRefGoogle Scholar
  7. 7.
    Gusfield, D.: Algorithms on Strings, Trees, and Sequences. Cambridge University Press, Cambridge (1997)zbMATHGoogle Scholar
  8. 8.
    Navarro, G., Mäkinen, V.: Compressed full-text indexes. ACM Computing Surveys (to appear), Preliminary version available at http://www.dcc.uchile.cl/~gnavarro/ps/acmcs06.ps.gz
  9. 9.
    Witten, I.H, Moffat, A., Bell, T.C: Managing Gigabytes: Compressing and Indexing Documents and Images, 2nd edn. Morgan Kaufmann, San Francisco (1999)Google Scholar
  10. 10.
    Zobel, J., Moffat, A., Ramamohanarao, K.: Guidelines for Presentation and Comparison of Indexing Techniques. SIGMOD Record 25(3), 10–15 (1996)CrossRefGoogle Scholar
  11. 11.
    Hon, W.K., Sadakane, K., Sung, W.K.: Breaking a Time-and-Space Barrier in Constructing Full-Text Indices. In: Proc. FOCS, pp. 251–260. IEEE Computer Society, Los Alamitos (2003)Google Scholar
  12. 12.
    Ukkonen, E.: On-line Construction of Suffix Trees. Algorithmica 14(3), 249–260 (1995)zbMATHCrossRefMathSciNetGoogle Scholar
  13. 13.
    Blumer, A., Blumer, J., Haussler, D., Ehrenfeucht, A., Chen, M.T., Seiferas, J.I.: The Smallest Automaton Recognizing the Subwords of a Text. Theor. Comput. Sci. 40, 31–55 (1985)zbMATHCrossRefMathSciNetGoogle Scholar
  14. 14.
    Inenaga, S., Hoshino, H., Shinohara, A., Takeda, M., Arikawa, S., Mauri, G., Pavesi, G.: On-line construction of compact directed acyclic word graphs. Discrete Applied Mathematics 146(2), 156–179 (2005)zbMATHCrossRefMathSciNetGoogle Scholar
  15. 15.
    Kärkkäinen, J., Sanders, P., Burkhardt, S.: Linear Work Suffix Array Construction. J. ACM 53(6), 1–19 (2006)CrossRefMathSciNetGoogle Scholar
  16. 16.
    Inenaga, S.: personal communication (December 2006)Google Scholar
  17. 17.
    Kärkkäinen, J., Ukkonen, E.: Sparse Suffix Trees. In: Cai, J.-Y., Wong, C.K. (eds.) COCOON 1996. LNCS, vol. 1090, pp. 219–230. Springer, Heidelberg (1996)Google Scholar
  18. 18.
    Ferragina, P., Venturini, R.: A Simple Storage Scheme for Strings Achieving Entropy Bounds. Theoretical Computer Science 372(1), 115–121 (2007)zbMATHCrossRefMathSciNetGoogle Scholar
  19. 19.
    Abouelhoda, M.I., Kurtz, S., Ohlebusch, E.: Replacing Suffix Trees with Enhanced Suffix Arrays. J. Discrete Algorithms 2(1), 53–86 (2004)zbMATHCrossRefMathSciNetGoogle Scholar
  20. 20.
    Kasai, T., Lee, G., Arimura, H., Arikawa, S., Park, K.: Linear-Time Longest-Common-Prefix Computation in Suffix Arrays and Its Applications. In: Amir, A., Landau, G.M. (eds.) CPM 2001. LNCS, vol. 2089, pp. 181–192. Springer, Heidelberg (2001)Google Scholar
  21. 21.
    Aluru, S.: Handbook of Computational Molecular Biology. Chapman & Hall/CRC, Sydney, Australia (2006)Google Scholar
  22. 22.
    Manber, U., Myers, E.W.: Suffix Arrays: A New Method for On-Line String Searches. SIAM J. Comput. 22(5), 935–948 (1993)zbMATHCrossRefMathSciNetGoogle Scholar
  23. 23.
    Fischer, J., Heun, V.: A new succinct representation of RMQ-information and improvements in the enhanced suffix array. In: Proc. ESCAPE. LNCS (to appear, 2007)Google Scholar
  24. 24.
    Larsson, N.J., Sadakane, K.: Faster suffix sorting. Technical Report LU-CS-TR:99-214, LUNDFD6/(NFCS-3140)/1–20/(1999), Department of Computer Science, Lund University, Sweden (May 1999)Google Scholar
  25. 25.
    Alstrup, S., Gavoille, C., Kaplan, H., Rauhe, T.: Nearest Common Ancestors: A Survey and a New Distributed Algorithm. In: Proc. SPAA, pp. 258–264. ACM Press, New York (2002)Google Scholar
  26. 26.
    Ferragina, P., Navarro, G.: The Pizza & Chili Corpus. Available at http://pizzachili.di.unipi.it, http://pizzachili.dcc.uchile.cl
  27. 27.
    Università degli Studi di Milano, Laboratory for Web Algorithmics: URLs from the.eu domain. Available at http://law.dsi.unimi.it/index.php
  28. 28.
    Maniscalco, M.A., Puglisi, S.J.: An efficient, versatile approach to suffix sorting. ACM Journal of Experimental Algorithmics (to appear), Available at http://www.michael-maniscalco.com/msufsort.htm
  29. 29.
    Manzini, G., Ferragina, P.: Engineering a lightweight suffix array construction algorithm. Algorithmica, 40(1), 33–50 (2004), Available at http://www.mfn.unipmn.it/~manzini/lightweight Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2007

Authors and Affiliations

  • Paolo Ferragina
    • 1
  • Johannes Fischer
    • 2
  1. 1.Dipartimento di Informatica, University of Pisa 
  2. 2.Institut für Informatik, Ludwig-Maximilians-Universität München 

Personalised recommendations