ESA 1999: Algorithms - ESA’ 99 pp 224-235 | Cite as

On Constructing Suffix Arrays in External Memory

  • Andreas Crauser
  • Paolo Ferragina
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 1643)

Abstract

The construction of full-text indexes on very large text collections is nowadays a hot problem. The suffix array [16] is one of the most attractive full-text indexing data structures due to its simplicity, space efficiency and powerful/fast search operations supported. In this paper we analyze theoretically and experimentally, the I/O-complexity and the working space of six algorithms for constructing large suffix arrays. Additionally, we design a new external-memory algorithm that follows the basic philosophy underlying the algorithm in [13] but in a significantly different manner, thus combining its good practical qualities with efficient worstcase performances. At the best of our knowledge, this is the first study which provides a wide spectrum of possible approaches to the construction of suffix arrays in external memory, and thus it should be helpful to anyone who is interested in building full-text indexes on very large text collections.

Keywords

External Memory Construction Algorithm Internal Memory Suffix Array Text Collection 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    L. Arge, P. Ferragina, R. Grossi and J. S. Vitter. On sorting Strings in External Memory. In ACM Symp. on Theory of Computing, pp. 540–548, 1997.Google Scholar
  2. 2.
    R. Ahuja, K. Mehlhorn, J. B. Orlin and R. E. Tarjan. Faster Algorithms for the Shortest Path Problem. Journal of the ACM (2), pp. 213–223, 1990.CrossRefMathSciNetGoogle Scholar
  3. 3.
    A. Andersson and S. Nilsson. Efficient implementation of Suffix Trees. Software Practice and Experience, 2(25): 129–141, 1995.CrossRefGoogle Scholar
  4. 4.
    S. Burkhard, A. Crauser, P. Ferragina, H. Lenhof, E. Rivals and M. Vingron. q-gram based database searching using a suffix array (QUASAR). International Conference on Computational Molecular Biology, 1999.Google Scholar
  5. 5.
    D. R. Clark and J. I. Munro. Efficient Suffix Trees on Secondary Storage. In ACM-SIAM Symp. on Discrete Algorithms, pp.383–391, 1996.Google Scholar
  6. 6.
    A. Crauser, P. Ferragina and U. Meyer. Practical and Efficient Priority Queues for External Memory. Technical Report MPI, see WEB pages of the authors.Google Scholar
  7. 7.
    A. Crauser and K. Mehlhorn. LEDA-SM: A Library Prototype for Computing in Secondary Memory. Technical Report MPI, see WEB pages of the authors.Google Scholar
  8. 8.
    C. Faloutsos. Access Methods for text. ACM Computing Surveys, 17, pp.49–74, March 1985.CrossRefGoogle Scholar
  9. 9.
    M. Farach. Optimal suffix tree construction with large alphabets. In IEEE Foundations of Computer Science, pp. 137–143, 1997.Google Scholar
  10. 10.
    M. Farach, P. Ferragina and S. Muthukrishnan. Overcoming the Memory Bottleneck in Suffix Tree Construction. In IEEE Foundations of Computer Science, 1998.Google Scholar
  11. 11.
    C. L. Feng. PAT-Tree-Based Keyword Extraction for Chinese Information Retrieval. ACM SIGIR, pp. 50–58, 1997.Google Scholar
  12. 12.
    P. Ferragina and R. Grossi. A Fully-Dynamic Data Structure for External Substring Search. In ACM Symp. Theory of Computing, pp. 693–702, 1995. Also Journal of the ACM (to appear).Google Scholar
  13. 13.
    G. H. Gonnet, R. A. Baeza-Yates and T. Snider. Newindices for text:PAT trees and PAT arrays. In Information Retrieval-Data Structures and Algorithms,W. B. Frakes and R. BaezaYates Editors, pp. 66–82, Prentice-Hall, 1992.Google Scholar
  14. 14.
    D. E. Knuth. The Art of Computer Programming: Sorting and Searching. Vol. 3, Addison-Wesley Publishing Co. 1973.Google Scholar
  15. 15.
    S. Kurtz. Reducing the Space Requirement of SuffixTrees. Technical Report 98-03, University of Bielefeld, 1998.Google Scholar
  16. 16.
    U. Manber and G. Myers. Suffix arrays: a new method for on-line string searches. SIAM Journal of Computing 22, 5,pp. 935–948, 1993.MATHCrossRefMathSciNetGoogle Scholar
  17. 17.
    E. M. McCreight. A space-economical suffix tree construction algorithm. Journal of the ACM 23, 2,pp. 262–272, 1976.MATHCrossRefMathSciNetGoogle Scholar
  18. 18.
    G. Navarro, J. P. Kitajima, B. A. Ribeiro-Neto and N. Ziviani. Distributed Generation of Suffix Arrays. In Combinatorial Pattern Matching Conference, pp. 103–115, 1997.Google Scholar
  19. 19.
    S. Näher and K. Mehlhorn. LEDA:A Platform for Combinatorial and Geometric Computing. Communications of the ACM (38), 1995.Google Scholar
  20. 20.
    C. Ruemmler and J. Wilkes. An introduction to disk drive modeling. IEEE Computer, 27(3):17–29, 1994.Google Scholar
  21. 21.
    E. A. Shriver and J. S. Vitter. Algorithms for parallel memory I: two-level memories. Algorithmica, 12(2-3), pp. 110–147, 1994.MATHCrossRefMathSciNetGoogle Scholar
  22. 22.
    D. E. Vengroff and J. S. Vitter. I/O-efficient scientific computing using TPIE. In IEEE Symposium on Parallel and Distributed Computing, 1995.Google Scholar
  23. 23.
    J. Vitter. External memory algorithms. Invited Tutorial in 17th Ann. ACMSymp. on Principles of Database Systems (PODS’ 98), 1998. Also Invited Paper in European Symposium on Algorithms (ESA’ 98), 1998.Google Scholar
  24. 24.
    J. Zobel, A. Moffat and K. Ramamohanarao. Guidelines for presentation and comparison of indexing techniques. SIGMAD Record 25, 3:10–15, 1996.CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 1999

Authors and Affiliations

  • Andreas Crauser
    • 1
  • Paolo Ferragina
    • 2
  1. 1.Max-Planck-Institut für InformatikSaarbrückenGermany
  2. 2.Dipartimento di InformaticaUniversità di PisaItaly

Personalised recommendations