ESA 1999: Algorithms - ESA’ 99 pp 224-235 | Cite as
On Constructing Suffix Arrays in External Memory
Abstract
The construction of full-text indexes on very large text collections is nowadays a hot problem. The suffix array [16] is one of the most attractive full-text indexing data structures due to its simplicity, space efficiency and powerful/fast search operations supported. In this paper we analyze theoretically and experimentally, the I/O-complexity and the working space of six algorithms for constructing large suffix arrays. Additionally, we design a new external-memory algorithm that follows the basic philosophy underlying the algorithm in [13] but in a significantly different manner, thus combining its good practical qualities with efficient worstcase performances. At the best of our knowledge, this is the first study which provides a wide spectrum of possible approaches to the construction of suffix arrays in external memory, and thus it should be helpful to anyone who is interested in building full-text indexes on very large text collections.
Keywords
External Memory Construction Algorithm Internal Memory Suffix Array Text CollectionPreview
Unable to display preview. Download preview PDF.
References
- 1.L. Arge, P. Ferragina, R. Grossi and J. S. Vitter. On sorting Strings in External Memory. In ACM Symp. on Theory of Computing, pp. 540–548, 1997.Google Scholar
- 2.R. Ahuja, K. Mehlhorn, J. B. Orlin and R. E. Tarjan. Faster Algorithms for the Shortest Path Problem. Journal of the ACM (2), pp. 213–223, 1990.CrossRefMathSciNetGoogle Scholar
- 3.A. Andersson and S. Nilsson. Efficient implementation of Suffix Trees. Software Practice and Experience, 2(25): 129–141, 1995.CrossRefGoogle Scholar
- 4.S. Burkhard, A. Crauser, P. Ferragina, H. Lenhof, E. Rivals and M. Vingron. q-gram based database searching using a suffix array (QUASAR). International Conference on Computational Molecular Biology, 1999.Google Scholar
- 5.D. R. Clark and J. I. Munro. Efficient Suffix Trees on Secondary Storage. In ACM-SIAM Symp. on Discrete Algorithms, pp.383–391, 1996.Google Scholar
- 6.A. Crauser, P. Ferragina and U. Meyer. Practical and Efficient Priority Queues for External Memory. Technical Report MPI, see WEB pages of the authors.Google Scholar
- 7.A. Crauser and K. Mehlhorn. LEDA-SM: A Library Prototype for Computing in Secondary Memory. Technical Report MPI, see WEB pages of the authors.Google Scholar
- 8.C. Faloutsos. Access Methods for text. ACM Computing Surveys, 17, pp.49–74, March 1985.CrossRefGoogle Scholar
- 9.M. Farach. Optimal suffix tree construction with large alphabets. In IEEE Foundations of Computer Science, pp. 137–143, 1997.Google Scholar
- 10.M. Farach, P. Ferragina and S. Muthukrishnan. Overcoming the Memory Bottleneck in Suffix Tree Construction. In IEEE Foundations of Computer Science, 1998.Google Scholar
- 11.C. L. Feng. PAT-Tree-Based Keyword Extraction for Chinese Information Retrieval. ACM SIGIR, pp. 50–58, 1997.Google Scholar
- 12.P. Ferragina and R. Grossi. A Fully-Dynamic Data Structure for External Substring Search. In ACM Symp. Theory of Computing, pp. 693–702, 1995. Also Journal of the ACM (to appear).Google Scholar
- 13.G. H. Gonnet, R. A. Baeza-Yates and T. Snider. Newindices for text:PAT trees and PAT arrays. In Information Retrieval-Data Structures and Algorithms,W. B. Frakes and R. BaezaYates Editors, pp. 66–82, Prentice-Hall, 1992.Google Scholar
- 14.D. E. Knuth. The Art of Computer Programming: Sorting and Searching. Vol. 3, Addison-Wesley Publishing Co. 1973.Google Scholar
- 15.S. Kurtz. Reducing the Space Requirement of SuffixTrees. Technical Report 98-03, University of Bielefeld, 1998.Google Scholar
- 16.U. Manber and G. Myers. Suffix arrays: a new method for on-line string searches. SIAM Journal of Computing 22, 5,pp. 935–948, 1993.MATHCrossRefMathSciNetGoogle Scholar
- 17.E. M. McCreight. A space-economical suffix tree construction algorithm. Journal of the ACM 23, 2,pp. 262–272, 1976.MATHCrossRefMathSciNetGoogle Scholar
- 18.G. Navarro, J. P. Kitajima, B. A. Ribeiro-Neto and N. Ziviani. Distributed Generation of Suffix Arrays. In Combinatorial Pattern Matching Conference, pp. 103–115, 1997.Google Scholar
- 19.S. Näher and K. Mehlhorn. LEDA:A Platform for Combinatorial and Geometric Computing. Communications of the ACM (38), 1995.Google Scholar
- 20.C. Ruemmler and J. Wilkes. An introduction to disk drive modeling. IEEE Computer, 27(3):17–29, 1994.Google Scholar
- 21.E. A. Shriver and J. S. Vitter. Algorithms for parallel memory I: two-level memories. Algorithmica, 12(2-3), pp. 110–147, 1994.MATHCrossRefMathSciNetGoogle Scholar
- 22.D. E. Vengroff and J. S. Vitter. I/O-efficient scientific computing using TPIE. In IEEE Symposium on Parallel and Distributed Computing, 1995.Google Scholar
- 23.J. Vitter. External memory algorithms. Invited Tutorial in 17th Ann. ACMSymp. on Principles of Database Systems (PODS’ 98), 1998. Also Invited Paper in European Symposium on Algorithms (ESA’ 98), 1998.Google Scholar
- 24.J. Zobel, A. Moffat and K. Ramamohanarao. Guidelines for presentation and comparison of indexing techniques. SIGMAD Record 25, 3:10–15, 1996.CrossRefGoogle Scholar