Advertisement

Distributed generation of suffix arrays

  • Gonzalo Navarro
  • João Paulo Kitajima
  • Berthier A. Ribeiro-Neto
  • Nivio Ziviani
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 1264)

Abstract

An algorithm for the distributed computation of suffix arrays for large texts is presented. The parallelism model is that of a set of sequential tasks which execute in parallel and exchange messages among them. The underlying architecture is that of a high bandwidth network of processors. Our algorithm builds the suffix array by quickly assigning an independent subproblem to each processor and completing the process with a final local sorting. We demonstrate that the algorithm has time complexity of O(b log n) computation and O(b) communication in the average case, where b corresponds to the local text size on each processor (i.e., text size n divided by r, the number of processors). This is faster than the best known sequential algorithm and improves over previous parallel algorithms to build suffix arrays, both in time complexity and scaling factor.

Keywords

Parallel Algorithm Parallel Machine Binary Search Sequential Algorithm Index Point 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    T. Anderson, D. Culler, and D. Patterson. A case for NOW (Network of Workstations). IEEE Micro, 15(1):54–64, February 1995.Google Scholar
  2. 2.
    A. Apostolico, C. Iliopoulos, G. Landau, B. Schieber, and U. Vishkin. Parallel construction of a suffix tree with applications. Algorithmica, 3:347–365, 1988.Google Scholar
  3. 3.
    E. Barbosa and N. Ziviani. From partial to full inverted lists for text searching. In R. Baeza-Yates and U. Manber, editors, Proc. of the Second South American Workshop on String Processing (WSP'95), pages 1–10, April 1995.Google Scholar
  4. 4.
    G. Gonnet. PAT 3.1: An Efficient Text Searching System — User's Manual. Centre of the New Oxford English Dictionary, University of Waterloo, Canada, 1987.Google Scholar
  5. 5.
    G. H. Gonnet, R. A. Baeza-Yates, and T. Snider. New indices for text: Pat trees and pat arrays. In Information Retrieval — Data Structures & Algorithms, pages 66–82. Prentice-Hall, 1992.Google Scholar
  6. 6.
    D. Harman. Overview of the third text retrieval conference. In Proceedings of the Third Text Retrieval Conference — TREC-3, Gaithersburg, Maryland, 1995. National Institute of Standards and Technology. NIST Special Publication 500-225.Google Scholar
  7. 7.
    J. Jájá. An Introduction to Parallel Algorithms. Addison-Wesley, 1992.Google Scholar
  8. 8.
    J. Jájá, K. W. Ryu, and U. Vishkin. Sorting strings and constructing digital search trees in parallel. Theoretical Computer Science, 154(2):225–245, 1996.Google Scholar
  9. 9.
    J. Karkkainen. Suffix cactus: A cross between suffix tree and suffix array. In Proc. CPM'95, pages 191–204. Springer-Verlag, 1995. LNCS 937.Google Scholar
  10. 10.
    J. P. Kitajima, B. Ribeiro, and N. Ziviani. Network and memory analysis in distributed parallel generation of pat arrays. In Fourteenth Brazilian Symposium on Computer Architecture, pages 192–202, Recife, August 1996.Google Scholar
  11. 11.
    J.P. Kitajima, M.D. Resende, B. Ribeiro, and N. Ziviani. Distributed parallel generation of indices for very large text databases. Technical Report 008/97, Universidade Federal de Minas Gerais — Departamento de Ciência da Computação, Belo Horizonte, Brazil, April 1997. ftp://ftp.dcc.ufmg.br/pub/research/-nivio/papers/.Google Scholar
  12. 12.
    Donald E. Knuth. The Art of Computer Programming: Sorting and Searching. Addison Wesley, 1973.Google Scholar
  13. 13.
    U. Manber and G. Myers. Suffix arrays: A new method for on-line string searches. SIAM Journal on Computing, 22, 1993.Google Scholar
  14. 14.
    D.R. Morrison. PATRICIA — Practical Algorithm to Retrieve Information Coded In Alphanumeric. JACM, 15(4):514–534, October 1968.Google Scholar
  15. 15.
    M. J. Quinn. Parallel Computing: Theory and Practice. McGraw-Hill, second edition, 1994.Google Scholar
  16. 16.
    B. Ribeiro, J. P. Kitajima, and N. Ziviani. Distributed parallel generation of Pat arrays. Technical Report 019/96, Universidade Federal de Minas Gerais — Departamento de Ciência da Computação, Belo Horizonte, Brazil, June 1996. ftp://ftp.dcc.ufmg.br/pub/research/nivio/papers/.Google Scholar
  17. 17.
    B. Ribeiro, J.P. Kitajima, G. Navarro, and N. Ziviani. Parallel generation of inverted lists on a network of workstations. Technical Report 009/97, Universidade Federal de Minas Gerais — Departamento de Ciência da Computação, Belo Horizonte, Brazil, April 1997. ftp://ftp.dcc.ufmg.br/pub/research/-nivio/papers/.Google Scholar
  18. 18.
    B. Ribeiro, G. Navarro, J. P. Kitajima, and N. Ziviani. Recursive parallel generation of suffix arrays. Technical Report 010/97, Universidade Federal de Minas Gerais — Departamento de Ciência da Computação, Belo Horizonte, Brazil, April 1997. ftp://ftp.dcc.ufmg.br/pub/research/nivio/papers/.Google Scholar
  19. 19.
    W. Szpankowski. Probabilistic analysis of generalized suffix trees. In Proc. CPM'92, pages 1–14. Springer-Verlag, April 1992. LNCS 644.Google Scholar
  20. 20.
    E. Ukkonen. Constructing suffix trees on-line in linear time. Algorithmica, 14(3):249–260, Sep 1995.Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 1997

Authors and Affiliations

  • Gonzalo Navarro
    • 1
  • João Paulo Kitajima
    • 2
  • Berthier A. Ribeiro-Neto
    • 2
  • Nivio Ziviani
    • 2
  1. 1.Dept. of Computer ScienceUniversity of ChileChile
  2. 2.Dept. of Computer ScienceFederal University of Minas GeraisBrazil

Personalised recommendations