Skip to main content

Distributed and Paged Suffix Trees for Large Genetic Databases

  • Conference paper
  • First Online:
Combinatorial Pattern Matching (CPM 2003)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2676))

Included in the following conference series:

Abstract

We present two new variants of the suffix tree which allow much larger genome sequence databases to be handled efficiently. The method is based on a new linear time construction algorithm for “sparse” suffix trees, which are subtrees of the whole suffix tree. The new data structures are called the paged suffix tree (PST) and the distributed suffix tree (DST). Both tackle the memory bottleneck by constructing subtrees of the full suffix tree independently and are designed for single processor and distributed memory parallel computing environments (e.g. Beowulf clusters), respectively. The standard operations on suffix trees of biological importance are shown to be easily translatable to these new data structures. While none of these operations on the DST require interprocess communication, many have optimal expected parallel running times.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. A. Andersson, N. Larsson, Jesper, and K. Swanson. Suffix trees on words. In Proceedings of the 7th Annual Symposium on Combinatorial Pattern Matching, LNCS 1075, pages 102–115. Springer-Verlag, 1996.

    Google Scholar 

  2. A. Andersson and S. Nilsson. Improved behaviour of tries by adaptive branching. Information Processing Letters, 46:293–300, 1993.

    Article  MathSciNet  Google Scholar 

  3. A. Apostolico. The myriad virtues of subword trees. In A. Apostolico and Z. Galil, editors, Combinatorial Algorithms on Words, volume F12 of NATO ASI Series, pages 85–96. Springer-Verlag, 1985.

    Google Scholar 

  4. W. I. Chang and E. L. Lawler. Sublinear expected time approximate string matching and biological applications. Algorithmica, 12:327–344, 1994.

    Article  MATH  MathSciNet  Google Scholar 

  5. R. Clifford. Indexed strings for large-scale genomic analysis. PhD thesis, Imperial College of Science Technology and Medicine, London, April 2001.

    Google Scholar 

  6. A. Delcher, S. Kasif, R. Fleischmann, J. Peterson, O. White, and S. Salzberg. Alignment of whole genomes. Nucleic Acids Research, 27(11):2369–2376, 1999.

    Article  Google Scholar 

  7. B. Dorohonceanu and C. Nevill-Manning. Accelerating protein classification using suffix trees. In Proceedings of the 8th International Conference on Intelligent Systems for Molecular Biology (ISMB), pages 126–133, 2000.

    Google Scholar 

  8. P. Ferragina and R. Grossi. A fully-dynamic data structure for external substring search. In Proceedings of the 27th Annual ACM Symposium on Theory of Computing, pages 693–702, Las Vegas, Nevada, 1995.

    Google Scholar 

  9. P. Ferragina and R. Grossi. Fast string searching in secondary storage: Theoretical developments and experimental results. In Proceedings of the Seventh Annual Symposium on Discrete Algorithms, pages 373–382, Atlanta, Georgia, 1996.

    Google Scholar 

  10. P. Ferragina and R. Grossi. The string B-Tree: a new data structure for string search in external memory and its applications. Journal of the ACM, 46(2):238–280, 1999.

    Article  MathSciNet  Google Scholar 

  11. R. Giegerich and S. Kurtz. From Ukkonen to McCreight and Weiner: A unifying view of linear-time suffix tree construction. Algorithmica, 1997.

    Google Scholar 

  12. D. Gusfield. Algorithms on strings, trees and sequences. Computer Science and Computational Biology. Cambridge University Press, 1997.

    Google Scholar 

  13. D. Gusfield, G. M. Landau, and D. Schieber. An efficient algorithm for the all pairs suffix-prefix problem. Information Processing Letters, 41:181–185, 1992.

    Article  MATH  MathSciNet  Google Scholar 

  14. J. Kärkkäinen. Suffix cactus: a cross between suffix tree and suffix array. In Z. Galil and E. Ukkonen, editors, Proceedings of the 6th Annual Symposium on Combinatorial Pattern Matching, LNCS 937, pages 191–204. Springer-Verlag, 1995.

    Google Scholar 

  15. J. Kärkkäinen and E. Ukkonen. Sparse suffix trees. In COCOON’ 96, Hong Kong, LNCS 1090, pages 219–230. Springer-Verlag, 1996.

    Google Scholar 

  16. S. Kurtz. Reducing the space requirement of suffix trees. Report 98–03. Technical report, Technische Fakultat, Universität Bielefeld, 1998.

    Google Scholar 

  17. S. Kurtz and C. Schleiermacher. Reputer: Fast computation of maximal repeats in complete genomes. Bioinformatics, 15(5):426–427, 1999.

    Article  Google Scholar 

  18. U. Manber and G. Myers. Suffix arrays: a new method for on-line string searches. In Proceedings of the 1st Annual ACM-SIAM Symposium on Discrete Algorithms, 1990.

    Google Scholar 

  19. E. Ukkonen. On-line construction of suffix-trees. Algorithmica, 14:249–260, 1995.

    Article  MATH  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2003 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Clifford, R., Sergot, M. (2003). Distributed and Paged Suffix Trees for Large Genetic Databases. In: Baeza-Yates, R., Chávez, E., Crochemore, M. (eds) Combinatorial Pattern Matching. CPM 2003. Lecture Notes in Computer Science, vol 2676. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-44888-8_6

Download citation

  • DOI: https://doi.org/10.1007/3-540-44888-8_6

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-40311-1

  • Online ISBN: 978-3-540-44888-4

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics