Abstract
We present two new variants of the suffix tree which allow much larger genome sequence databases to be handled efficiently. The method is based on a new linear time construction algorithm for “sparse” suffix trees, which are subtrees of the whole suffix tree. The new data structures are called the paged suffix tree (PST) and the distributed suffix tree (DST). Both tackle the memory bottleneck by constructing subtrees of the full suffix tree independently and are designed for single processor and distributed memory parallel computing environments (e.g. Beowulf clusters), respectively. The standard operations on suffix trees of biological importance are shown to be easily translatable to these new data structures. While none of these operations on the DST require interprocess communication, many have optimal expected parallel running times.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
A. Andersson, N. Larsson, Jesper, and K. Swanson. Suffix trees on words. In Proceedings of the 7th Annual Symposium on Combinatorial Pattern Matching, LNCS 1075, pages 102–115. Springer-Verlag, 1996.
A. Andersson and S. Nilsson. Improved behaviour of tries by adaptive branching. Information Processing Letters, 46:293–300, 1993.
A. Apostolico. The myriad virtues of subword trees. In A. Apostolico and Z. Galil, editors, Combinatorial Algorithms on Words, volume F12 of NATO ASI Series, pages 85–96. Springer-Verlag, 1985.
W. I. Chang and E. L. Lawler. Sublinear expected time approximate string matching and biological applications. Algorithmica, 12:327–344, 1994.
R. Clifford. Indexed strings for large-scale genomic analysis. PhD thesis, Imperial College of Science Technology and Medicine, London, April 2001.
A. Delcher, S. Kasif, R. Fleischmann, J. Peterson, O. White, and S. Salzberg. Alignment of whole genomes. Nucleic Acids Research, 27(11):2369–2376, 1999.
B. Dorohonceanu and C. Nevill-Manning. Accelerating protein classification using suffix trees. In Proceedings of the 8th International Conference on Intelligent Systems for Molecular Biology (ISMB), pages 126–133, 2000.
P. Ferragina and R. Grossi. A fully-dynamic data structure for external substring search. In Proceedings of the 27th Annual ACM Symposium on Theory of Computing, pages 693–702, Las Vegas, Nevada, 1995.
P. Ferragina and R. Grossi. Fast string searching in secondary storage: Theoretical developments and experimental results. In Proceedings of the Seventh Annual Symposium on Discrete Algorithms, pages 373–382, Atlanta, Georgia, 1996.
P. Ferragina and R. Grossi. The string B-Tree: a new data structure for string search in external memory and its applications. Journal of the ACM, 46(2):238–280, 1999.
R. Giegerich and S. Kurtz. From Ukkonen to McCreight and Weiner: A unifying view of linear-time suffix tree construction. Algorithmica, 1997.
D. Gusfield. Algorithms on strings, trees and sequences. Computer Science and Computational Biology. Cambridge University Press, 1997.
D. Gusfield, G. M. Landau, and D. Schieber. An efficient algorithm for the all pairs suffix-prefix problem. Information Processing Letters, 41:181–185, 1992.
J. Kärkkäinen. Suffix cactus: a cross between suffix tree and suffix array. In Z. Galil and E. Ukkonen, editors, Proceedings of the 6th Annual Symposium on Combinatorial Pattern Matching, LNCS 937, pages 191–204. Springer-Verlag, 1995.
J. Kärkkäinen and E. Ukkonen. Sparse suffix trees. In COCOON’ 96, Hong Kong, LNCS 1090, pages 219–230. Springer-Verlag, 1996.
S. Kurtz. Reducing the space requirement of suffix trees. Report 98–03. Technical report, Technische Fakultat, Universität Bielefeld, 1998.
S. Kurtz and C. Schleiermacher. Reputer: Fast computation of maximal repeats in complete genomes. Bioinformatics, 15(5):426–427, 1999.
U. Manber and G. Myers. Suffix arrays: a new method for on-line string searches. In Proceedings of the 1st Annual ACM-SIAM Symposium on Discrete Algorithms, 1990.
E. Ukkonen. On-line construction of suffix-trees. Algorithmica, 14:249–260, 1995.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2003 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Clifford, R., Sergot, M. (2003). Distributed and Paged Suffix Trees for Large Genetic Databases. In: Baeza-Yates, R., Chávez, E., Crochemore, M. (eds) Combinatorial Pattern Matching. CPM 2003. Lecture Notes in Computer Science, vol 2676. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-44888-8_6
Download citation
DOI: https://doi.org/10.1007/3-540-44888-8_6
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-40311-1
Online ISBN: 978-3-540-44888-4
eBook Packages: Springer Book Archive