The Enhanced Suffix Array and Its Applications to Genome Analysis

Abouelhoda, Mohamed Ibrahim; Kurtz, Stefan; Ohlebusch, Enno

doi:10.1007/3-540-45784-4_35

Mohamed Ibrahim Abouelhoda⁶,
Stefan Kurtz⁶ &
Enno Ohlebusch⁶

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2452))

Included in the following conference series:

International Workshop on Algorithms in Bioinformatics

1234 Accesses
41 Citations
3 Altmetric

Abstract

In large scale applications as computational genome analysis, the space requirement of the suffix tree is a severe drawback. In this paper, we present a uniform framework that enables us to systematically replace every string processing algorithm that is based on a bottomup traversal of a suffix tree by a corresponding algorithm based on an enhanced suffix array (a suffix array enhanced with the lcp-table). In this framework, we will show how maximal, supermaximal, and tandem repeats, as well as maximal unique matches can be efficiently computed. Because enhanced suffix arrays require much less space than suffix trees, very large genomes can now be indexed and analyzed, a task which was not feasible before. Experimental results demonstrate that our programs require not only less space but also much less time than other programs developed for the same tasks.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

M.I. Abouelhoda, E. Ohlebusch, and S. Kurtz. Optimal Exact String Matching Based on Suffix Arrays. In Proceedings of the Ninth International Symposium on String Processing and Information Retrieval. Springer-Verlag, Lecture Notes in Computer Science, 2002.
Google Scholar
A. Apostolico. The Myriad Virtues of Subword Trees. In Combinatorial Algorithms on Words, Springer-Verlag, pages 85–96, 1985.
Google Scholar
J. Bentley and R. Sedgewick. Fast Algorithms for Sorting and Searching Strings. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms, pages 360–369, 1997.
Google Scholar
M. Burrows and D.J. Wheeler. A Block-Sorting Lossless Data Compression Algorithm. Research Report 124, Digital Systems Research Center, 1994.
Google Scholar
A.L. Delcher, S. Kasif, R.D. Fleischmann, J. Peterson, O. White, and S.L. Salzberg. Alignment of Whole Genomes. Nucleic Acids Res., 27:2369–2376, 1999.
Article Google Scholar
J. A. Eisen, J. F. Heidelberg, O. White, and S.L. Salzberg. Evidence for Symmetric Chromosomal Inversions Around the Replication Origin in Bacteria. Genome Biology, 1(6):1–9, 2000.
Article Google Scholar
D. Gusfield. Algorithms on Strings, Trees, and Sequences. Cambridge University Press, New York, 1997.
MATH Google Scholar
D. Gusfield and J. Stoye. Linear Time Algorithms for Finding and Representing all the Tandem Repeats in a String. Report CSE-98-4, Computer Science Division, University of California, Davis, 1998.
Google Scholar
T. Kasai, G. Lee, H. Arimura, S. Arikawa, and K. Park. Linear-Time Longest-Common-Prefix Computation in Suffix Arrays and its Applications. In Proceedings of the 12th Annual Symposium on Combinatorial Pattern Matching, pages 181–192. Lecture Notes in Computer Science 2089, Springer-Verlag, 2001.
Google Scholar
J. Knight, D. Gusfield, and J. Stoye. The Strmat Software-Package, 1998. http://www.cs.ucdavis.edu/ gus.eld/strmat.tar.gz.
R. Kolpakov and G. Kucherov. Finding Maximal Repetitions in a Word in Linear Time. In Symposium on Foundations of Computer Science, pages 596–604. IEEE Computer Society, 1999.
Google Scholar
S. Kurtz. Reducing the Space Requirement of Suffix Trees. Software—Practice and Experience, 29(13):1149–1171, 1999.
Article Google Scholar
S. Kurtz, J.V. Choudhuri, E. Ohlebusch, C. Schleiermacher, J. Stoye, and R. Giegerich. REPuter: The Manifold Applications of Repeat Analysis on a Genomic Scale. Nucleic Acids Res., 29(22):4633–4642, 2001.
Article Google Scholar
E.S. Lander, L.M. Linton, B. Birren, C. Nusbaum, M.C. Zody, J. Baldwin, K. Devon, and K. Dewar, et. al. Initial Sequencing and Analysis of the Human Genome. Nature, 409:860–921, 2001.
Article Google Scholar
N.J. Larsson and K. Sadakane. Faster Suffix Sorting. Technical Report LU-CSTR: 99-214, Dept. of Computer Science, Lund University, 1999.
Google Scholar
U. Manber and E.W. Myers. Suffix Arrays: A New Method for On-Line String Searches. SIAM Journal on Computing, 22(5):935–948, 1993.
Article MATH MathSciNet Google Scholar
C. O’Keefe and E. Eichler. The Pathological Consequences and Evolutionary Implications of Recent Human Genomic Duplications. In Comparative Genomics, pages 29–46. Kluwer Press, 2000.
Google Scholar
J. Stoye and D. Gusffield. Simple and Flexible Detection of Contiguous Repeats Using a Suffix Tree. Theoretical Computer Science, 270(1–2):843–856, 2002.
Article MATH MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

Faculty of Technology, University of Bielefeld, P.O. Box 10 01 31, 33501, Bielefeld, Germany
Mohamed Ibrahim Abouelhoda, Stefan Kurtz & Enno Ohlebusch

Authors

Mohamed Ibrahim Abouelhoda
View author publications
You can also search for this author in PubMed Google Scholar
Stefan Kurtz
View author publications
You can also search for this author in PubMed Google Scholar
Enno Ohlebusch
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

IMIM-UPF-CRG, Dr. Aiguader 80, 08003, Barcelona, Spain
Roderic Guigó
Department of Computer Science, University of California, 95616, Davis, CA, USA
Dan Gusfield

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Abouelhoda, M.I., Kurtz, S., Ohlebusch, E. (2002). The Enhanced Suffix Array and Its Applications to Genome Analysis. In: Guigó, R., Gusfield, D. (eds) Algorithms in Bioinformatics. WABI 2002. Lecture Notes in Computer Science, vol 2452. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45784-4_35

Download citation

DOI: https://doi.org/10.1007/3-540-45784-4_35
Published: 10 October 2002
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-44211-0
Online ISBN: 978-3-540-45784-8
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics