Suffix Tree of Alignment: An Efficient Index for Similar Data

Na, Joong Chae; Park, Heejin; Crochemore, Maxime; Holub, Jan; Iliopoulos, Costas S.; Mouchard, Laurent; Park, Kunsoo

doi:10.1007/978-3-642-45278-9_29

Joong Chae Na¹⁷,
Heejin Park¹⁸,
Maxime Crochemore¹⁹,
Jan Holub²⁰,
Costas S. Iliopoulos¹⁹,
Laurent Mouchard²¹ &
…
Kunsoo Park²²

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 8288))

Included in the following conference series:

International Workshop on Combinatorial Algorithms

907 Accesses
10 Citations

Abstract

We consider an index data structure for similar strings. The generalized suffix tree can be a solution for this. The generalized suffix tree of two strings A and B is a compacted trie representing all suffixes in A and B. It has |A| + |B| leaves and can be constructed in O(|A| + |B|) time. However, if the two strings are similar, the generalized suffix tree is not efficient because it does not exploit the similarity which is usually represented as an alignment of A and B.

In this paper we propose a space/time-efficient suffix tree of alignment which wisely exploits the similarity in an alignment. Our suffix tree for an alignment of A and B has |A| + l _d + l ₁ leaves where l _d is the sum of the lengths of all parts of B different from A and l ₁ is the sum of the lengths of some common parts of A and B. We did not compromise the pattern search to reduce the space. Our suffix tree can be searched for a pattern P in O(|P| + occ) time where occ is the number of occurrences of P in A and B. We also present an efficient algorithm to construct the suffix tree of alignment. When the suffix tree is constructed from scratch, the algorithm requires O(|A| + l _d + l ₁ + l ₂) time where l ₂ is the sum of the lengths of other common substrings of A and B. When the suffix tree of A is already given, it requires O(l _d + l ₁ + l ₂) time.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

The 1000 Genomes Project Consortium. A map of human genome variation from population-scale sequencing 467(7319), 1061–1073 (2010)
Google Scholar
Amir, A., Farach, M., Galil, Z., Giancarlo, R., Park, K.: Dynamic dictionary matching. J. Comput. Syst. Sci. 49, 208–222 (1994)
Article MathSciNet MATH Google Scholar
Baeza-Yates, R.A., Gonnet, G.H.: Fast text searching for regular expressions or automaton searching on tries. J. ACM 43(6), 915–936 (1996)
MathSciNet MATH Google Scholar
Bille, P., Gørtz, I.L.: Substring range reporting. In: Giancarlo, R., Manzini, G. (eds.) CPM 2011. LNCS, vol. 6661, pp. 299–308. Springer, Heidelberg (2011)
Chapter Google Scholar
Burrows, M., Wheeler, D.J.: A block-sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation, California (1994)
Google Scholar
Crochemore, M., Rytter, W.: Jewels of Stringology. World Scientific Publishing, Singapore (2002)
Book Google Scholar
Do, H.H., Jansson, J., Sadakane, K., Sung, W.-K.: Fast relative lempel-ziv self-index for similar sequences. In: Snoeyink, J., Lu, P., Su, K., Wang, L. (eds.) FAW-AAIM 2012. LNCS, vol. 7285, pp. 291–302. Springer, Heidelberg (2012)
Chapter Google Scholar
Farach-Colton, M., Ferragina, P., Muthukrishnan, S.: On the sorting-complexity of suffix tree construction. J. ACM 47(6), 987–1011 (2000)
MathSciNet MATH Google Scholar
Grossi, R., Vitter, J.S.: Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SIAM J. Comput. 35(2), 378–407 (2005)
Article MathSciNet MATH Google Scholar
Gusfield, D.: Algorithms on Strings, Tree, and Sequences. Cambridge University Press, Cambridge (1997)
Book Google Scholar
Huang, S., Lam, T.W., Sung, W.K., Tam, S.L., Yiu, S.M.: Indexing similar dna sequences. In: Chen, B. (ed.) AAIM 2010. LNCS, vol. 6124, pp. 180–190. Springer, Heidelberg (2010)
Chapter Google Scholar
Karlin, S., Ghandour, G., Ost, F., Tavare, S., Korn, L.J.: New approaches for computer analysis of nucleic acid sequences. Proc. Natl. Acad. Sci. 80(18), 5660–5664 (1983)
Article MATH Google Scholar
Kreft, S., Navarro, G.: On compressing and indexing repetitive sequences. Theor. Comput. Sci. (to appear)
Google Scholar
Kuruppu, S., Puglisi, S.J., Zobel, J.: Relative lempel-ziv compression of genomes for large-scale storage and retrieval. In: Chavez, E., Lonardi, S. (eds.) SPIRE 2010. LNCS, vol. 6393, pp. 201–206. Springer, Heidelberg (2010)
Chapter Google Scholar
Levenshtein, V.: Binary codes capable of correcting deletions, insertions and reversals. Soviet Physics Doklady 10(8), 707–710 (1966)
MathSciNet Google Scholar
Mäkinen, V., Navarro, G., Sirén, J., Välimäki, N.: Storage and retrieval of individual genomes. In: Batzoglou, S. (ed.) RECOMB 2009. LNCS, vol. 5541, pp. 121–137. Springer, Heidelberg (2009)
Chapter Google Scholar
Mäkinen, V., Navarro, G., Sirén, J., Välimäki, N.: Storage and retrieval of highly repetitive sequence collections. J. Comput. Bio. 17(3), 281–308 (2010)
Article Google Scholar
McCreight, E.M.: A space-economical suffix tree construction algorithm. J. ACM 23(2), 262–272 (1976)
MathSciNet MATH Google Scholar
Navarro, G.: Indexing highly repetitive collections. In: Arumugam, S., Smyth, B. (eds.) IWOCA 2012. LNCS, vol. 7643, pp. 274–279. Springer, Heidelberg (2012)
Google Scholar
Sadakane, K.: Compressed suffix trees with full functionality. Theor. Comput. Sci. 41(4), 589–607 (2007)
MathSciNet MATH Google Scholar
Ukkonen, E.: On-line construction of suffix trees. Algorithmica 14(3), 249–260 (1995)
Article MathSciNet MATH Google Scholar
Weiner, P.: Linear pattern matching algorithms. In: Proc. of the 14th IEEE Symp. on Switching and Automata Theory, pp. 1–11 (1973)
Google Scholar
Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Trans. on Information Theory 23(3), 337–343 (1977)
Article MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

Sejong University, Korea
Joong Chae Na
Hanyang University, Korea
Heejin Park
King’s College London, UK
Maxime Crochemore & Costas S. Iliopoulos
Czech Technical University in Prague, Czech Republic
Jan Holub
University of Rouen, France
Laurent Mouchard
Seoul National University, Korea
Kunsoo Park

Authors

Joong Chae Na
View author publications
You can also search for this author in PubMed Google Scholar
Heejin Park
View author publications
You can also search for this author in PubMed Google Scholar
Maxime Crochemore
View author publications
You can also search for this author in PubMed Google Scholar
Jan Holub
View author publications
You can also search for this author in PubMed Google Scholar
Costas S. Iliopoulos
View author publications
You can also search for this author in PubMed Google Scholar
Laurent Mouchard
View author publications
You can also search for this author in PubMed Google Scholar
Kunsoo Park
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

UFR des Sciences et Techniques, LITIS, Université de Rouen, 76821, EA 4180, Mont Saint Aignan Cedex, France
Thierry Lecroq & Laurent Mouchard &

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Na, J.C. et al. (2013). Suffix Tree of Alignment: An Efficient Index for Similar Data. In: Lecroq, T., Mouchard, L. (eds) Combinatorial Algorithms. IWOCA 2013. Lecture Notes in Computer Science, vol 8288. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-45278-9_29

Download citation

DOI: https://doi.org/10.1007/978-3-642-45278-9_29
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-45277-2
Online ISBN: 978-3-642-45278-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics