Suffix Tree of Alignment: An Efficient Index for Similar Data

  • Joong Chae Na
  • Heejin Park
  • Maxime Crochemore
  • Jan Holub
  • Costas S. Iliopoulos
  • Laurent Mouchard
  • Kunsoo Park
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8288)


We consider an index data structure for similar strings. The generalized suffix tree can be a solution for this. The generalized suffix tree of two strings A and B is a compacted trie representing all suffixes in A and B. It has |A| + |B| leaves and can be constructed in O(|A| + |B|) time. However, if the two strings are similar, the generalized suffix tree is not efficient because it does not exploit the similarity which is usually represented as an alignment of A and B.

In this paper we propose a space/time-efficient suffix tree of alignment which wisely exploits the similarity in an alignment. Our suffix tree for an alignment of A and B has |A| + l d  + l 1 leaves where l d is the sum of the lengths of all parts of B different from A and l 1 is the sum of the lengths of some common parts of A and B. We did not compromise the pattern search to reduce the space. Our suffix tree can be searched for a pattern P in O(|P| + occ) time where occ is the number of occurrences of P in A and B. We also present an efficient algorithm to construct the suffix tree of alignment. When the suffix tree is constructed from scratch, the algorithm requires O(|A| + l d  + l 1 + l 2) time where l 2 is the sum of the lengths of other common substrings of A and B. When the suffix tree of A is already given, it requires O(l d  + l 1 + l 2) time.


Indexes for similar data suffix trees alignments 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    The 1000 Genomes Project Consortium. A map of human genome variation from population-scale sequencing 467(7319), 1061–1073 (2010)Google Scholar
  2. 2.
    Amir, A., Farach, M., Galil, Z., Giancarlo, R., Park, K.: Dynamic dictionary matching. J. Comput. Syst. Sci. 49, 208–222 (1994)MathSciNetCrossRefzbMATHGoogle Scholar
  3. 3.
    Baeza-Yates, R.A., Gonnet, G.H.: Fast text searching for regular expressions or automaton searching on tries. J. ACM 43(6), 915–936 (1996)MathSciNetzbMATHGoogle Scholar
  4. 4.
    Bille, P., Gørtz, I.L.: Substring range reporting. In: Giancarlo, R., Manzini, G. (eds.) CPM 2011. LNCS, vol. 6661, pp. 299–308. Springer, Heidelberg (2011)CrossRefGoogle Scholar
  5. 5.
    Burrows, M., Wheeler, D.J.: A block-sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation, California (1994)Google Scholar
  6. 6.
    Crochemore, M., Rytter, W.: Jewels of Stringology. World Scientific Publishing, Singapore (2002)CrossRefGoogle Scholar
  7. 7.
    Do, H.H., Jansson, J., Sadakane, K., Sung, W.-K.: Fast relative lempel-ziv self-index for similar sequences. In: Snoeyink, J., Lu, P., Su, K., Wang, L. (eds.) FAW-AAIM 2012. LNCS, vol. 7285, pp. 291–302. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  8. 8.
    Farach-Colton, M., Ferragina, P., Muthukrishnan, S.: On the sorting-complexity of suffix tree construction. J. ACM 47(6), 987–1011 (2000)MathSciNetzbMATHGoogle Scholar
  9. 9.
    Grossi, R., Vitter, J.S.: Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SIAM J. Comput. 35(2), 378–407 (2005)MathSciNetCrossRefzbMATHGoogle Scholar
  10. 10.
    Gusfield, D.: Algorithms on Strings, Tree, and Sequences. Cambridge University Press, Cambridge (1997)CrossRefGoogle Scholar
  11. 11.
    Huang, S., Lam, T.W., Sung, W.K., Tam, S.L., Yiu, S.M.: Indexing similar dna sequences. In: Chen, B. (ed.) AAIM 2010. LNCS, vol. 6124, pp. 180–190. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  12. 12.
    Karlin, S., Ghandour, G., Ost, F., Tavare, S., Korn, L.J.: New approaches for computer analysis of nucleic acid sequences. Proc. Natl. Acad. Sci. 80(18), 5660–5664 (1983)CrossRefzbMATHGoogle Scholar
  13. 13.
    Kreft, S., Navarro, G.: On compressing and indexing repetitive sequences. Theor. Comput. Sci. (to appear)Google Scholar
  14. 14.
    Kuruppu, S., Puglisi, S.J., Zobel, J.: Relative lempel-ziv compression of genomes for large-scale storage and retrieval. In: Chavez, E., Lonardi, S. (eds.) SPIRE 2010. LNCS, vol. 6393, pp. 201–206. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  15. 15.
    Levenshtein, V.: Binary codes capable of correcting deletions, insertions and reversals. Soviet Physics Doklady 10(8), 707–710 (1966)MathSciNetGoogle Scholar
  16. 16.
    Mäkinen, V., Navarro, G., Sirén, J., Välimäki, N.: Storage and retrieval of individual genomes. In: Batzoglou, S. (ed.) RECOMB 2009. LNCS, vol. 5541, pp. 121–137. Springer, Heidelberg (2009)CrossRefGoogle Scholar
  17. 17.
    Mäkinen, V., Navarro, G., Sirén, J., Välimäki, N.: Storage and retrieval of highly repetitive sequence collections. J. Comput. Bio. 17(3), 281–308 (2010)CrossRefGoogle Scholar
  18. 18.
    McCreight, E.M.: A space-economical suffix tree construction algorithm. J. ACM 23(2), 262–272 (1976)MathSciNetzbMATHGoogle Scholar
  19. 19.
    Navarro, G.: Indexing highly repetitive collections. In: Arumugam, S., Smyth, B. (eds.) IWOCA 2012. LNCS, vol. 7643, pp. 274–279. Springer, Heidelberg (2012)Google Scholar
  20. 20.
    Sadakane, K.: Compressed suffix trees with full functionality. Theor. Comput. Sci. 41(4), 589–607 (2007)MathSciNetzbMATHGoogle Scholar
  21. 21.
    Ukkonen, E.: On-line construction of suffix trees. Algorithmica 14(3), 249–260 (1995)MathSciNetCrossRefzbMATHGoogle Scholar
  22. 22.
    Weiner, P.: Linear pattern matching algorithms. In: Proc. of the 14th IEEE Symp. on Switching and Automata Theory, pp. 1–11 (1973)Google Scholar
  23. 23.
    Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Trans. on Information Theory 23(3), 337–343 (1977)MathSciNetCrossRefzbMATHGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Joong Chae Na
    • 1
  • Heejin Park
    • 2
  • Maxime Crochemore
    • 3
  • Jan Holub
    • 4
  • Costas S. Iliopoulos
    • 3
  • Laurent Mouchard
    • 5
  • Kunsoo Park
    • 6
  1. 1.Sejong UniversityKorea
  2. 2.Hanyang UniversityKorea
  3. 3.King’s College LondonUK
  4. 4.Czech Technical University in PragueCzech Republic
  5. 5.University of RouenFrance
  6. 6.Seoul National UniversityKorea

Personalised recommendations