Suffix Tree of Alignment: An Efficient Index for Similar Data

  • Joong Chae Na
  • Heejin Park
  • Maxime Crochemore
  • Jan Holub
  • Costas S. Iliopoulos
  • Laurent Mouchard
  • Kunsoo Park
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8288)

Abstract

We consider an index data structure for similar strings. The generalized suffix tree can be a solution for this. The generalized suffix tree of two strings A and B is a compacted trie representing all suffixes in A and B. It has |A| + |B| leaves and can be constructed in O(|A| + |B|) time. However, if the two strings are similar, the generalized suffix tree is not efficient because it does not exploit the similarity which is usually represented as an alignment of A and B.

In this paper we propose a space/time-efficient suffix tree of alignment which wisely exploits the similarity in an alignment. Our suffix tree for an alignment of A and B has |A| + ld + l1 leaves where ld is the sum of the lengths of all parts of B different from A and l1 is the sum of the lengths of some common parts of A and B. We did not compromise the pattern search to reduce the space. Our suffix tree can be searched for a pattern P in O(|P| + occ) time where occ is the number of occurrences of P in A and B. We also present an efficient algorithm to construct the suffix tree of alignment. When the suffix tree is constructed from scratch, the algorithm requires O(|A| + ld + l1 + l2) time where l2 is the sum of the lengths of other common substrings of A and B. When the suffix tree of A is already given, it requires O(ld + l1 + l2) time.

Keywords

Indexes for similar data suffix trees alignments 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    The 1000 Genomes Project Consortium. A map of human genome variation from population-scale sequencing 467(7319), 1061–1073 (2010)Google Scholar
  2. 2.
    Amir, A., Farach, M., Galil, Z., Giancarlo, R., Park, K.: Dynamic dictionary matching. J. Comput. Syst. Sci. 49, 208–222 (1994)MathSciNetCrossRefMATHGoogle Scholar
  3. 3.
    Baeza-Yates, R.A., Gonnet, G.H.: Fast text searching for regular expressions or automaton searching on tries. J. ACM 43(6), 915–936 (1996)MathSciNetMATHGoogle Scholar
  4. 4.
    Bille, P., Gørtz, I.L.: Substring range reporting. In: Giancarlo, R., Manzini, G. (eds.) CPM 2011. LNCS, vol. 6661, pp. 299–308. Springer, Heidelberg (2011)CrossRefGoogle Scholar
  5. 5.
    Burrows, M., Wheeler, D.J.: A block-sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation, California (1994)Google Scholar
  6. 6.
    Crochemore, M., Rytter, W.: Jewels of Stringology. World Scientific Publishing, Singapore (2002)CrossRefGoogle Scholar
  7. 7.
    Do, H.H., Jansson, J., Sadakane, K., Sung, W.-K.: Fast relative lempel-ziv self-index for similar sequences. In: Snoeyink, J., Lu, P., Su, K., Wang, L. (eds.) FAW-AAIM 2012. LNCS, vol. 7285, pp. 291–302. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  8. 8.
    Farach-Colton, M., Ferragina, P., Muthukrishnan, S.: On the sorting-complexity of suffix tree construction. J. ACM 47(6), 987–1011 (2000)MathSciNetMATHGoogle Scholar
  9. 9.
    Grossi, R., Vitter, J.S.: Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SIAM J. Comput. 35(2), 378–407 (2005)MathSciNetCrossRefMATHGoogle Scholar
  10. 10.
    Gusfield, D.: Algorithms on Strings, Tree, and Sequences. Cambridge University Press, Cambridge (1997)CrossRefGoogle Scholar
  11. 11.
    Huang, S., Lam, T.W., Sung, W.K., Tam, S.L., Yiu, S.M.: Indexing similar dna sequences. In: Chen, B. (ed.) AAIM 2010. LNCS, vol. 6124, pp. 180–190. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  12. 12.
    Karlin, S., Ghandour, G., Ost, F., Tavare, S., Korn, L.J.: New approaches for computer analysis of nucleic acid sequences. Proc. Natl. Acad. Sci. 80(18), 5660–5664 (1983)CrossRefMATHGoogle Scholar
  13. 13.
    Kreft, S., Navarro, G.: On compressing and indexing repetitive sequences. Theor. Comput. Sci. (to appear)Google Scholar
  14. 14.
    Kuruppu, S., Puglisi, S.J., Zobel, J.: Relative lempel-ziv compression of genomes for large-scale storage and retrieval. In: Chavez, E., Lonardi, S. (eds.) SPIRE 2010. LNCS, vol. 6393, pp. 201–206. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  15. 15.
    Levenshtein, V.: Binary codes capable of correcting deletions, insertions and reversals. Soviet Physics Doklady 10(8), 707–710 (1966)MathSciNetGoogle Scholar
  16. 16.
    Mäkinen, V., Navarro, G., Sirén, J., Välimäki, N.: Storage and retrieval of individual genomes. In: Batzoglou, S. (ed.) RECOMB 2009. LNCS, vol. 5541, pp. 121–137. Springer, Heidelberg (2009)CrossRefGoogle Scholar
  17. 17.
    Mäkinen, V., Navarro, G., Sirén, J., Välimäki, N.: Storage and retrieval of highly repetitive sequence collections. J. Comput. Bio. 17(3), 281–308 (2010)CrossRefGoogle Scholar
  18. 18.
    McCreight, E.M.: A space-economical suffix tree construction algorithm. J. ACM 23(2), 262–272 (1976)MathSciNetMATHGoogle Scholar
  19. 19.
    Navarro, G.: Indexing highly repetitive collections. In: Arumugam, S., Smyth, B. (eds.) IWOCA 2012. LNCS, vol. 7643, pp. 274–279. Springer, Heidelberg (2012)Google Scholar
  20. 20.
    Sadakane, K.: Compressed suffix trees with full functionality. Theor. Comput. Sci. 41(4), 589–607 (2007)MathSciNetMATHGoogle Scholar
  21. 21.
    Ukkonen, E.: On-line construction of suffix trees. Algorithmica 14(3), 249–260 (1995)MathSciNetCrossRefMATHGoogle Scholar
  22. 22.
    Weiner, P.: Linear pattern matching algorithms. In: Proc. of the 14th IEEE Symp. on Switching and Automata Theory, pp. 1–11 (1973)Google Scholar
  23. 23.
    Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Trans. on Information Theory 23(3), 337–343 (1977)MathSciNetCrossRefMATHGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Joong Chae Na
    • 1
  • Heejin Park
    • 2
  • Maxime Crochemore
    • 3
  • Jan Holub
    • 4
  • Costas S. Iliopoulos
    • 3
  • Laurent Mouchard
    • 5
  • Kunsoo Park
    • 6
  1. 1.Sejong UniversityKorea
  2. 2.Hanyang UniversityKorea
  3. 3.King’s College LondonUK
  4. 4.Czech Technical University in PragueCzech Republic
  5. 5.University of RouenFrance
  6. 6.Seoul National UniversityKorea

Personalised recommendations