Engineering a Compressed Suffix Tree Implementation

  • Niko Välimäki
  • Wolfgang Gerlach
  • Kashyap Dixit
  • Veli Mäkinen
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4525)

Abstract

Suffix tree is one of the most important data structures in string algorithms and biological sequence analysis. Unfortunately, when it comes to implementing those algorithms and applying them to real genomic sequences, often the main memory size becomes the bottleneck. This is easily explained by the fact that while a DNA sequence of length n from alphabet Σ = {A,C,G,T} can be stored in n log|Σ| = 2n bits, its suffix tree occupies O(nlogn) bits. In practice, the size difference easily reaches factor 50.

We report on an implementation of the compressed suffix tree very recently proposed by Sadakane (Theory of Computing Systems, in press). The compressed suffix tree occupies space proportional to the text size, i.e. O(n log|Σ|) bits, and supports all typical suffix tree operations with at most logn factor slowdown. Our experiments show that, e.g. on a 10 MB DNA sequence, the compressed suffix tree takes 10% of the space of normal suffix tree. At the same time, a representative algorithm is slowed down by factor 30.

Our implementation follows the original proposal in spirit, but some internal parts are tailored towards practical implementation. Our construction algorithm has time requirement O(nlogn log|Σ|) and uses closely the same space as the final structure while constructing it: on the 10 MB DNA sequence, the maximum space usage during construction is only 1.4 times the final product size.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Abouelhoda, M.I., Kurtz, S., Ohlebusch, E.: Replacing suffix trees with enhanced suffix arrays. Journal of Discrete Algorithms 2, 53–86 (2004)MATHCrossRefMathSciNetGoogle Scholar
  2. Apostolico, A.: The myriad virtues of subword trees. In: Combinatorial Algorithms on Words, NATO ISI Series pp. 85–96. Springer, Heidelberg (1985)Google Scholar
  3. Burrows, M., Wheeler, D.: A block sorting lossless data compression algorithm. Technical Report Technical Report 124, Digital Equipment Corporation (1994)Google Scholar
  4. Chan, W.-L., Hon, W.-K., Lam, T.-W.: Compressed index for a dynamic collection of texts. In: Sahinalp, S.C., Muthukrishnan, S.M., Dogrusoz, U. (eds.) CPM 2004. LNCS, vol. 3109, pp. 445–456. Springer, Heidelberg (2004)Google Scholar
  5. Cheung, C.-F., Yu, J.X., Lu, H.: Constructing suffix tree for gigabyte sequences with megabyte memory. IEEE Transactions on Knowledge and Data Engineering 17(1), 90–105 (2005)CrossRefGoogle Scholar
  6. Crochemore, M., Rytter, W.: Jewels of Stringology. World Scientific, Singapore (2002)Google Scholar
  7. Elias, P.: Universal codeword sets and representation of the integers. IEEE Transactions on Information Theory 21(2), 194–200 (1975)MATHCrossRefMathSciNetGoogle Scholar
  8. Farach-Colton, M., Bender, M.A.: The lca problem revisited. In: Gonnet, G.H., Viola, A. (eds.) LATIN 2000. LNCS, vol. 1776, pp. 88–94. Springer, Heidelberg (2000)Google Scholar
  9. Ferragina, P., Manzini, G.: Indexing compressed texts. Journal of the ACM 52(4), 552–581 (2005)CrossRefMathSciNetGoogle Scholar
  10. González, R., Grabowski, Sz., Mäkinen, V., Navarro, G.: Practical implementation of rank and select queries. In: Nikoletseas, S.E. (ed.) WEA 2005. LNCS, vol. 3503, pp. 27–38. Springer, Heidelberg (2005)Google Scholar
  11. Grossi, R., Gupta, A., Vitter, J.: High-order entropy-compressed text indexes. In: Proc. SODA’03, pp. 841–850 (2003)Google Scholar
  12. Grossi, R., Vitter, J.: Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SIAM Journal on Computing 35(2), 378–407 (2006)CrossRefMathSciNetGoogle Scholar
  13. Gusfield, D.: Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. Cambridge University Press, Cambridge (1997)MATHGoogle Scholar
  14. Hon, W.-K.: On the Construction and Application of Compressed Text Indexes. PhD thesis, University of Hong Kong (2004)Google Scholar
  15. Hon, W.-K., Sadakane, K.: Space-economical algorithms for finding maximal unique matches. In: Apostolico, A., Takeda, M. (eds.) CPM 2002. LNCS, vol. 2373, pp. 144–152. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  16. Hon, W.-K., Sadakane, K., Sung, W.-K.: Breaking a time-and-space barrier in constructing full-text indices. In: Proc. FOCS’03, p. 251 (2003)Google Scholar
  17. Kasai, T., Lee, G., Arimura, H., Arikawa, S., Park, K.: Linear-time longest-common-prefix computation in suffix arrays and its applications. In: Amir, A., Landau, G.M. (eds.) CPM 2001. LNCS, vol. 2089, pp. 181–192. Springer, Heidelberg (2001)Google Scholar
  18. Mäkinen, V., Navarro, G.: Succinct suffix arrays based on run-length encoding. Nordic Journal of Computing 12(1), 40–66 (2005)MathSciNetGoogle Scholar
  19. Mäkinen, V., Navarro, G.: Dynamic entropy compressed sequences and full-text indexes. In: Lewenstein, M., Valiente, G. (eds.) CPM 2006. LNCS, vol. 4009, pp. 306–317. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  20. Manber, U., Myers, G.: Suffix arrays: a new method for on-line string searches. SIAM Journal on Computing, 935–948 (1993)Google Scholar
  21. Munro, I.: Tables. In: Chandru, V., Vinay, V. (eds.) Foundations of Software Technology and Theoretical Computer Science (FSTTCS’96). LNCS, vol. 1180, pp. 37–42. Springer, Heidelberg (1996)Google Scholar
  22. Munro, I., Raman, V., Rao, S.: Space efficient suffix trees. Journal of Algorithms 39(2), 205–222 (2001)MATHCrossRefMathSciNetGoogle Scholar
  23. Navarro, G.: Indexing text using the Ziv-Lempel trie. Journal of Discrete Algorithms (JDA) 2(1), 87–114 (2004)MATHCrossRefMathSciNetGoogle Scholar
  24. Navarro, G., Mäkinen, V.: Compressed full-text indexes. ACM Computing Surveys (To appear 2007), preliminary version available at ftp://ftp.dcc.uchile.cl/pub/users/gnavarro/survcompr2.ps.gz
  25. Sadakane, K.: Compressed suffix trees with full functionality. Theory of Computing Systems (To appear 2007), preliminary version available at http://tcslab.csce.kyushu-u.ac.jp/~sada/papers/cst.ps
  26. Schürmann, K.-B., Stoye, J.: An incomplex algorithm for fast suffix array construction. In: Proc. ALENEX/ANALCO, pp. 77–85 (2005)Google Scholar

Copyright information

© Springer Berlin Heidelberg 2007

Authors and Affiliations

  • Niko Välimäki
    • 1
  • Wolfgang Gerlach
    • 2
  • Kashyap Dixit
    • 3
  • Veli Mäkinen
    • 1
  1. 1.Department of Computer Science, University of HelsinkiFinland
  2. 2.Technische Fakultät, Universität BielefeldGermany
  3. 3.Department of Computer Science and Engineering, Indian Institute of Technology, KanpurIndia

Personalised recommendations