Run-Length Compressed Indexes Are Superior for Highly Repetitive Sequence Collections

  • Jouni Sirén
  • Niko Välimäki
  • Veli Mäkinen
  • Gonzalo Navarro
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5280)

Abstract

A repetitive sequence collection is one where portions of a base sequence of length n are repeated many times with small variations, forming a collection of total length N. Examples of such collections are version control data and genome sequences of individuals, where the differences can be expressed by lists of basic edit operations. This paper is devoted to studying ways to store massive sets of highly repetitive sequence collections in space-efficient manner so that retrieval of the content as well as queries on the content of the sequences can be provided time-efficiently. We show that the state-of-the-art entropy-bound full-text self-indexes do not yet provide satisfactory space bounds for this specific task. We engineer some new structures that use run-length encoding and give empirical evidence that these structures are superior to the current structures.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Arroyuelo, D., Navarro, G., Sadakane, K.: Reducing the space requirement of LZ-index. In: Lewenstein, M., Valiente, G. (eds.) CPM 2006. LNCS, vol. 4009, pp. 318–329. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  2. 2.
    Burrows, M., Wheeler, D.: A block sorting lossless data compression algorithm. Technical Report Technical Report 124, Digital Equipment Corporation (1994)Google Scholar
  3. 3.
    Church, G.M.: Genomes for all. Scientific American 294(1), 47–54 (2006)CrossRefGoogle Scholar
  4. 4.
    Ferragina, P., Giancarlo, R., Manzini, G.: The myriad virtues of wavelet trees. In: Bugliesi, M., Preneel, B., Sassone, V., Wegener, I. (eds.) ICALP 2006. LNCS, vol. 4051, pp. 560–571. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  5. 5.
    Ferragina, P., Manzini, G.: Indexing compressed texts. J. of the ACM 52(4), 552–581 (2005)MathSciNetCrossRefMATHGoogle Scholar
  6. 6.
    Ferragina, P., Manzini, G., Mäkinen, V., Navarro, G.: Compressed representations of sequences and full-text indexes. ACM TALG 3(2) article 20 (2007)Google Scholar
  7. 7.
    Fischer, J., Mäkinen, V., Navarro, G.: An(other) entropy-bounded compressed suffix tree. In: Ferragina, P., Landau, G.M. (eds.) CPM 2008. LNCS, vol. 5029, pp. 152–165. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  8. 8.
    Grossi, R., Gupta, A., Vitter, J.: High-order entropy-compressed text indexes. In: Proc. 14th SODA, pp. 841–850 (2003)Google Scholar
  9. 9.
    Grossi, R., Vitter, J.: Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SIAM J. on Computing 35(2), 378–407 (2006)MathSciNetCrossRefMATHGoogle Scholar
  10. 10.
    Gupta, A., Hon, W.-K., Shah, R., Vitter, J.S.: Compressed data structures: Dictionaries and data-aware measures. In: Proc. 16th DCC, pp. 213–222 (2006)Google Scholar
  11. 11.
    Gusfield, D.: Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. Cambridge University Press, Cambridge (1997)CrossRefMATHGoogle Scholar
  12. 12.
    Hall, N.: Advanced sequencing technologies and their wider impact in microbiology. The Journal of Experimental Biology 209, 1518–1525 (2007)CrossRefGoogle Scholar
  13. 13.
    Kärkkäinen, J.: Repetition-based text indexes. Technical Report A-1999-4, Department of Computer Science, University of Helsinki, Finland (1999)Google Scholar
  14. 14.
    Mäkinen, V., Navarro, G.: Succinct suffix arrays based on run-length encoding. Nordic Journal of Computing 12(1), 40–66 (2005)MathSciNetMATHGoogle Scholar
  15. 15.
    Mäkinen, V., Navarro, G., Sadakane, K.: Advantages of backward searching — efficient secondary memory and distributed implementation of compressed suffix arrays. In: Fleischer, R., Trippen, G. (eds.) ISAAC 2004, vol. 3341, pp. 681–692. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  16. 16.
    Mäkinen, V., Navarro, G., Sirén, J., Välimäki, N.: Run-length compressed indexes for repetitive sequence collections. Technical Report C-2008-42, Department of Computer Science, University of Helsinki, Finland (2008)Google Scholar
  17. 17.
    Manber, U., Myers, G.: Suffix arrays: a new method for on-line string searches. SIAM J. on Computing 22(5), 935–948 (1993)MathSciNetCrossRefMATHGoogle Scholar
  18. 18.
    Manzini, G.: An analysis of the Burrows-Wheeler transform. J. of the ACM 48(3), 407–430 (2001)MathSciNetCrossRefMATHGoogle Scholar
  19. 19.
    Navarro, G.: Indexing text using the ziv-lempel trie. J. of Discrete Algorithms (JDA) 2(1), 87–114 (2004)MathSciNetCrossRefMATHGoogle Scholar
  20. 20.
    Navarro, G., Mäkinen, V.: Compressed full-text indexes. ACM Computing Surveys 39(1) article 2 (2007)Google Scholar
  21. 21.
    Pennisi, E.: Breakthrough of the year: Human genetic variation. Science 21, 1842–1843 (2007)CrossRefGoogle Scholar
  22. 22.
    Russo, L., Navarro, G., Oliveira, A.: Dynamic fully-compressed suffix trees. In: Ferragina, P., Landau, G.M. (eds.) CPM 2008. LNCS, vol. 5029, pp. 191–203. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  23. 23.
    Russo, L., Navarro, G., Oliveira, A.: Fully-compressed suffix trees. In: Laber, E.S., Bornstein, C., Nogueira, L.T., Faria, L. (eds.) LATIN 2008. LNCS, vol. 4957, pp. 362–373. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  24. 24.
    Sadakane, K.: New text indexing functionalities of the compressed suffix arrays. J. of Algorithms 48(2), 294–313 (2003)MathSciNetCrossRefMATHGoogle Scholar
  25. 25.
    Sadakane, K.: Compressed suffix trees with full functionality. Theory of Computing Systems 41(4), 589–607 (2007)MathSciNetCrossRefMATHGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2008

Authors and Affiliations

  • Jouni Sirén
    • 1
  • Niko Välimäki
    • 1
  • Veli Mäkinen
    • 1
  • Gonzalo Navarro
    • 2
  1. 1.Dept. of Computer ScienceUniv. of HelsinkiFinland
  2. 2.Dept. of Computer ScienceUniv. of ChileChile

Personalised recommendations