Storage and Retrieval of Individual Genomes

  • Veli Mäkinen
  • Gonzalo Navarro
  • Jouni Sirén
  • Niko Välimäki
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5541)


A repetitive sequence collection is one where portions of a base sequence of length n are repeated many times with small variations, forming a collection of total length N. Examples of such collections are version control data and genome sequences of individuals, where the differences can be expressed by lists of basic edit operations. Flexible and efficient data analysis on a such typically huge collection is plausible using suffix trees. However, suffix tree occupies O(N logN) bits, which very soon inhibits in-memory analyses. Recent advances in full-text self-indexing reduce the space of suffix tree to O(N logσ) bits, where σ is the alphabet size. In practice, the space reduction is more than 10-fold, for example on suffix tree of Human Genome. However, this reduction factor remains constant when more sequences are added to the collection.

We develop a new family of self-indexes suited for the repetitive sequence collection setting. Their expected space requirement depends only on the length n of the base sequence and the number s of variations in its repeated copies. That is, the space reduction factor is no longer constant, but depends on N/n.

We believe the structures developed in this work will provide a fundamental basis for storage and retrieval of individual genomes as they become available due to rapid progress in the sequencing technologies.


Comparative genomics full-text indexing suffix tree compressed data structures 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Blanford, D., Blelloch, G.: Compact representations of ordered sets. In: Proc. 15th SODA, pp. 11–19 (2004)Google Scholar
  2. 2.
    Burrows, M., Wheeler, D.: A block sorting lossless data compression algorithm. Technical Report Technical Report 124, Digital Equipment Corporation (1994)Google Scholar
  3. 3.
    Church, G.M.: Genomes for all. Scientific American 294(1), 47–54 (2006)CrossRefGoogle Scholar
  4. 4.
    Ferragina, P., Manzini, G.: Indexing compressed texts. Journal of the ACM 52(4), 552–581 (2005)CrossRefGoogle Scholar
  5. 5.
    Ferragina, P., Manzini, G., Mäkinen, V., Navarro, G.: Compressed representations of sequences and full-text indexes. ACM Transactions on Algorithms (TALG) 3(2) article 20 (2007)Google Scholar
  6. 6.
    Fischer, J., Mäkinen, V., Navarro, G.: An(other) entropy-bounded compressed suffix tree. In: Ferragina, P., Landau, G.M. (eds.) CPM 2008. LNCS, vol. 5029, pp. 152–165. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  7. 7.
    Grossi, R., Vitter, J.: Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SIAM Journal on Computing 35(2), 378–407 (2006)CrossRefGoogle Scholar
  8. 8.
    Gupta, A., Hon, W.-K., Shah, R., Vitter, J.S.: Compressed data structures: Dictionaries and data-aware measures. In: DCC 2006: Proceedings of the Data Compression Conference (DCC 2006), pp. 213–222 (2006)Google Scholar
  9. 9.
    Gusfield, D.: Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. Cambridge University Press, Cambridge (1997)CrossRefGoogle Scholar
  10. 10.
    Hall, N.: Advanced sequencing technologies and their wider impact in microbiology. The Journal of Experimental Biology 209, 1518–1525 (2007)CrossRefGoogle Scholar
  11. 11.
    Kaplan, H.: Persistent Data Structures. In: Mehta, D.P., Sahni, S. (eds.) Handbook of Data Structures and Applications, vol. 31. Chapman & Hall, Boca Raton (2005)Google Scholar
  12. 12.
    Mäkinen, V., Navarro, G.: Succinct suffix arrays based on run-length encoding. Nordic Journal of Computing 12(1), 40–66 (2005)Google Scholar
  13. 13.
    Manber, U., Myers, G.: Suffix arrays: a new method for on-line string searches. SIAM J. Comput. 22(5), 935–948 (1993)CrossRefGoogle Scholar
  14. 14.
    Manzini, G.: An analysis of the Burrows-Wheeler transform. Journal of the ACM 48(3), 407–430 (2001)CrossRefGoogle Scholar
  15. 15.
    Navarro, G., Mäkinen, V.: Compressed full-text indexes. ACM Computing Surveys 39(1) article 2 (2007)Google Scholar
  16. 16.
    Overmars, M.H.: Searching in the past, i. Technical Report Technical Report RUU-CS-81-7, Department of Computer Science, University of Utrecht, Utrecht, Netherlands (1981)Google Scholar
  17. 17.
    Pennisi, E.: Breakthrough of the year: Human genetic variation. Science 21, 1842–1843 (2007)CrossRefGoogle Scholar
  18. 18.
    Russo, L., Navarro, G., Oliveira, A.: Dynamic fully-compressed suffix trees. In: Ferragina, P., Landau, G.M. (eds.) CPM 2008. LNCS, vol. 5029, pp. 191–203. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  19. 19.
    Russo, L., Navarro, G., Oliveira, A.: Fully-compressed suffix trees. In: Laber, E.S., Bornstein, C., Nogueira, L.T., Faria, L. (eds.) LATIN 2008. LNCS, vol. 4957, pp. 362–373. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  20. 20.
    Sadakane, K.: New text indexing functionalities of the compressed suffix arrays. Journal of Algorithms 48(2), 294–313 (2003)CrossRefGoogle Scholar
  21. 21.
    Sadakane, K.: Compressed suffix trees with full functionality. Theory of Computing Systems 41(4), 589–607 (2007)CrossRefGoogle Scholar
  22. 22.
    Sirén, J., Välimäki, N., Mäkinen, V., Navarro, G.: Run-length compressed indexes are superior for highly repetitive sequence collections. In: Amir, A., Turpin, A., Moffat, A. (eds.) SPIRE 2008. LNCS, vol. 5280, pp. 164–175. Springer, Heidelberg (2008)Google Scholar
  23. 23.
    Waterman, M.S.: Introduction to Computational Biology. Chapman & Hall, University Press (1995)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2009

Authors and Affiliations

  • Veli Mäkinen
    • 1
  • Gonzalo Navarro
    • 2
  • Jouni Sirén
    • 1
  • Niko Välimäki
    • 1
  1. 1.Department of Computer ScienceUniversity of HelsinkiFinland
  2. 2.Department of Computer ScienceUniversity of ChileChile

Personalised recommendations