Skip to main content

Storage and Retrieval of Individual Genomes

  • Conference paper
Research in Computational Molecular Biology (RECOMB 2009)

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 5541))

Abstract

A repetitive sequence collection is one where portions of a base sequence of length n are repeated many times with small variations, forming a collection of total length N. Examples of such collections are version control data and genome sequences of individuals, where the differences can be expressed by lists of basic edit operations. Flexible and efficient data analysis on a such typically huge collection is plausible using suffix trees. However, suffix tree occupies O(N logN) bits, which very soon inhibits in-memory analyses. Recent advances in full-text self-indexing reduce the space of suffix tree to O(N logσ) bits, where σ is the alphabet size. In practice, the space reduction is more than 10-fold, for example on suffix tree of Human Genome. However, this reduction factor remains constant when more sequences are added to the collection.

We develop a new family of self-indexes suited for the repetitive sequence collection setting. Their expected space requirement depends only on the length n of the base sequence and the number s of variations in its repeated copies. That is, the space reduction factor is no longer constant, but depends on N/n.

We believe the structures developed in this work will provide a fundamental basis for storage and retrieval of individual genomes as they become available due to rapid progress in the sequencing technologies.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Blanford, D., Blelloch, G.: Compact representations of ordered sets. In: Proc. 15th SODA, pp. 11–19 (2004)

    Google Scholar 

  2. Burrows, M., Wheeler, D.: A block sorting lossless data compression algorithm. Technical Report Technical Report 124, Digital Equipment Corporation (1994)

    Google Scholar 

  3. Church, G.M.: Genomes for all. Scientific American 294(1), 47–54 (2006)

    Article  Google Scholar 

  4. Ferragina, P., Manzini, G.: Indexing compressed texts. Journal of the ACM 52(4), 552–581 (2005)

    Article  Google Scholar 

  5. Ferragina, P., Manzini, G., Mäkinen, V., Navarro, G.: Compressed representations of sequences and full-text indexes. ACM Transactions on Algorithms (TALG) 3(2) article 20 (2007)

    Google Scholar 

  6. Fischer, J., Mäkinen, V., Navarro, G.: An(other) entropy-bounded compressed suffix tree. In: Ferragina, P., Landau, G.M. (eds.) CPM 2008. LNCS, vol. 5029, pp. 152–165. Springer, Heidelberg (2008)

    Chapter  Google Scholar 

  7. Grossi, R., Vitter, J.: Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SIAM Journal on Computing 35(2), 378–407 (2006)

    Article  Google Scholar 

  8. Gupta, A., Hon, W.-K., Shah, R., Vitter, J.S.: Compressed data structures: Dictionaries and data-aware measures. In: DCC 2006: Proceedings of the Data Compression Conference (DCC 2006), pp. 213–222 (2006)

    Google Scholar 

  9. Gusfield, D.: Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. Cambridge University Press, Cambridge (1997)

    Book  Google Scholar 

  10. Hall, N.: Advanced sequencing technologies and their wider impact in microbiology. The Journal of Experimental Biology 209, 1518–1525 (2007)

    Article  Google Scholar 

  11. Kaplan, H.: Persistent Data Structures. In: Mehta, D.P., Sahni, S. (eds.) Handbook of Data Structures and Applications, vol. 31. Chapman & Hall, Boca Raton (2005)

    Google Scholar 

  12. Mäkinen, V., Navarro, G.: Succinct suffix arrays based on run-length encoding. Nordic Journal of Computing 12(1), 40–66 (2005)

    Google Scholar 

  13. Manber, U., Myers, G.: Suffix arrays: a new method for on-line string searches. SIAM J. Comput. 22(5), 935–948 (1993)

    Article  Google Scholar 

  14. Manzini, G.: An analysis of the Burrows-Wheeler transform. Journal of the ACM 48(3), 407–430 (2001)

    Article  Google Scholar 

  15. Navarro, G., Mäkinen, V.: Compressed full-text indexes. ACM Computing Surveys 39(1) article 2 (2007)

    Google Scholar 

  16. Overmars, M.H.: Searching in the past, i. Technical Report Technical Report RUU-CS-81-7, Department of Computer Science, University of Utrecht, Utrecht, Netherlands (1981)

    Google Scholar 

  17. Pennisi, E.: Breakthrough of the year: Human genetic variation. Science 21, 1842–1843 (2007)

    Article  Google Scholar 

  18. Russo, L., Navarro, G., Oliveira, A.: Dynamic fully-compressed suffix trees. In: Ferragina, P., Landau, G.M. (eds.) CPM 2008. LNCS, vol. 5029, pp. 191–203. Springer, Heidelberg (2008)

    Chapter  Google Scholar 

  19. Russo, L., Navarro, G., Oliveira, A.: Fully-compressed suffix trees. In: Laber, E.S., Bornstein, C., Nogueira, L.T., Faria, L. (eds.) LATIN 2008. LNCS, vol. 4957, pp. 362–373. Springer, Heidelberg (2008)

    Chapter  Google Scholar 

  20. Sadakane, K.: New text indexing functionalities of the compressed suffix arrays. Journal of Algorithms 48(2), 294–313 (2003)

    Article  Google Scholar 

  21. Sadakane, K.: Compressed suffix trees with full functionality. Theory of Computing Systems 41(4), 589–607 (2007)

    Article  Google Scholar 

  22. Sirén, J., Välimäki, N., Mäkinen, V., Navarro, G.: Run-length compressed indexes are superior for highly repetitive sequence collections. In: Amir, A., Turpin, A., Moffat, A. (eds.) SPIRE 2008. LNCS, vol. 5280, pp. 164–175. Springer, Heidelberg (2008)

    Google Scholar 

  23. Waterman, M.S.: Introduction to Computational Biology. Chapman & Hall, University Press (1995)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2009 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Mäkinen, V., Navarro, G., Sirén, J., Välimäki, N. (2009). Storage and Retrieval of Individual Genomes. In: Batzoglou, S. (eds) Research in Computational Molecular Biology. RECOMB 2009. Lecture Notes in Computer Science(), vol 5541. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-02008-7_9

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-02008-7_9

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-02007-0

  • Online ISBN: 978-3-642-02008-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics