Storage and Retrieval of Individual Genomes

Mäkinen, Veli; Navarro, Gonzalo; Sirén, Jouni; Välimäki, Niko

doi:10.1007/978-3-642-02008-7_9

Veli Mäkinen²⁰,
Gonzalo Navarro²¹,
Jouni Sirén²⁰ &
…
Niko Välimäki²⁰

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 5541))

Included in the following conference series:

Annual International Conference on Research in Computational Molecular Biology

1672 Accesses
15 Citations

Abstract

A repetitive sequence collection is one where portions of a base sequence of length n are repeated many times with small variations, forming a collection of total length N. Examples of such collections are version control data and genome sequences of individuals, where the differences can be expressed by lists of basic edit operations. Flexible and efficient data analysis on a such typically huge collection is plausible using suffix trees. However, suffix tree occupies O(N logN) bits, which very soon inhibits in-memory analyses. Recent advances in full-text self-indexing reduce the space of suffix tree to O(N logσ) bits, where σ is the alphabet size. In practice, the space reduction is more than 10-fold, for example on suffix tree of Human Genome. However, this reduction factor remains constant when more sequences are added to the collection.

We develop a new family of self-indexes suited for the repetitive sequence collection setting. Their expected space requirement depends only on the length n of the base sequence and the number s of variations in its repeated copies. That is, the space reduction factor is no longer constant, but depends on N/n.

We believe the structures developed in this work will provide a fundamental basis for storage and retrieval of individual genomes as they become available due to rapid progress in the sequencing technologies.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Blanford, D., Blelloch, G.: Compact representations of ordered sets. In: Proc. 15th SODA, pp. 11–19 (2004)
Google Scholar
Burrows, M., Wheeler, D.: A block sorting lossless data compression algorithm. Technical Report Technical Report 124, Digital Equipment Corporation (1994)
Google Scholar
Church, G.M.: Genomes for all. Scientific American 294(1), 47–54 (2006)
Article Google Scholar
Ferragina, P., Manzini, G.: Indexing compressed texts. Journal of the ACM 52(4), 552–581 (2005)
Article Google Scholar
Ferragina, P., Manzini, G., Mäkinen, V., Navarro, G.: Compressed representations of sequences and full-text indexes. ACM Transactions on Algorithms (TALG) 3(2) article 20 (2007)
Google Scholar
Fischer, J., Mäkinen, V., Navarro, G.: An(other) entropy-bounded compressed suffix tree. In: Ferragina, P., Landau, G.M. (eds.) CPM 2008. LNCS, vol. 5029, pp. 152–165. Springer, Heidelberg (2008)
Chapter Google Scholar
Grossi, R., Vitter, J.: Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SIAM Journal on Computing 35(2), 378–407 (2006)
Article Google Scholar
Gupta, A., Hon, W.-K., Shah, R., Vitter, J.S.: Compressed data structures: Dictionaries and data-aware measures. In: DCC 2006: Proceedings of the Data Compression Conference (DCC 2006), pp. 213–222 (2006)
Google Scholar
Gusfield, D.: Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. Cambridge University Press, Cambridge (1997)
Book Google Scholar
Hall, N.: Advanced sequencing technologies and their wider impact in microbiology. The Journal of Experimental Biology 209, 1518–1525 (2007)
Article Google Scholar
Kaplan, H.: Persistent Data Structures. In: Mehta, D.P., Sahni, S. (eds.) Handbook of Data Structures and Applications, vol. 31. Chapman & Hall, Boca Raton (2005)
Google Scholar
Mäkinen, V., Navarro, G.: Succinct suffix arrays based on run-length encoding. Nordic Journal of Computing 12(1), 40–66 (2005)
Google Scholar
Manber, U., Myers, G.: Suffix arrays: a new method for on-line string searches. SIAM J. Comput. 22(5), 935–948 (1993)
Article Google Scholar
Manzini, G.: An analysis of the Burrows-Wheeler transform. Journal of the ACM 48(3), 407–430 (2001)
Article Google Scholar
Navarro, G., Mäkinen, V.: Compressed full-text indexes. ACM Computing Surveys 39(1) article 2 (2007)
Google Scholar
Overmars, M.H.: Searching in the past, i. Technical Report Technical Report RUU-CS-81-7, Department of Computer Science, University of Utrecht, Utrecht, Netherlands (1981)
Google Scholar
Pennisi, E.: Breakthrough of the year: Human genetic variation. Science 21, 1842–1843 (2007)
Article Google Scholar
Russo, L., Navarro, G., Oliveira, A.: Dynamic fully-compressed suffix trees. In: Ferragina, P., Landau, G.M. (eds.) CPM 2008. LNCS, vol. 5029, pp. 191–203. Springer, Heidelberg (2008)
Chapter Google Scholar
Russo, L., Navarro, G., Oliveira, A.: Fully-compressed suffix trees. In: Laber, E.S., Bornstein, C., Nogueira, L.T., Faria, L. (eds.) LATIN 2008. LNCS, vol. 4957, pp. 362–373. Springer, Heidelberg (2008)
Chapter Google Scholar
Sadakane, K.: New text indexing functionalities of the compressed suffix arrays. Journal of Algorithms 48(2), 294–313 (2003)
Article Google Scholar
Sadakane, K.: Compressed suffix trees with full functionality. Theory of Computing Systems 41(4), 589–607 (2007)
Article Google Scholar
Sirén, J., Välimäki, N., Mäkinen, V., Navarro, G.: Run-length compressed indexes are superior for highly repetitive sequence collections. In: Amir, A., Turpin, A., Moffat, A. (eds.) SPIRE 2008. LNCS, vol. 5280, pp. 164–175. Springer, Heidelberg (2008)
Google Scholar
Waterman, M.S.: Introduction to Computational Biology. Chapman & Hall, University Press (1995)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, University of Helsinki, Finland
Veli Mäkinen, Jouni Sirén & Niko Välimäki
Department of Computer Science, University of Chile, Chile
Gonzalo Navarro

Authors

Veli Mäkinen
View author publications
You can also search for this author in PubMed Google Scholar
Gonzalo Navarro
View author publications
You can also search for this author in PubMed Google Scholar
Jouni Sirén
View author publications
You can also search for this author in PubMed Google Scholar
Niko Välimäki
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Computer Science Department, James H. Clark Center, 318 Campus Drive, RM S266, CA 94305-5428,, Stanford, USA
Serafim Batzoglou

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Mäkinen, V., Navarro, G., Sirén, J., Välimäki, N. (2009). Storage and Retrieval of Individual Genomes. In: Batzoglou, S. (eds) Research in Computational Molecular Biology. RECOMB 2009. Lecture Notes in Computer Science(), vol 5541. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-02008-7_9

Download citation

DOI: https://doi.org/10.1007/978-3-642-02008-7_9
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-02007-0
Online ISBN: 978-3-642-02008-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics