Bloom Filter Trie – A Data Structure for Pan-Genome Storage

Holley, Guillaume; Wittler, Roland; Stoye, Jens

doi:10.1007/978-3-662-48221-6_16

Guillaume Holley^6,7,8,
Roland Wittler^6,7,8 &
Jens Stoye^6,7,8

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 9289))

Included in the following conference series:

International Workshop on Algorithms in Bioinformatics

1275 Accesses
10 Citations

Abstract

High throughput sequencing technologies have become fast and cheap in the past years. As a result, large-scale projects started to sequence tens to several thousands of genomes per species, producing a high number of sequences sampled from each genome. Such a highly redundant collection of very similar sequences is called a pan-genome. It can be transformed into a set of sequences “colored” by the genomes to which they belong. A colored de-Bruijn graph (C-DBG) extracts from the sequences all colored k-mers, strings of length k, and stores them in vertices. In this paper, we present an alignment-free, reference-free and incremental data structure for storing a pan-genome as a C-DBG: the Bloom Filter Trie. The data structure allows to store and compress a set of colored k-mers, and also to efficiently traverse the graph. Experimental results prove better performance compared to another state-of-the-art data structure.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://github.com/lh3/wgsim.
2.
http://www.7-zip.org/.

References

Bloom, B.H.: Space/time trade-offs in hash coding with allowable errors. Comm. ACM 13(7), 422–426 (1970)
Article Google Scholar
Burrows, M., Wheeler, D.J.: A block-sorting lossless data compression algorithm. Digital SRC Research Report 124 (1994)
Google Scholar
Chikhi, R., Medvedev, P.: Informed and automated k-mer size selection for genome assembly. Bioinformatics 30(1), 31–37 (2014)
Article Google Scholar
Cox, A.J., Bauer, M.J., Jakobi, T., Rosone, G.: Large-scale compression of genomic sequence databases with the Burrows-Wheeler transform. Bioinformatics 28(11), 1415–1419 (2012)
Article Google Scholar
Crusoe, M., Edvenson, G., Fish, J., Howe, A., McDonald, E., Nahum, J., Nanlohy, K., Ortiz-Zuazaga, H., Pell, J., Simpson, J., Scott, C., Srinivasan, R.R., Zhang, Q., Brown, C.T.: The khmer software package: enabling efficient sequence analysis figshare (2014)
Google Scholar
Deorowicz, S., Kokot, M., Grabowski, S., Debudaj-Grabysz, A.: KMC 2: Fast and resource-frugal k-mer counting. Bioinformatics 31(10), 1569–1576 (2015)
Article Google Scholar
DePristo, M.A., Banks, E., Poplin, R., Garimella, K.V., Maguire, J.R., Hartl, C., Philippakis, A.A., del Angel, G., Rivas, M.A., Hanna, M., et al.: A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat. Genet. 43(5), 491–498 (2011)
Article Google Scholar
Ernst, C., Rahmann, S.: PanCake: A data structure for pangenomes. In: Proceedings of the German Conference on Bioinformatics, vol. 34, pp. 35–45 (2013)
Google Scholar
Ferragina, P., Manzini, G.: An experimental study of an opportunistic index. In: Proceedings of the 12th ACM-SIAM Symposium on Discrete Algorithms, vol. 1, pp. 269–278 (2001)
Google Scholar
Fredking, E.: Trie memory. Comm. ACM 3(9), 490–499 (1960)
Article Google Scholar
Heinz, S., Zobel, J., Williams, H.E.: Burst tries: a fast, efficient data structure for string keys. ACM Trans. Inf. Syst. 20(2), 192–223 (2002)
Article Google Scholar
Huang, L., Popic, V., Batzoglou, S.: Short read alignment with populations of genomes. Bioinformatics 29(13), i361–i370 (2013)
Article Google Scholar
Iqbal, Z., Caccamo, M., Turner, I., Flicek, P., McVean, G.: De novo assembly and genotyping of variants using colored de Bruijn graphs. Nat. Genet. 44(2), 226–232 (2012)
Article Google Scholar
Li, H., Durbin, R.: Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25(14), 1754–1760 (2009)
Article Google Scholar
Marçais, G., Kingsford, C.: A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics 27(6), 764–770 (2011)
Article Google Scholar
Marcus, S., Lee, H., Schatz, M.C.: SplitMEM: a graphical algorithm for pan-genome analysis with suffix skips. Bioinformatics 30(24), 3476–3483 (2014)
Article Google Scholar
Myers, E.W.: The fragment assembly string graph. Bioinformatics 21, ii79–ii85 (2005)
Google Scholar
Nguyen, N., Hickey, G., Zerbino, D.R., Raney, B., Earl, D., Armstrong, J., Haussler, D., Paten, B.: Building a pangenome reference for a population. J. Comput. Biol. 22(5), 387–401 (2015)
Article MathSciNet Google Scholar
Paten, B., Diekhans, M., Earl, D., St. John, J., Ma, J., Suh, B., Haussler, D.: Cactus graphs for genome comparisons. J. Comput. Biol. 18(3), 469–481 (2011)
Article MathSciNet Google Scholar
Pell, J., Hintze, A., Canino-Koning, R., Howe, A., Tiedje, J.M., Brown, C.T.: Scaling metagenome sequence assembly with probabilistic de Bruijn graphs. Proc. Natl. Acad. Sci. U.S.A. 109(33), 13272–13277 (2012)
Article MathSciNet Google Scholar
Solomon, B., Kingsford, C.: Large-Scale Search of Transcriptomic Read Sets with Sequence Bloom Trees. bioRxiv, 017087 (2015)
Google Scholar
Wandelt, S., Starlinger, J., Bux, M., Leser, U.: RCSI: Scalable similarity search in thousand(s) of genomes. Proc. VLDB Endow. 6(13), 1534–1545 (2013)
Article Google Scholar

Download references

Acknowledgments

The authors wish to thank the anonymous reviewers and the authors of SBT for helpful comments. GH and RW are funded by the International DFG Research Training Group GRK 1906/1.

Author information

Authors and Affiliations

Genome Informatics, Faculty of Technology, Bielefeld University, Bielefeld, Germany
Guillaume Holley, Roland Wittler & Jens Stoye
Center for Biotechnology, Bielefeld University, Bielefeld, Germany
Guillaume Holley, Roland Wittler & Jens Stoye
International Research Training Group 1906, Bielefeld University, Bielefeld, Germany
Guillaume Holley, Roland Wittler & Jens Stoye

Authors

Guillaume Holley
View author publications
You can also search for this author in PubMed Google Scholar
Roland Wittler
View author publications
You can also search for this author in PubMed Google Scholar
Jens Stoye
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Guillaume Holley .

Editor information

Editors and Affiliations

University of Maryland, College Park, Maryland, USA
Mihai Pop
University of Lille, Lille, France
Hélène Touzet

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Holley, G., Wittler, R., Stoye, J. (2015). Bloom Filter Trie – A Data Structure for Pan-Genome Storage. In: Pop, M., Touzet, H. (eds) Algorithms in Bioinformatics. WABI 2015. Lecture Notes in Computer Science(), vol 9289. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-48221-6_16

Download citation

DOI: https://doi.org/10.1007/978-3-662-48221-6_16
Published: 28 August 2015
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-48220-9
Online ISBN: 978-3-662-48221-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics