Abstract
High throughput sequencing technologies have become fast and cheap in the past years. As a result, large-scale projects started to sequence tens to several thousands of genomes per species, producing a high number of sequences sampled from each genome. Such a highly redundant collection of very similar sequences is called a pan-genome. It can be transformed into a set of sequences “colored” by the genomes to which they belong. A colored de-Bruijn graph (C-DBG) extracts from the sequences all colored k-mers, strings of length k, and stores them in vertices. In this paper, we present an alignment-free, reference-free and incremental data structure for storing a pan-genome as a C-DBG: the Bloom Filter Trie. The data structure allows to store and compress a set of colored k-mers, and also to efficiently traverse the graph. Experimental results prove better performance compared to another state-of-the-art data structure.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
References
Bloom, B.H.: Space/time trade-offs in hash coding with allowable errors. Comm. ACM 13(7), 422–426 (1970)
Burrows, M., Wheeler, D.J.: A block-sorting lossless data compression algorithm. Digital SRC Research Report 124 (1994)
Chikhi, R., Medvedev, P.: Informed and automated k-mer size selection for genome assembly. Bioinformatics 30(1), 31–37 (2014)
Cox, A.J., Bauer, M.J., Jakobi, T., Rosone, G.: Large-scale compression of genomic sequence databases with the Burrows-Wheeler transform. Bioinformatics 28(11), 1415–1419 (2012)
Crusoe, M., Edvenson, G., Fish, J., Howe, A., McDonald, E., Nahum, J., Nanlohy, K., Ortiz-Zuazaga, H., Pell, J., Simpson, J., Scott, C., Srinivasan, R.R., Zhang, Q., Brown, C.T.: The khmer software package: enabling efficient sequence analysis figshare (2014)
Deorowicz, S., Kokot, M., Grabowski, S., Debudaj-Grabysz, A.: KMC 2: Fast and resource-frugal k-mer counting. Bioinformatics 31(10), 1569–1576 (2015)
DePristo, M.A., Banks, E., Poplin, R., Garimella, K.V., Maguire, J.R., Hartl, C., Philippakis, A.A., del Angel, G., Rivas, M.A., Hanna, M., et al.: A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat. Genet. 43(5), 491–498 (2011)
Ernst, C., Rahmann, S.: PanCake: A data structure for pangenomes. In: Proceedings of the German Conference on Bioinformatics, vol. 34, pp. 35–45 (2013)
Ferragina, P., Manzini, G.: An experimental study of an opportunistic index. In: Proceedings of the 12th ACM-SIAM Symposium on Discrete Algorithms, vol. 1, pp. 269–278 (2001)
Fredking, E.: Trie memory. Comm. ACM 3(9), 490–499 (1960)
Heinz, S., Zobel, J., Williams, H.E.: Burst tries: a fast, efficient data structure for string keys. ACM Trans. Inf. Syst. 20(2), 192–223 (2002)
Huang, L., Popic, V., Batzoglou, S.: Short read alignment with populations of genomes. Bioinformatics 29(13), i361–i370 (2013)
Iqbal, Z., Caccamo, M., Turner, I., Flicek, P., McVean, G.: De novo assembly and genotyping of variants using colored de Bruijn graphs. Nat. Genet. 44(2), 226–232 (2012)
Li, H., Durbin, R.: Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25(14), 1754–1760 (2009)
Marçais, G., Kingsford, C.: A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics 27(6), 764–770 (2011)
Marcus, S., Lee, H., Schatz, M.C.: SplitMEM: a graphical algorithm for pan-genome analysis with suffix skips. Bioinformatics 30(24), 3476–3483 (2014)
Myers, E.W.: The fragment assembly string graph. Bioinformatics 21, ii79–ii85 (2005)
Nguyen, N., Hickey, G., Zerbino, D.R., Raney, B., Earl, D., Armstrong, J., Haussler, D., Paten, B.: Building a pangenome reference for a population. J. Comput. Biol. 22(5), 387–401 (2015)
Paten, B., Diekhans, M., Earl, D., St. John, J., Ma, J., Suh, B., Haussler, D.: Cactus graphs for genome comparisons. J. Comput. Biol. 18(3), 469–481 (2011)
Pell, J., Hintze, A., Canino-Koning, R., Howe, A., Tiedje, J.M., Brown, C.T.: Scaling metagenome sequence assembly with probabilistic de Bruijn graphs. Proc. Natl. Acad. Sci. U.S.A. 109(33), 13272–13277 (2012)
Solomon, B., Kingsford, C.: Large-Scale Search of Transcriptomic Read Sets with Sequence Bloom Trees. bioRxiv, 017087 (2015)
Wandelt, S., Starlinger, J., Bux, M., Leser, U.: RCSI: Scalable similarity search in thousand(s) of genomes. Proc. VLDB Endow. 6(13), 1534–1545 (2013)
Acknowledgments
The authors wish to thank the anonymous reviewers and the authors of SBT for helpful comments. GH and RW are funded by the International DFG Research Training Group GRK 1906/1.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Holley, G., Wittler, R., Stoye, J. (2015). Bloom Filter Trie – A Data Structure for Pan-Genome Storage. In: Pop, M., Touzet, H. (eds) Algorithms in Bioinformatics. WABI 2015. Lecture Notes in Computer Science(), vol 9289. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-48221-6_16
Download citation
DOI: https://doi.org/10.1007/978-3-662-48221-6_16
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-48220-9
Online ISBN: 978-3-662-48221-6
eBook Packages: Computer ScienceComputer Science (R0)