Skip to main content

Bloom Filter Trie – A Data Structure for Pan-Genome Storage

  • Conference paper
  • First Online:
Book cover Algorithms in Bioinformatics (WABI 2015)

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 9289))

Included in the following conference series:

Abstract

High throughput sequencing technologies have become fast and cheap in the past years. As a result, large-scale projects started to sequence tens to several thousands of genomes per species, producing a high number of sequences sampled from each genome. Such a highly redundant collection of very similar sequences is called a pan-genome. It can be transformed into a set of sequences “colored” by the genomes to which they belong. A colored de-Bruijn graph (C-DBG) extracts from the sequences all colored k-mers, strings of length k, and stores them in vertices. In this paper, we present an alignment-free, reference-free and incremental data structure for storing a pan-genome as a C-DBG: the Bloom Filter Trie. The data structure allows to store and compress a set of colored k-mers, and also to efficiently traverse the graph. Experimental results prove better performance compared to another state-of-the-art data structure.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://github.com/lh3/wgsim.

  2. 2.

    http://www.7-zip.org/.

References

  1. Bloom, B.H.: Space/time trade-offs in hash coding with allowable errors. Comm. ACM 13(7), 422–426 (1970)

    Article  Google Scholar 

  2. Burrows, M., Wheeler, D.J.: A block-sorting lossless data compression algorithm. Digital SRC Research Report 124 (1994)

    Google Scholar 

  3. Chikhi, R., Medvedev, P.: Informed and automated k-mer size selection for genome assembly. Bioinformatics 30(1), 31–37 (2014)

    Article  Google Scholar 

  4. Cox, A.J., Bauer, M.J., Jakobi, T., Rosone, G.: Large-scale compression of genomic sequence databases with the Burrows-Wheeler transform. Bioinformatics 28(11), 1415–1419 (2012)

    Article  Google Scholar 

  5. Crusoe, M., Edvenson, G., Fish, J., Howe, A., McDonald, E., Nahum, J., Nanlohy, K., Ortiz-Zuazaga, H., Pell, J., Simpson, J., Scott, C., Srinivasan, R.R., Zhang, Q., Brown, C.T.: The khmer software package: enabling efficient sequence analysis figshare (2014)

    Google Scholar 

  6. Deorowicz, S., Kokot, M., Grabowski, S., Debudaj-Grabysz, A.: KMC 2: Fast and resource-frugal k-mer counting. Bioinformatics 31(10), 1569–1576 (2015)

    Article  Google Scholar 

  7. DePristo, M.A., Banks, E., Poplin, R., Garimella, K.V., Maguire, J.R., Hartl, C., Philippakis, A.A., del Angel, G., Rivas, M.A., Hanna, M., et al.: A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat. Genet. 43(5), 491–498 (2011)

    Article  Google Scholar 

  8. Ernst, C., Rahmann, S.: PanCake: A data structure for pangenomes. In: Proceedings of the German Conference on Bioinformatics, vol. 34, pp. 35–45 (2013)

    Google Scholar 

  9. Ferragina, P., Manzini, G.: An experimental study of an opportunistic index. In: Proceedings of the 12th ACM-SIAM Symposium on Discrete Algorithms, vol. 1, pp. 269–278 (2001)

    Google Scholar 

  10. Fredking, E.: Trie memory. Comm. ACM 3(9), 490–499 (1960)

    Article  Google Scholar 

  11. Heinz, S., Zobel, J., Williams, H.E.: Burst tries: a fast, efficient data structure for string keys. ACM Trans. Inf. Syst. 20(2), 192–223 (2002)

    Article  Google Scholar 

  12. Huang, L., Popic, V., Batzoglou, S.: Short read alignment with populations of genomes. Bioinformatics 29(13), i361–i370 (2013)

    Article  Google Scholar 

  13. Iqbal, Z., Caccamo, M., Turner, I., Flicek, P., McVean, G.: De novo assembly and genotyping of variants using colored de Bruijn graphs. Nat. Genet. 44(2), 226–232 (2012)

    Article  Google Scholar 

  14. Li, H., Durbin, R.: Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25(14), 1754–1760 (2009)

    Article  Google Scholar 

  15. Marçais, G., Kingsford, C.: A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics 27(6), 764–770 (2011)

    Article  Google Scholar 

  16. Marcus, S., Lee, H., Schatz, M.C.: SplitMEM: a graphical algorithm for pan-genome analysis with suffix skips. Bioinformatics 30(24), 3476–3483 (2014)

    Article  Google Scholar 

  17. Myers, E.W.: The fragment assembly string graph. Bioinformatics 21, ii79–ii85 (2005)

    Google Scholar 

  18. Nguyen, N., Hickey, G., Zerbino, D.R., Raney, B., Earl, D., Armstrong, J., Haussler, D., Paten, B.: Building a pangenome reference for a population. J. Comput. Biol. 22(5), 387–401 (2015)

    Article  MathSciNet  Google Scholar 

  19. Paten, B., Diekhans, M., Earl, D., St. John, J., Ma, J., Suh, B., Haussler, D.: Cactus graphs for genome comparisons. J. Comput. Biol. 18(3), 469–481 (2011)

    Article  MathSciNet  Google Scholar 

  20. Pell, J., Hintze, A., Canino-Koning, R., Howe, A., Tiedje, J.M., Brown, C.T.: Scaling metagenome sequence assembly with probabilistic de Bruijn graphs. Proc. Natl. Acad. Sci. U.S.A. 109(33), 13272–13277 (2012)

    Article  MathSciNet  Google Scholar 

  21. Solomon, B., Kingsford, C.: Large-Scale Search of Transcriptomic Read Sets with Sequence Bloom Trees. bioRxiv, 017087 (2015)

    Google Scholar 

  22. Wandelt, S., Starlinger, J., Bux, M., Leser, U.: RCSI: Scalable similarity search in thousand(s) of genomes. Proc. VLDB Endow. 6(13), 1534–1545 (2013)

    Article  Google Scholar 

Download references

Acknowledgments

The authors wish to thank the anonymous reviewers and the authors of SBT for helpful comments. GH and RW are funded by the International DFG Research Training Group GRK 1906/1.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Guillaume Holley .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Holley, G., Wittler, R., Stoye, J. (2015). Bloom Filter Trie – A Data Structure for Pan-Genome Storage. In: Pop, M., Touzet, H. (eds) Algorithms in Bioinformatics. WABI 2015. Lecture Notes in Computer Science(), vol 9289. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-48221-6_16

Download citation

  • DOI: https://doi.org/10.1007/978-3-662-48221-6_16

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-662-48220-9

  • Online ISBN: 978-3-662-48221-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics