Skip to main content

An Efficient, Scalable and Exact Representation of High-Dimensional Color Information Enabled via de Bruijn Graph Search

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 11467))

Abstract

The colored de Bruijn graph (cdbg) and its variants have become an important combinatorial structure used in numerous areas in genomics, such as population-level variation detection in metagenomic samples, large scale sequence search, and cdbg-based reference sequence indices. As samples or genomes are added to the cdbg, the color information comes to dominate the space required to represent this data structure.

In this paper, we show how to represent the color information efficiently by adopting a hierarchical encoding that exploits correlations among color classes—patterns of color occurrence—present in the de Bruijn graph (dbg). A major challenge in deriving an efficient encoding of the color information that takes advantage of such correlations is determining which color classes are close to each other in the high-dimensional space of possible color patterns. We demonstrate that the dbg itself can be used as an efficient mechanism to search for approximate nearest neighbors in this space. While our approach reduces the encoding size of the color information even for relatively small cdbgs (hundreds of experiments), the gains are particularly consequential as the number of potential colors (i.e. samples or references) grows into the thousands.

We apply this encoding in the context of two different applications; the implicit cdbg used for a large-scale sequence search index, Mantis, as well as the encoding of color information used in population-level variation detection by tools such as Vari and Rainbowfish. Our results show significant improvements in the overall size and scalability of representation of the color information. In our experiment on 10,000 samples, we achieved more than \(11\times \) better compression compared to RRR.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   59.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   74.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    The nodes of the de Bruijn graph are typical stored implicitly, because the node set is simply a function of E.

References

  1. Iqbal, Z., Caccamo, M., Turner, I., Flicek, P., McVean, G.: De novo assembly and genotyping of variants using colored de Bruijn graphs. Nat. Genet. 44, 226–232 (2012). https://doi.org/10.1038/ng.102810.1038/ng.1028

    Article  Google Scholar 

  2. Pevzner, P.A., Tang, H., Waterman, M.S.: An Eulerian path approach to DNA fragment assembly. Proc. National Acad. Sci. 98(17), 9748–9753 (2001)

    Article  MathSciNet  Google Scholar 

  3. Pevzner, P.A., Tang, H.: Fragment assembly with double-barreled data. Bioinformatics 17(Suppl. 1), s225–s233 (2001)

    Article  Google Scholar 

  4. Chikhi, R., Limasset, A., Jackman, S., Simpson, J.T., Medvedev, P.: On the representation of de bruijn graphs. In: Sharan, R. (ed.) RECOMB 2014. LNCS, vol. 8394, pp. 35–55. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-05269-4_4

    Chapter  Google Scholar 

  5. Prashant, P., Fatemeh, A., Bender, M.A., Ferdman, M., Johnson, R., Patro, R.: Mantis: a fast, small, and exact large-scale sequence-search index. Cell Syst. 7(2), 201–207.e4 (2018). https://doi.org/10.1016/j.cels.2018.05.021

  6. Solomon, B., Kingsford, C.: Fast search of thousands of short-read sequencing experiments. Nat. Biotechnol. 34(3), 300–302 (2016)

    Article  Google Scholar 

  7. Solomon, B., Kingsford, C.: Improved search of large transcriptomic sequencing databases using split sequence bloom trees. In: Sahinalp, S.C. (ed.) RECOMB 2017. LNCS, vol. 10229, pp. 257–271. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-56970-3_16

    Chapter  Google Scholar 

  8. Sun, C., Harris, R.S., Chikhi, R., Medvedev, P.: AllSome sequence bloom trees. In: Sahinalp, S.C. (ed.) RECOMB 2017. LNCS, vol. 10229, pp. 272–286. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-56970-3_17

    Chapter  Google Scholar 

  9. Bradley, P., den Bakker, H., Rocha, E., McVean, G., Iqbal, Z.: Real-time search of all bacterial and viral genomic data. BioRxiv, p. 234955 (2017)

    Google Scholar 

  10. Muggli, M.D., et al.: Succinct colored de bruijn graphs. Bioinformatics 33, 3181–3187 (2017)

    Article  Google Scholar 

  11. Holley, G., Wittler, R., Stoye, J.: Bloom Filter Trie: an alignment-free and reference-free data structure for pan-genome storage. Algorithms Mol. Biol. 11(1), 3 (2016)

    Article  Google Scholar 

  12. Almodaresi, F., Pandey, P., Patro, R.: Rainbowfish: a succinct colored de Bruijn graph representation. In: LIPIcs-Leibniz International Proceedings in Informatics, vol. 88. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik (2017)

    Google Scholar 

  13. Liu, B., Guo, H., Brudno, M., Wang, Y.: deBGA: read alignment with de Bruijn graph-based seed and extension. Bioinformatics 32(21), 3224–3232 (2016a)

    Article  Google Scholar 

  14. Chikhi, R., Rizk, G.: Space-efficient and exact de bruijn graph representation based on a bloom filter. In: Raphael, B., Tang, J. (eds.) WABI 2012. LNCS, vol. 7534, pp. 236–248. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33122-0_19

    Chapter  Google Scholar 

  15. Salikhov, K., Sacomoto, G., Kucherov, G.: Using cascading bloom filters to improve the memory usage for de brujin graphs. Algorithms Mol. Biol. 9(1), 2 (2014)

    Article  Google Scholar 

  16. Bowe, A., Onodera, T., Sadakane, K., Shibuya, T.: Succinct de bruijn graphs. In: Raphael, B., Tang, J. (eds.) WABI 2012. LNCS, vol. 7534, pp. 225–235. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33122-0_18

    Chapter  Google Scholar 

  17. Crawford, V., Kuhnle, A., Boucher, C., Chikhi, R., Gagie, T., Hancock, J.: Practical dynamic de bruijn graphs. Bioinformatics 34, 4189–4195 (2018)

    Article  Google Scholar 

  18. Pandey, P., Bender, M.A., Johnson, R., Patro, R.: deBGR: an efficient and near-exact representation of the weighted de bruijn graph. Bioinformatics 33(14), i133–i141 (2017)

    Article  Google Scholar 

  19. Mustafa, H., Schilken, I., Karasikov, M., Eickhoff, C., Rätsch, G., Kahles, A.: Dynamic compression schemes for graph coloring. Bioinformatics, p. bty632 (2018). https://doi.org/10.1093/bioinformatics/bty632

  20. Burrows, M., Wheeler, D.J.: A block-sorting lossless data compression algorithm (1994)

    Google Scholar 

  21. Raman, R., Raman, V., Srinivasa Rao, S.: Succinct indexable dictionaries with applications to encoding k-ary trees and multisets. In: Proceedings of the Thirteenth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 233–242. Society for Industrial and Applied Mathematics (2002)

    Google Scholar 

  22. Elias, P.: Efficient storage and retrieval by content and address of static files. J. ACM (JACM) 21(2), 246–260 (1974)

    Article  MathSciNet  Google Scholar 

  23. Raidl, G.R.: Exact and heuristic approaches for solving the bounded diameter minimum spanning tree problem. Ph.D. thesis (2008)

    Google Scholar 

  24. Althaus, E., Funke, S., Har-Peled, S., Könemann, J., Ramos, E.A., Skutella, M.: Approximating k-hop minimum-spanning trees. Oper. Res. Lett. 33(2):115–120 (2005). https://doi.org/10.1016/j.orl.2004.05.005. http://www.sciencedirect.com/science/article/pii/S0167637704000719. ISSN 0167–6377

  25. Manyem, P., Stallmann, M.F.M.: Some approximation results in multicasting. Technical report, Raleigh, NC, USA (1996)

    Google Scholar 

  26. Khuller, S., Raghavachari, B., Young, N.E.: Balancing minimum spanning and shortest path trees. CoRR, cs.DS/0205045 (2002). http://arxiv.org/abs/cs.DS/0205045

  27. Marathe, M.V., Ravi, R., Sundaram, R., Ravi, S.S., Rosenkrantz, D.J., Hunt III, H.B.: Bicriteria network design problems. CoRR, cs.CC/9809103 (1998). http://arxiv.org/abs/cs.CC/9809103

  28. Simpson, J.T., Wong, K., Jackman, S.D., Schein, J.E., Jones, S.J.M., Birol, I.: ABySS: a parallel assembler for short read sequence data. Genome Res. 19(6), 1117–1123 (2009)

    Article  Google Scholar 

  29. Schulz, M.H., Zerbino, D.R., Vingron, M., Birney, E.: Oases: robust de novo RNA-seq assembly across the dynamic range of expression levels. Bioinformatics 28(8), 1086–1092 (2012)

    Article  Google Scholar 

  30. Zerbino, D.R., Birney, E.: Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 18(5), 821–829 (2008)

    Article  Google Scholar 

  31. Grabherr, M.G., et al.: Full-length transcriptome assembly from RNA-seq data without a reference genome. Nature Biotechnol. 29(7), 644–652 (2011)

    Article  Google Scholar 

  32. Chang, Z., et al.: Bridger: a new framework for de novo transcriptome assembly using RNA-seq data. Genome Biol. 16(1), 30 (2015)

    Article  Google Scholar 

  33. Liu, J., et al.: Binpacker: packing-based de novo transcriptome assembly from RNA-seq data. PLOS Comput. Biol. 12(2), e1004772 (2016b)

    Article  Google Scholar 

  34. Almodaresi, F., Sarkar, H., Srivastava, A., Patro, R.: A space and time-efficient index for the compacted colored de Bruijn graph. Bioinformatics 34(13), i169–i177 (2018)

    Article  Google Scholar 

  35. Turner, I., Garimella, K.V., Iqbal, Z., McVean, G.: Integrating long-range connectivity information into de Bruijn graphs. Bioinformatics 34(15), 2556–2565 (2018). https://doi.org/10.1093/bioinformatics/bty157

    Article  Google Scholar 

  36. Alipanahi, B., Muggli, M.D., Jundi, M., Noyes, N., Boucher, C.: Resistome SNP calling via read colored de Bruijn graphs. bioRxiv, p. 156174 (2018)

    Google Scholar 

  37. Alipanahi, B., Kuhnle, A., Boucher, C.: Recoloring the colored de Bruijn graph. In: Gagie, T., Moffat, A., Navarro, G., Cuadros-Vargas, E. (eds.) SPIRE 2018. LNCS, vol. 11147, pp. 1–11. Springer, Cham (2018b). https://doi.org/10.1007/978-3-030-00479-8_1

    Chapter  Google Scholar 

  38. Pandey, P., Bender, M.A., Johnson, R., Patro, R.: A general-purpose counting filter: making every bit count. In: Proceedings of the 2017 ACM International Conference on Management of Data, pp. 775–787. ACM (2017)

    Google Scholar 

  39. Yu, Y., et al.: SeqOthello: querying RNA-seq experiments at scale. Genome Biol. 19(1), 167 (2018). https://doi.org/10.1186/s13059-018-1535-9. ISSN 1474–760X

    Article  Google Scholar 

  40. Ottaviano, G., Venturini, R.: Partitioned Elias-Fano Indexes. In: Proceedings of the 37th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 273–282. ACM (2014)

    Google Scholar 

  41. Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory 23(3), 337–343 (1977)

    Article  MathSciNet  Google Scholar 

  42. Bookstein, A., Klein, S.T.: Compression of correlated bit-vectors. Inf. Syst. 16(4), 387–400 (1991)

    Article  Google Scholar 

  43. Pandey, P., Bender, M.A., Johnson, R., Patro, R.: Squeakr: an exact and approximate k-mer counting system. Bioinformatics, btx636 (2017). https://doi.org/10.1093/bioinformatics/btx636

  44. NIH. SRA (2017). https://www.ebi.ac.uk/ena/browse. Accessed 06 Nov 2017

  45. O’Leary, N.A., et al.: Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res. gkv1189 (2015)

    Google Scholar 

  46. Yu, Y.W., Daniels, N.M., Danko, D.C., Berger, B.: Entropy-scaling search of massive biological data. Cell systems 1(2), 130–140 (2015)

    Article  Google Scholar 

Download references

Acknowledgments and Declarations

This work was supported by the US National Science Foundation grants BIO-1564917, CCF-1439084, CCF-1716252, CNS-1408695, National Institutes of Health grant R01HG009937. The experiments were conducted with equipment purchased through NSF CISE Research Infrastructure Grant Number 1405641. RP is a co-founder of Ocean Genomics.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Prashant Pandey .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Almodaresi, F., Pandey, P., Ferdman, M., Johnson, R., Patro, R. (2019). An Efficient, Scalable and Exact Representation of High-Dimensional Color Information Enabled via de Bruijn Graph Search. In: Cowen, L. (eds) Research in Computational Molecular Biology. RECOMB 2019. Lecture Notes in Computer Science(), vol 11467. Springer, Cham. https://doi.org/10.1007/978-3-030-17083-7_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-17083-7_1

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-17082-0

  • Online ISBN: 978-3-030-17083-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics