An Efficient, Scalable and Exact Representation of High-Dimensional Color Information Enabled via de Bruijn Graph Search

Almodaresi, Fatemeh; Pandey, Prashant; Ferdman, Michael; Johnson, Rob; Patro, Rob

doi:10.1007/978-3-030-17083-7_1

An Efficient, Scalable and Exact Representation of High-Dimensional Color Information Enabled via de Bruijn Graph Search

Fatemeh Almodaresi¹⁵,
Prashant Pandey¹⁵,
Michael Ferdman¹⁵,
Rob Johnson^15,16 &
…
Rob Patro¹⁵

Conference paper
First Online: 02 April 2019

2157 Accesses
5 Citations
5 Altmetric

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 11467))

Abstract

The colored de Bruijn graph (cdbg) and its variants have become an important combinatorial structure used in numerous areas in genomics, such as population-level variation detection in metagenomic samples, large scale sequence search, and cdbg-based reference sequence indices. As samples or genomes are added to the cdbg, the color information comes to dominate the space required to represent this data structure.

In this paper, we show how to represent the color information efficiently by adopting a hierarchical encoding that exploits correlations among color classes—patterns of color occurrence—present in the de Bruijn graph (dbg). A major challenge in deriving an efficient encoding of the color information that takes advantage of such correlations is determining which color classes are close to each other in the high-dimensional space of possible color patterns. We demonstrate that the dbg itself can be used as an efficient mechanism to search for approximate nearest neighbors in this space. While our approach reduces the encoding size of the color information even for relatively small cdbgs (hundreds of experiments), the gains are particularly consequential as the number of potential colors (i.e. samples or references) grows into the thousands.

We apply this encoding in the context of two different applications; the implicit cdbg used for a large-scale sequence search index, Mantis, as well as the encoding of color information used in population-level variation detection by tools such as Vari and Rainbowfish. Our results show significant improvements in the overall size and scalability of representation of the color information. In our experiment on 10,000 samples, we achieved more than \(11\times \) better compression compared to RRR.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 74.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
The nodes of the de Bruijn graph are typical stored implicitly, because the node set is simply a function of E.

References

Iqbal, Z., Caccamo, M., Turner, I., Flicek, P., McVean, G.: De novo assembly and genotyping of variants using colored de Bruijn graphs. Nat. Genet. 44, 226–232 (2012). https://doi.org/10.1038/ng.102810.1038/ng.1028
Article Google Scholar
Pevzner, P.A., Tang, H., Waterman, M.S.: An Eulerian path approach to DNA fragment assembly. Proc. National Acad. Sci. 98(17), 9748–9753 (2001)
Article MathSciNet Google Scholar
Pevzner, P.A., Tang, H.: Fragment assembly with double-barreled data. Bioinformatics 17(Suppl. 1), s225–s233 (2001)
Article Google Scholar
Chikhi, R., Limasset, A., Jackman, S., Simpson, J.T., Medvedev, P.: On the representation of de bruijn graphs. In: Sharan, R. (ed.) RECOMB 2014. LNCS, vol. 8394, pp. 35–55. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-05269-4_4
Chapter Google Scholar
Prashant, P., Fatemeh, A., Bender, M.A., Ferdman, M., Johnson, R., Patro, R.: Mantis: a fast, small, and exact large-scale sequence-search index. Cell Syst. 7(2), 201–207.e4 (2018). https://doi.org/10.1016/j.cels.2018.05.021
Solomon, B., Kingsford, C.: Fast search of thousands of short-read sequencing experiments. Nat. Biotechnol. 34(3), 300–302 (2016)
Article Google Scholar
Solomon, B., Kingsford, C.: Improved search of large transcriptomic sequencing databases using split sequence bloom trees. In: Sahinalp, S.C. (ed.) RECOMB 2017. LNCS, vol. 10229, pp. 257–271. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-56970-3_16
Chapter Google Scholar
Sun, C., Harris, R.S., Chikhi, R., Medvedev, P.: AllSome sequence bloom trees. In: Sahinalp, S.C. (ed.) RECOMB 2017. LNCS, vol. 10229, pp. 272–286. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-56970-3_17
Chapter Google Scholar
Bradley, P., den Bakker, H., Rocha, E., McVean, G., Iqbal, Z.: Real-time search of all bacterial and viral genomic data. BioRxiv, p. 234955 (2017)
Google Scholar
Muggli, M.D., et al.: Succinct colored de bruijn graphs. Bioinformatics 33, 3181–3187 (2017)
Article Google Scholar
Holley, G., Wittler, R., Stoye, J.: Bloom Filter Trie: an alignment-free and reference-free data structure for pan-genome storage. Algorithms Mol. Biol. 11(1), 3 (2016)
Article Google Scholar
Almodaresi, F., Pandey, P., Patro, R.: Rainbowfish: a succinct colored de Bruijn graph representation. In: LIPIcs-Leibniz International Proceedings in Informatics, vol. 88. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik (2017)
Google Scholar
Liu, B., Guo, H., Brudno, M., Wang, Y.: deBGA: read alignment with de Bruijn graph-based seed and extension. Bioinformatics 32(21), 3224–3232 (2016a)
Article Google Scholar
Chikhi, R., Rizk, G.: Space-efficient and exact de bruijn graph representation based on a bloom filter. In: Raphael, B., Tang, J. (eds.) WABI 2012. LNCS, vol. 7534, pp. 236–248. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33122-0_19
Chapter Google Scholar
Salikhov, K., Sacomoto, G., Kucherov, G.: Using cascading bloom filters to improve the memory usage for de brujin graphs. Algorithms Mol. Biol. 9(1), 2 (2014)
Article Google Scholar
Bowe, A., Onodera, T., Sadakane, K., Shibuya, T.: Succinct de bruijn graphs. In: Raphael, B., Tang, J. (eds.) WABI 2012. LNCS, vol. 7534, pp. 225–235. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33122-0_18
Chapter Google Scholar
Crawford, V., Kuhnle, A., Boucher, C., Chikhi, R., Gagie, T., Hancock, J.: Practical dynamic de bruijn graphs. Bioinformatics 34, 4189–4195 (2018)
Article Google Scholar
Pandey, P., Bender, M.A., Johnson, R., Patro, R.: deBGR: an efficient and near-exact representation of the weighted de bruijn graph. Bioinformatics 33(14), i133–i141 (2017)
Article Google Scholar
Mustafa, H., Schilken, I., Karasikov, M., Eickhoff, C., Rätsch, G., Kahles, A.: Dynamic compression schemes for graph coloring. Bioinformatics, p. bty632 (2018). https://doi.org/10.1093/bioinformatics/bty632
Burrows, M., Wheeler, D.J.: A block-sorting lossless data compression algorithm (1994)
Google Scholar
Raman, R., Raman, V., Srinivasa Rao, S.: Succinct indexable dictionaries with applications to encoding k-ary trees and multisets. In: Proceedings of the Thirteenth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 233–242. Society for Industrial and Applied Mathematics (2002)
Google Scholar
Elias, P.: Efficient storage and retrieval by content and address of static files. J. ACM (JACM) 21(2), 246–260 (1974)
Article MathSciNet Google Scholar
Raidl, G.R.: Exact and heuristic approaches for solving the bounded diameter minimum spanning tree problem. Ph.D. thesis (2008)
Google Scholar
Althaus, E., Funke, S., Har-Peled, S., Könemann, J., Ramos, E.A., Skutella, M.: Approximating k-hop minimum-spanning trees. Oper. Res. Lett. 33(2):115–120 (2005). https://doi.org/10.1016/j.orl.2004.05.005. http://www.sciencedirect.com/science/article/pii/S0167637704000719. ISSN 0167–6377
Manyem, P., Stallmann, M.F.M.: Some approximation results in multicasting. Technical report, Raleigh, NC, USA (1996)
Google Scholar
Khuller, S., Raghavachari, B., Young, N.E.: Balancing minimum spanning and shortest path trees. CoRR, cs.DS/0205045 (2002). http://arxiv.org/abs/cs.DS/0205045
Marathe, M.V., Ravi, R., Sundaram, R., Ravi, S.S., Rosenkrantz, D.J., Hunt III, H.B.: Bicriteria network design problems. CoRR, cs.CC/9809103 (1998). http://arxiv.org/abs/cs.CC/9809103
Simpson, J.T., Wong, K., Jackman, S.D., Schein, J.E., Jones, S.J.M., Birol, I.: ABySS: a parallel assembler for short read sequence data. Genome Res. 19(6), 1117–1123 (2009)
Article Google Scholar
Schulz, M.H., Zerbino, D.R., Vingron, M., Birney, E.: Oases: robust de novo RNA-seq assembly across the dynamic range of expression levels. Bioinformatics 28(8), 1086–1092 (2012)
Article Google Scholar
Zerbino, D.R., Birney, E.: Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 18(5), 821–829 (2008)
Article Google Scholar
Grabherr, M.G., et al.: Full-length transcriptome assembly from RNA-seq data without a reference genome. Nature Biotechnol. 29(7), 644–652 (2011)
Article Google Scholar
Chang, Z., et al.: Bridger: a new framework for de novo transcriptome assembly using RNA-seq data. Genome Biol. 16(1), 30 (2015)
Article Google Scholar
Liu, J., et al.: Binpacker: packing-based de novo transcriptome assembly from RNA-seq data. PLOS Comput. Biol. 12(2), e1004772 (2016b)
Article Google Scholar
Almodaresi, F., Sarkar, H., Srivastava, A., Patro, R.: A space and time-efficient index for the compacted colored de Bruijn graph. Bioinformatics 34(13), i169–i177 (2018)
Article Google Scholar
Turner, I., Garimella, K.V., Iqbal, Z., McVean, G.: Integrating long-range connectivity information into de Bruijn graphs. Bioinformatics 34(15), 2556–2565 (2018). https://doi.org/10.1093/bioinformatics/bty157
Article Google Scholar
Alipanahi, B., Muggli, M.D., Jundi, M., Noyes, N., Boucher, C.: Resistome SNP calling via read colored de Bruijn graphs. bioRxiv, p. 156174 (2018)
Google Scholar
Alipanahi, B., Kuhnle, A., Boucher, C.: Recoloring the colored de Bruijn graph. In: Gagie, T., Moffat, A., Navarro, G., Cuadros-Vargas, E. (eds.) SPIRE 2018. LNCS, vol. 11147, pp. 1–11. Springer, Cham (2018b). https://doi.org/10.1007/978-3-030-00479-8_1
Chapter Google Scholar
Pandey, P., Bender, M.A., Johnson, R., Patro, R.: A general-purpose counting filter: making every bit count. In: Proceedings of the 2017 ACM International Conference on Management of Data, pp. 775–787. ACM (2017)
Google Scholar
Yu, Y., et al.: SeqOthello: querying RNA-seq experiments at scale. Genome Biol. 19(1), 167 (2018). https://doi.org/10.1186/s13059-018-1535-9. ISSN 1474–760X
Article Google Scholar
Ottaviano, G., Venturini, R.: Partitioned Elias-Fano Indexes. In: Proceedings of the 37th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 273–282. ACM (2014)
Google Scholar
Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory 23(3), 337–343 (1977)
Article MathSciNet Google Scholar
Bookstein, A., Klein, S.T.: Compression of correlated bit-vectors. Inf. Syst. 16(4), 387–400 (1991)
Article Google Scholar
Pandey, P., Bender, M.A., Johnson, R., Patro, R.: Squeakr: an exact and approximate k-mer counting system. Bioinformatics, btx636 (2017). https://doi.org/10.1093/bioinformatics/btx636
NIH. SRA (2017). https://www.ebi.ac.uk/ena/browse. Accessed 06 Nov 2017
O’Leary, N.A., et al.: Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res. gkv1189 (2015)
Google Scholar
Yu, Y.W., Daniels, N.M., Danko, D.C., Berger, B.: Entropy-scaling search of massive biological data. Cell systems 1(2), 130–140 (2015)
Article Google Scholar

Download references

Acknowledgments and Declarations

This work was supported by the US National Science Foundation grants BIO-1564917, CCF-1439084, CCF-1716252, CNS-1408695, National Institutes of Health grant R01HG009937. The experiments were conducted with equipment purchased through NSF CISE Research Infrastructure Grant Number 1405641. RP is a co-founder of Ocean Genomics.

Author information

Authors and Affiliations

Computer Science Department, Stony Brook University, Stony Brook, USA
Fatemeh Almodaresi, Prashant Pandey, Michael Ferdman, Rob Johnson & Rob Patro
VMware Research, Palo Alto, USA
Rob Johnson

Authors

Fatemeh Almodaresi
View author publications
You can also search for this author in PubMed Google Scholar
Prashant Pandey
View author publications
You can also search for this author in PubMed Google Scholar
Michael Ferdman
View author publications
You can also search for this author in PubMed Google Scholar
Rob Johnson
View author publications
You can also search for this author in PubMed Google Scholar
Rob Patro
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Prashant Pandey .

Editor information

Editors and Affiliations

Tufts University, Cambridge, MA, USA
Lenore J. Cowen

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Almodaresi, F., Pandey, P., Ferdman, M., Johnson, R., Patro, R. (2019). An Efficient, Scalable and Exact Representation of High-Dimensional Color Information Enabled via de Bruijn Graph Search. In: Cowen, L. (eds) Research in Computational Molecular Biology. RECOMB 2019. Lecture Notes in Computer Science(), vol 11467. Springer, Cham. https://doi.org/10.1007/978-3-030-17083-7_1

Download citation

DOI: https://doi.org/10.1007/978-3-030-17083-7_1
Published: 02 April 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-17082-0
Online ISBN: 978-3-030-17083-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics