Abstract
The colored de Bruijn graph (cdbg) and its variants have become an important combinatorial structure used in numerous areas in genomics, such as population-level variation detection in metagenomic samples, large scale sequence search, and cdbg-based reference sequence indices. As samples or genomes are added to the cdbg, the color information comes to dominate the space required to represent this data structure.
In this paper, we show how to represent the color information efficiently by adopting a hierarchical encoding that exploits correlations among color classes—patterns of color occurrence—present in the de Bruijn graph (dbg). A major challenge in deriving an efficient encoding of the color information that takes advantage of such correlations is determining which color classes are close to each other in the high-dimensional space of possible color patterns. We demonstrate that the dbg itself can be used as an efficient mechanism to search for approximate nearest neighbors in this space. While our approach reduces the encoding size of the color information even for relatively small cdbgs (hundreds of experiments), the gains are particularly consequential as the number of potential colors (i.e. samples or references) grows into the thousands.
We apply this encoding in the context of two different applications; the implicit cdbg used for a large-scale sequence search index, Mantis, as well as the encoding of color information used in population-level variation detection by tools such as Vari and Rainbowfish. Our results show significant improvements in the overall size and scalability of representation of the color information. In our experiment on 10,000 samples, we achieved more than \(11\times \) better compression compared to RRR.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
The nodes of the de Bruijn graph are typical stored implicitly, because the node set is simply a function of E.
References
Iqbal, Z., Caccamo, M., Turner, I., Flicek, P., McVean, G.: De novo assembly and genotyping of variants using colored de Bruijn graphs. Nat. Genet. 44, 226–232 (2012). https://doi.org/10.1038/ng.102810.1038/ng.1028
Pevzner, P.A., Tang, H., Waterman, M.S.: An Eulerian path approach to DNA fragment assembly. Proc. National Acad. Sci. 98(17), 9748–9753 (2001)
Pevzner, P.A., Tang, H.: Fragment assembly with double-barreled data. Bioinformatics 17(Suppl. 1), s225–s233 (2001)
Chikhi, R., Limasset, A., Jackman, S., Simpson, J.T., Medvedev, P.: On the representation of de bruijn graphs. In: Sharan, R. (ed.) RECOMB 2014. LNCS, vol. 8394, pp. 35–55. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-05269-4_4
Prashant, P., Fatemeh, A., Bender, M.A., Ferdman, M., Johnson, R., Patro, R.: Mantis: a fast, small, and exact large-scale sequence-search index. Cell Syst. 7(2), 201–207.e4 (2018). https://doi.org/10.1016/j.cels.2018.05.021
Solomon, B., Kingsford, C.: Fast search of thousands of short-read sequencing experiments. Nat. Biotechnol. 34(3), 300–302 (2016)
Solomon, B., Kingsford, C.: Improved search of large transcriptomic sequencing databases using split sequence bloom trees. In: Sahinalp, S.C. (ed.) RECOMB 2017. LNCS, vol. 10229, pp. 257–271. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-56970-3_16
Sun, C., Harris, R.S., Chikhi, R., Medvedev, P.: AllSome sequence bloom trees. In: Sahinalp, S.C. (ed.) RECOMB 2017. LNCS, vol. 10229, pp. 272–286. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-56970-3_17
Bradley, P., den Bakker, H., Rocha, E., McVean, G., Iqbal, Z.: Real-time search of all bacterial and viral genomic data. BioRxiv, p. 234955 (2017)
Muggli, M.D., et al.: Succinct colored de bruijn graphs. Bioinformatics 33, 3181–3187 (2017)
Holley, G., Wittler, R., Stoye, J.: Bloom Filter Trie: an alignment-free and reference-free data structure for pan-genome storage. Algorithms Mol. Biol. 11(1), 3 (2016)
Almodaresi, F., Pandey, P., Patro, R.: Rainbowfish: a succinct colored de Bruijn graph representation. In: LIPIcs-Leibniz International Proceedings in Informatics, vol. 88. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik (2017)
Liu, B., Guo, H., Brudno, M., Wang, Y.: deBGA: read alignment with de Bruijn graph-based seed and extension. Bioinformatics 32(21), 3224–3232 (2016a)
Chikhi, R., Rizk, G.: Space-efficient and exact de bruijn graph representation based on a bloom filter. In: Raphael, B., Tang, J. (eds.) WABI 2012. LNCS, vol. 7534, pp. 236–248. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33122-0_19
Salikhov, K., Sacomoto, G., Kucherov, G.: Using cascading bloom filters to improve the memory usage for de brujin graphs. Algorithms Mol. Biol. 9(1), 2 (2014)
Bowe, A., Onodera, T., Sadakane, K., Shibuya, T.: Succinct de bruijn graphs. In: Raphael, B., Tang, J. (eds.) WABI 2012. LNCS, vol. 7534, pp. 225–235. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33122-0_18
Crawford, V., Kuhnle, A., Boucher, C., Chikhi, R., Gagie, T., Hancock, J.: Practical dynamic de bruijn graphs. Bioinformatics 34, 4189–4195 (2018)
Pandey, P., Bender, M.A., Johnson, R., Patro, R.: deBGR: an efficient and near-exact representation of the weighted de bruijn graph. Bioinformatics 33(14), i133–i141 (2017)
Mustafa, H., Schilken, I., Karasikov, M., Eickhoff, C., Rätsch, G., Kahles, A.: Dynamic compression schemes for graph coloring. Bioinformatics, p. bty632 (2018). https://doi.org/10.1093/bioinformatics/bty632
Burrows, M., Wheeler, D.J.: A block-sorting lossless data compression algorithm (1994)
Raman, R., Raman, V., Srinivasa Rao, S.: Succinct indexable dictionaries with applications to encoding k-ary trees and multisets. In: Proceedings of the Thirteenth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 233–242. Society for Industrial and Applied Mathematics (2002)
Elias, P.: Efficient storage and retrieval by content and address of static files. J. ACM (JACM) 21(2), 246–260 (1974)
Raidl, G.R.: Exact and heuristic approaches for solving the bounded diameter minimum spanning tree problem. Ph.D. thesis (2008)
Althaus, E., Funke, S., Har-Peled, S., Könemann, J., Ramos, E.A., Skutella, M.: Approximating k-hop minimum-spanning trees. Oper. Res. Lett. 33(2):115–120 (2005). https://doi.org/10.1016/j.orl.2004.05.005. http://www.sciencedirect.com/science/article/pii/S0167637704000719. ISSN 0167–6377
Manyem, P., Stallmann, M.F.M.: Some approximation results in multicasting. Technical report, Raleigh, NC, USA (1996)
Khuller, S., Raghavachari, B., Young, N.E.: Balancing minimum spanning and shortest path trees. CoRR, cs.DS/0205045 (2002). http://arxiv.org/abs/cs.DS/0205045
Marathe, M.V., Ravi, R., Sundaram, R., Ravi, S.S., Rosenkrantz, D.J., Hunt III, H.B.: Bicriteria network design problems. CoRR, cs.CC/9809103 (1998). http://arxiv.org/abs/cs.CC/9809103
Simpson, J.T., Wong, K., Jackman, S.D., Schein, J.E., Jones, S.J.M., Birol, I.: ABySS: a parallel assembler for short read sequence data. Genome Res. 19(6), 1117–1123 (2009)
Schulz, M.H., Zerbino, D.R., Vingron, M., Birney, E.: Oases: robust de novo RNA-seq assembly across the dynamic range of expression levels. Bioinformatics 28(8), 1086–1092 (2012)
Zerbino, D.R., Birney, E.: Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 18(5), 821–829 (2008)
Grabherr, M.G., et al.: Full-length transcriptome assembly from RNA-seq data without a reference genome. Nature Biotechnol. 29(7), 644–652 (2011)
Chang, Z., et al.: Bridger: a new framework for de novo transcriptome assembly using RNA-seq data. Genome Biol. 16(1), 30 (2015)
Liu, J., et al.: Binpacker: packing-based de novo transcriptome assembly from RNA-seq data. PLOS Comput. Biol. 12(2), e1004772 (2016b)
Almodaresi, F., Sarkar, H., Srivastava, A., Patro, R.: A space and time-efficient index for the compacted colored de Bruijn graph. Bioinformatics 34(13), i169–i177 (2018)
Turner, I., Garimella, K.V., Iqbal, Z., McVean, G.: Integrating long-range connectivity information into de Bruijn graphs. Bioinformatics 34(15), 2556–2565 (2018). https://doi.org/10.1093/bioinformatics/bty157
Alipanahi, B., Muggli, M.D., Jundi, M., Noyes, N., Boucher, C.: Resistome SNP calling via read colored de Bruijn graphs. bioRxiv, p. 156174 (2018)
Alipanahi, B., Kuhnle, A., Boucher, C.: Recoloring the colored de Bruijn graph. In: Gagie, T., Moffat, A., Navarro, G., Cuadros-Vargas, E. (eds.) SPIRE 2018. LNCS, vol. 11147, pp. 1–11. Springer, Cham (2018b). https://doi.org/10.1007/978-3-030-00479-8_1
Pandey, P., Bender, M.A., Johnson, R., Patro, R.: A general-purpose counting filter: making every bit count. In: Proceedings of the 2017 ACM International Conference on Management of Data, pp. 775–787. ACM (2017)
Yu, Y., et al.: SeqOthello: querying RNA-seq experiments at scale. Genome Biol. 19(1), 167 (2018). https://doi.org/10.1186/s13059-018-1535-9. ISSN 1474–760X
Ottaviano, G., Venturini, R.: Partitioned Elias-Fano Indexes. In: Proceedings of the 37th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 273–282. ACM (2014)
Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory 23(3), 337–343 (1977)
Bookstein, A., Klein, S.T.: Compression of correlated bit-vectors. Inf. Syst. 16(4), 387–400 (1991)
Pandey, P., Bender, M.A., Johnson, R., Patro, R.: Squeakr: an exact and approximate k-mer counting system. Bioinformatics, btx636 (2017). https://doi.org/10.1093/bioinformatics/btx636
NIH. SRA (2017). https://www.ebi.ac.uk/ena/browse. Accessed 06 Nov 2017
O’Leary, N.A., et al.: Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res. gkv1189 (2015)
Yu, Y.W., Daniels, N.M., Danko, D.C., Berger, B.: Entropy-scaling search of massive biological data. Cell systems 1(2), 130–140 (2015)
Acknowledgments and Declarations
This work was supported by the US National Science Foundation grants BIO-1564917, CCF-1439084, CCF-1716252, CNS-1408695, National Institutes of Health grant R01HG009937. The experiments were conducted with equipment purchased through NSF CISE Research Infrastructure Grant Number 1405641. RP is a co-founder of Ocean Genomics.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Almodaresi, F., Pandey, P., Ferdman, M., Johnson, R., Patro, R. (2019). An Efficient, Scalable and Exact Representation of High-Dimensional Color Information Enabled via de Bruijn Graph Search. In: Cowen, L. (eds) Research in Computational Molecular Biology. RECOMB 2019. Lecture Notes in Computer Science(), vol 11467. Springer, Cham. https://doi.org/10.1007/978-3-030-17083-7_1
Download citation
DOI: https://doi.org/10.1007/978-3-030-17083-7_1
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-17082-0
Online ISBN: 978-3-030-17083-7
eBook Packages: Computer ScienceComputer Science (R0)