Sparse Binary Relation Representations for Genome Graph Annotation

Karasikov, Mikhail; Mustafa, Harun; Joudaki, Amir; Javadzadeh-No, Sara; Rätsch, Gunnar; Kahles, André

doi:10.1007/978-3-030-17083-7_8

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 11467))

Included in the following conference series:

International Conference on Research in Computational Molecular Biology

2034 Accesses
1 Citations

Abstract

High-throughput DNA sequencing data is accumulating in public repositories, and efficient approaches for storing and indexing such data are in high demand. In recent research, several graph data structures have been proposed to represent large sets of sequencing data and to allow for efficient querying of sequences. In particular, the concept of labeled de Bruijn graphs has been explored by several groups. While there has been good progress towards representing the sequence graph in small space, methods for storing a set of labels on top of such graphs are still not sufficiently explored. It is also currently not clear how characteristics of the input data, such as the sparsity and correlations of labels, can help to inform the choice of method to compress the graph labeling. In this work, we present a new compression approach, Multi-BRWT, which is adaptive to different kinds of input data. We show an up to 29% improvement in compression performance over the basic BRWT method, and up to a 68% improvement over the current state-of-the-art for de Bruijn graph label compression. To put our results into perspective, we present a systematic analysis of five different state-of-the-art annotation compression schemes, evaluate key metrics on both artificial and real-world data and discuss how different data characteristics influence the compression performance. We show that the improvements of our new method can be robustly reproduced for different representative real-world datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 74.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

UK10K Project (2015). Accessed 2 Nov 2018. https://www.uk10k.org
Agarwala, R., et al.: Database resources of the national center for biotechnology information. Nucleic Acids Res. (2017). https://doi.org/10.1093/nar/gkw1071
Alipanahi, B., Muggli, M.D., Jundi, M., Noyes, N., Boucher, C.: Resistome SNP Calling via Read Colored de Bruijn Graphs. bioRxiv (2018). https://doi.org/10.1101/156174
Almodaresi, F., Pandey, P., Patro, R.: Rainbowfish: a succinct colored de Bruijn graph representation. In: Schwartz, R., Reinert, K. (eds.) 17th International Workshop on Algorithms in Bioinformatics (WABI 2017). Leibniz International Proceedings in Informatics (LIPIcs), vol. 88, pp. 18:1–18:15. Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik, Dagstuhl (2017). https://doi.org/10.4230/LIPIcs.WABI.2017.18
Álvarez-García, S., Brisaboa, N.: Compressed k2-triples for full-in-memory RDF engines. arXiv preprint arXiv:1105.4004 (2011)
Baraniuk, R., Davenport, M., DeVore, R., Wakin, M.: A simple proof of the restricted isometry property for random matrices. Constr. Approximation (2008). https://doi.org/10.1007/s00365-007-9003-x
Barbay, J., Claude, F., Navarro, G.: Compact binary relation representations with rich functionality. Inf. Comput. (2013). https://doi.org/10.1016/j.ic.2013.10.003
Brisaboa, N.R., Ladra, S., Navarro, G.: k²-trees for compact web graph representation. In: Karlgren, J., Tarhio, J., Hyyrö, H. (eds.) SPIRE 2009. LNCS, vol. 5721, pp. 18–30. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-03784-9_3
Chapter Google Scholar
Church, D.M., et al.: Modernizing reference genome assemblies. PLoS Biol. 9(7), e1001,091 (2011). https://doi.org/10.1371/journal.pbio.1001091
Ghosh, M., Gupta, I., Gupta, S., Kumar, N.: Fast compaction algorithms for NoSQL databases. In: Proceedings of International Conference on Distributed Computing Systems (2015). https://doi.org/10.1109/ICDCS.2015.53
Gog, S., Beller, T., Moffat, A., Petri, M.: From theory to practice: plug and play with succinct data structures. In: Gudmundsson, J., Katajainen, J. (eds.) SEA 2014. LNCS, vol. 8504, pp. 326–337. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-07959-2_28
Chapter Google Scholar
Iqbal, Z., Caccamo, M., Turner, I., Flicek, P., McVean, G.: De novo assembly and genotyping of variants using colored de Bruijn graphs. Nat. Genet. (2012). https://doi.org/10.1038/ng.1028
Kokot, M., Długosz, M., Deorowicz, S.: KMC 3: counting and manipulating k-mer statistics. Bioinformatics 33(17), 2759–2761 (2017). https://doi.org/10.1093/bioinformatics/btx304
Muggli, M.D., et al.: Succinct colored de Bruijn graphs. Bioinformatics (2017). https://doi.org/10.1093/bioinformatics/btx067
Mustafa, H., et al.: Dynamic compression schemes for graph coloring. Bioinformatics (2018). https://doi.org/10.1093/bioinformatics/bty632
Navarro, G., Providel, E.: Fast, small, simple rank/select on bitmaps. In: Klasing, R. (ed.) SEA 2012. LNCS, vol. 7276, pp. 295–306. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-30850-5_26
Chapter Google Scholar
Novak, A., Paten, B.: A graph extension of the positional Burrows Wheeler transform and its applications. Algorithms Mol. Biol. (2016). https://doi.org/10.1101/051409
Okanohara, D., Sadakane, K.: Practical entropy-compressed rank/select dictionary. In: Proceedings of the Meeting on Algorithm Engineering & Expermiments, pp. 60–70. Society for Industrial and Applied Mathematics (2007). https://dl.acm.org/citation.cfm?id=2791194
O’Leary, N.A., et al.: Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res. (2016). https://doi.org/10.1093/nar/gkv1189
Pandey, P., Almodaresi, F., Bender, M.A., Ferdman, M., Johnson, R., Patro, R.: Mantis: a fast, small, and exact large-scale sequence-search index. Cell Syst. (2018). https://doi.org/10.1016/j.cels.2018.05.021
Raman, R., Raman, V., Satti, S.R.: Succinct indexable dictionaries with applications to encoding k-ary trees, prefix sums and multisets. ACM Trans. Algorithms 3(4) (2007). https://doi.org/10.1145/1290672.1290680
Solomon, B., Kingsford, C.: Improved search of large transcriptomic sequencing databases using split sequence bloom trees. J. Comput. Biol. 25(7), 755–765 (2018). https://doi.org/10.1089/cmb.2017.0265
Stephens, Z.D., et al.: Big data: astronomical or genomical? PLoS Biol. (2015). https://doi.org/10.1371/journal.pbio.1002195
Weigel, D., Mott, R.: The 1001 genomes project for Arabidopsis thaliana. Genome Biol. 10(5), 107 (2009). https://doi.org/10.1186/gb-2009-10-5-107
Zhang, G.: Bird sequencing project takes off. Nature 522(7554), 34–34 (2015). https://doi.org/10.1038/522034d, http://www.nature.com/articles/522034d

Download references

Acknowledgements

We would like to thank the members of the Biomedical Informatics group for fruitful discussions and critical questions, and Torsten Hoefler and Mario Stanke for constructive feedback on the graph setup. Harun Mustafa and Mikhail Karasikov are funded by the Swiss National Science Foundation grant #407540_167331 “Scalable Genome Graph Data Structures for Metagenomics and Genome Annotation” as part of Swiss National Research Programme (NRP) 75 “Big Data”.

Author information

Authors and Affiliations

Department of Computer Science, ETH Zurich, 8092, Zurich, Switzerland
Mikhail Karasikov, Harun Mustafa, Amir Joudaki, Sara Javadzadeh-No, Gunnar Rätsch & André Kahles
University Hospital Zurich, Biomedical Informatics Research, 8091, Zurich, Switzerland
Mikhail Karasikov, Harun Mustafa, Amir Joudaki, Gunnar Rätsch & André Kahles
SIB Swiss Institute of Bioinformatics, 1015, Lausanne, Switzerland
Mikhail Karasikov, Harun Mustafa, Gunnar Rätsch & André Kahles

Authors

Mikhail Karasikov
View author publications
You can also search for this author in PubMed Google Scholar
Harun Mustafa
View author publications
You can also search for this author in PubMed Google Scholar
Amir Joudaki
View author publications
You can also search for this author in PubMed Google Scholar
Sara Javadzadeh-No
View author publications
You can also search for this author in PubMed Google Scholar
Gunnar Rätsch
View author publications
You can also search for this author in PubMed Google Scholar
André Kahles
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Gunnar Rätsch or André Kahles .

Editor information

Editors and Affiliations

Tufts University, Cambridge, MA, USA
Lenore J. Cowen

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Karasikov, M., Mustafa, H., Joudaki, A., Javadzadeh-No, S., Rätsch, G., Kahles, A. (2019). Sparse Binary Relation Representations for Genome Graph Annotation. In: Cowen, L. (eds) Research in Computational Molecular Biology. RECOMB 2019. Lecture Notes in Computer Science(), vol 11467. Springer, Cham. https://doi.org/10.1007/978-3-030-17083-7_8

Download citation

DOI: https://doi.org/10.1007/978-3-030-17083-7_8
Published: 02 April 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-17082-0
Online ISBN: 978-3-030-17083-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics