Improved Search of Large Transcriptomic Sequencing Databases Using Split Sequence Bloom Trees

Solomon, Brad; Kingsford, Carl

doi:10.1007/978-3-319-56970-3_16

Brad Solomon¹⁴ &
Carl Kingsford¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 10229))

Included in the following conference series:

International Conference on Research in Computational Molecular Biology

2056 Accesses
11 Citations
6 Altmetric

Abstract

Enormous databases of short-read RNA-seq sequencing experiments such as the NIH Sequencing Read Archive (SRA) are now available. These databases could answer many questions about the condition-specific expression or population variation, and this resource is only going to grow over time. However, these collections remain difficult to use due to the inability to search for a particular expressed sequence. While some progress has been made on this problem, it is still not feasible to search collections of hundreds of terabytes of short-read sequencing experiments. We introduce an indexing scheme called Split Sequence Bloom Tree (SSBT) to support sequence-based querying of terabyte-scale collections of thousands of short-read sequencing experiments. SSBT is an improvement over the SBT [1] data structure for the same task. We apply SSBT to the problem of finding conditions under which query transcripts are expressed. Our experiments are conducted on a set of 2,652 publicly available RNA-seq experiments contained in the NIH for the breast, blood, and brain tissues. We demonstrate that this SSBT index can be queried for a 1000 nt sequence in under 4 min using a single thread and can be stored in just 39 GB, a five-fold improvement in search and storage costs compared to SBT.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Solomon, B., Kingsford, C.: Fast search of thousands of short-read sequencing experiments. Nat. Biotechnol. 34, 300–302 (2016)
Article Google Scholar
Leinonen, R., Sugawara, H., Shumway, M., The International Nucleotide Sequence Database Collaboration: The sequence read archive. Nucleic Acids Res. 39(Database issue), D19–D21 (2011)
Article Google Scholar
Camacho, C., Coulouris, G., Avagyan, V., Ma, N., Papadopoulos, J., Bealer, K., Madden, T.L.: BLAST+: architecture and applications. BMC Bioinform. 10(1), 421 (2009)
Article Google Scholar
Burrows, M., Wheeler, D.J.: A block sorting lossless data compression algorithm. Technical report 124, Digital Equipment Corporation (1994)
Google Scholar
Ferragina, P., Manzini, G.: Indexing compressed text. J. ACM 52(4), 552–581 (2005)
Article MathSciNet MATH Google Scholar
Grossi, R., Vitter, J.S.: Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SIAM J. Comput. 35(2), 378–407 (2005)
Article MathSciNet MATH Google Scholar
Grossi, R., Vitter, J.S., Xu, B.: Wavelet trees: from theory to practice. In: 2011 First International Conference on Data Compression, Communications and Processing (CCP), pp. 210–221. IEEE (2011)
Google Scholar
Navarro, G., Mäkinen, V.: Compressed full-text indexes. ACM Comput. Surv. 39, 2 (2007)
Article MATH Google Scholar
Ziviani, N., Moura, E., Navarro, G., Baeza-Yates, R.: Compression: a key for next-generation text retrieval systems. IEEE Comput. 33, 37–44 (2000)
Article Google Scholar
Navarro, G., Moura, E., Neubert, M., Ziviani, N., Baeza-Yates, R.: Adding compression to block addressing inverted indexes. Inf. Retrieval 3, 49–77 (2000)
Article Google Scholar
Loh, P.-R., Baym, M., Berger, B.: Compressive genomics. Nat. Biotechnol. 30(7), 627–630 (2012)
Article Google Scholar
Daniels, N.M., Gallant, A., Peng, J., Cowen, L.J., Baym, M., Berger, B.: Compressive genomics for protein databases. Bioinformatics 29(13), i283–i290 (2013)
Article Google Scholar
Yu, Y.W., Daniels, N.M., Danko, D.C., Berger, B.: Entropy-scaling search of massive biological data. Cell Syst. 1(2), 130–140 (2015)
Article Google Scholar
Holley, G., Wittler, R., Stoye, J.: Bloom filter trie: an alignment-free and reference-free data structure for pan-genome storage. Algorithms Mol. Biol. 11(1), 1 (2016)
Article Google Scholar
Bloom, B.H.: Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13(7), 422–426 (1970)
Article MATH Google Scholar
Broder, A., Mitzenmacher, M.: Network applications of bloom filters: a survey. Internet Math. 1(4), 485–509 (2005)
Article MathSciNet MATH Google Scholar
Raman, R., Raman, V., Rao, S.S.: Succinct indexable dictionaries with applications to encoding k-ary trees and multisets. In: Proceedings of the Thirteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2002, Philadelphia, PA, USA, pp. 233–242. Society for Industrial and Applied Mathematics (2002)
Google Scholar
Vigna, S.: Broadword implementation of rank/select queries. In: McGeoch, C.C. (ed.) WEA 2008. LNCS, vol. 5038, pp. 154–168. Springer, Heidelberg (2008). doi:10.1007/978-3-540-68552-4_12
Chapter Google Scholar
Gog, S., Beller, T., Moffat, A., Petri, M.: From theory to practice: plug and play with succinct data structures. In: Gudmundsson, J., Katajainen, J. (eds.) SEA 2014. LNCS, vol. 8504, pp. 326–337. Springer, Cham (2014). doi:10.1007/978-3-319-07959-2_28
Google Scholar
Rasmussen, K., Stoye, J., Myers, E.: Efficient q-gram filters for finding all \(\epsilon \)-matches over a given length. J. Comput. Biol. 13(2), 296–308 (2006)
Article MathSciNet MATH Google Scholar
Philippe, N., Salson, M., Commes, T., Rivals, E.: CRAC: an integrated approach to the analysis of RNA-seq reads. Genome Biol. 14(3), R30 (2013)
Article Google Scholar
Zhang, Q., Pell, J., Canino-Koning, R., Howe, A.C., Brown, C.T.: These are not the k-mers you are looking for: efficient online k-mer counting using a probabilistic data structure. PLoS ONE 9(7), e101271 (2014)
Article Google Scholar
Brown, T., Howe, A., Zhang, Q., Pyrkosz, A., Brom, T.: A reference-free algorithm for computational normalization of shotgun sequencing data. arXiv:1203.4802 [q-bio.GN]
Patro, R., Mount, S.M., Kingsford, C.: Sailfish enables alignment-free isoform quantification from RNA-seq reads using lightweight algorithms. Nat. Biotechnol. 32, 462–464 (2014)
Article Google Scholar

Download references

Acknowledgements

We would like to thank Hao Wang, Natalie Sauerwald, Cong Ma, Tim Wall, Mingfu Sho, and especially Guillaume Marçais, Dan DeBlasio, and Heewook Lee for valuable discussions and comments on the manuscript. This research is funded in part by the Gordon and Betty Moore Foundation’s Data-Driven Discovery Initiative through Grant GBMF4554 to Carl Kingsford. It is partially funded by the US National Science Foundation (CCF-1256087, CCF-1319998) and the US National Institutes of Health (R21HG006913, R01HG007104). C.K. received support as an Alfred P. Sloan Research Fellow. B.S. is a predoctoral trainee supported by US National Institutes of Health training grant T32 EB009403 as part of the Howard Hughes Medical Institute (HHMI)-National Institute of Biomedical Imaging and Bioengineering (NIBIB) Interfaces Initiative.

Author information

Authors and Affiliations

Computational Biology Department, School of Computer Science, Joint Carnegie Mellon University–University of Pittsburgh Ph.D. Program in Computational Biology, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA
Brad Solomon
Computational Biology Department, School of Computer Science, Carnegie Mellon University, 5000 Forbes Ave., Pittsburgh, Pennsylvania, USA
Carl Kingsford

Authors

Brad Solomon
View author publications
You can also search for this author in PubMed Google Scholar
Carl Kingsford
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Carl Kingsford .

Editor information

Editors and Affiliations

Indiana University Bloomington, Bloomington, Indiana, USA
S. Cenk Sahinalp

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Solomon, B., Kingsford, C. (2017). Improved Search of Large Transcriptomic Sequencing Databases Using Split Sequence Bloom Trees. In: Sahinalp, S. (eds) Research in Computational Molecular Biology. RECOMB 2017. Lecture Notes in Computer Science(), vol 10229. Springer, Cham. https://doi.org/10.1007/978-3-319-56970-3_16

Download citation

DOI: https://doi.org/10.1007/978-3-319-56970-3_16
Published: 12 April 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-56969-7
Online ISBN: 978-3-319-56970-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics