Skip to main content

Improved Search of Large Transcriptomic Sequencing Databases Using Split Sequence Bloom Trees

  • Conference paper
  • First Online:
Research in Computational Molecular Biology (RECOMB 2017)

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 10229))

Abstract

Enormous databases of short-read RNA-seq sequencing experiments such as the NIH Sequencing Read Archive (SRA) are now available. These databases could answer many questions about the condition-specific expression or population variation, and this resource is only going to grow over time. However, these collections remain difficult to use due to the inability to search for a particular expressed sequence. While some progress has been made on this problem, it is still not feasible to search collections of hundreds of terabytes of short-read sequencing experiments. We introduce an indexing scheme called Split Sequence Bloom Tree (SSBT) to support sequence-based querying of terabyte-scale collections of thousands of short-read sequencing experiments. SSBT is an improvement over the SBT [1] data structure for the same task. We apply SSBT to the problem of finding conditions under which query transcripts are expressed. Our experiments are conducted on a set of 2,652 publicly available RNA-seq experiments contained in the NIH for the breast, blood, and brain tissues. We demonstrate that this SSBT index can be queried for a 1000 nt sequence in under 4 min using a single thread and can be stored in just 39 GB, a five-fold improvement in search and storage costs compared to SBT.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Solomon, B., Kingsford, C.: Fast search of thousands of short-read sequencing experiments. Nat. Biotechnol. 34, 300–302 (2016)

    Article  Google Scholar 

  2. Leinonen, R., Sugawara, H., Shumway, M., The International Nucleotide Sequence Database Collaboration: The sequence read archive. Nucleic Acids Res. 39(Database issue), D19–D21 (2011)

    Article  Google Scholar 

  3. Camacho, C., Coulouris, G., Avagyan, V., Ma, N., Papadopoulos, J., Bealer, K., Madden, T.L.: BLAST+: architecture and applications. BMC Bioinform. 10(1), 421 (2009)

    Article  Google Scholar 

  4. Burrows, M., Wheeler, D.J.: A block sorting lossless data compression algorithm. Technical report 124, Digital Equipment Corporation (1994)

    Google Scholar 

  5. Ferragina, P., Manzini, G.: Indexing compressed text. J. ACM 52(4), 552–581 (2005)

    Article  MathSciNet  MATH  Google Scholar 

  6. Grossi, R., Vitter, J.S.: Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SIAM J. Comput. 35(2), 378–407 (2005)

    Article  MathSciNet  MATH  Google Scholar 

  7. Grossi, R., Vitter, J.S., Xu, B.: Wavelet trees: from theory to practice. In: 2011 First International Conference on Data Compression, Communications and Processing (CCP), pp. 210–221. IEEE (2011)

    Google Scholar 

  8. Navarro, G., Mäkinen, V.: Compressed full-text indexes. ACM Comput. Surv. 39, 2 (2007)

    Article  MATH  Google Scholar 

  9. Ziviani, N., Moura, E., Navarro, G., Baeza-Yates, R.: Compression: a key for next-generation text retrieval systems. IEEE Comput. 33, 37–44 (2000)

    Article  Google Scholar 

  10. Navarro, G., Moura, E., Neubert, M., Ziviani, N., Baeza-Yates, R.: Adding compression to block addressing inverted indexes. Inf. Retrieval 3, 49–77 (2000)

    Article  Google Scholar 

  11. Loh, P.-R., Baym, M., Berger, B.: Compressive genomics. Nat. Biotechnol. 30(7), 627–630 (2012)

    Article  Google Scholar 

  12. Daniels, N.M., Gallant, A., Peng, J., Cowen, L.J., Baym, M., Berger, B.: Compressive genomics for protein databases. Bioinformatics 29(13), i283–i290 (2013)

    Article  Google Scholar 

  13. Yu, Y.W., Daniels, N.M., Danko, D.C., Berger, B.: Entropy-scaling search of massive biological data. Cell Syst. 1(2), 130–140 (2015)

    Article  Google Scholar 

  14. Holley, G., Wittler, R., Stoye, J.: Bloom filter trie: an alignment-free and reference-free data structure for pan-genome storage. Algorithms Mol. Biol. 11(1), 1 (2016)

    Article  Google Scholar 

  15. Bloom, B.H.: Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13(7), 422–426 (1970)

    Article  MATH  Google Scholar 

  16. Broder, A., Mitzenmacher, M.: Network applications of bloom filters: a survey. Internet Math. 1(4), 485–509 (2005)

    Article  MathSciNet  MATH  Google Scholar 

  17. Raman, R., Raman, V., Rao, S.S.: Succinct indexable dictionaries with applications to encoding k-ary trees and multisets. In: Proceedings of the Thirteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2002, Philadelphia, PA, USA, pp. 233–242. Society for Industrial and Applied Mathematics (2002)

    Google Scholar 

  18. Vigna, S.: Broadword implementation of rank/select queries. In: McGeoch, C.C. (ed.) WEA 2008. LNCS, vol. 5038, pp. 154–168. Springer, Heidelberg (2008). doi:10.1007/978-3-540-68552-4_12

    Chapter  Google Scholar 

  19. Gog, S., Beller, T., Moffat, A., Petri, M.: From theory to practice: plug and play with succinct data structures. In: Gudmundsson, J., Katajainen, J. (eds.) SEA 2014. LNCS, vol. 8504, pp. 326–337. Springer, Cham (2014). doi:10.1007/978-3-319-07959-2_28

    Google Scholar 

  20. Rasmussen, K., Stoye, J., Myers, E.: Efficient q-gram filters for finding all \(\epsilon \)-matches over a given length. J. Comput. Biol. 13(2), 296–308 (2006)

    Article  MathSciNet  MATH  Google Scholar 

  21. Philippe, N., Salson, M., Commes, T., Rivals, E.: CRAC: an integrated approach to the analysis of RNA-seq reads. Genome Biol. 14(3), R30 (2013)

    Article  Google Scholar 

  22. Zhang, Q., Pell, J., Canino-Koning, R., Howe, A.C., Brown, C.T.: These are not the k-mers you are looking for: efficient online k-mer counting using a probabilistic data structure. PLoS ONE 9(7), e101271 (2014)

    Article  Google Scholar 

  23. Brown, T., Howe, A., Zhang, Q., Pyrkosz, A., Brom, T.: A reference-free algorithm for computational normalization of shotgun sequencing data. arXiv:1203.4802 [q-bio.GN]

  24. Patro, R., Mount, S.M., Kingsford, C.: Sailfish enables alignment-free isoform quantification from RNA-seq reads using lightweight algorithms. Nat. Biotechnol. 32, 462–464 (2014)

    Article  Google Scholar 

Download references

Acknowledgements

We would like to thank Hao Wang, Natalie Sauerwald, Cong Ma, Tim Wall, Mingfu Sho, and especially Guillaume Marçais, Dan DeBlasio, and Heewook Lee for valuable discussions and comments on the manuscript. This research is funded in part by the Gordon and Betty Moore Foundation’s Data-Driven Discovery Initiative through Grant GBMF4554 to Carl Kingsford. It is partially funded by the US National Science Foundation (CCF-1256087, CCF-1319998) and the US National Institutes of Health (R21HG006913, R01HG007104). C.K. received support as an Alfred P. Sloan Research Fellow. B.S. is a predoctoral trainee supported by US National Institutes of Health training grant T32 EB009403 as part of the Howard Hughes Medical Institute (HHMI)-National Institute of Biomedical Imaging and Bioengineering (NIBIB) Interfaces Initiative.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Carl Kingsford .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Solomon, B., Kingsford, C. (2017). Improved Search of Large Transcriptomic Sequencing Databases Using Split Sequence Bloom Trees. In: Sahinalp, S. (eds) Research in Computational Molecular Biology. RECOMB 2017. Lecture Notes in Computer Science(), vol 10229. Springer, Cham. https://doi.org/10.1007/978-3-319-56970-3_16

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-56970-3_16

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-56969-7

  • Online ISBN: 978-3-319-56970-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics