AllSome Sequence Bloom Trees

  • Chen Sun
  • Robert S. Harris
  • Rayan Chikhi
  • Paul Medvedev
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10229)


The ubiquity of next generation sequencing has transformed the size and nature of many databases, pushing the boundaries of current indexing and searching methods. One particular example is a database of 2,652 human RNA-seq experiments uploaded to the Sequence Read Archive. Recently, Solomon and Kingsford proposed the Sequence Bloom Tree data structure and demonstrated how it can be used to accurately identify SRA samples that have a transcript of interest potentially expressed. In this paper, we propose an improvement called the AllSome Sequence Bloom Tree. Results show that our new data structure significantly improves performance, reducing the tree construction time by 52.7% and query time by 39–85%, with a price of up to 3x memory consumption during queries. Notably, it can query a batch of 198,074 queries in under 8 h (compared to around two days previously) and a whole set of \(k\)-mers from a sequencing experiment (about 27 mil \(k\)-mers) in under 11 min.


Sequence Bloom Trees Bloom filters RNA-seq Data structures Algorithms Bioinformatics 



This work has been supported in part by NSF awards DBI-1356529, CCF-551439057, IIS-1453527, and IIS-1421908 to PM.


  1. 1.
    SBT-SK software and data. Accessed 01 July 2016
  2. 2.
    Baier, U., Beller, T., Ohlebusch, E.: Graphical pan-genome analysis with compressed suffix trees and the Burrows-Wheeler transform. Bioinformatics 32, 497–504 (2015)CrossRefGoogle Scholar
  3. 3.
    Bloom, B.H.: Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13(7), 422–426 (1970)CrossRefMATHGoogle Scholar
  4. 4.
    Bray, N.L., Pimentel, H., Melsted, P., Pachter, L.: Near-optimal probabilistic RNA-seq quantification. Nat. Biotechnol. 34(5), 525–527 (2016)CrossRefGoogle Scholar
  5. 5.
    Chambi, S., Lemire, D., Kaser, O., Godin, R.: Better bitmap performance with roaring bitmaps. Softw. Pract. Exp. 46(5), 709–719 (2015)CrossRefGoogle Scholar
  6. 6.
    Chikhi, R., Rizk, G.: Space-efficient and exact de Bruijn graph representation based on a bloom filter. Algorithms Mol. Biol. 8(1), 1 (2013)CrossRefGoogle Scholar
  7. 7.
    Consortium, C.P.G., et al: Computational pan-genomics: status, promises and challenges. Brief. Bioinform. bbw089 (2016)Google Scholar
  8. 8.
    Crainiceanu, A., Lemire, D.: Bloofi: multidimensional bloom filters. Inf. Syst. 54, 311–324 (2015)CrossRefGoogle Scholar
  9. 9.
    Dolle, D.D., Liu, Z., Cotten, M.L., Simpson, J.T., Iqbal, Z., Durbin, R., McCarthy, S., Keane, T.: Using reference-free compressed data structures to analyse sequencing reads from thousands of human genomes. Genome Res. 27, 300–309 (2016)CrossRefGoogle Scholar
  10. 10.
    Ernst, C., Rahmann, S.: PanCake: a data structure for pangenomes. Ger. Conf. Bioinform. 34, 35–45 (2013)MATHGoogle Scholar
  11. 11.
    Gog, S., Beller, T., Moffat, A., Petri, M.: From theory to practice: plug and play with succinct data structures. In: Gudmundsson, J., Katajainen, J. (eds.) SEA 2014. LNCS, vol. 8504, pp. 326–337. Springer, Cham (2014). doi: 10.1007/978-3-319-07959-2_28 Google Scholar
  12. 12.
    Heo, Y., Wu, X.L., Chen, D., Ma, J., Hwu, W.M.: BLESS: bloom filter-based error correction solution for high-throughput sequencing reads. Bioinformatics 30, 1354–1362 (2014)CrossRefGoogle Scholar
  13. 13.
    Holley, G., Wittler, R., Stoye, J.: Bloom filter trie – a data structure for pan-genome storage. In: Pop, M., Touzet, H. (eds.) WABI 2015. LNCS, vol. 9289, pp. 217–230. Springer, Heidelberg (2015). doi: 10.1007/978-3-662-48221-6_16 CrossRefGoogle Scholar
  14. 14.
    de Hoon, M.J., Imoto, S., Nolan, J., Miyano, S.: Open source clustering software. Bioinformatics 20(9), 1453–1454 (2004)CrossRefGoogle Scholar
  15. 15.
    Leinonen, R., Sugawara, H., Shumway, M.: The sequence read archive. Nucleic Acids Res. 39, D19–D21 (2010)CrossRefGoogle Scholar
  16. 16.
    Liu, B., Zhu, D., Wang, Y.: deBWT: parallel construction of Burrows-Wheeler Transform for large collection of genomes with de Bruijn-branch encoding. Bioinformatics 32(12), i174–i182 (2016)CrossRefGoogle Scholar
  17. 17.
    Loh, P.R., Baym, M., Berger, B.: Compressive genomics. Nat. Biotechnol. 30(7), 627–630 (2012)CrossRefGoogle Scholar
  18. 18.
    Mäkinen, V., Belazzougui, D., Cunial, F., Tomescu, A.I.: Genome-Scale Algorithm Design. Cambridge University Press, Cambridge (2015)CrossRefGoogle Scholar
  19. 19.
    Marçais, G., Kingsford, C.: A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics 27(6), 764–770 (2011)CrossRefGoogle Scholar
  20. 20.
    Marchet, C., Limasset, A., Bittner, L., Peterlongo, P.: A resource-frugal probabilistic dictionary and applications in (meta) genomics (2016). arXiv preprint: arXiv:1605.08319
  21. 21.
    Marcus, S., Lee, H., Schatz, M.C.: SplitMEM: a graphical algorithm for pan-genome analysis with suffix skips. Bioinformatics 30(24), 3476–3483 (2014)CrossRefGoogle Scholar
  22. 22.
    Melsted, P., Pritchard, J.K.: Efficient counting of k-mers in DNA sequences using a bloom filter. BMC Bioinform. 12(1), 333 (2011)CrossRefGoogle Scholar
  23. 23.
    Minkin, I., Pham, S., Medvedev, P.: TwoPaCo: an efficient algorithm to build the compacted de Bruijn graph from many complete genomes. Bioinformatics btw609 (2016)Google Scholar
  24. 24.
    Murray, K.D., Webers, C., Ong, C.S., Borevitz, J.O., Warthmann, N.: kWIP: the k-mer weighted inner product, a de novo estimator of genetic similarity (2016). bioRxiv: 075481Google Scholar
  25. 25.
    Navarro, G., De Moura, E.S., Neubert, M., Ziviani, N., Baeza-Yates, R.: Adding compression to block addressing inverted indexes. Inf. Retr. 3(1), 49–77 (2000)CrossRefGoogle Scholar
  26. 26.
    Nellore, A., Collado-Torres, L., Jaffe, A.E., Alquicira-Hernández, J., Wilks, C., Pritt, J., Morton, J., Leek, J.T., Langmead, B.: Rail-RNA: scalable analysis of RNA-seq splicing and coverage. Bioinformatics btw575 (2016)Google Scholar
  27. 27.
    Patro, R., Mount, S.M., Kingsford, C.: Sailfish enables alignment-free isoform quantification from RNA-seq reads using lightweight algorithms. Nat. Biotechnol. 32(5), 462–464 (2014)CrossRefGoogle Scholar
  28. 28.
    Raman, R., Raman, V., Rao, S.S.: Succinct indexable dictionaries with applications to encoding k-ary trees and multisets. In: Proceedings of the Thirteenth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 233–242. Society for Industrial and Applied Mathematics (2002)Google Scholar
  29. 29.
    Rozov, R., Shamir, R., Halperin, E.: Fast lossless compression via cascading bloom filters. BMC Bioinform. 15(9), 1 (2014)Google Scholar
  30. 30.
    Salikhov, K., Sacomoto, G., Kucherov, G.: Using cascading bloom filters to improve the memory usage for de Brujin graphs. In: Darling, A., Stoye, J. (eds.) WABI 2013. LNCS, vol. 8126, pp. 364–376. Springer, Heidelberg (2013). doi: 10.1007/978-3-642-40453-5_28 CrossRefGoogle Scholar
  31. 31.
    Solomon, B., Kingsford, C.: Fast search of thousands of short-read sequencing experiments. Nat. Biotechnol. 34(3), 300–302 (2016)CrossRefGoogle Scholar
  32. 32.
    Stranneheim, H., Käller, M., Allander, T., Andersson, B., Arvestad, L., Lundeberg, J.: Classification of DNA sequences using bloom filters. Bioinformatics 26(13), 1595–1600 (2010)CrossRefGoogle Scholar
  33. 33.
    Sun, C., Harris, R.S., Chikhi, R., Medvedev, P.: Allsome sequence bloom trees. bioRxiv (2016).
  34. 34.
    Trapnell, C., Roberts, A., Goff, L., Pertea, G., Kim, D., Kelley, D.R., Pimentel, H., Salzberg, S.L., Rinn, J.L., Pachter, L.: Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nat. Protoc. 7(3), 562–578 (2012)CrossRefGoogle Scholar
  35. 35.
    Yu, Y.W., Daniels, N.M., Danko, D.C., Berger, B.: Entropy-scaling search of massive biological data. Cell Syst. 1(2), 130–140 (2015)CrossRefGoogle Scholar
  36. 36.
    Ziviani, N., de Moura, E.S., Navarro, G., Baeza-Yates, R.: Compression: a key for next-generation text retrieval systems. IEEE Comput. 33(11), 37–44 (2000)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  • Chen Sun
    • 1
  • Robert S. Harris
    • 2
  • Rayan Chikhi
    • 3
  • Paul Medvedev
    • 1
    • 4
    • 5
  1. 1.Department of Computer Science and EngineeringThe Pennsylvania State UniversityUniversity ParkUSA
  2. 2.Department of BiologyThe Pennsylvania State UniversityUniversity ParkUSA
  3. 3.CNRS, CRIStALUniversity of LilleLilleFrance
  4. 4.Department of Biochemistry and Molecular BiologyThe Pennsylvania State UniversityUniversity ParkUSA
  5. 5.Genome Sciencies Institute of the HuckThe Pennsylvania State UniversityUniversity ParkUSA

Personalised recommendations