AllSome Sequence Bloom Trees
The ubiquity of next generation sequencing has transformed the size and nature of many databases, pushing the boundaries of current indexing and searching methods. One particular example is a database of 2,652 human RNA-seq experiments uploaded to the Sequence Read Archive. Recently, Solomon and Kingsford proposed the Sequence Bloom Tree data structure and demonstrated how it can be used to accurately identify SRA samples that have a transcript of interest potentially expressed. In this paper, we propose an improvement called the AllSome Sequence Bloom Tree. Results show that our new data structure significantly improves performance, reducing the tree construction time by 52.7% and query time by 39–85%, with a price of up to 3x memory consumption during queries. Notably, it can query a batch of 198,074 queries in under 8 h (compared to around two days previously) and a whole set of \(k\)-mers from a sequencing experiment (about 27 mil \(k\)-mers) in under 11 min.
KeywordsSequence Bloom Trees Bloom filters RNA-seq Data structures Algorithms Bioinformatics
This work has been supported in part by NSF awards DBI-1356529, CCF-551439057, IIS-1453527, and IIS-1421908 to PM.
- 1.SBT-SK software and data. http://www.cs.cmu.edu/%7Eckingsf/software/bloomtree/. Accessed 01 July 2016
- 7.Consortium, C.P.G., et al: Computational pan-genomics: status, promises and challenges. Brief. Bioinform. bbw089 (2016)Google Scholar
- 20.Marchet, C., Limasset, A., Bittner, L., Peterlongo, P.: A resource-frugal probabilistic dictionary and applications in (meta) genomics (2016). arXiv preprint: arXiv:1605.08319
- 23.Minkin, I., Pham, S., Medvedev, P.: TwoPaCo: an efficient algorithm to build the compacted de Bruijn graph from many complete genomes. Bioinformatics btw609 (2016)Google Scholar
- 24.Murray, K.D., Webers, C., Ong, C.S., Borevitz, J.O., Warthmann, N.: kWIP: the k-mer weighted inner product, a de novo estimator of genetic similarity (2016). bioRxiv: 075481Google Scholar
- 26.Nellore, A., Collado-Torres, L., Jaffe, A.E., Alquicira-Hernández, J., Wilks, C., Pritt, J., Morton, J., Leek, J.T., Langmead, B.: Rail-RNA: scalable analysis of RNA-seq splicing and coverage. Bioinformatics btw575 (2016)Google Scholar
- 28.Raman, R., Raman, V., Rao, S.S.: Succinct indexable dictionaries with applications to encoding k-ary trees and multisets. In: Proceedings of the Thirteenth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 233–242. Society for Industrial and Applied Mathematics (2002)Google Scholar
- 29.Rozov, R., Shamir, R., Halperin, E.: Fast lossless compression via cascading bloom filters. BMC Bioinform. 15(9), 1 (2014)Google Scholar
- 33.Sun, C., Harris, R.S., Chikhi, R., Medvedev, P.: Allsome sequence bloom trees. bioRxiv (2016). http://biorxiv.org/content/early/2016/12/02/090464