EPR-Dictionaries: A Practical and Fast Data Structure for Constant Time Searches in Unidirectional and Bidirectional FM Indices

  • Christopher Pockrandt
  • Marcel Ehrhardt
  • Knut Reinert
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10229)

Abstract

The unidirectional FM index was introduced by Ferragina and Manzini in 2000 and allows to search a pattern in the index in one direction. The bidirectional FM index (2FM) was introduced by Lam et al. in 2009. It allows to search for a pattern by extending an infix of the pattern arbitrarily to the left or right. If \(\sigma \) is the size of the alphabet then the method of Lam et al. can conduct one step in time \(\mathcal {O}(\sigma )\) while needing space \(\mathcal {O}(\sigma \cdot n)\) using constant time rank queries on bit vectors. Schnattinger and colleagues improved this time to \(\mathcal {O}(\log \sigma )\) while using \(\mathcal {O}(\log \sigma \cdot n)\) bits of space for both, the FM and 2FM index. This is achieved by the use of binary wavelet trees.

In this paper we introduce a new, practical method for conducting an exact search in a uni- and bidirectional FM index in \(\mathcal {O}(1)\) time per step while using \(\mathcal {O}(\log \sigma \cdot n) + o(\log \sigma \cdot \sigma \cdot n)\) bits of space. This is done by replacing the binary wavelet tree by a new data structure, the Enhanced Prefixsum Rank dictionary (EPR-dictionary).

We implemented this method in the SeqAn C++ library and experimentally validated our theoretical results. In addition we compared our implementation with other freely available implementations of bidirectional indices and show that we are between \(\approx 2.2-4.2\) times faster. This will have a large impact for many bioinformatics applications that rely on practical implementations of (2)FM indices e.g. for read mapping. To our knowledge this is the first implementation of a constant time method for a search step in 2FM indices.

Keywords

FM index Bidirectional BWT Bit vector Rank queries Read mapping 

Notes

Acknowledgments

We would like to acknowledge Enrico Siragusa for his previous implementations of the FM index in SeqAn. The first author acknowledges the support of the International Max-Planck Research School for Computational Biology and Scientific Computing (IMPRS-CBSC). We also thank Veli Mäkinen and Simon Gog for very helpful remarks on a previous version of this manuscript during the Dagstuhl seminar 16351 “Next Generation Sequencing - Algorithms, and Software For Biomedical Applications”.

References

  1. 1.
    Belazzougui, D., Cunial, F., Kärkkäinen, J., Mäkinen, V.: Versatile succinct representations of the bidirectional burrows-wheeler transform. In: Bodlaender, H.L., Italiano, G.F. (eds.) ESA 2013. LNCS, vol. 8125, pp. 133–144. Springer, Heidelberg (2013). doi:10.1007/978-3-642-40450-4_12 CrossRefGoogle Scholar
  2. 2.
    Belazzougui, D., Navarro, G.: Optimal lower and upper bounds for representing sequences. ACM Trans. Algorithms 11, 31 (2015)MathSciNetCrossRefGoogle Scholar
  3. 3.
    Burrows, M., Wheeler, D.J.: A block-sorting lossless data compression algorithm. Technical report (1994)Google Scholar
  4. 4.
    Döring, A., Weese, D., Rausch, T., Reinert, K.: SeqAn an efficient, generic C++ library for sequence analysis. BMC Bioinform. 9, 11 (2008). https://doi.org/10.1186/1471-2105-9-11 CrossRefGoogle Scholar
  5. 5.
    Ferragina, P., Manzini, G.: Opportunistic data structures with applications. In: Annual Symposium on Foundations of Computer Science (2000). https://doi.org/10.1109/SFCS.2000.892127
  6. 6.
    Ferragina, P., Manzini, G., Mäkinen, V., Navarro, G.: Compressed representations of sequences and full-text indexes. ACM Trans. Algorithms (TALG) 3, 20 (2007)MathSciNetCrossRefMATHGoogle Scholar
  7. 7.
    Gog, S., Beller, T., Moffat, A., Petri, M.: From theory to practice: plug and play with succinct data structures. In: Gudmundsson, J., Katajainen, J. (eds.) SEA 2014. LNCS, vol. 8504, pp. 326–337. Springer, Cham (2014). doi:10.1007/978-3-319-07959-2_28 Google Scholar
  8. 8.
    Grossi, R., Gupta, A., Vitter, J.: High-order entropy-compressed text indexes. In: Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms (2003)Google Scholar
  9. 9.
    Hauswedell, H., Singer, J., Reinert, K.: Lambda: the local aligner for massive biological data. Bioinformatics (Oxford, England) 30, i349–i355 (2014). https://doi.org/10.1093/bioinformatics/btu439 CrossRefGoogle Scholar
  10. 10.
    Jacobson, G.J.: Succinct static data structures (1988)Google Scholar
  11. 11.
    Lam, T., Li, R., Tam, A., Wong, S., Wu, E.: High throughput short read alignment via bi-directional BWT. In: Proceedings of BIBM, pp. 31–36 (2009). https://doi.org/10.1109/BIBM.2009.42
  12. 12.
    Lam, T., Sung, W., Tam, S., Wong, C., Yiu, S.: Compressed indexing and local alignment of DNA. Bioinformatics 24, 791–797 (2008). https://doi.org/10.1093/bioinformatics/btn032 CrossRefGoogle Scholar
  13. 13.
    Langmead, B., Salzberg, S.L.: Fast gapped-read alignment with Bowtie 2. Nat. Methods 9, 357–359 (2012)CrossRefGoogle Scholar
  14. 14.
    Li, H.: Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM (2013)Google Scholar
  15. 15.
    Li, H., Durbin, R.: Fast and accurate short read alignment with burrows-wheeler transform. Bioinformatics 25, 1754–1760 (2009). https://doi.org/10.1093/bioinformatics/btp324 CrossRefGoogle Scholar
  16. 16.
    Meyer, F., Kurtz, S., Backofen, R., Will, S., Beckstette, M.: Structator: fast index-based search for RNA sequence-structure patterns. BMC Bioinform. 12, 214 (2011). https://doi.org/10.1186/1471-2105-12-214 CrossRefGoogle Scholar
  17. 17.
    Navarro, G., Providel, E.: Fast, small, simple rank/select on bitmaps. In: International Symposium on Experimental Algorithms (2012). https://doi.org/10.1007/978-3-642-30850-5_26
  18. 18.
    Santiago, M., Sammeth, M., Guigo, R., Ribeca, P.: The GEM mapper: fast, accurate and versatile alignment by filtration. Nat. Methods 9, 1185–1188 (2012). https://doi.org/10.1038/nmeth.2221 CrossRefGoogle Scholar
  19. 19.
    Schnattinger, T., Ohlebusch, E., Gog, S.: Bidirectional search in a string with wavelet trees and bidirectional matching statistics. Inf. Comput. 213, 13–22 (2012). https://doi.org/10.1016/j.ic.2011.03.007 MathSciNetCrossRefMATHGoogle Scholar
  20. 20.
    Siragusa, E.: Approximate string matching for high-throughput sequencing. Ph.D. thesis, Freie Universität Berlin (2015)Google Scholar
  21. 21.
    Siragusa, E., Weese, D., Reinert, K.: Fast and accurate read mapping with approximate seeds and multiple backtracking. Nucleic Acids Res. 41, e78–e78 (2013). https://doi.org/10.1093/nar/gkt005 CrossRefGoogle Scholar
  22. 22.
    Ye, Y., Choi, J.-H., Tang, H.: Rapsearch: a fast protein similarity search tool for short reads. BMC Bioinform. 12, 1 (2011)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  • Christopher Pockrandt
    • 1
    • 2
  • Marcel Ehrhardt
    • 1
  • Knut Reinert
    • 1
  1. 1.Department of Computer Science and MathematicsFreie Universität BerlinBerlinGermany
  2. 2.International Max Planck Research School for Computational Biology and Scientific ComputationBerlinGermany

Personalised recommendations