# EPR-Dictionaries: A Practical and Fast Data Structure for Constant Time Searches in Unidirectional and Bidirectional FM Indices

## Abstract

The unidirectional FM index was introduced by Ferragina and Manzini in 2000 and allows to search a pattern in the index in one direction. The bidirectional FM index (2FM) was introduced by Lam et al. in 2009. It allows to search for a pattern by extending an infix of the pattern arbitrarily to the left or right. If \(\sigma \) is the size of the alphabet then the method of Lam et al. can conduct one step in time \(\mathcal {O}(\sigma )\) while needing space \(\mathcal {O}(\sigma \cdot n)\) using constant time rank queries on bit vectors. Schnattinger and colleagues improved this time to \(\mathcal {O}(\log \sigma )\) while using \(\mathcal {O}(\log \sigma \cdot n)\) bits of space for both, the FM and 2FM index. This is achieved by the use of binary wavelet trees.

In this paper we introduce a new, practical method for conducting an exact search in a uni- and bidirectional FM index in \(\mathcal {O}(1)\) time per step while using \(\mathcal {O}(\log \sigma \cdot n) + o(\log \sigma \cdot \sigma \cdot n)\) bits of space. This is done by replacing the binary wavelet tree by a new data structure, the *Enhanced Prefixsum Rank dictionary* (EPR-dictionary).

We implemented this method in the SeqAn C++ library and experimentally validated our theoretical results. In addition we compared our implementation with other freely available implementations of bidirectional indices and show that we are between \(\approx 2.2-4.2\) times faster. This will have a large impact for many bioinformatics applications that rely on practical implementations of (2)FM indices e.g. for read mapping. To our knowledge this is the first implementation of a constant time method for a search step in 2FM indices.

### Keywords

FM index Bidirectional BWT Bit vector Rank queries Read mapping## Notes

### Acknowledgments

We would like to acknowledge Enrico Siragusa for his previous implementations of the FM index in SeqAn. The first author acknowledges the support of the International Max-Planck Research School for Computational Biology and Scientific Computing (IMPRS-CBSC). We also thank Veli Mäkinen and Simon Gog for very helpful remarks on a previous version of this manuscript during the Dagstuhl seminar 16351 “Next Generation Sequencing - Algorithms, and Software For Biomedical Applications”.

### References

- 1.Belazzougui, D., Cunial, F., Kärkkäinen, J., Mäkinen, V.: Versatile succinct representations of the bidirectional burrows-wheeler transform. In: Bodlaender, H.L., Italiano, G.F. (eds.) ESA 2013. LNCS, vol. 8125, pp. 133–144. Springer, Heidelberg (2013). doi:10.1007/978-3-642-40450-4_12 CrossRefGoogle Scholar
- 2.Belazzougui, D., Navarro, G.: Optimal lower and upper bounds for representing sequences. ACM Trans. Algorithms
**11**, 31 (2015)MathSciNetCrossRefGoogle Scholar - 3.Burrows, M., Wheeler, D.J.: A block-sorting lossless data compression algorithm. Technical report (1994)Google Scholar
- 4.Döring, A., Weese, D., Rausch, T., Reinert, K.: SeqAn an efficient, generic C++ library for sequence analysis. BMC Bioinform.
**9**, 11 (2008). https://doi.org/10.1186/1471-2105-9-11 CrossRefGoogle Scholar - 5.Ferragina, P., Manzini, G.: Opportunistic data structures with applications. In: Annual Symposium on Foundations of Computer Science (2000). https://doi.org/10.1109/SFCS.2000.892127
- 6.Ferragina, P., Manzini, G., Mäkinen, V., Navarro, G.: Compressed representations of sequences and full-text indexes. ACM Trans. Algorithms (TALG)
**3**, 20 (2007)MathSciNetCrossRefMATHGoogle Scholar - 7.Gog, S., Beller, T., Moffat, A., Petri, M.: From theory to practice: plug and play with succinct data structures. In: Gudmundsson, J., Katajainen, J. (eds.) SEA 2014. LNCS, vol. 8504, pp. 326–337. Springer, Cham (2014). doi:10.1007/978-3-319-07959-2_28 Google Scholar
- 8.Grossi, R., Gupta, A., Vitter, J.: High-order entropy-compressed text indexes. In: Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms (2003)Google Scholar
- 9.Hauswedell, H., Singer, J., Reinert, K.: Lambda: the local aligner for massive biological data. Bioinformatics (Oxford, England)
**30**, i349–i355 (2014). https://doi.org/10.1093/bioinformatics/btu439 CrossRefGoogle Scholar - 10.Jacobson, G.J.: Succinct static data structures (1988)Google Scholar
- 11.Lam, T., Li, R., Tam, A., Wong, S., Wu, E.: High throughput short read alignment via bi-directional BWT. In: Proceedings of BIBM, pp. 31–36 (2009). https://doi.org/10.1109/BIBM.2009.42
- 12.Lam, T., Sung, W., Tam, S., Wong, C., Yiu, S.: Compressed indexing and local alignment of DNA. Bioinformatics
**24**, 791–797 (2008). https://doi.org/10.1093/bioinformatics/btn032 CrossRefGoogle Scholar - 13.Langmead, B., Salzberg, S.L.: Fast gapped-read alignment with Bowtie 2. Nat. Methods
**9**, 357–359 (2012)CrossRefGoogle Scholar - 14.Li, H.: Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM (2013)Google Scholar
- 15.Li, H., Durbin, R.: Fast and accurate short read alignment with burrows-wheeler transform. Bioinformatics
**25**, 1754–1760 (2009). https://doi.org/10.1093/bioinformatics/btp324 CrossRefGoogle Scholar - 16.Meyer, F., Kurtz, S., Backofen, R., Will, S., Beckstette, M.: Structator: fast index-based search for RNA sequence-structure patterns. BMC Bioinform.
**12**, 214 (2011). https://doi.org/10.1186/1471-2105-12-214 CrossRefGoogle Scholar - 17.Navarro, G., Providel, E.: Fast, small, simple rank/select on bitmaps. In: International Symposium on Experimental Algorithms (2012). https://doi.org/10.1007/978-3-642-30850-5_26
- 18.Santiago, M., Sammeth, M., Guigo, R., Ribeca, P.: The GEM mapper: fast, accurate and versatile alignment by filtration. Nat. Methods
**9**, 1185–1188 (2012). https://doi.org/10.1038/nmeth.2221 CrossRefGoogle Scholar - 19.Schnattinger, T., Ohlebusch, E., Gog, S.: Bidirectional search in a string with wavelet trees and bidirectional matching statistics. Inf. Comput.
**213**, 13–22 (2012). https://doi.org/10.1016/j.ic.2011.03.007 MathSciNetCrossRefMATHGoogle Scholar - 20.Siragusa, E.: Approximate string matching for high-throughput sequencing. Ph.D. thesis, Freie Universität Berlin (2015)Google Scholar
- 21.Siragusa, E., Weese, D., Reinert, K.: Fast and accurate read mapping with approximate seeds and multiple backtracking. Nucleic Acids Res.
**41**, e78–e78 (2013). https://doi.org/10.1093/nar/gkt005 CrossRefGoogle Scholar - 22.Ye, Y., Choi, J.-H., Tang, H.: Rapsearch: a fast protein similarity search tool for short reads. BMC Bioinform.
**12**, 1 (2011)CrossRefGoogle Scholar