Advertisement

Fragmented BWT: An Extended BWT for Full-Text Indexing

  • Masaru ItoEmail author
  • Hiroshi Inoue
  • Kenjiro Taura
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9954)

Abstract

This paper proposes Fragmented Burrows Wheeler Transform (FBWT), an extension to the well-known BWT structure for full-text indexing and searching. A FBWT consists of a number of BWT fragments each covering only a subset of all the suffixes of the original string. As constructing FBWT does not entail building the BWT of the whole string, it is faster than constructing BWT. On the other hand, searching with FBWT can be more costly than that with BWT, since searching the former requires searching all fragments; its amount of work is \(O(dp + {\textit{occ}}\log ^{1+\epsilon }n)\) as opposed to \(O(p + {\textit{occ}}\log ^{1+\epsilon }n)\) of regular BWT, where p is the length of the query string, n the length of the original text, occ the occurrences of the query string, and d the number of fragments. To compensate the search cost, searching with FBWT can be accelerated with SIMD instructions by searching multiple fragments in parallel. Experiments show that building FBWT is about twice as fast as building BWT via a state of the art algorithm (SA-IS); and that FBWT’s search performance compared to BWT’s depends on the number of occurrences, ranging from four times slower than BWT (when there are few occurrences), to twice as fast as BWT (when there are many).

Keywords

Suffix array Burrows- Wheeler transform Full-text indexing 

Notes

Acknowledgement

This work was in part supported by Grant-in-Aid for Scientific Research (A) 16H01715.

References

  1. 1.
    Burrows, M., Wheeler, D.: A block-sorting lossless data compression algorithm. Algorithm Data Compression (124), p. 18 (1994)Google Scholar
  2. 2.
    Claude, F., Navarro, G.: The wavelet matrix. In: SPIRE, pp. 167–179 (2012)Google Scholar
  3. 3.
    Ferragina, P., Manzini, G.: Indexing compressed text. J. ACM 52(4), 552–581 (2000)MathSciNetCrossRefzbMATHGoogle Scholar
  4. 4.
    Ferragina, P., Manzini, G., Mäkinen, V., Navarro, G.: Compressed representations of sequences and full-text indexes. ACM Trans. Algorithms 3(2), 20 (2007)MathSciNetCrossRefzbMATHGoogle Scholar
  5. 5.
    Grossi, R., Gupta, A., Vitter, S.: High-order entropy-compressed text indexes. In: Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 841–850 (2003)Google Scholar
  6. 6.
    Hayashi, S., Taura, K.: Parallel and memory-efficient Burrows-Wheeler transform. In: Proceedings - 2013 IEEE International Conference on Big Data, pp. 43–50 (2013)Google Scholar
  7. 7.
    Kärkkäinen, J., Sanders, P.: Simple linear work suffix array construction. In: Colloquium on Automata, Languages and Programming, pp. 943–955 (2003)Google Scholar
  8. 8.
    Kärkkäinen, J., K.D., S., P.: Parallel external memory suffix sorting. In: CPM 2015, pp. 329–342 (2015)Google Scholar
  9. 9.
    Langmead, B., Trapnell, C., Pop, M., Salzberg, S.: Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 10(3), 1 (2009)CrossRefGoogle Scholar
  10. 10.
    Li, H., Durbin, R.: Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25, 1754–1760 (2009)CrossRefGoogle Scholar
  11. 11.
    Li, R., Yu, C., Li, Y., Lam, W., Yiu, M., Kristiansen, K., Wang, J.: SOAP2: an improved ultrafast tool for short read alignment. Bioinformatics 25, 1966–1967 (2009)CrossRefGoogle Scholar
  12. 12.
    Manber, U., Myers, G.: Suffix string arrays: a new searches method for on-line. In: Proceedings of the First Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 319–327 (1990)Google Scholar
  13. 13.
    Nong, G., Zhang, S., Chan, H.: Linear suffix array construction by almost pure induced-sorting. In: 2009 Data Compression Conference, pp. 193–202 (2009)Google Scholar
  14. 14.
    Sadakane, K.: New text indexing functionalities of the compressed suffix arrays. J. Algorithms 48(2), 294–313 (2003)MathSciNetCrossRefzbMATHGoogle Scholar

Copyright information

© Springer International Publishing AG 2016

Authors and Affiliations

  1. 1.Department of Information and Communication Engineering, Graduate School of Information Science and TechnologyThe University of TokyoTokyoJapan
  2. 2.IBM ResearchTokyoJapan

Personalised recommendations