Advertisement

Memory-Aware BWT by Segmenting Sequences to Support Subsequence Search

  • Jiaying Wang
  • Xiaochun Yang
  • Bin Wang
  • Huaijie Zhu
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7235)

Abstract

Nowadays, Burrows-Wheeler transform (BWT) has been receiving significant attentions in academia for addressing subsequence matching problems. Although BWT is a typical technique to transform a sequence into a new sequence that is “easy to compress”, it can also be extended as a kind of full text index techniques. Traditional BWT requires nlogn + nlogσ bits to build index for a sequence with n characters, where σ is size of the alphabet. Building BWT index for a long sequence on PCs with limited memory is a great challenge. In order to solve the problem, we propose a novel variation of BWT index named S-BWT, which separates the source sequence into segments. It can reduce the memory cost to n(logσ + logn − logk )/k bits, where k is the number of segments. However, querying on each segment separately using the existing approaches has to undertake the risk of losing some significant results. In this paper, we propose two query methods based on S-BWT and guarantee to find all subsequence occurrences. Our methods can not only require small memory space, but also are faster than the state-of-art BWT backward search method for long sequence.

Keywords

BWT subsequence matching full text index 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Burrows, M., Wheeler, D.J.: A block-sorting lossless data compression algorithm. Technical report, SRC Research Report 124 (1994)Google Scholar
  2. 2.
    Puglisi, S.J., Smyth, W.F., Turpin, A.: A taxonomy of suffix array construction algorithms. ACM Comput. Surv. 39(2) (2007)Google Scholar
  3. 3.
    Kim, D.K., Sim, J.S., Park, H., Park, K.: Constructing suffix arrays in linear time. J. Discrete Algorithms 3(2-4), 126–142 (2005)MathSciNetzbMATHCrossRefGoogle Scholar
  4. 4.
    Ko, P., Aluru, S.: Space efficient linear time construction of suffix arrays. J. Discrete Algorithms 3(2-4), 143–156 (2005)MathSciNetzbMATHCrossRefGoogle Scholar
  5. 5.
    Manzini, G., Ferragina, P.: Engineering a lightweight suffix array construction algorithm. Algorithmica 40(1), 33–50 (2004)MathSciNetzbMATHCrossRefGoogle Scholar
  6. 6.
    Burkhardt, S., Kärkkäinen, J.: Fast Lightweight Suffix Array Construction and Checking. In: Baeza-Yates, R., Chávez, E., Crochemore, M. (eds.) CPM 2003. LNCS, vol. 2676, pp. 55–69. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  7. 7.
    Kärkkäinen, J., Sanders, P., Burkhardt, S.: Linear work suffix array construction. J. ACM 53(6), 918–936 (2006)MathSciNetCrossRefGoogle Scholar
  8. 8.
    Crauser, A., Ferragina, P.: A theoretical and experimental study on the construction of suffix arrays in external memory. Algorithmica 32(1), 1–35 (2002)MathSciNetzbMATHCrossRefGoogle Scholar
  9. 9.
    Dementiev, R., Kärkkäinen, J., Mehnert, J., Sanders, P.: Better external memory suffix array construction. ACM Journal of Experimental Algorithmics, 12 (2008)Google Scholar
  10. 10.
    Kärkkäinen, J.: Fast bwt in small space by blockwise suffix sorting. Theor. Comput. Sci. 387(3), 249–257 (2007)zbMATHCrossRefGoogle Scholar
  11. 11.
    Hon, W.-K., Sadakane, K., Sung, W.-K.: Breaking a time-and-space barrier in constructing full-text indices. SIAM J. Comput. 38(6), 2162–2178 (2009)MathSciNetzbMATHCrossRefGoogle Scholar
  12. 12.
    Lam, T.-W., Sadakane, K., Sung, W.-K., Yiu, S.-M.: A Space and Time Efficient Algorithm for Constructing Compressed Suffix Arrays. In: Ibarra, O.H., Zhang, L. (eds.) COCOON 2002. LNCS, vol. 2387, pp. 401–410. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  13. 13.
    Ferragina, P., Manzini, G.: Opportunistic data structures with applications. In: FOCS, pp. 390–398 (2000)Google Scholar
  14. 14.
    Ferragina, P., Manzini, G.: Indexing compressed text. J. ACM 52(4), 552–581 (2005)MathSciNetCrossRefGoogle Scholar
  15. 15.
    Ferragina, P., Manzini, G., Mäkinen, V., Navarro, G.: Compressed representations of sequences and full-text indexes. ACM Transactions on Algorithms 3(2) (2007)Google Scholar
  16. 16.
    Navarro, G., Mäkinen, V.: Compressed full-text indexes. ACM Comput. Surv. 39(1) (2007)Google Scholar
  17. 17.
    Sirén, J., Välimäki, N., Mäkinen, V., Navarro, G.: Run-Length Compressed Indexes Are Superior for Highly Repetitive Sequence Collections. In: Amir, A., Turpin, A., Moffat, A. (eds.) SPIRE 2008. LNCS, vol. 5280, pp. 164–175. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  18. 18.
    Li, H., Durbin, R.: Fast and accurate short read alignment with burrows-wheeler transform. Bioinformatics 25(14), 1754–1760 (2009)CrossRefGoogle Scholar
  19. 19.
    Li, R., Yu, C., Li, Y., Lam, T.W., Yiu, S.-M., Kristiansen, K., Wang, J.: Soap2: an improved ultrafast tool for short read alignment. Bioinformatics 25(15), 1966–1967 (2009)CrossRefGoogle Scholar
  20. 20.
    Salson, M., Lecroq, T., Léonard, M., Mouchard, L.: A four-stage algorithm for updating a burrows-wheeler transform. Theor. Comput. Sci. 410(43), 4350–4359 (2009)zbMATHCrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Jiaying Wang
    • 1
  • Xiaochun Yang
    • 1
  • Bin Wang
    • 1
  • Huaijie Zhu
    • 1
  1. 1.College of Information Science and EngineeringNortheastern UniversityChina

Personalised recommendations