Faster Approximate Pattern Matching in Compressed Repetitive Texts

  • Travis Gagie
  • Paweł Gawrychowski
  • Simon J. Puglisi
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7074)

Abstract

Motivated by the imminent growth of massive, highly redundant genomic databases we study the problem of compressing a string database while simultaneously supporting fast random access, substring extraction and pattern matching to the underlying string(s).

Bille et al. (2011) recently showed how, given a straight-line program with r rules for a string s of length n, we can build an \(\ensuremath{\mathcal{O}\!\left( {r} \right)}\)-word data structure that allows us to extract any substring s [i..j] in \(\ensuremath{\mathcal{O}\!\left( {\log n + j - i} \right)}\) time. They also showed how, given a pattern p of length m and an edit distance k ≤ m, their data structure supports finding all occ approximate matches to p in s in \(\ensuremath{\mathcal{O}\!\left( {r (\min (m k, k^4 + m) + \log n) + \ensuremath{\mathsf{occ}}} \right)}\) time. Rytter (2003) and Charikar et al. (2005) showed that r is always at least the number z of phrases in the LZ77 parse of s, and gave algorithms for building straight-line programs with \(\ensuremath{\mathcal{O}\!\left( {z \log n} \right)}\) rules. In this paper we give a simple \(\ensuremath{\mathcal{O}\!\left( {z \log n} \right)}\)-word data structure that takes the same time for substring extraction but only \(\ensuremath{\mathcal{O}\!\left( {z (\min (m k, k^4 + m)) + \ensuremath{\mathsf{occ}}} \right)}\) time for approximate pattern matching.

Keywords

Internal Node Edit Distance Approximate Match Block Graph Phrase Boundary 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Arroyuelo, D., Navarro, G., Sadakane, K.: Stronger Lempel-Ziv based compressed text indexing. Algorithmica (to appear)Google Scholar
  2. 2.
    Bille, P., Landau, G.M., Raman, R., Sadakane, K., Satti, S.R., Weimann, O.: Random access to grammar-compressed strings. In: Proceedings of the 22nd Symposium on Discrete Algorithms, SODA (2011)Google Scholar
  3. 3.
    Cole, R., Hariharan, R.: Approximate string matching: A simpler faster algorithm. SIAM Journal on Computing 31(6), 1761–1782 (2002)MathSciNetCrossRefMATHGoogle Scholar
  4. 4.
    Ferragina, P., Venturini, R.: A simple storage scheme for strings achieving entropy bounds. Theoretical Computer Science 372(1), 115–121 (2007)MathSciNetCrossRefMATHGoogle Scholar
  5. 5.
    Ferragina, P., Manzini, G.: On compressing the textual web. In: WSDM 2010: Proceedings of the Third ACM International Conference on Web Search and Data Mining, pp. 391–400. ACM, New York (2010)Google Scholar
  6. 6.
    Gagie, T., Gawrychowski, P.: Grammar-Based Compression in a Streaming Model. In: Dediu, A.-H., Fernau, H., Martín-Vide, C. (eds.) LATA 2010. LNCS, vol. 6031, pp. 273–284. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  7. 7.
    Genome 10K Community of Scientists: A proposal to obtain whole-genome sequence for 10,000 vertebrate species. Journal of Heredity 100, 659–674 (2009)CrossRefGoogle Scholar
  8. 8.
    González, R., Navarro, G.: Compressed Text Indexes with Fast Locate. In: Ma, B., Zhang, K. (eds.) CPM 2007. LNCS, vol. 4580, pp. 216–227. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  9. 9.
    Kreft, S., Navarro, G.: LZ77-like compression with fast random access. In: Proceedings of the Data Compression Conference, DCC (2010)Google Scholar
  10. 10.
    Kreft, S., Navarro, G.: Self-Indexing Based on LZ77. In: Giancarlo, R., Manzini, G. (eds.) CPM 2011. LNCS, vol. 6661, pp. 41–54. Springer, Heidelberg (2011)CrossRefGoogle Scholar
  11. 11.
    Landau, G.M., Vishkin, U.: Fast parallel and serial approximate string matching. Journal of Algorithms 10(2), 157–169 (1989)MathSciNetCrossRefMATHGoogle Scholar
  12. 12.
    Mäkinen, V., Navarro, G., Sirén, J., Välimäki, N.: Storage and retrieval of highly repetitive sequence collections. Journal of Computational Biology 17(3), 281–308 (2010)MathSciNetCrossRefGoogle Scholar
  13. 13.
    Manzini, G.: An analysis of the Burrows-Wheeler transform. Journal of the ACM 48(3), 407–430 (2001)MathSciNetCrossRefMATHGoogle Scholar
  14. 14.
    Durbin, R., et al.: 1000 genomes (2010), http://www.1000genomes.org/
  15. 15.
    Rytter, W.: Application of Lempel-Ziv factorization to the approximation of grammar-based compression. Theoretical Computer Science 302(1-3), 211–222 (2003)MathSciNetCrossRefMATHGoogle Scholar
  16. 16.
    Sirén, J., Välimäki, N., Mäkinen, V., Navarro, G.: Run-Length Compressed Indexes are Superior for Highly Repetitive Sequence Collections. In: Amir, A., Turpin, A., Moffat, A. (eds.) SPIRE 2008. LNCS, vol. 5280, pp. 164–175. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  17. 17.
    Storer, J.A., Szymanski, T.G.: Data compression via textual substitution. Journal of the ACM 29(4), 928–951 (1982)MathSciNetCrossRefMATHGoogle Scholar
  18. 18.
    Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Transactions on Information Theory 23(3), 337–343 (1977)MathSciNetCrossRefMATHGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Travis Gagie
    • 1
  • Paweł Gawrychowski
    • 2
  • Simon J. Puglisi
    • 3
  1. 1.Department of Computer ScienceAalto UniversityEspooFinland
  2. 2.Department of Computer ScienceUniversity of WrocławWrocławPoland
  3. 3.Department of InformaticsKing’s College LondonLondonUnited Kingdom

Personalised recommendations