Compressed Text Indexing with Wildcards

  • Wing-Kai Hon
  • Tsung-Han Ku
  • Rahul Shah
  • Sharma V. Thankachan
  • Jeffrey Scott Vitter
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7024)

Abstract

Let \(T=T_1\phi^{k_1}T_2\phi^{k_2}\cdots\phi^{k_d}T_{d+1}\) be a text of total length n, where characters of each T i are chosen from an alphabet Σ of size σ, and φ denotes a wildcard symbol. The text indexing with wildcards problem is to index T such that when we are given a query pattern P, we can locate the occurrences of P in T efficiently. This problem has been applied in indexing genomic sequences that contain single-nucleotide polymorphisms (SNP) because SNP can be modeled as wildcards. Recently Tam et al. (2009) and Thachuk (2011) have proposed succinct indexes for this problem. In this paper, we present the first compressed index for this problem, which takes only nH h  + o(n logσ) + O(d logn) bits space, where H h is the hth-order empirical entropy (h = o(log σ n)) of T.

Keywords

Locus Node Query Range Query Time Storage Scheme Query Pattern 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Belazzougui, D.: Succinct Dictionary Matching with No Slowdown. In: Amir, A., Parida, L. (eds.) CPM 2010. LNCS, vol. 6129, pp. 88–100. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  2. 2.
    Burrows, M., Wheeler, D.J.: A Block-sorting Lossless Data Compression Algorithm. Technical Report 124, Digital Equipment Corporation, Paolo Alto, CA, USA (1994)Google Scholar
  3. 3.
    Chien, Y.F., Hon, W.K., Shah, R., Vitter, J.S.: Geometric Burrows-Wheeler Transform: Linking Range Searching and Text Indexing. In: DCC, pp. 252–261 (2008)Google Scholar
  4. 4.
    Cole, R., Gottlieb, L.-A., Lewenstein, M.: Dictionary Matching and Indexing with Errors and Don’t Cares. In: STOC, pp. 91–100 (2004)Google Scholar
  5. 5.
    Ferragina, P., Manzini, G.: Indexing Compressed Text. Journal of the ACM 52(4), 552–581 (2005)MathSciNetCrossRefMATHGoogle Scholar
  6. 6.
    Ferragina, P., Manzini, G., Mäkinen, V., Navarro, G.: Compressed Representations of Sequences and Full-Text Indexes. ACM Transactions on Algorithms 3(2) (2007)Google Scholar
  7. 7.
    Ferragina, P., Venturini, R.: A Simple Storage Scheme for Strings Achieving Entropy Bounds. Theoretical Computer Science 372(1), 115–121 (2007)MathSciNetCrossRefMATHGoogle Scholar
  8. 8.
    Grossi, R., Vitter, J.S.: Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching. SIAM Journal on Computing 35(2), 378–407 (2005)MathSciNetCrossRefMATHGoogle Scholar
  9. 9.
    Hon, W.-K., Ku, T.-H., Shah, R., Thankachan, S.V., Vitter, J.S.: Faster Compressed Dictionary Matching. In: Chavez, E., Lonardi, S. (eds.) SPIRE 2010. LNCS, vol. 6393, pp. 191–200. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  10. 10.
    Hon, W.K., Lam, T.W., Shah, R., Tam, S.L., Vitter, J.S.: Compressed Index for Dictionary Matching. In: DCC, pp. 23–32 (2008)Google Scholar
  11. 11.
    Hon, W.-K., Shah, R., Thankachan, S.V., Vitter, J.S.: On Entropy-Compressed Text Indexing in External Memory. In: Karlgren, J., Tarhio, J., Hyyrö, H. (eds.) SPIRE 2009. LNCS, vol. 5721, pp. 75–89. Springer, Heidelberg (2009)CrossRefGoogle Scholar
  12. 12.
    Kärkkäinen, J., Ukkonen, E.: Sparse Suffix Trees. In: COCOON, vol. 219–230 (1996)Google Scholar
  13. 13.
    Manber, U., Myers, G.: Suffix Arrays: A New Method for On-Line String Searches. SIAM Journal on Computing 22(5), 935–948 (1993)MathSciNetCrossRefMATHGoogle Scholar
  14. 14.
    McCreight, E.M.: A Space-economical Suffix Tree Construction Algorithm. Journal of the ACM 23(2), 262–272 (1976)MathSciNetCrossRefMATHGoogle Scholar
  15. 15.
    Nekrich, Y.: Orthogonal Range Searching in Linear and Almost-Linear Space. Computational Geometry 42(4), 342–351 (2009)MathSciNetCrossRefMATHGoogle Scholar
  16. 16.
    Raman, R., Raman, V., Rao, S.S.: Succinct Indexable Dictionaries with Applications to Encoding k-ary Trees, Prefix Sums and Multisets. ACM Transactions on Algorithms 3(4) (2007)Google Scholar
  17. 17.
    Lam, T.-W., Sung, W.-K., Tam, S.-L., Yiu, S.-M.: Space Efficient Indexes for String Matching with Don’t Cares. In: Tokuyama, T. (ed.) ISAAC 2007. LNCS, vol. 4835, pp. 846–857. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  18. 18.
    Tam, A., Wu, E., Lam, T.-W., Yiu, S.-M.: Succinct Text Indexing with Wildcards. In: Karlgren, J., Tarhio, J., Hyyrö, H. (eds.) SPIRE 2009. LNCS, vol. 5721, pp. 39–50. Springer, Heidelberg (2009)CrossRefGoogle Scholar
  19. 19.
    Thachuk, C.: Succincter Text Indexing with Wildcards. In: Giancarlo, R., Manzini, G. (eds.) CPM 2011. LNCS, vol. 6661, pp. 27–40. Springer, Heidelberg (2011)CrossRefGoogle Scholar
  20. 20.
    Weiner, P.: Linear Pattern Matching Algorithms. In: FOCS, pp. 1–11 (1973)Google Scholar
  21. 21.
    Ziv, J., Lempel, A.: Compression of Individual Sequences via Variable Length Coding. IEEE Transactions on Information Theory 24(5), 530–536 (1978)MathSciNetCrossRefMATHGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Wing-Kai Hon
    • 1
  • Tsung-Han Ku
    • 1
  • Rahul Shah
    • 2
  • Sharma V. Thankachan
    • 2
  • Jeffrey Scott Vitter
    • 3
  1. 1.National Tsing Hua UniversityTaiwan
  2. 2.Louisiana State UniversityUSA
  3. 3.The University of KansasUSA

Personalised recommendations