Advertisement

Compressed Indexes for Aligned Pattern Matching

  • Sharma V. Thankachan
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7024)

Abstract

In many situations like protein sequences, the primary protein sequence is associated with secondary structure labels [6]. This can be treated as two sequences aligned character by character. Many other DNA and RNA sequences involve linkages which are aligned across or in the same or different strands. In this paper, we consider the most natural characterization of aligned string data.

The aligned pattern matching problem is to index two input texts T 1[1...n] and T 2[1...n], each having n characters taken from an alphabet set Σ of size σ = polylog(n), such that the following query can be answered efficiently: given two query patterns P 1 and P 2, find all the text positions i such that P 1 matches with T 1[i...(i + |P 1| − 1)] and P 2 matches with T 2[i...(i + |P 2| − 1)]. Our objective is to design a compressed space index for this problem and we obtained the following main results: when the query patterns are sufficiently long (|P 1|, |P 2| > α = Θ( log2 + 2ε n), where ε > 0), we can design an index which takes nH k  + nH k  + o(nlogσ) bits space and O(|P 1| + |P 2| + log4 + 4ε n + t) query time, where H k and H k denotes the empirical kth-order entropy (k = o(log σ n)) of T 1 and T 2 respectively, t represents the number of outputs and ε > 0. Further we show that designing a compressed/succinct space index with poly-logarithmic query time, which works for query patterns of all lengths is at least as hard as designing a linear space index for 3-dimensional orthogonal range reporting with poly-logarithmic query time. However, we introduce another compressed index of nH k  + nH k  + O(n) + o(nlogσ) bits space requirement with a query time of \(O(|P_1|+|P_2|+\sqrt{nt}\log^{2+\epsilon} n)\) which works without any restriction on the length of the patterns.

Keywords

Query Time Lower Common Ancestor Marked Node Valid Output Orthogonal Range 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Alstrup, S., Bordal, G.S., Rauhe, T.: New data structure for orthogonal range searching. In: FOCS, pp. 198–207 (2000)Google Scholar
  2. 2.
    Chazelle, B.: Lower bounds for orthogonal range searching: I. the reporting case. JACM 37, 200–212, 1990 (2005)MathSciNetCrossRefzbMATHGoogle Scholar
  3. 3.
    Bender, M.A., Farach-Colton, M.: The LCA Problem Revisited. In: Gonnet, G.H., Viola, A. (eds.) LATIN 2000. LNCS, vol. 1776, pp. 88–94. Springer, Heidelberg (2000)CrossRefGoogle Scholar
  4. 4.
    Burrows, M., Wheeler, D.J.: A Block-Sorting Lossless Data Compression Algorithm. Technical Report 124, Digital Equipment Corporation, Paolo Alto, CA, USA (1994)Google Scholar
  5. 5.
    Chien, Y.F., Hon, W.-K., Shah, R., Vitter, J.S.: Geometric Burrows-Wheeler Transform: Linking Range Searching and Text Indexing. In: DCC 2008, pp. 252–261 (2008)Google Scholar
  6. 6.
    Eltabakh, M.Y., Hon, W.-K., Shah, R., Aref, W.G., Vitte, J.S.: The SBC-tree: an index for run-length compressed sequences. In: EDBT, pp. 523–534 (2008)Google Scholar
  7. 7.
    Ferragina, P., Manzini, G., Mäkinen, V., Navarro, G.: Compressed representations of sequences and full-text indexes. TALG 3(2) (2007)Google Scholar
  8. 8.
    Ferragina, P., Manzini, G.: Indexing Compressed Text. JACM 52(4), 552–581 (2005); A preliminary version appears in FOCS 2000 MathSciNetCrossRefzbMATHGoogle Scholar
  9. 9.
    Grossi, R., Vitter, J.S.: Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching. SIAM Journal on Computing 35(2), 378–407 (2005); A preliminary version appears in STOC 2000MathSciNetCrossRefzbMATHGoogle Scholar
  10. 10.
    Grossi, R., Gupta, A., Vitter, J.S.: High-Order Entropy-Compressed Text Indexes. In: SODA, pp. 841–850 (2003)Google Scholar
  11. 11.
    Hon, W.-K., Shah, R., Vitter, J.S.: Ordered Pattern Matching: Towards Full-Text Retrieval. Technical Report TR-06-008, Purdue University (March 2006)Google Scholar
  12. 12.
    Hon, W.-K., Shah, R., Vitter, J.S.: Space-Efficient Framework for Top-k String Retrieval Problems. In: FOCS 2009, pp. 713–722 (2009)Google Scholar
  13. 13.
    Hon, W.-K., Shah, R., Thankachan, S.V., Vitter, J.S.: String Retrieval for Multi-pattern Queries. In: Chavez, E., Lonardi, S. (eds.) SPIRE 2010. LNCS, vol. 6393, pp. 55–66. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  14. 14.
    Manber, U., Myers, G.: Suffix Arrays: A New Method for On-Line String Searches. SIAM Journal on Computing 22(5), 935–948 (1993)MathSciNetCrossRefzbMATHGoogle Scholar
  15. 15.
    Munro, J.I., Raman, V.: Succinct Representation of Balanced Parentheses and Static Trees. SICOMP 31(3), 762–776 (2001)MathSciNetCrossRefzbMATHGoogle Scholar
  16. 16.
    Navarro, G., Mäkinen, V.: Compressed Full-Text Indexes. ACM Computing Surveys 39(1) (2007)Google Scholar
  17. 17.
    Raman, R., Raman, V., Rao, S.S.: Succinct Indexable Dictionaries with Applications to Encoding k-ary Trees, Prefix Sums and Multisets. TALG 3(4) (2007)Google Scholar
  18. 18.
    Sadakane, K.: Compressed Suffix Trees with Full Functionality. In: TCS, pp. 589–607 (2007)Google Scholar
  19. 19.
    Weiner, P.: Linear Pattern Matching Algorithms. In: Proc. Switching and Automata Theory, pp. 1–11 (1973)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Sharma V. Thankachan
    • 1
  1. 1.Department of CSLouisiana State UniversityUSA

Personalised recommendations