Suffix Tree of Alignment: An Efficient Index for Similar Data
We consider an index data structure for similar strings. The generalized suffix tree can be a solution for this. The generalized suffix tree of two strings A and B is a compacted trie representing all suffixes in A and B. It has |A| + |B| leaves and can be constructed in O(|A| + |B|) time. However, if the two strings are similar, the generalized suffix tree is not efficient because it does not exploit the similarity which is usually represented as an alignment of A and B.
In this paper we propose a space/time-efficient suffix tree of alignment which wisely exploits the similarity in an alignment. Our suffix tree for an alignment of A and B has |A| + l d + l 1 leaves where l d is the sum of the lengths of all parts of B different from A and l 1 is the sum of the lengths of some common parts of A and B. We did not compromise the pattern search to reduce the space. Our suffix tree can be searched for a pattern P in O(|P| + occ) time where occ is the number of occurrences of P in A and B. We also present an efficient algorithm to construct the suffix tree of alignment. When the suffix tree is constructed from scratch, the algorithm requires O(|A| + l d + l 1 + l 2) time where l 2 is the sum of the lengths of other common substrings of A and B. When the suffix tree of A is already given, it requires O(l d + l 1 + l 2) time.
KeywordsIndexes for similar data suffix trees alignments
Unable to display preview. Download preview PDF.
- 1.The 1000 Genomes Project Consortium. A map of human genome variation from population-scale sequencing 467(7319), 1061–1073 (2010)Google Scholar
- 5.Burrows, M., Wheeler, D.J.: A block-sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation, California (1994)Google Scholar
- 13.Kreft, S., Navarro, G.: On compressing and indexing repetitive sequences. Theor. Comput. Sci. (to appear)Google Scholar
- 19.Navarro, G.: Indexing highly repetitive collections. In: Arumugam, S., Smyth, B. (eds.) IWOCA 2012. LNCS, vol. 7643, pp. 274–279. Springer, Heidelberg (2012)Google Scholar
- 22.Weiner, P.: Linear pattern matching algorithms. In: Proc. of the 14th IEEE Symp. on Switching and Automata Theory, pp. 1–11 (1973)Google Scholar